Sunday, September 27, 2020

Metadata Locks Instrumentation in MariaDB 10.5

There are different ways to study metadata locks in MySQL and MariaDB, as I once described in details. Until recently MariaDB had not provided the performance_schema.metadata_locks table, but it was finally added in 10.5. So, now you can easily get the details in a way that became the easiest and most well known since MySQL 5.7. You just have to make sure performance_schema is enabled at startup:

openxs@ao756:~/dbs/maria10.5$ bin/mysqld_safe --no-defaults --port=3311 --socket=/tmp/mariadb105.sock --performance_schema=1 &
[1] 23626
openxs@ao756:~/dbs/maria10.5$ 200927 18:26:41 mysqld_safe Logging to '/home/openxs/dbs/maria10.5/data/ao756.err'.
200927 18:26:41 mysqld_safe Starting mariadbd daemon with databases from /home/openxs/dbs/maria10.5/data

openxs@ao756:~/dbs/maria10.5$ bin/mysql --socket=/tmp/mariadb105.sock
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.5.6-MariaDB Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select @@performance_schema;
+----------------------+
| @@performance_schema |
+----------------------+
|                    1 |
+----------------------+
1 row in set (0,000 sec)

MariaDB [(none)]> use performance_schema
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [performance_schema]> show tables like '%metadata%';
+-------------------------------------------+
| Tables_in_performance_schema (%metadata%) |
+-------------------------------------------+
| metadata_locks                            |
+-------------------------------------------+
1 row in set (0,001 sec)

MariaDB [performance_schema]> select * from setup_instruments where name like '%metadata%';
+---------------------------------------------------------+---------+-------+
| NAME                                                    | ENABLED | TIMED |
+---------------------------------------------------------+---------+-------+
| wait/io/file/csv/metadata                               | YES     | YES   |
| stage/sql/Waiting for schema metadata lock              | NO      | NO    |
| stage/sql/Waiting for table metadata lock               | NO      | NO    |
| stage/sql/Waiting for stored function metadata lock     | NO      | NO    |
| stage/sql/Waiting for stored procedure metadata lock    | NO      | NO    |
| stage/sql/Waiting for stored package body metadata lock | NO      | NO    |
| stage/sql/Waiting for trigger metadata lock             | NO      | NO    |
| stage/sql/Waiting for event metadata lock               | NO      | NO    |
| memory/performance_schema/metadata_locks                | YES     | NO    |
| wait/lock/metadata/sql/mdl                              | NO      | NO    |
+---------------------------------------------------------+---------+-------+
10 rows in set (0,034 sec)

So, the metadata_locks table is there and we just need to enable wait/lock/metadata/sql/mdl instrument:

MariaDB [performance_schema]> update setup_instruments set enabled = 'YES', timed='YES' where name like 'wait%mdl';
Query OK, 1 row affected (0,030 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [performance_schema]> select * from metadata_locks\G
*************************** 1. row ***************************
          OBJECT_TYPE: TABLE
        OBJECT_SCHEMA: performance_schema
          OBJECT_NAME: metadata_locks
OBJECT_INSTANCE_BEGIN: 140032981253664
            LOCK_TYPE: SHARED_READ
        LOCK_DURATION: TRANSACTION
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 13
       OWNER_EVENT_ID: 1
1 row in set (0,001 sec)

MariaDB [performance_schema]> select connection_id();
+-----------------+
| connection_id() |
+-----------------+
|               3 |
+-----------------+
1 row in set (0,000 sec)

MariaDB [performance_schema]> select * from threads where processlist_id = 3\G
*************************** 1. row ***************************
          THREAD_ID: 13
               NAME: thread/sql/one_connection
               TYPE: FOREGROUND
     PROCESSLIST_ID: 3
   PROCESSLIST_USER: openxs
   PROCESSLIST_HOST: localhost
     PROCESSLIST_DB: performance_schema
PROCESSLIST_COMMAND: Query
   PROCESSLIST_TIME: 0
  PROCESSLIST_STATE: Sending data
   PROCESSLIST_INFO: select * from threads where processlist_id = 3
   PARENT_THREAD_ID: 1
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: Socket
       THREAD_OS_ID: 23719
1 row in set (0,018 sec)

The output above shows that one has to join to threads table on thread_id = metadata_locks.owner_thread_id to get the details for thread that set metadata lock: 

MariaDB [performance_schema]> SELECT OBJECT_TYPE, OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS, THREAD_ID, PROCESSLIST_ID, PROCESSLIST_INFO FROM performance_schema.metadata_locks INNER JOIN performance_schema.threads ON THREAD_ID = OWNER_THREAD_ID WHERE PROCESSLIST_ID <> CONNECTION_ID()\G
*************************** 1. row ***************************
     OBJECT_TYPE: BACKUP
   OBJECT_SCHEMA: NULL
     OBJECT_NAME: NULL
       LOCK_TYPE: BACKUP_DDL
     LOCK_STATUS: PENDING
       THREAD_ID: 50
  PROCESSLIST_ID: 21
PROCESSLIST_INFO: drop database test
1 row in set (0,001 sec)

In the cases above we see the lock, SHARED_READ, on the table we access, performance_schema.metadata_locks, set by thread that reads from the table. Then we see how to avoid seeing locks set by current connection and how to get current statement (or other SHOW PROCESSLIST details) for the thread associated with the lock.

I noted that SOURCE column in the metadata_locks table is empty. It seems to be the case for other locks too, like those caused by sysbench test running:

MariaDB [performance_schema]> select * from metadata_locks limit 5\G
*************************** 1. row ***************************
          OBJECT_TYPE: TABLE
        OBJECT_SCHEMA: sbtest
          OBJECT_NAME: sbtest1
OBJECT_INSTANCE_BEGIN: 140032787904800
            LOCK_TYPE: SHARED_WRITE
        LOCK_DURATION: TRANSACTION
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 34
       OWNER_EVENT_ID: 1
*************************** 2. row ***************************
          OBJECT_TYPE: BACKUP
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: 140032778203712
            LOCK_TYPE: BACKUP_TRANS_DML
        LOCK_DURATION: STATEMENT

          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 34
       OWNER_EVENT_ID: 1
*************************** 3. row ***************************
          OBJECT_TYPE: TABLE
        OBJECT_SCHEMA: sbtest
          OBJECT_NAME: sbtest1
OBJECT_INSTANCE_BEGIN: 140033249782768
            LOCK_TYPE: SHARED_WRITE
        LOCK_DURATION: TRANSACTION
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 33
       OWNER_EVENT_ID: 1
*************************** 4. row ***************************
          OBJECT_TYPE: BACKUP
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: 140033248057584
            LOCK_TYPE: BACKUP_TRANS_DML
        LOCK_DURATION: STATEMENT
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 33
       OWNER_EVENT_ID: 1
*************************** 5. row ***************************
          OBJECT_TYPE: TABLE
        OBJECT_SCHEMA: sbtest
          OBJECT_NAME: sbtest1
OBJECT_INSTANCE_BEGIN: 140033449398688
            LOCK_TYPE: SHARED_WRITE
        LOCK_DURATION: TRANSACTION
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 32
       OWNER_EVENT_ID: 1
5 rows in set (0,001 sec)

MariaDB [performance_schema]>

I do not think that it is normal, as in MySQL 8.0.21, for example, I still see the reference to the line of source code where the lock is set:

openxs@ao756:~/dbs/8.0$ bin/mysql --socket=/tmp/mysql8.sock -uroot performance_schema
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 8.0.21 Source distribution

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> flush tables with read lock;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from metadata_locks\G
*************************** 1. row ***************************
          OBJECT_TYPE: GLOBAL
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: NULL
          COLUMN_NAME: NULL
OBJECT_INSTANCE_BEGIN: 139972651847648
            LOCK_TYPE: SHARED
        LOCK_DURATION: EXPLICIT
          LOCK_STATUS: GRANTED
               SOURCE: lock.cc:1033
      OWNER_THREAD_ID: 49
       OWNER_EVENT_ID: 112
*************************** 2. row ***************************
          OBJECT_TYPE: COMMIT
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: NULL
          COLUMN_NAME: NULL
OBJECT_INSTANCE_BEGIN: 139972651711440
            LOCK_TYPE: SHARED
        LOCK_DURATION: EXPLICIT
          LOCK_STATUS: GRANTED
               SOURCE: lock.cc:1108
      OWNER_THREAD_ID: 49
       OWNER_EVENT_ID: 112
...

I reported this immediately as a bug, MDEV-23827 - "performance_schema.metadata_locks.source column is always empty".

The details in this table that are not covered by the MySQL manual require some investigation. You probably noted BACKUP_TRANS_DML as a lock type above. Another example (that you may get while mariabackup a.k.a mariadb-backup in 10.5 is running):

MariaDB [performance_schema]> select * from metadata_locks\G
*************************** 1. row ***************************
          OBJECT_TYPE: BACKUP
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: 140032980286816
            LOCK_TYPE: BACKUP_START
        LOCK_DURATION: EXPLICIT
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 44
       OWNER_EVENT_ID: 1
...

I've investigated similar details in the past for MySQL and Percona Server, by code review and some gdb tests. I'd expect to see these values documented somewhere in the mdl.h file in the source code. Here is what we have:

openxs@ao756:~/git/server$ grep -rn BACKUP_TRANS_DML *
mysql-test/main/mdl.result:13:MDL_BACKUP_TRANS_DML      Backup lock
mysql-test/main/mdl.result:19:MDL_BACKUP_TRANS_DML      Backup lock
sql/mdl.h:295:#define MDL_BACKUP_TRANS_DML enum_mdl_type(8)
sql/mdl.cc:134:  { C_STRING_WITH_LEN("MDL_BACKUP_TRANS_DML") },
sql/mdl.cc:490:               MDL_BIT(MDL_BACKUP_TRANS_DML)));
sql/mdl.cc:1633:  MDL_BIT(MDL_BACKUP_DML) | MDL_BIT(MDL_BACKUP_TRANS_DML) | MDL_BIT(MDL_BACKUP_SYS_DML) | MDL_BIT(MDL_BACKUP_DDL) | MDL_BIT(MDL_BACKUP_ALTER_COPY),
sql/mdl.cc:1634:  MDL_BIT(MDL_BACKUP_DML) | MDL_BIT(MDL_BACKUP_TRANS_DML) | MDL_BIT(MDL_BACKUP_SYS_DML) | MDL_BIT(MDL_BACKUP_DDL) | MDL_BIT(MDL_BACKUP_ALTER_COPY) | MDL_BIT(MDL_BACKUP_COMMIT),
sql/sql_base.cc:2048:      mdl_type= MDL_BACKUP_TRANS_DML;
storage/perfschema/table_helper.cc:636:    case MDL_BACKUP_TRANS_DML:
storage/perfschema/table_helper.cc:637:      PFS_engine_table::set_field_varchar_utf8(f, STRING_WITH_LEN("BACKUP_TRANS_DML"));
openxs@ao756:~/git/server$

This is what we have in sql/mdl.h in MariaDB 10.5:

    258 /** Backup locks */
    259
    260 /**
    261   Block concurrent backup
    262 */
    263 #define MDL_BACKUP_START enum_mdl_type(0)
    264 /**
    265    Block new write requests to non transactional tables
    266 */
    267 #define MDL_BACKUP_FLUSH enum_mdl_type(1)
    268 /**
    269    In addition to previous locks, blocks running requests to non trans tables
    270    Used to wait until all DML usage of on trans tables are finished
    271 */
    272 #define MDL_BACKUP_WAIT_FLUSH enum_mdl_type(2)
    273 /**
    274    In addition to previous locks, blocks new DDL's from starting
    275 */
    276 #define MDL_BACKUP_WAIT_DDL enum_mdl_type(3)
    277 /**
    278    In addition to previous locks, blocks commits
    279 */
    280 #define MDL_BACKUP_WAIT_COMMIT enum_mdl_type(4)
    281
    282 /**
    283   Blocks (or is blocked by) statements that intend to modify data. Acquired
    284   before commit lock by FLUSH TABLES WITH READ LOCK.
    285 */
    286 #define MDL_BACKUP_FTWRL1 enum_mdl_type(5)
    287
    288 /**
    289   Blocks (or is blocked by) commits. Acquired after global read lock by
    290   FLUSH TABLES WITH READ LOCK.
    291 */
    292 #define MDL_BACKUP_FTWRL2 enum_mdl_type(6)
    293
    294 #define MDL_BACKUP_DML enum_mdl_type(7)
    295 #define MDL_BACKUP_TRANS_DML enum_mdl_type(8)
    296 #define MDL_BACKUP_SYS_DML enum_mdl_type(9)
...

I see creative re-use of the values from enum_mdl_type.

The last bug not the least, this is how the global read lock and the metadata lock wait look like:

MariaDB [performance_schema]> flush tables with read lock;
Query OK, 0 rows affected (0,090 sec)

MariaDB [performance_schema]> select * from metadata_locks\G
*************************** 1. row ***************************
          OBJECT_TYPE: BACKUP
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: 140033115590992
            LOCK_TYPE: BACKUP_FTWRL1
        LOCK_DURATION: EXPLICIT
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 56
       OWNER_EVENT_ID: 1
...
MariaDB [performance_schema]> select * from metadata_locks\G
*************************** 1. row ***************************
          OBJECT_TYPE: BACKUP
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: 140033115590992
            LOCK_TYPE: BACKUP_FTWRL1
        LOCK_DURATION: EXPLICIT
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 56
       OWNER_EVENT_ID: 1
*************************** 2. row ***************************
          OBJECT_TYPE: BACKUP
        OBJECT_SCHEMA: NULL
          OBJECT_NAME: NULL
OBJECT_INSTANCE_BEGIN: 140032979653216
            LOCK_TYPE: BACKUP_DDL
        LOCK_DURATION: STATEMENT
          LOCK_STATUS: PENDING
               SOURCE:
      OWNER_THREAD_ID: 50
       OWNER_EVENT_ID: 1

*************************** 3. row ***************************
          OBJECT_TYPE: TABLE
        OBJECT_SCHEMA: performance_schema
          OBJECT_NAME: metadata_locks
OBJECT_INSTANCE_BEGIN: 140033113928032
            LOCK_TYPE: SHARED_READ
        LOCK_DURATION: TRANSACTION
          LOCK_STATUS: GRANTED
               SOURCE:
      OWNER_THREAD_ID: 56
       OWNER_EVENT_ID: 1
3 rows in set (0,000 sec)

MariaDB [performance_schema]> select * from information_schema.metadata_lock_info\G
*************************** 1. row ***************************
    THREAD_ID: 23
    LOCK_MODE: MDL_BACKUP_FTWRL2
LOCK_DURATION: NULL
    LOCK_TYPE: Backup lock
 TABLE_SCHEMA:
   TABLE_NAME:
1 row in set (0,001 sec)

The last statement above shows that the old way to check metadata locks, via metadata_lock_info plugin, had provided much less details. We could not even distinguish lock from lock wait (pending lock) with it!

Are there any locks on that bridge? Go figure...

Finally, even with metadata_locks table in place, finding the details about the exact blocking lock may be not trivial. See this blog post for some idea on how to do this based on assumption that blocking lock is surely set earlier than pending one.

Sunday, September 20, 2020

Dynamic Tracing of pthread_mutex_lock in MariaDB - First Steps

 I've got several comments and questions during my session on dynamic tracing at MariaDB Server Fest 2020. One of them was from a MariaDB developer at Slack and sounded as follows:

"Can perf profile contention on pthread_mutex_lock?"

I surely stated that it can (it can profile anything), but tried to clarify what exactly is needed whenever we hit the probe. The replies where the following:

"ideally, collect stack traces of mariadbd, and sort them in descending order by time spent between pthread_mutex_lock and next pthread_mutex_unlock."

and:

"less ideally, just collect stacktraces and sort in descending order by frequency"

I am not sure about the former (I am yet to try to trace more than one probe at a time with perf and find out how to aggregate the result properly). It is surely doable (in a more efficient way) with bcc tools if one writes proper program, and probably even with bpftrace. I am going to show how to do this in the next posts one day... But the question was about perf.

Fortunately I can easily show how to do the latter. For the primitive test I've started MariaDB 10.5 built from GitHub source on my Ubuntu 16.04 netbook with the following options:

openxs@ao756:~/dbs/maria10.5$ bin/mysqld_safe --no-defaults --port=3311 --socket=/tmp/mariadb105.sock --performance_schema=1 --performance-schema-instrument='memory/%=ON' --thread-handling='pool-of-threads' &

 and used the following sysbench test as a load that I expected to cause some mutex waits:

openxs@ao756:~/dbs/maria10.5$ sysbench --table-size=1000000 --threads=34 --time=300 --report-interval=5 --mysql-socket=/tmp/mariadb105.sock --mysql-user=openxs --mysql-db=sbtest /usr/share/sysbench/oltp_read_only.lua run

First step is to find out what exact pthread library is used:

openxs@ao756:~/dbs/maria10.5$ ldd bin/mariadbd
        linux-vdso.so.1 =>  (0x00007ffc03437000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f585d0c6000)
        liblz4.so.1 => /usr/lib/x86_64-linux-gnu/liblz4.so.1 (0x00007f585ceae000)
        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f585cc8c000)
        libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x00007f585ca7c000)
        libsnappy.so.1 => /usr/lib/x86_64-linux-gnu/libsnappy.so.1 (0x00007f585c874000)
        libaio.so.1 => /lib/x86_64-linux-gnu/libaio.so.1 (0x00007f585c672000)
        libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f585c467000)
        libssl.so.1.0.0 => /lib/x86_64-linux-gnu/libssl.so.1.0.0 (0x00007f585c1ff000)
        libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f585bdba000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f585bb9d000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f585b999000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f585b617000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f585b30e000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f585b0f8000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f585ad2e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f585f4e6000)

Adding the probe after that is trivial:

openxs@ao756:~/dbs/maria10.5$ sudo perf probe -x  /lib/x86_64-linux-gnu/libpthread.so.0 pthread_mutex_lock
Added new event:
  probe_libpthread:pthread_mutex_lock (on pthread_mutex_lock in /lib/x86_64-linux-gnu/libpthread-2.23.so)

You can now use it in all perf tools, such as:

        perf record -e probe_libpthread:pthread_mutex_lock -aR sleep 1

I had not checked yet what are the arguments and not tried to record mutex address or anything besides the fact of the call to this function. Then I started sysbench and while it worked:

openxs@ao756:~/dbs/maria10.5$ sysbench --table-size=1000000 --threads=34 --time=300 --report-interval=5 --mysql-socket=/tmp/mariadb105.sock --mysql-user=openxs --mysql-db=sbtest /usr/share/sysbench/oltp_read_only.lua run
sysbench 1.1.0-faaff4f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 34
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 34 tps: 564.25 qps: 9079.14 (r/w/o: 7944.45/0.00/1134.69) lat (ms,95%): 99.33 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 34 tps: 570.68 qps: 9136.92 (r/w/o: 7995.56/0.00/1141.37) lat (ms,95%): 104.84 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 34 tps: 585.57 qps: 9345.87 (r/w/o: 8175.33/0.00/1170.53) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 34 tps: 588.03 qps: 9409.23 (r/w/o: 8232.98/0.00/1176.25) lat (ms,95%): 97.55 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 34 tps: 587.39 qps: 9399.31 (r/w/o: 8224.92/0.00/1174.39) lat (ms,95%): 97.55 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 34 tps: 584.80 qps: 9360.32 (r/w/o: 8190.33/0.00/1169.99) lat (ms,95%): 77.19 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 34 tps: 573.02 qps: 9173.87 (r/w/o: 8028.24/0.00/1145.63) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 34 tps: 572.60 qps: 9160.00 (r/w/o: 8014.00/0.00/1146.00) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 45s ] thds: 34 tps: 529.54 qps: 8468.10 (r/w/o: 7409.02/0.00/1059.09) lat (ms,95%): 101.13 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 34 tps: 393.04 qps: 6276.27 (r/w/o: 5491.38/0.00/784.88) lat (ms,95%): 121.08 err/s: 0.00 reconn/s: 0.00
[ 55s ] thds: 34 tps: 382.60 qps: 6134.80 (r/w/o: 5369.20/0.00/765.60) lat (ms,95%): 110.66 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 34 tps: 386.36 qps: 6210.71 (r/w/o: 5436.79/0.00/773.91) lat (ms,95%): 134.90 err/s: 0.00 reconn/s: 0.00
[ 65s ] thds: 34 tps: 397.01 qps: 6320.60 (r/w/o: 5528.37/0.00/792.23) lat (ms,95%): 153.02 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 34 tps: 419.01 qps: 6699.60 (r/w/o: 5861.98/0.00/837.63) lat (ms,95%): 248.83 err/s: 0.00 reconn/s: 0.00
...

I've tried to run perf record for the mariadbd process as follows, with -g option:

openxs@ao756:~/dbs/maria10.5$ sudo perf record -e probe_libpthread:pthread_mutex_lock -g -p`pidof mariadbd` sleep 30
^C[ perf record: Woken up 735 times to write data ]
[ perf record: Captured and wrote 183.516 MB perf.data (1150509 samples) ]

 I had noty set the frequency and with the default one we see notable drop of QPS from sysbench and huge amount of data collected in less than 30 seconds (highlighted). So we can not speak about really small impact with such an approach. In production case I'd have to play with sampling frequency and time to run for sure.

Raw results can be checked with perf script:

openxs@ao756:~/dbs/maria10.5$ sudo perf script | more
mariadbd  9621 [000] 30901.858515: probe_libpthread:pthread_mutex_lock: (7f28dd88fd40)
                    9d40 pthread_mutex_lock (/lib/x86_64-linux-gnu/libpthread-2.23.so)
                  6a9233 close_thread_tables (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  70b87c mysql_execute_command (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  72a0ae Prepared_statement::execute (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  72a546 Prepared_statement::execute_loop (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  72af58 mysql_stmt_execute_common (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  72b0b5 mysqld_stmt_execute (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  7090d8 dispatch_command (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  70777c do_command (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  887c40 tp_callback (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  a76650 worker_main (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  b31911 pfs_spawn_thread (/home/openxs/dbs/maria10.5/bin/mariadbd)
                    76ba start_thread (/lib/x86_64-linux-gnu/libpthread-2.23.so)
--More--

Now we can process the results collected. The easiest way is to use perf report:

openxs@ao756:~/dbs/maria10.5$ sudo perf report -g >/tmp/perf_mutex.txt

In the resulting file we can see the following:

openxs@ao756:~/dbs/maria10.5$ cat /tmp/perf_mutex.txt | head -100
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'probe_libpthread:pthread_mutex_lock'
# Event count (approx.): 1150509
#
# Children      Self  Command   Shared Object        Symbol                     
# ........  ........  ........  ...................  ......................................................
#
   100.00%   100.00%  mariadbd  libpthread-2.23.so   [.] pthread_mutex_lock     
            |
            |--99.98%-- start_thread
            |          |
            |          |--99.97%-- pfs_spawn_thread
            |          |          |
            |          |          |--99.95%-- worker_main
            |          |          |          |
            |          |          |          |--86.17%-- tp_callback
            |          |          |          |          do_command
            |          |          |          |          dispatch_command
            |          |          |          |          |
            |          |          |          |          |--46.56%-- mysqld_stmt_execute
            |          |          |          |          |          mysql_stmt_execute_common
            |          |          |          |          |          Prepared_statement::execute_loop
            |          |          |          |          |          Prepared_statement::execute
            |          |          |          |          |          |
            |          |          |          |          |          |--27.70%-- mysql_execute_command
            |          |          |          |          |          |          |
            |          |          |          |          |          |          |--17.60%-- close_thread_tables
            |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |--8.80%-- close_thread_table
            |          |          |          |          |          |          |          |          pthread_mutex_lock
            |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |           --8.80%-- pthread_mutex_lock
            |          |          |          |          |          |          |
            |          |          |          |          |          |          |--9.47%-- execute_sqlcom_select
            |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |--9.46%-- open_and_lock_tables
            |          |          |          |          |          |          |          |          open_tables
            |          |          |          |          |          |          |          |          open_table
            |          |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |          |--8.80%-- pthread_mutex_lock
            |          |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |          |--0.63%-- open_table_get_mdl_lock
            |          |          |          |          |          |          |          |          |          MDL_context::acquire_lock
            |          |          |          |          |          |          |          |          |          MDL_context::try_acquire_lock_impl
            |          |          |          |          |          |          |          |          |          MDL_map::find_or_insert
            |          |          |          |          |          |          |          |          |          pthread_mutex_lock
            |          |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |           --0.03%-- tdc_acquire_share
            |          |          |          |          |          |          |          |                     pthread_mutex_lock
            |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |           --0.01%-- handle_select
            |          |          |          |          |          |          |                     mysql_select
            |          |          |          |          |          |          |                     JOIN::optimize
            |          |          |          |          |          |          |                     JOIN::optimize_inner
            |          |          |          |          |          |          |                     |
            |          |          |          |          |          |          |                     |--0.01%-- join_read_const_table
            |          |          |          |          |          |          |                     |          join_read_const
            |          |          |          |          |          |          |                     |          handler::ha_index_read_idx_map
            |          |          |          |          |          |          |                     |          handler::index_read_idx_map
            |          |          |          |          |          |          |                     |          ha_innobase::index_read
            |          |          |          |          |          |          |                     |          row_search_mvcc
            |          |          |          |          |          |          |                     |          |
            |          |          |          |          |          |          |                     |          |--0.01%-- btr_cur_search_to_nth_level_func
            |          |          |          |          |          |          |                     |          |          buf_page_get_gen
            |          |          |          |          |          |          |                     |          |          buf_page_get_low
            |          |          |          |          |          |          |                     |          |          buf_read_page
            |          |          |          |          |          |          |                     |          |          buf_read_page_low
            |          |          |          |          |          |          |                     |          |          pthread_mutex_lock
            |          |          |          |          |          |          |                     |          |
            |          |          |          |          |          |          |                     |           --0.00%-- btr_pcur_move_to_next_page
            |          |          |          |          |          |          |                     |                     buf_page_get_gen
            |          |          |          |          |          |          |                     |                     buf_page_get_low
            |          |          |          |          |          |          |                     |                     buf_read_page
            |          |          |          |          |          |          |                     |                     buf_read_page_low
            |          |          |          |          |          |          |                     |                     pthread_mutex_lock
            |          |          |          |          |          |          |                     |
            |          |          |          |          |          |          |                      --0.01%-- SQL_SELECT::test_quick_select
            |          |          |          |          |          |          |                                get_key_scans_params
            |          |          |          |          |          |          |                                DsMrr_impl::dsmrr_info_const
            |          |          |          |          |          |          |                                handler::multi_range_read_info_const
            |          |          |          |          |          |          |                                ha_innobase::records_in_range
            |          |          |          |          |          |          |                                btr_estimate_n_rows_in_range_low
            |          |          |          |          |          |          |                                |
            |          |          |          |          |          |          |                                |--0.01%-- btr_cur_search_to_nth_level_func
            |          |          |          |          |          |          |                                |          buf_page_get_gen
            |          |          |          |          |          |          |                                |          buf_page_get_low
            |          |          |          |          |          |          |                                |          buf_read_page
            |          |          |          |          |          |          |                                |          buf_read_page_low
            |          |          |          |          |          |          |                                |          pthread_mutex_lock
            |          |          |          |          |          |          |                                |
            |          |          |          |          |          |          |                                 --0.00%-- buf_page_get_gen
            |          |          |          |          |          |          |                                           buf_page_get_low
            |          |          |          |          |          |          |                                           buf_read_page
            |          |          |          |          |          |          |                                           buf_read_page_low
            |          |          |          |          |          |          |                                           pthread_mutex_lock
            |          |          |          |          |          |          |
...

but the result is so large that you can not clearly see the full picture even with the smallest font set. One would need another way to visualize the data, for example, as a flame graph:

openxs@ao756:~/dbs/maria10.5$ sudo perf script | ~/git/FlameGraph/stackcollapse-perf.pl > /tmp/perf-folded.txt
openxs@ao756:~/dbs/maria10.5$ ~/git/FlameGraph/flamegraph.pl /tmp/perf-folded.txt >/tmp/mutex.svg

The resulting flame graph looks as follows:

Flame Graph for the pthread_mutex_lock calls in MariaDB server.

It makes it easier to see were most of waits originate from.

We can also check the /tmp/perf-folded.txt file created in the process:

openxs@ao756:~/dbs/maria10.5$ head -10 /tmp/perf-folded.txt                    
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;pthread_mutex_lock 82
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;tpool::task::execute;pthread_mutex_lock 41
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;tpool::task::execute;tpool::thread_pool_generic::timer_generic::execute;pthread_mutex_lock 82
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;tpool::task::execute;tpool::thread_pool_generic::timer_generic::execute;srv_error_monitor_task;pthread_mutex_lock 19
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;tpool::task::execute;tpool::thread_pool_generic::timer_generic::execute;srv_master_callback;pthread_mutex_lock 18
mariadbd;start_thread;buf_flush_page_cleaner;buf_flush_lists;buf_flush_do_batch;pthread_mutex_lock 46
mariadbd;start_thread;buf_flush_page_cleaner;pthread_mutex_lock 64
mariadbd;start_thread;pfs_spawn_thread;ma_checkpoint_background;pthread_mutex_lock 1
mariadbd;start_thread;pfs_spawn_thread;ma_checkpoint_background;translog_get_horizon;pthread_mutex_lock 1
mariadbd;start_thread;pfs_spawn_thread;pthread_mutex_lock 112
openxs@ao756:~/dbs/maria10.5$ cat /tmp/perf-folded.txt | sort -r -n -k 1,1 | more
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;THD::reset_killed;pthread_mutex_lock 231467
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;pthread_mutex_lock 115734
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::exe
cute_loop;Prepared_statement::execute;THD::set_statement;pthread_mutex_lock 2169
99
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::exe
cute_loop;Prepared_statement::execute;mysql_execute_command;MDL_context::release
_locks_stored_before;MDL_context::release_lock;MDL_lock::remove_ticket;pthread_m
utex_lock 7232
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::exe
cute_loop;Prepared_statement::execute;mysql_execute_command;execute_sqlcom_selec
t;open_and_lock_tables;open_tables;open_table;tdc_acquire_share;pthread_mutex_lo
ck 301
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::exe
cute_loop;Prepared_statement::execute;mysql_execute_command;execute_sqlcom_selec
t;open_and_lock_tables;open_tables;open_table;pthread_mutex_lock 101268

mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat

to find out that it contains unique stack traces as the first field and number of such stack traces as the second. So, and answer the real question from a developer I have sort the result properly:

openxs@ao756:~/dbs/maria10.5$ cat /tmp/perf-folded.txt | sort -r -n -k 2 | head -5
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispatch_command;THD::reset_killed;pthread_mutex_lock 231467
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispatch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::execute_loop;Prepared_statement::execute;THD::set_statement;pthread_mutex_lock 216999
mariadbd;start_thread;pfs_spawn_thread;worker_main;pthread_mutex_lock 158530
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispatch_command;pthread_mutex_lock 115734
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispatch_command;delete_explain_query;pthread_mutex_lock 101269
openxs@ao756:~/dbs/maria10.5$

To summarize, perf (augmented with some post processing tools) does allow to collect stacktraces leading to some probe and sort in descending order by frequency.

Finally, this is my presentation that inspired the question:


You can watch the session recorded if you missed it..

Sunday, September 13, 2020

BCC Tools for disk I/O Analysis and More

A lot of time passed since I had written about bcc tools in my dynamic tracing series. That post was mostly about building bcc tools from GitHub source on Ubuntu 16.04 and using trace utility to add dynamic probe to MariaDB server. Other that that I had presented tplist and mysql_qslower with a couple of slides at several events, but that's all. There are many more tools useful for DBAs in the bundle, and today I am going to describe several of them mostly related to I/O monitoring. It is yet another blog post in support of my upcoming MariaDB Server Fest 2020 session on dynamic tracing.

I was inspired by another cool resentation, "DB AMA: BCC - Tools for BPF-based Linux IO analysis", made by my dear friend and former colleague in Percona, Marcos Albe:

It has much less likes and views that it honestly deserves.

For this post I picked up only some of bcc tools discussed there. Separate post about using bpftrace for similar analysis is in progress, but I can not complete it today, mostly because bpftrace is not supported on this old Ubuntu I have at hand:

openxs@ao756:~/dbs/maria10.5$ uname -a
Linux ao756 4.4.0-187-generic #217-Ubuntu SMP Tue Jul 21 04:18:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

So I can not check and copy/paste the details quickly.

I started with rebuilding bcc tools from current source:

openxs@ao756:~/git/bcc$ git log -1
commit ba73657cb8c4dab83dfb89eed4a8b3866255569a
Author: Hao <1075808668@qq.com>
Date:   Sat Sep 12 02:05:29 2020 +0800

    Netqtop 3037 (#3048)

    The tool netqtop uses tracepoints NET_DEV_START_XMIT and NETIF_RECEIVE_SKB to intercept every transmitted and received packet, as a result, considerable performance descent is expected.
    Details for some performance evaluation can be found at
       https://github.com/iovisor/bcc/pull/3048
openxs@ao756:~/git/bcc$

Nothing fancy there, just many tests are still failing:

openxs@ao756:~/git/bcc/build$ make test
...
68% tests passed, 14 tests failed out of 44

Total Test time (real) = 340.86 sec

The following tests FAILED:
          4 - py_test_stat1_b (Failed)
          5 - py_test_bpf_log (Failed)
          6 - py_test_stat1_c (Failed)
          7 - py_test_xlate1_c (Failed)
          8 - py_test_call1 (Failed)
         11 - py_test_trace3_c (Failed)
         15 - py_test_debuginfo (Failed)
         16 - py_test_brb (Failed)
         17 - py_test_brb2 (Failed)
         18 - py_test_clang (Failed)
         23 - py_test_stackid (Failed)
         30 - py_test_tools_smoke (Failed)
         38 - py_test_lpm_trie (Failed)
         44 - lua_test_standalone (Failed)
Errors while running CTest
Makefile:105: recipe for target 'test' failed
make: *** [test] Error 8
but I still had a hope that most of the tools work (and it's really the case). So I executed sudo make install and ended up with the following:

openxs@ao756:~/git/bcc/build$ ls -F /usr/share/bcc/tools/
argdist*       dirtop*          lib/             pythoncalls*  tclobjnew*
bashreadline*  doc/             llcstat*         pythonflow*   tclstat*
bindsnoop*     drsnoop*         mdflush*         pythongc*     tcpaccept*
biolatency*    execsnoop*       memleak*         pythonstat*   tcpconnect*
biolatpcts*    exitsnoop*       mountsnoop*      reset-trace*  tcpconnlat*
biosnoop*      ext4dist*        mysqld_qslower*  rubycalls*    tcpdrop*
biotop*        ext4slower*      netqtop*         rubyflow*     tcplife*
bitesize*      filelife*        netqtop.c        rubygc*       tcpretrans*
bpflist*       fileslower*      nfsdist*         rubyobjnew*   tcprtt*
btrfsdist*     filetop*         nfsslower*       rubystat*     tcpstates*
btrfsslower*   funccount*       nodegc*          runqlat*      tcpsubnet*
cachestat*     funcinterval*    nodestat*        runqlen*      tcptop*
cachetop*      funclatency*     offcputime*      runqslower*   tcptracer*
capable*       funcslower*      offwaketime*     shmsnoop*     tplist*
cobjnew*       gethostlatency*  old/             slabratetop*  trace*
compactsnoop*  hardirqs*        oomkill*         sofdsnoop*    ttysnoop*
cpudist*       inject*          opensnoop*       softirqs*     vfscount*
cpuunclaimed*  javacalls*       perlcalls*       solisten*     vfsstat*
criticalstat*  javaflow*        perlflow*        sslsniff*     wakeuptime*
dbslower*      javagc*          perlstat*        stackcount*   xfsdist*
dbstat*        javaobjnew*      phpcalls*        statsnoop*    xfsslower*
dcsnoop*       javastat*        phpflow*         syncsnoop*    zfsdist*
dcstat*        javathreads*     phpstat*         syscount*     zfsslower*
deadlock*      killsnoop*       pidpersec*       tclcalls*
deadlock.c     klockstat*       profile*         tclflow*
openxs@ao756:~/git/bcc/build$

The tools I am going to try today are highlighted above. I'll run some sysbench tests like this:

openxs@ao756:~/dbs/maria10.5$ sysbench --table-size=1000000 --threads=4 --time=100 --mysql-socket=/tmp/mariadb105.sock --mysql-user=openxs --mysql-db=sbtest --report-interval=5 /usr/share/sysbench/oltp_update_index.lua run
sysbench 1.1.0-faaff4f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 4 tps: 113.89 qps: 113.89 (r/w/o: 0.00/113.89/0.00) lat (ms,95%): 101.13 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 4 tps: 113.60 qps: 113.60 (r/w/o: 0.00/113.60/0.00) lat (ms,95%): 110.66 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 4 tps: 110.20 qps: 110.20 (r/w/o: 0.00/110.20/0.00) lat (ms,95%): 144.97 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 112.00 qps: 112.00 (r/w/o: 0.00/112.00/0.00) lat (ms,95%): 153.02 err/s: 0.00 reconn/s: 0.00
...

as a test load against recent MariaDB 10.5.6 also built from source and started with mostly default settings and some Performance Schema instrumentation enabled:

 openxs@ao756:~/dbs/maria10.5$ bin/mysqld_safe --no-defaults --port=3311 --socket=/tmp/mariadb105.sock --performance_schema=1 --performance-schema-instrument='memory/%=ON' &
[1] 19616
openxs@ao756:~/dbs/maria10.5$ 200913 13:16:14 mysqld_safe Logging to '/home/openxs/dbs/maria10.5/data/ao756.err'.
200913 13:16:14 mysqld_safe Starting mariadbd daemon with databases from /home/openxs/dbs/maria10.5/data
...

funccount - count functions, tracepoints and USDT probes hits

This tool is useful when you looking for what actually to trace, as it shows number of hits per tracepoint or probe. As usual, you can just call it with --help option to get a lot of details about options:

openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount --help
usage: funccount [-h] [-p PID] [-i INTERVAL] [-d DURATION] [-T] [-r] [-D]
                 [-c CPU]
                 pattern

Count functions, tracepoints, and USDT probes

positional arguments:
  pattern               search expression for events

optional arguments:
  -h, --help            show this help message and exit
  -p PID, --pid PID     trace this PID only
  -i INTERVAL, --interval INTERVAL
                        summary interval, seconds
  -d DURATION, --duration DURATION
                        total duration of trace, seconds
  -T, --timestamp       include timestamp on output
  -r, --regexp          use regular expressions. Default is "*" wildcards
                        only.
  -D, --debug           print BPF program before starting (for debugging
                        purposes)
  -c CPU, --cpu CPU     trace this CPU only

examples:
    ./funccount 'vfs_*'             # count kernel fns starting with "vfs"
    ./funccount -r '^vfs.*'         # same as above, using regular expressions
    ./funccount -Ti 5 'vfs_*'       # output every 5 seconds, with timestamps
    ./funccount -d 10 'vfs_*'       # trace for 10 seconds only
    ./funccount -p 185 'vfs_*'      # count vfs calls for PID 181 only
    ./funccount t:sched:sched_fork  # count calls to the sched_fork tracepoint
    ./funccount -p 185 u:node:gc*   # count all GC USDT probes in node, PID 185
    ./funccount c:malloc            # count all malloc() calls in libc
    ./funccount go:os.*             # count all "os.*" calls in libgo
    ./funccount -p 185 go:os.*      # count all "os.*" calls in libgo, PID 185
    ./funccount ./test:read*        # count "read*" calls in the ./test binary
    ./funccount -c 1 'vfs_*'        # count vfs calls on CPU 1 only

Based on the above I tried to trace everything for a couple of seconds while sysbench was running, and failed:

openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount -i 1 -d 2 '*'
maximum of 1000 probes allowed, attempted 53271

Too many probes to trace. The following commands work though:

openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount -T -i 1 -d 1 'ext4*'
Tracing 509 functions for "ext4*"... Hit Ctrl-C to end.

13:29:37
FUNC                                    COUNT
...
ext4_sync_file                             14
ext4_finish_bio                            14
ext4_get_block_write                       14
ext4_direct_IO                             14
...
ext4_statfs                               110
ext4_htree_store_dirent                   110
ext4_da_write_begin                       112
ext4fs_dirhash                            122
ext4_es_lookup_extent                     143
ext4_journal_check_start                  219
Detaching...

openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount -T -i 1 -d 1 '*sync'
Tracing 181 functions for "*sync"... Hit Ctrl-C to end.

13:32:11
FUNC                                    COUNT
sys_fsync                                   1
SyS_fsync                                   3
journal_end_buffer_io_sync                  8
end_bio_bh_io_sync                         22
do_fsync                                  142
kill_fasync                               158
i915_gem_object_sync                      217
SyS_fdatasync                             232
blk_run_queue_async                       291
sys_fdatasync                             324
try_to_del_timer_sync                     408
run_sync                                  483
del_timer_sync                           1056
Detaching...
openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount -T -i 1 -d 1 '*file'
Tracing 151 functions for "*file"... Hit Ctrl-C to end.

13:32:41
FUNC                                    COUNT
security_mmap_file                          1
apparmor_mmap_file                          1
sock_alloc_file                             3
ext4_release_file                           8
aa_audit_file                              29
ext4_sync_file                             96
anon_inode_getfile                        100
__mnt_want_write_file                     100
__mnt_drop_write_file                     101
trace_create_file                         180
alloc_file                                205
locks_remove_file                         232
tracefs_create_file                       605
Detaching...
openxs@ao756:~$

So, it's clear that some notable number of calls to ext4_sync_file happens and we may want to trace it further. You can use funccount tool to get top N probes for further detailed study.

biotop - block device (disk) I/O by process

We have a way to find out what I/O related functions are oftedn used. But what processes actually use them and do a lot of disk I/O? Ther biotop tool helps to find out:

openxs@ao756:~$ sudo /usr/share/bcc/tools/biotop --help
usage: biotop [-h] [-C] [-r MAXROWS] [interval] [count]

Block device (disk) I/O by process

positional arguments:
  interval              output interval, in seconds
  count                 number of outputs

optional arguments:
  -h, --help            show this help message and exit
  -C, --noclear         don't clear the screen
  -r MAXROWS, --maxrows MAXROWS
                        maximum rows to print, default 20

examples:
    ./biotop            # block device I/O top, 1 second refresh
    ./biotop -C         # don't clear the screen
    ./biotop 5          # 5 second summaries
    ./biotop 5 10       # 5 second summaries, 10 times only

I tried to run it several times as follows while sysbench test was running:

openxs@ao756:~$ sudo /usr/share/bcc/tools/biotop -C
Tracing... Output every 1 secs. Hit Ctrl-C to end

13:41:05 loadavg: 3.63 4.69 7.97 1/764 20587

PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
19702  mariadbd         W 8   0   sda          1       0  10.85
253    dmcrypt_write    W 8   0   sda        130       0   1.25
0                       W 8   0   sda         63       0  14.23


...
PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms
20714  dd               R 8   0   sda          2       0  20.67
12517  update-manager   R 8   0   sda          3       0 519.19
253    dmcrypt_write    W 8   0   sda         80       0 151.81
4090   mysqld           R 8   0   sda          1       0 506.16
253    dmcrypt_write    R 8   0   sda          1       0   1.04
5917   node_exporter    R 8   0   sda          4       0 422.73
0                       R 8   0   sda          1       0   5.31
1218   percona-qan-age  R 8   0   sda          4       0  72.50
Detaching...

We can see several top I/O users, including mariadbd and dd that I later ran concurrently to prove some points. You probably noted that I have PMM running on this netbook. A lot of I/O is related to dmcrypt_write - that's because I use encrypted disk volumn here. I am a bit surprised to see all zeros (0) for Kbytes column. Maybe there is some problem to get proper counts on this system for the bcc tools.

We see AVGms column in the output above - that's the average time for the I/O (latency) from the issuer to the device, to its completion, in milliseconds. Next tool may help to find out what's behind these average values, how slow the I/O may really be at times.

biolatency - summarize block device I/O latency as a histogram

Averages may hide some rare severe problems, so it makes sense to check the output of biolatency too:

openxs@ao756:~$ sudo /usr/share/bcc/tools/biolatency --help
usage: biolatency [-h] [-T] [-Q] [-m] [-D] [-F] [interval] [count]

Summarize block device I/O latency as a histogram

positional arguments:
  interval            output interval, in seconds
  count               number of outputs

optional arguments:
  -h, --help          show this help message and exit
  -T, --timestamp     include timestamp on output
  -Q, --queued        include OS queued time in I/O time
  -m, --milliseconds  millisecond histogram
  -D, --disks         print a histogram per disk device
  -F, --flags         print a histogram per set of I/O flags

examples:
    ./biolatency            # summarize block I/O latency as a histogram
    ./biolatency 1 10       # print 1 second summaries, 10 times
    ./biolatency -mT 1      # 1s summaries, milliseconds, and timestamps
    ./biolatency -Q         # include OS queued time in I/O time
    ./biolatency -D         # show each disk device separately
    ./biolatency -F         # show I/O flags separately

I've called it as follows to get histograms for 3 seconds, including OS queued time:

openxs@ao756:~$ sudo /usr/share/bcc/tools/biolatency -mQD 1 3
Tracing block device I/O... Hit Ctrl-C to end.


disk = 'sda'
     msecs               : count     distribution
         0 -> 1          : 73       |****************************************|
         2 -> 3          : 4        |**                                      |
         4 -> 7          : 5        |**                                      |
         8 -> 15         : 64       |***********************************     |
        16 -> 31         : 24       |*************                           |
        32 -> 63         : 6        |***                                     |
        64 -> 127        : 2        |*                                       |


disk = 'sda'
     msecs               : count     distribution
         0 -> 1          : 80       |****************************************|
         2 -> 3          : 8        |****                                    |
         4 -> 7          : 2        |*                                       |
         8 -> 15         : 57       |****************************            |
        16 -> 31         : 11       |*****                                   |
        32 -> 63         : 4        |**                                      |
        64 -> 127        : 4        |**                                      |


disk = 'sda'
     msecs               : count     distribution
         0 -> 1          : 48       |****************************************|
         2 -> 3          : 1        |                                        |
         4 -> 7          : 7        |*****                                   |
         8 -> 15         : 45       |*************************************   |
        16 -> 31         : 25       |********************                    |
        32 -> 63         : 11       |*********                               |
openxs@ao756:~$

It is clear from the above that there are many really fst I/O, but many of them take 8-15 milliseconds or even more.

We can do some analysis at filesystem level too. Next two tools help in case of ext4 filesystem (like mine). There are similar tools in bcc tools collections for other types of filesystems.

ext4slower - trace common ext4 file operations slower than a threshold

It could be useful to find out wehat kind of operation (read, write or sync) is slow, and with what file. ext4slower tool may help:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4slower --help
usage: ext4slower [-h] [-j] [-p PID] [min_ms]

Trace common ext4 file operations slower than a threshold

positional arguments:
  min_ms             minimum I/O duration to trace, in ms (default 10)

optional arguments:
  -h, --help         show this help message and exit
  -j, --csv          just print fields: comma-separated values
  -p PID, --pid PID  trace this PID only

examples:
    ./ext4slower             # trace operations slower than 10 ms (default)
    ./ext4slower 1           # trace operations slower than 1 ms
    ./ext4slower -j 1        # ... 1 ms, parsable output (csv)
    ./ext4slower 0           # trace all operations (warning: verbose)
    ./ext4slower -p 185      # trace PID 185 only

I tried to aly it to MariaDB server while test was running and ended up with the following:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4slower 1 -p`pidof mariadbd`
Tracing ext4 operations slower than 1 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
13:55:17 mariadbd       19702  S 0       0          11.32 ib_logfile0
13:55:17 mariadbd       19702  S 0       0          10.97 ib_logfile0
...
13:55:17 mariadbd       19702  S 0       0          10.51 ib_logfile0
13:55:17 mariadbd       19702  S 0       0          44.53 ibdata1
13:55:17 mariadbd       19702  S 0       0          44.14 sbtest1.ibd
13:55:17 mariadbd       19702  S 0       0          20.13 ib_logfile0
...
13:55:17 mariadbd       19702  S 0       0          11.21 ib_logfile0
13:55:17 mariadbd       19702  S 0       0          43.34 ibdata1
13:55:17 mariadbd       19702  S 0       0         179.07 sbtest1.ibd
13:55:17 mariadbd       19702  S 0       0           6.46 ib_logfile0
...
13:55:18 mariadbd       19702  S 0       0          47.79 ibdata1
13:55:18 mariadbd       19702  S 0       0          18.42 ib_logfile0
13:55:18 mariadbd       19702  S 0       0          41.76 ib_logfile0
13:55:18 mariadbd       19702  S 0       0          60.47 ibdata1

Looks like only sync operation where sometimes slow. To double check that slow writes are also reported I started dd concurrently:

openxs@ao756:~/dbs/maria10.5$ dd if=/dev/zero of=test.dd bs=1024 count=1000000
1000000+0 records in
1000000+0 records out
1024000000 bytes (1,0 GB, 977 MiB) copied, 12,4325 s, 82,4 MB/s

and got the following reported:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4slower 10
Tracing ext4 operations slower than 10 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
13:57:45 dd             20972  W 1024    317088     10.10 test.dd
13:57:45 dd             20972  W 1024    317576    179.25 test.dd
13:57:45 dd             20972  W 1024    322812     15.42 test.dd

13:57:45 dd             20972  W 1024    327394     14.32 test.dd
13:57:46 dd             20972  W 1024    354096     17.61 test.dd
13:57:46 mysqld         4090   W 114688  1024      272.75 ibdata1
^Copenxs@ao756:~$

We can clearly see dd writes at differenyt offsets, 1024 bytes e4ach. Some of them were very slow. mysqld process is for PMM server running in Docker container.

ext4dist - summarize ext4 operation latency

We can further study histograms of latencies per operation with the ext4dist tool:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4dist --help
usage: ext4dist [-h] [-T] [-m] [-p PID] [interval] [count]

Summarize ext4 operation latency

positional arguments:
  interval            output interval, in seconds
  count               number of outputs

optional arguments:
  -h, --help          show this help message and exit
  -T, --notimestamp   don't include timestamp on interval output
  -m, --milliseconds  output in milliseconds
  -p PID, --pid PID   trace this PID only

examples:
    ./ext4dist            # show operation latency as a histogram
    ./ext4dist -p 181     # trace PID 181 only
    ./ext4dist 1 10       # print 1 second summaries, 10 times
    ./ext4dist -m 5       # 5s summaries, milliseconds

Here we can see how sync latencies are distributed in general:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4dist -T -m
Tracing ext4 operation latency... Hit Ctrl-C to end.
^C

operation = read
     msecs               : count     distribution
         0 -> 1          : 1336     |****************************************|
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 5        |                                        |
        16 -> 31         : 7        |                                        |
        32 -> 63         : 1        |                                        |

operation = write
     msecs               : count     distribution
         0 -> 1          : 10834    |****************************************|
         2 -> 3          : 3        |                                        |
         4 -> 7          : 3        |                                        |
         8 -> 15         : 4        |                                        |
        16 -> 31         : 11       |                                        |
        32 -> 63         : 6        |                                        |
        64 -> 127        : 2        |                                        |

operation = open
     msecs               : count     distribution
         0 -> 1          : 328      |****************************************|

operation = fsync
     msecs               : count     distribution
         0 -> 1          : 8        |                                        |
         2 -> 3          : 2        |                                        |
         4 -> 7          : 40       |                                        |
         8 -> 15         : 2159     |****************************************|
        16 -> 31         : 247      |****                                    |
        32 -> 63         : 319      |*****                                   |
        64 -> 127        : 152      |**                                      |
       128 -> 255        : 84       |*                                       |
       256 -> 511        : 2        |                                        |
openxs@ao756:~$

Other oerations were fast enough.

offwaketime - summarize blocked time by kernel stack trace + waker stack

You may be not even sure what the alication is waiting for, disk I/O or something else. In such cases we can try to profile the off-CPU time spent with tools like offwaketime:

openxs@ao756:~$ sudo /usr/share/bcc/tools/offwaketime --help
usage: offwaketime [-h] [-p PIDS | -t TIDS | -u | -k] [-U | -K] [-d] [-f]
                   [--stack-storage-size STACK_STORAGE_SIZE]
                   [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]
                   [duration]

Summarize blocked time by kernel stack trace + waker stack

positional arguments:
  duration              duration of trace, in seconds

optional arguments:
  -h, --help            show this help message and exit
  -p PIDS, --pid PIDS   trace these PIDS only. Can be a comma separated list
                        of PIDS.
  -t TIDS, --tid TIDS   trace these TIDS only. Can be a comma separated list
                        of TIDS.
  -u, --user-threads-only
                        user threads only (no kernel threads)
  -k, --kernel-threads-only
                        kernel threads only (no user threads)
  -U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
  -K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
  -d, --delimited       insert delimiter between kernel/user stacks
  -f, --folded          output folded format
  --stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 1024)
  -m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store
                        traces (default 1)
  -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we
                        store traces (default U64_MAX)
  --state STATE         filter on this thread state bitmask (eg, 2 ==
                        TASK_UNINTERRUPTIBLE) see include/linux/sched.h

examples:
    ./offwaketime             # trace off-CPU + waker stack time until Ctrl-C
    ./offwaketime 5           # trace for 5 seconds only
    ./offwaketime -f 5        # 5 seconds, and output in folded format
    ./offwaketime -m 1000     # trace only events that last more than 1000 usec
    ./offwaketime -M 9000     # trace only events that last less than 9000 usec
    ./offwaketime -p 185      # only trace threads for PID 185
    ./offwaketime -t 188      # only trace thread 188
    ./offwaketime -u          # only trace user threads (no kernel)
    ./offwaketime -k          # only trace kernel threads (no user)
    ./offwaketime -U          # only show user space stacks (no kernel)
    ./offwaketime -K          # only show kernel space stacks (no user)

But my first attempt to use it failed immediately:

openxs@ao756:~$ sudo /usr/share/bcc/tools/offwaketime -f 5 -p`pidof mariadbd`
could not open bpf map: stack_traces, error: Invalid argument
Traceback (most recent call last):
  File "/usr/share/bcc/tools/offwaketime", line 290, in <module>
    b = BPF(text=bpf_text)
  File "/usr/lib/python2.7/dist-packages/bcc/__init__.py", line 364, in __init__
    raise Exception("Failed to compile BPF module %s" % (src_file or "<text>"))
Exception: Failed to compile BPF module <text>

Quck search proved that it is a known problem for Ubuntu 16.04. Wokraround (use a tool from old/ subdirectory for older kernels) helped in this case.

openxs@ao756:~$ sudo /usr/share/bcc/tools/old/offwaketime -f 5
...
mariadbd;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;mariadbd 14206606
percona-qan-api;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;percona-qan-api 14677762
prometheus;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;prometheus 14777599
dockerd;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;dockerd 14925945
consul;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;consul 14973943
mysqld;entry_SYSCALL_64_fastpath;sys_io_getevents;read_events;schedule;-;hrtimer_wakeup;__hrtimer_run_queues;hrtimer_interrupt;local_apic_timer_interrupt;smp_apic_timer_interrupt;apic_timer_interrupt;cpuidle_enter;call_cpuidle;cpu_startup_entry;start_secondary;swapper/1 15502360
prometheus1;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;prometheus1 16239277
sysbench;entry_SYSCALL_64_fastpath;sys_recvfrom;SYSC_recvfrom;sock_recvmsg;unix_stream_recvmsg;unix_stream_read_generic;schedule_timeout;schedule;-;autoremove_wake_function;__wake_up_common;__wake_up_sync_key;sock_def_readable;unix_stream_sendmsg;sock_sendmsg;SYSC_sendto;sys_sendto;entry_SYSCALL_64_fastpath;;mariadbd 19405099
node_exporter;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;node_exporter 23541664
mariadbd;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake_op;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;mariadbd 32664296
openxs@ao756:~$

If I interret the above right, we see waiting on mutex for mariadbd.

Finally, the profile tool: 

openxs@ao756:~$ sudo /usr/share/bcc/tools/profile --help
usage: profile [-h] [-p PID | -L TID] [-U | -K] [-F FREQUENCY | -c COUNT] [-d]
               [-a] [-I] [-f] [--stack-storage-size STACK_STORAGE_SIZE]
               [-C CPU] [--cgroupmap CGROUPMAP] [--mntnsmap MNTNSMAP]
               [duration]

Profile CPU stack traces at a timed interval

positional arguments:
  duration              duration of trace, in seconds

optional arguments:
  -h, --help            show this help message and exit
  -p PID, --pid PID     profile process with this PID only
  -L TID, --tid TID     profile thread with this TID only
  -U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
  -K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
  -F FREQUENCY, --frequency FREQUENCY
                        sample frequency, Hertz
  -c COUNT, --count COUNT
                        sample period, number of events
  -d, --delimited       insert delimiter between kernel/user stacks
  -a, --annotations     add _[k] annotations to kernel frames
  -I, --include-idle    include CPU idle stacks
  -f, --folded          output folded format, one line per stack (for flame
                        graphs)
  --stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 16384)
  -C CPU, --cpu CPU     cpu number to run profile on
  --cgroupmap CGROUPMAP
                        trace cgroups in this BPF map only
  --mntnsmap MNTNSMAP   trace mount namespaces in this BPF map only

examples:
    ./profile             # profile stack traces at 49 Hertz until Ctrl-C
    ./profile -F 99       # profile stack traces at 99 Hertz
    ./profile -c 1000000  # profile stack traces every 1 in a million events
    ./profile 5           # profile at 49 Hertz for 5 seconds only
    ./profile -f 5        # output in folded format for flame graphs
    ./profile -p 185      # only profile process with PID 185
    ./profile -L 185      # only profile thread with TID 185
    ./profile -U          # only show user space stacks (no kernel)
    ./profile -K          # only show kernel space stacks (no user)
    ./profile --cgroupmap mappath  # only trace cgroups in this BPF map
    ./profile --mntnsmap mappath   # only trace mount namespaces in the map

had not worked at all on my old kernel:

openxs@ao756:~$ sudo /usr/share/bcc/tools/profile -U
Sampling at 49 Hertz of all threads by user stack... Hit Ctrl-C to end.
/virtual/main.c:15:10: fatal error: 'uapi/linux/bpf_perf_event.h' file not found
#include <uapi/linux/bpf_perf_event.h>
         ^
1 error generated.
Traceback (most recent call last):
  File "/usr/share/bcc/tools/profile", line 277, in <module>
    b = BPF(text=bpf_text)
  File "/usr/lib/python2.7/dist-packages/bcc/__init__.py", line 364, in __init__
    raise Exception("Failed to compile BPF module %s" % (src_file or "<text>"))
Exception: Failed to compile BPF module <text>

openxs@ao756:~$ sudo /usr/share/bcc/tools/old/profile -U
Sampling at 49 Hertz of all threads by user stack... Hit Ctrl-C to end.
could not open bpf map: stack_traces, error: Invalid argument
BPF initialization failed. perf_misc_flags() may be inlined in your kernel build.
This tool will be updated in the future to support Linux 4.9, which has reliable profiling support. Exiting.

Header file is missing. Proper eBPF instrumentation to use perf events was added only in kernels 4.9+

To summarize:

  1. bcc tools allow to do detailed disk I/O analysis at different levels, even on old enough Linux kernels like 4.4
  2. funccount tool helps to find out what to trace with more details.
  3. Latencies distribution mattes and histograms provided by some bcc tools may help to understand this.
  4. There is a limit on total number of active probes.
  5. There are alternative implementations of some bcc tools for older kernel versions, but looks like its time for all of us to upgrade to Linux kernels 5.x.y.