Blog of (former?) MySQL Entomologist: eBPF

Showing posts with label eBPF. Show all posts

Sunday, May 9, 2021

Off-CPU Analysis Attempt to Find the Reason of Performance Regression in MariaDB 10.4

I did not write new blog posts for more than 2 months already. Busy days... But now I am on vacation and my Percona Live Online 2021 talk on flame graphs is coming soon, so I decided to renew my experience with bcc tools and try to get some insights for one of the bugs I've reported for MariaDB using the off-CPU flame graphs.

The idea was to check why sysbench oltp_read_write test started to work notably slower in a newer version of MariaDB 10.4.x after 10.4.15. On my good, old and slow Acer netbook with recently updated Ubuntu:

openxs@ao756:~/git/server$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
openxs@ao756:~/git/server$ uname -a
Linux ao756 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I've compiled current MariaDB 10.4 from GitHub source following my usual way:

openxs@ao756:~/git/server$ git checkout 10.4
...
openxs@ao756:~/git/server$ git pull
...
openxs@ao756:~/git/server$ git submodule update --init --recursive
Submodule path 'extra/wolfssl/wolfssl': checked out '9c87f979a7f1d3a6d786b260653d566c1d31a1c4'
Submodule path 'libmariadb': checked out '180c543704d627a50a52aaf60e24ca14e0ec4686'
Submodule path 'wsrep-lib': checked out 'f271ad0c6e3c647df83c1d5ec9cd26d77cef2337'
Submodule path 'wsrep-lib/wsrep-API/v26': checked out '76cf223c690845bbf561cb820a46e06a18ad80d1'
openxs@ao756:~/git/server$ git branch
10.3
* 10.4
10.5
10.6
openxs@ao756:~/git/server$ git log -1
commit 583b72ad0ddbc46a7aaeda1c1373b89d4bded9ea (HEAD -> 10.4, origin/bb-10.4-merge, origin/10.4)
Merge: 473e85e9316 a4139f8d68b
Author: Oleksandr Byelkin <sanja@mariadb.com>
Date:   Fri May 7 11:50:24 2021 +0200

    Merge branch 'bb-10.4-release' into 10.4
openxs@ao756:~/git/server$ rm CMakeCache.txt
openxs@ao756:~/git/server$ cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/maria10.4 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DPLUGIN_TOKUDB=NO -DWITH_SSL=system
...
-- Generating done
-- Build files have been written to: /home/openxs/git/server
openxs@ao756:~/git/server$ time make -j 3
...
[100%] Building C object extra/mariabackup/CMakeFiles/mariabackup.dir/__/__/libmysqld/libmysql.c.o
[100%] Linking CXX executable mariabackup
[100%] Built target mariabackup

real    74m9,550s
user    134m10,837s
sys     7m0,387s

openxs@ao756:~/git/server$ make install && make clean
...
openxs@ao756:~/git/server$ cd ~/dbs/maria10.4
openxs@ao756:~/dbs/maria10.4$ bin/mysqld_safe --no-defaults --port=3309 --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 &
[1] 27483
openxs@ao756:~/dbs/maria10.4$ 210507 19:15:37 mysqld_safe Logging to '/home/openxs/dbs/maria10.4/data/ao756.err'.
210507 19:15:37 mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/maria10.4/data

openxs@ao756:~/dbs/maria10.4$ bin/mysql --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 8
Server version: 10.4.19-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> drop database if exists sbtest;
Query OK, 5 rows affected (1.297 sec)

MariaDB [(none)]> create database sbtest;
Query OK, 1 row affected (0.001 sec)

MariaDB [(none)]> exit
Bye

and compared to 10.4.15 with the following test:

openxs@ao756:~/dbs/maria10.4$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=4 prepare
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Initializing worker threads...

Creating table 'sbtest2'...
Creating table 'sbtest3'...
Creating table 'sbtest4'...
Creating table 'sbtest1'...
Inserting 100000 records into 'sbtest1'
Inserting 100000 records into 'sbtest4'
Inserting 100000 records into 'sbtest2'
Inserting 100000 records into 'sbtest3'
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest4'...
Creating table 'sbtest5'...
Inserting 100000 records into 'sbtest5'
Creating a secondary index on 'sbtest2'...
Creating a secondary index on 'sbtest3'...
Creating a secondary index on 'sbtest5'...

openxs@ao756:~/dbs/maria10.4$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
...
    transactions:                        35630 (118.54 per sec.)
    queries:                             712600 (2370.87 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.5612s
    total number of events:              35630

Latency (ms):
         min:                                    3.30
         avg:                                   33.70
         max:                                 2200.17
         95th percentile:                      164.45
...

openxs@ao756:~/dbs/maria10.4.15$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
...
    transactions:                        56785 (189.25 per sec.)
    queries:                             1135700 (3784.99 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.0501s
    total number of events:              56785

Latency (ms):
         min:                                    3.15
         avg:                                   21.13
         max:                                  704.36
         95th percentile:                      108.68
...

So, basically with the same test with all tables fitting into the buffer pool (1G) and all but few default settings current MariaDB 10.4.19 demonstrate up to 60% drop in throughput and increase of 95th latency on this netbook (even more than 15% or so reported on faster quad core Fedora desktop previously).

If you read MDEV-24272 carefully, the regression was tracked up to a specific commit, but I tried to apply various tools to actually see where more time is spent now, specifically. Profiling with perf and creating on-CPU flame graphs had not given me any clear insight that would explain that increase in latency, so my next idea was to trace off-CPU time spent, that is, try to find out how long MariaDB server waits and where in the code that's mostly happen.

For this I've used the offcputime tool:

openxs@ao756:~$ /usr/share/bcc/tools/offcputime -h
usage: offcputime [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f]
                  [--stack-storage-size STACK_STORAGE_SIZE]
                  [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]
                  [duration]

Summarize off-CPU time by stack trace

positional arguments:
duration              duration of trace, in seconds

optional arguments:
-h, --help            show this help message and exit
-p PID, --pid PID     trace this PID only
-t TID, --tid TID     trace this TID only
-u, --user-threads-only
                        user threads only (no kernel threads)
-k, --kernel-threads-only
                        kernel threads only (no user threads)
-U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
-K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
-d, --delimited       insert delimiter between kernel/user stacks
-f, --folded          output folded format
--stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 1024)
-m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store
                        traces (default 1)
-M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we
                        store traces (default U64_MAX)
--state STATE         filter on this thread state bitmask (eg, 2 ==
                        TASK_UNINTERRUPTIBLE) see include/linux/sched.h

examples:
    ./offcputime             # trace off-CPU stack time until Ctrl-C
    ./offcputime 5           # trace for 5 seconds only
    ./offcputime -f 5        # 5 seconds, and output in folded format
    ./offcputime -m 1000     # trace only events that last more than 1000 usec
    ./offcputime -M 10000    # trace only events that last less than 10000 usec
    ./offcputime -p 185      # only trace threads for PID 185
    ./offcputime -t 188      # only trace thread 188
    ./offcputime -u          # only trace user threads (no kernel)
    ./offcputime -k          # only trace kernel threads (no user)
    ./offcputime -U          # only show user space stacks (no kernel)
    ./offcputime -K          # only show kernel space stacks (no user)

I've stored the outputs in /dev/shm to have less impact on disk I/O that I suspected as one of the reasons:

openxs@ao756:~$ ls /dev/shm
openxs@ao756:~$ mkdir /dev/shm/out
openxs@ao756:~$ ls /dev/shm
out

Basically the following commands were used to create folded (ready to use for building flame graphs) user-space only stacks and time spent off-CPU in them over 60 seconds of tracing while sysbench tests were running for clean setup on MariaDB 10.4.15 and then on current 10.4.19, and create flame graphs based on those stacks where start_thread frame is present (to clean up irrelevant details):

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10415.out
WARNING: 27 stack traces lost and could not be displayed.
openxs@ao756:~$ cat /dev/shm/out/offcpu_10415.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10415.svg

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10419.out
WARNING: 24 stack traces lost and could not be displayed.
openxs@ao756:~$ cat /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10419.svg

As a result I've got the following graphs (.png screenshots from real .svg files below). On 10.4.15 we spent around 43 seconds (out of 60 we monitored) off-CPU, mostly in do_command() and waiting for network I/O, and the graph was the following:

Off-CPU time for MariaDB 10.4.15

In case of 10.4.19 the graph is very different and we seem to have spent 79 seconds off-CPU, mostly in background io_handler_thread():

Off-CPU time for MariaDB 10.4.19

I was surprised to see more than 60 seconds spent off-CPU in this case. Maybe this is possible because I have 2 cores and MariaDB threads were waiting on both most of the time.

I've then tried to use differential flame graph to highlight the call stacks that the main difference is related to. I've crated it from existing folded stack traces with the following command:

openxs@ao756:~$ ~/git/FlameGraph/difffolded.pl /dev/shm/out/offcpu_10415.out /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Diff Flame Graph" --countname=us > /tmp/offcpu_diff.svg

The resulting graph is presented below:

The source of increase is highlighted in red

Here we clearly see that main increase in time spent waiting in 10.4.19 is related to io_handler_thread(), but increase happened in almost all background threads.

* * *

To summarize:

When some performance regression happens you should check not only those code paths in the application where software started to spend more time working, but also where it started to wait more.
In case of https://jira.mariadb.org/browse/MDEV-24272 we clearly started to flush more to disk from the very beginning of sysbench oltp_read_write test in newer versions of MariaDB 10.4.x, and on my slow encrypted HDD this matters a lot. The load that was supposed to be CPU-bound (as we have large enough buffer pool) becomes disk-bound.
Flame graphs are cool for highlighting the difference and in this post I've shown both a classical smart way to produce them without too much impact, and a way to highlight the difference in them with a differential flam graph produced by the difffolded.pl tool created by Brendan Gregg.
Other cases when Flame Graphs may help MySQL or MariaDB DBAs are discussed during my upcoming Percona Live 2021 talk on May 12. See you there!
I'll get back to this nice regression bug to study the test case in more details with other tools, maybe more than once. Stay tuned!

Sunday, September 13, 2020

BCC Tools for disk I/O Analysis and More

A lot of time passed since I had written about bcc tools in my dynamic tracing series. That post was mostly about building bcc tools from GitHub source on Ubuntu 16.04 and using trace utility to add dynamic probe to MariaDB server. Other that that I had presented tplist and mysql_qslower with a couple of slides at several events, but that's all. There are many more tools useful for DBAs in the bundle, and today I am going to describe several of them mostly related to I/O monitoring. It is yet another blog post in support of my upcoming MariaDB Server Fest 2020 session on dynamic tracing.

I was inspired by another cool resentation, "DB AMA: BCC - Tools for BPF-based Linux IO analysis", made by my dear friend and former colleague in Percona, Marcos Albe:

It has much less likes and views that it honestly deserves.

For this post I picked up only some of bcc tools discussed there. Separate post about using bpftrace for similar analysis is in progress, but I can not complete it today, mostly because bpftrace is not supported on this old Ubuntu I have at hand:

openxs@ao756:~/dbs/maria10.5$ uname -a
Linux ao756 4.4.0-187-generic #217-Ubuntu SMP Tue Jul 21 04:18:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

So I can not check and copy/paste the details quickly.

I started with rebuilding bcc tools from current source:

openxs@ao756:~/git/bcc$ git log -1
commit ba73657cb8c4dab83dfb89eed4a8b3866255569a
Author: Hao <1075808668@qq.com>
Date:   Sat Sep 12 02:05:29 2020 +0800

    Netqtop 3037 (#3048)

    The tool netqtop uses tracepoints NET_DEV_START_XMIT and NETIF_RECEIVE_SKB to intercept every transmitted and received packet, as a result, considerable performance descent is expected.
    Details for some performance evaluation can be found at
       https://github.com/iovisor/bcc/pull/3048
openxs@ao756:~/git/bcc$

Nothing fancy there, just many tests are still failing:

openxs@ao756:~/git/bcc/build$ make test
...
68% tests passed, 14 tests failed out of 44

Total Test time (real) = 340.86 sec

The following tests FAILED:
          4 - py_test_stat1_b (Failed)
          5 - py_test_bpf_log (Failed)
          6 - py_test_stat1_c (Failed)
          7 - py_test_xlate1_c (Failed)
          8 - py_test_call1 (Failed)
         11 - py_test_trace3_c (Failed)
         15 - py_test_debuginfo (Failed)
         16 - py_test_brb (Failed)
         17 - py_test_brb2 (Failed)
         18 - py_test_clang (Failed)
         23 - py_test_stackid (Failed)
         30 - py_test_tools_smoke (Failed)
         38 - py_test_lpm_trie (Failed)
         44 - lua_test_standalone (Failed)
Errors while running CTest
Makefile:105: recipe for target 'test' failed
make: *** [test] Error 8

but I still had a hope that most of the tools work (and it's really the case). So I executed sudo make install and ended up with the following:

openxs@ao756:~/git/bcc/build$ ls -F /usr/share/bcc/tools/
argdist*       dirtop*          lib/             pythoncalls* tclobjnew*
bashreadline* doc/             llcstat*         pythonflow*   tclstat*
bindsnoop*     drsnoop*         mdflush*         pythongc*     tcpaccept*
biolatency*    execsnoop*       memleak*         pythonstat*   tcpconnect*
biolatpcts*    exitsnoop*       mountsnoop*      reset-trace* tcpconnlat*
biosnoop*      ext4dist*        mysqld_qslower* rubycalls*    tcpdrop*
biotop*        ext4slower*      netqtop*         rubyflow*     tcplife*
bitesize*      filelife*        netqtop.c        rubygc*       tcpretrans*
bpflist*       fileslower*      nfsdist*         rubyobjnew*   tcprtt*
btrfsdist*     filetop*         nfsslower*       rubystat*     tcpstates*
btrfsslower*   funccount*       nodegc*          runqlat*      tcpsubnet*
cachestat*     funcinterval*    nodestat*        runqlen*      tcptop*
cachetop*      funclatency*     offcputime*      runqslower*   tcptracer*
capable*       funcslower*      offwaketime*     shmsnoop*     tplist*
cobjnew*       gethostlatency* old/             slabratetop* trace*
compactsnoop* hardirqs*        oomkill*         sofdsnoop*    ttysnoop*
cpudist*       inject*          opensnoop*       softirqs*     vfscount*
cpuunclaimed* javacalls*       perlcalls*       solisten*     vfsstat*
criticalstat* javaflow*        perlflow*        sslsniff*     wakeuptime*
dbslower*      javagc*          perlstat*        stackcount*   xfsdist*
dbstat*        javaobjnew*      phpcalls*        statsnoop*    xfsslower*
dcsnoop*       javastat*        phpflow*         syncsnoop*    zfsdist*
dcstat*        javathreads*     phpstat*         syscount*     zfsslower*
deadlock*      killsnoop*       pidpersec*       tclcalls*
deadlock.c     klockstat*       profile*         tclflow*
openxs@ao756:~/git/bcc/build$

The tools I am going to try today are highlighted above. I'll run some sysbench tests like this:

openxs@ao756:~/dbs/maria10.5$ sysbench --table-size=1000000 --threads=4 --time=100 --mysql-socket=/tmp/mariadb105.sock --mysql-user=openxs --mysql-db=sbtest --report-interval=5 /usr/share/sysbench/oltp_update_index.lua run
sysbench 1.1.0-faaff4f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Report intermediate results every 5 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 5s ] thds: 4 tps: 113.89 qps: 113.89 (r/w/o: 0.00/113.89/0.00) lat (ms,95%): 101.13 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 4 tps: 113.60 qps: 113.60 (r/w/o: 0.00/113.60/0.00) lat (ms,95%): 110.66 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 4 tps: 110.20 qps: 110.20 (r/w/o: 0.00/110.20/0.00) lat (ms,95%): 144.97 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 112.00 qps: 112.00 (r/w/o: 0.00/112.00/0.00) lat (ms,95%): 153.02 err/s: 0.00 reconn/s: 0.00
...

as a test load against recent MariaDB 10.5.6 also built from source and started with mostly default settings and some Performance Schema instrumentation enabled:

openxs@ao756:~/dbs/maria10.5$ bin/mysqld_safe --no-defaults --port=3311 --socket=/tmp/mariadb105.sock --performance_schema=1 --performance-schema-instrument='memory/%=ON' &
[1] 19616
openxs@ao756:~/dbs/maria10.5$ 200913 13:16:14 mysqld_safe Logging to '/home/openxs/dbs/maria10.5/data/ao756.err'.
200913 13:16:14 mysqld_safe Starting mariadbd daemon with databases from /home/openxs/dbs/maria10.5/data
...

funccount - count functions, tracepoints and USDT probes hits

This tool is useful when you looking for what actually to trace, as it shows number of hits per tracepoint or probe. As usual, you can just call it with --help option to get a lot of details about options:

openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount --help
usage: funccount [-h] [-p PID] [-i INTERVAL] [-d DURATION] [-T] [-r] [-D]
                 [-c CPU]
                 pattern

Count functions, tracepoints, and USDT probes

positional arguments:
pattern               search expression for events

optional arguments:
-h, --help            show this help message and exit
-p PID, --pid PID     trace this PID only
-i INTERVAL, --interval INTERVAL
                        summary interval, seconds
-d DURATION, --duration DURATION
                        total duration of trace, seconds
-T, --timestamp       include timestamp on output
-r, --regexp          use regular expressions. Default is "*" wildcards
                        only.
-D, --debug           print BPF program before starting (for debugging
                        purposes)
-c CPU, --cpu CPU     trace this CPU only

examples:
    ./funccount 'vfs_*'             # count kernel fns starting with "vfs"
    ./funccount -r '^vfs.*'         # same as above, using regular expressions
    ./funccount -Ti 5 'vfs_*'       # output every 5 seconds, with timestamps
    ./funccount -d 10 'vfs_*'       # trace for 10 seconds only
    ./funccount -p 185 'vfs_*'      # count vfs calls for PID 181 only
    ./funccount t:sched:sched_fork # count calls to the sched_fork tracepoint
    ./funccount -p 185 u:node:gc*   # count all GC USDT probes in node, PID 185
    ./funccount c:malloc            # count all malloc() calls in libc
    ./funccount go:os.*             # count all "os.*" calls in libgo
    ./funccount -p 185 go:os.*      # count all "os.*" calls in libgo, PID 185
    ./funccount ./test:read*        # count "read*" calls in the ./test binary
    ./funccount -c 1 'vfs_*'        # count vfs calls on CPU 1 only

Based on the above I tried to trace everything for a couple of seconds while sysbench was running, and failed:

openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount -i 1 -d 2 '*'
maximum of 1000 probes allowed, attempted 53271

Too many probes to trace. The following commands work though:

openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount -T -i 1 -d 1 'ext4*'
Tracing 509 functions for "ext4*"... Hit Ctrl-C to end.

13:29:37
FUNC                                    COUNT
...
ext4_sync_file                             14
ext4_finish_bio                            14
ext4_get_block_write                       14
ext4_direct_IO                             14
...
ext4_statfs                               110
ext4_htree_store_dirent                   110
ext4_da_write_begin                       112
ext4fs_dirhash                            122
ext4_es_lookup_extent                     143
ext4_journal_check_start                  219
Detaching...
openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount -T -i 1 -d 1 '*sync'
Tracing 181 functions for "*sync"... Hit Ctrl-C to end.

13:32:11
FUNC                                    COUNT
sys_fsync                                   1
SyS_fsync                                   3
journal_end_buffer_io_sync                  8
end_bio_bh_io_sync                         22
do_fsync                                  142
kill_fasync                               158
i915_gem_object_sync                      217
SyS_fdatasync                             232
blk_run_queue_async                       291
sys_fdatasync                             324
try_to_del_timer_sync                     408
run_sync                                  483
del_timer_sync                           1056
Detaching...
openxs@ao756:~$ sudo /usr/share/bcc/tools/funccount -T -i 1 -d 1 '*file'
Tracing 151 functions for "*file"... Hit Ctrl-C to end.

13:32:41
FUNC                                    COUNT
security_mmap_file                          1
apparmor_mmap_file                          1
sock_alloc_file                             3
ext4_release_file                           8
aa_audit_file                              29
ext4_sync_file                             96
anon_inode_getfile                        100
__mnt_want_write_file                     100
__mnt_drop_write_file                     101
trace_create_file                         180
alloc_file                                205
locks_remove_file                         232
tracefs_create_file                       605
Detaching...
openxs@ao756:~$

So, it's clear that some notable number of calls to ext4_sync_file happens and we may want to trace it further. You can use funccount tool to get top N probes for further detailed study.

biotop - block device (disk) I/O by process

We have a way to find out what I/O related functions are oftedn used. But what processes actually use them and do a lot of disk I/O? Ther biotop tool helps to find out:

openxs@ao756:~$ sudo /usr/share/bcc/tools/biotop --help
usage: biotop [-h] [-C] [-r MAXROWS] [interval] [count]

Block device (disk) I/O by process

positional arguments:
interval              output interval, in seconds
count                 number of outputs

optional arguments:
-h, --help            show this help message and exit
-C, --noclear         don't clear the screen
-r MAXROWS, --maxrows MAXROWS
                        maximum rows to print, default 20

examples:
    ./biotop            # block device I/O top, 1 second refresh
    ./biotop -C         # don't clear the screen
    ./biotop 5          # 5 second summaries
    ./biotop 5 10       # 5 second summaries, 10 times only

I tried to run it several times as follows while sysbench test was running:

openxs@ao756:~$ sudo /usr/share/bcc/tools/biotop -C
Tracing... Output every 1 secs. Hit Ctrl-C to end

13:41:05 loadavg: 3.63 4.69 7.97 1/764 20587

PID    COMM             D MAJ MIN DISK       I/O Kbytes AVGms
19702 mariadbd         W 8   0   sda          1       0 10.85
253    dmcrypt_write    W 8   0   sda        130       0   1.25
0                       W 8   0   sda         63       0 14.23

...
PID    COMM             D MAJ MIN DISK       I/O Kbytes AVGms
20714 dd               R 8   0   sda          2       0 20.67
12517 update-manager   R 8   0   sda          3       0 519.19
253    dmcrypt_write    W 8   0   sda         80       0 151.81
4090   mysqld           R 8   0   sda          1       0 506.16
253    dmcrypt_write    R 8   0   sda          1       0   1.04
5917   node_exporter    R 8   0   sda          4       0 422.73
0                       R 8   0   sda          1       0   5.31
1218   percona-qan-age R 8   0   sda          4       0 72.50
Detaching...

We can see several top I/O users, including mariadbd and dd that I later ran concurrently to prove some points. You probably noted that I have PMM running on this netbook. A lot of I/O is related to dmcrypt_write - that's because I use encrypted disk volumn here. I am a bit surprised to see all zeros (0) for Kbytes column. Maybe there is some problem to get proper counts on this system for the bcc tools.

We see AVGms column in the output above - that's the average time for the I/O (latency) from the issuer to the device, to its completion, in milliseconds. Next tool may help to find out what's behind these average values, how slow the I/O may really be at times.

biolatency - summarize block device I/O latency as a histogram

Averages may hide some rare severe problems, so it makes sense to check the output of biolatency too:

openxs@ao756:~$ sudo /usr/share/bcc/tools/biolatency --help
usage: biolatency [-h] [-T] [-Q] [-m] [-D] [-F] [interval] [count]

Summarize block device I/O latency as a histogram

positional arguments:
interval            output interval, in seconds
count               number of outputs

optional arguments:
-h, --help          show this help message and exit
-T, --timestamp     include timestamp on output
-Q, --queued        include OS queued time in I/O time
-m, --milliseconds millisecond histogram
-D, --disks         print a histogram per disk device
-F, --flags         print a histogram per set of I/O flags

examples:
    ./biolatency            # summarize block I/O latency as a histogram
    ./biolatency 1 10       # print 1 second summaries, 10 times
    ./biolatency -mT 1      # 1s summaries, milliseconds, and timestamps
    ./biolatency -Q         # include OS queued time in I/O time
    ./biolatency -D         # show each disk device separately
    ./biolatency -F         # show I/O flags separately

I've called it as follows to get histograms for 3 seconds, including OS queued time:

openxs@ao756:~$ sudo /usr/share/bcc/tools/biolatency -mQD 1 3
Tracing block device I/O... Hit Ctrl-C to end.

disk = 'sda'
     msecs               : count     distribution
         0 -> 1          : 73       |****************************************|
         2 -> 3          : 4        |**                                      |
         4 -> 7          : 5        |**                                      |
         8 -> 15         : 64       |***********************************     |
        16 -> 31         : 24       |*************                           |
        32 -> 63         : 6        |***                                     |
        64 -> 127        : 2        |*                                       |

disk = 'sda'
     msecs               : count     distribution
         0 -> 1          : 80       |****************************************|
         2 -> 3          : 8        |****                                    |
         4 -> 7          : 2        |*                                       |
         8 -> 15         : 57       |****************************            |
        16 -> 31         : 11       |*****                                   |
        32 -> 63         : 4        |**                                      |
        64 -> 127        : 4        |**                                      |

disk = 'sda'
     msecs               : count     distribution
         0 -> 1          : 48       |****************************************|
         2 -> 3          : 1        |                                        |
         4 -> 7          : 7        |*****                                   |
         8 -> 15         : 45       |*************************************   |
        16 -> 31         : 25       |********************                    |
        32 -> 63         : 11       |*********                               |
openxs@ao756:~$

It is clear from the above that there are many really fst I/O, but many of them take 8-15 milliseconds or even more.

We can do some analysis at filesystem level too. Next two tools help in case of ext4 filesystem (like mine). There are similar tools in bcc tools collections for other types of filesystems.

ext4slower - trace common ext4 file operations slower than a threshold

It could be useful to find out wehat kind of operation (read, write or sync) is slow, and with what file. ext4slower tool may help:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4slower --help
usage: ext4slower [-h] [-j] [-p PID] [min_ms]

Trace common ext4 file operations slower than a threshold

positional arguments:
min_ms             minimum I/O duration to trace, in ms (default 10)

optional arguments:
-h, --help         show this help message and exit
-j, --csv          just print fields: comma-separated values
-p PID, --pid PID trace this PID only

examples:
    ./ext4slower             # trace operations slower than 10 ms (default)
    ./ext4slower 1           # trace operations slower than 1 ms
    ./ext4slower -j 1        # ... 1 ms, parsable output (csv)
    ./ext4slower 0           # trace all operations (warning: verbose)
    ./ext4slower -p 185      # trace PID 185 only

I tried to aly it to MariaDB server while test was running and ended up with the following:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4slower 1 -p`pidof mariadbd`
Tracing ext4 operations slower than 1 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
13:55:17 mariadbd       19702 S 0       0          11.32 ib_logfile0
13:55:17 mariadbd       19702 S 0       0          10.97 ib_logfile0
...
13:55:17 mariadbd       19702 S 0       0          10.51 ib_logfile0
13:55:17 mariadbd       19702 S 0       0          44.53 ibdata1
13:55:17 mariadbd       19702 S 0       0          44.14 sbtest1.ibd
13:55:17 mariadbd       19702 S 0       0          20.13 ib_logfile0
...
13:55:17 mariadbd       19702 S 0       0          11.21 ib_logfile0
13:55:17 mariadbd       19702 S 0       0          43.34 ibdata1
13:55:17 mariadbd       19702 S 0       0         179.07 sbtest1.ibd
13:55:17 mariadbd       19702 S 0       0           6.46 ib_logfile0
...
13:55:18 mariadbd       19702 S 0       0          47.79 ibdata1
13:55:18 mariadbd       19702 S 0       0          18.42 ib_logfile0
13:55:18 mariadbd       19702 S 0       0          41.76 ib_logfile0
13:55:18 mariadbd       19702 S 0       0          60.47 ibdata1

Looks like only sync operation where sometimes slow. To double check that slow writes are also reported I started dd concurrently:

openxs@ao756:~/dbs/maria10.5$ dd if=/dev/zero of=test.dd bs=1024 count=1000000
1000000+0 records in
1000000+0 records out
1024000000 bytes (1,0 GB, 977 MiB) copied, 12,4325 s, 82,4 MB/s

and got the following reported:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4slower 10
Tracing ext4 operations slower than 10 ms
TIME     COMM           PID    T BYTES   OFF_KB   LAT(ms) FILENAME
13:57:45 dd             20972 W 1024    317088     10.10 test.dd
13:57:45 dd             20972 W 1024    317576    179.25 test.dd
13:57:45 dd             20972 W 1024    322812     15.42 test.dd
13:57:45 dd             20972 W 1024    327394     14.32 test.dd
13:57:46 dd             20972 W 1024    354096     17.61 test.dd
13:57:46 mysqld         4090   W 114688 1024      272.75 ibdata1
^Copenxs@ao756:~$

We can clearly see dd writes at differenyt offsets, 1024 bytes e4ach. Some of them were very slow. mysqld process is for PMM server running in Docker container.

ext4dist - summarize ext4 operation latency

We can further study histograms of latencies per operation with the ext4dist tool:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4dist --help
usage: ext4dist [-h] [-T] [-m] [-p PID] [interval] [count]

Summarize ext4 operation latency

positional arguments:
interval            output interval, in seconds
count               number of outputs

optional arguments:
-h, --help          show this help message and exit
-T, --notimestamp   don't include timestamp on interval output
-m, --milliseconds output in milliseconds
-p PID, --pid PID   trace this PID only

examples:
    ./ext4dist            # show operation latency as a histogram
    ./ext4dist -p 181     # trace PID 181 only
    ./ext4dist 1 10       # print 1 second summaries, 10 times
    ./ext4dist -m 5       # 5s summaries, milliseconds

Here we can see how sync latencies are distributed in general:

openxs@ao756:~$ sudo /usr/share/bcc/tools/ext4dist -T -m
Tracing ext4 operation latency... Hit Ctrl-C to end.
^C

operation = read
     msecs               : count     distribution
         0 -> 1          : 1336     |****************************************|
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 5        |                                        |
        16 -> 31         : 7        |                                        |
        32 -> 63         : 1        |                                        |

operation = write
     msecs               : count     distribution
         0 -> 1          : 10834    |****************************************|
         2 -> 3          : 3        |                                        |
         4 -> 7          : 3        |                                        |
         8 -> 15         : 4        |                                        |
        16 -> 31         : 11       |                                        |
        32 -> 63         : 6        |                                        |
        64 -> 127        : 2        |                                        |

operation = open
     msecs               : count     distribution
         0 -> 1          : 328      |****************************************|

operation = fsync
     msecs               : count     distribution
         0 -> 1          : 8        |                                        |
         2 -> 3          : 2        |                                        |
         4 -> 7          : 40       |                                        |
         8 -> 15         : 2159     |****************************************|
        16 -> 31         : 247      |****                                    |
        32 -> 63         : 319      |*****                                   |
        64 -> 127        : 152      |**                                      |
       128 -> 255        : 84       |*                                       |
       256 -> 511        : 2        |                                        |
openxs@ao756:~$

Other oerations were fast enough.

offwaketime - summarize blocked time by kernel stack trace + waker stack

You may be not even sure what the alication is waiting for, disk I/O or something else. In such cases we can try to profile the off-CPU time spent with tools like offwaketime:

openxs@ao756:~$ sudo /usr/share/bcc/tools/offwaketime --help
usage: offwaketime [-h] [-p PIDS | -t TIDS | -u | -k] [-U | -K] [-d] [-f]
                   [--stack-storage-size STACK_STORAGE_SIZE]
                   [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]
                   [duration]

Summarize blocked time by kernel stack trace + waker stack

positional arguments:
duration              duration of trace, in seconds

optional arguments:
-h, --help            show this help message and exit
-p PIDS, --pid PIDS   trace these PIDS only. Can be a comma separated list
                        of PIDS.
-t TIDS, --tid TIDS   trace these TIDS only. Can be a comma separated list
                        of TIDS.
-u, --user-threads-only
                        user threads only (no kernel threads)
-k, --kernel-threads-only
                        kernel threads only (no user threads)
-U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
-K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
-d, --delimited       insert delimiter between kernel/user stacks
-f, --folded          output folded format
--stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 1024)
-m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store
                        traces (default 1)
-M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we
                        store traces (default U64_MAX)
--state STATE         filter on this thread state bitmask (eg, 2 ==
                        TASK_UNINTERRUPTIBLE) see include/linux/sched.h

examples:
    ./offwaketime             # trace off-CPU + waker stack time until Ctrl-C
    ./offwaketime 5           # trace for 5 seconds only
    ./offwaketime -f 5        # 5 seconds, and output in folded format
    ./offwaketime -m 1000     # trace only events that last more than 1000 usec
    ./offwaketime -M 9000     # trace only events that last less than 9000 usec
    ./offwaketime -p 185      # only trace threads for PID 185
    ./offwaketime -t 188      # only trace thread 188
    ./offwaketime -u          # only trace user threads (no kernel)
    ./offwaketime -k          # only trace kernel threads (no user)
    ./offwaketime -U          # only show user space stacks (no kernel)
    ./offwaketime -K          # only show kernel space stacks (no user)

But my first attempt to use it failed immediately:

openxs@ao756:~$ sudo /usr/share/bcc/tools/offwaketime -f 5 -p`pidof mariadbd`
could not open bpf map: stack_traces, error: Invalid argument
Traceback (most recent call last):
File "/usr/share/bcc/tools/offwaketime", line 290, in <module>
b = BPF(text=bpf_text)
File "/usr/lib/python2.7/dist-packages/bcc/__init__.py", line 364, in __init__
raise Exception("Failed to compile BPF module %s" % (src_file or "<text>"))
Exception: Failed to compile BPF module <text>

Quck search proved that it is a known problem for Ubuntu 16.04. Wokraround (use a tool from old/ subdirectory for older kernels) helped in this case.

openxs@ao756:~$ sudo /usr/share/bcc/tools/old/offwaketime -f 5
...
mariadbd;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;mariadbd 14206606
percona-qan-api;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;percona-qan-api 14677762
prometheus;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;prometheus 14777599
dockerd;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;dockerd 14925945
consul;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;consul 14973943
mysqld;entry_SYSCALL_64_fastpath;sys_io_getevents;read_events;schedule;-;hrtimer_wakeup;__hrtimer_run_queues;hrtimer_interrupt;local_apic_timer_interrupt;smp_apic_timer_interrupt;apic_timer_interrupt;cpuidle_enter;call_cpuidle;cpu_startup_entry;start_secondary;swapper/1 15502360
prometheus1;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;prometheus1 16239277
sysbench;entry_SYSCALL_64_fastpath;sys_recvfrom;SYSC_recvfrom;sock_recvmsg;unix_stream_recvmsg;unix_stream_read_generic;schedule_timeout;schedule;-;autoremove_wake_function;__wake_up_common;__wake_up_sync_key;sock_def_readable;unix_stream_sendmsg;sock_sendmsg;SYSC_sendto;sys_sendto;entry_SYSCALL_64_fastpath;;mariadbd 19405099
node_exporter;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;node_exporter 23541664
mariadbd;entry_SYSCALL_64_fastpath;sys_futex;do_futex;futex_wait;futex_wait_queue_me;schedule;-;futex_wake_op;do_futex;sys_futex;entry_SYSCALL_64_fastpath;;mariadbd 32664296
openxs@ao756:~$

If I interret the above right, we see waiting on mutex for mariadbd.

Finally, the profile tool:

openxs@ao756:~$ sudo /usr/share/bcc/tools/profile --help
usage: profile [-h] [-p PID | -L TID] [-U | -K] [-F FREQUENCY | -c COUNT] [-d]
               [-a] [-I] [-f] [--stack-storage-size STACK_STORAGE_SIZE]
               [-C CPU] [--cgroupmap CGROUPMAP] [--mntnsmap MNTNSMAP]
               [duration]

Profile CPU stack traces at a timed interval

positional arguments:
duration              duration of trace, in seconds

optional arguments:
-h, --help            show this help message and exit
-p PID, --pid PID     profile process with this PID only
-L TID, --tid TID     profile thread with this TID only
-U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
-K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
-F FREQUENCY, --frequency FREQUENCY
                        sample frequency, Hertz
-c COUNT, --count COUNT
                        sample period, number of events
-d, --delimited       insert delimiter between kernel/user stacks
-a, --annotations     add _[k] annotations to kernel frames
-I, --include-idle    include CPU idle stacks
-f, --folded          output folded format, one line per stack (for flame
                        graphs)
--stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 16384)
-C CPU, --cpu CPU     cpu number to run profile on
--cgroupmap CGROUPMAP
                        trace cgroups in this BPF map only
--mntnsmap MNTNSMAP   trace mount namespaces in this BPF map only

examples:
    ./profile             # profile stack traces at 49 Hertz until Ctrl-C
    ./profile -F 99       # profile stack traces at 99 Hertz
    ./profile -c 1000000 # profile stack traces every 1 in a million events
    ./profile 5           # profile at 49 Hertz for 5 seconds only
    ./profile -f 5        # output in folded format for flame graphs
    ./profile -p 185      # only profile process with PID 185
    ./profile -L 185      # only profile thread with TID 185
    ./profile -U          # only show user space stacks (no kernel)
    ./profile -K          # only show kernel space stacks (no user)
    ./profile --cgroupmap mappath # only trace cgroups in this BPF map
    ./profile --mntnsmap mappath   # only trace mount namespaces in the map

had not worked at all on my old kernel:

openxs@ao756:~$ sudo /usr/share/bcc/tools/profile -U
Sampling at 49 Hertz of all threads by user stack... Hit Ctrl-C to end.
/virtual/main.c:15:10: fatal error: 'uapi/linux/bpf_perf_event.h' file not found
#include <uapi/linux/bpf_perf_event.h>
         ^
1 error generated.
Traceback (most recent call last):
File "/usr/share/bcc/tools/profile", line 277, in <module>
    b = BPF(text=bpf_text)
File "/usr/lib/python2.7/dist-packages/bcc/__init__.py", line 364, in __init__
    raise Exception("Failed to compile BPF module %s" % (src_file or "<text>"))
Exception: Failed to compile BPF module <text>

openxs@ao756:~$ sudo /usr/share/bcc/tools/old/profile -U
Sampling at 49 Hertz of all threads by user stack... Hit Ctrl-C to end.
could not open bpf map: stack_traces, error: Invalid argument
BPF initialization failed. perf_misc_flags() may be inlined in your kernel build.
This tool will be updated in the future to support Linux 4.9, which has reliable profiling support. Exiting.

Header file is missing. Proper eBPF instrumentation to use perf events was added only in kernels 4.9+

To summarize:

bcc tools allow to do detailed disk I/O analysis at different levels, even on old enough Linux kernels like 4.4
funccount tool helps to find out what to trace with more details.
Latencies distribution mattes and histograms provided by some bcc tools may help to understand this.
There is a limit on total number of active probes.
There are alternative implementations of some bcc tools for older kernel versions, but looks like its time for all of us to upgrade to Linux kernels 5.x.y.