Sunday, January 24, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part II, Using the Existing Tools

In the previous post in this series I've presented a couple of my quick and dirty attempts to use bpftrace add uprobes to MariaDB server and dynamic libraries it uses, to trace queries and their execution times and collect stack traces related to mutex waits. Looks like for any non-trivial monitoring task we are going to end up with more than one probe and would need to do more processing to produce clean and useful results with minimal CPU and memory impact both in the process of collecting the data while in kernel context and processing them in user space. 

I have a long way to go with my lame command  towards this goal, so in this post I decided to check several existing bpftrace programs not directly related to MariaDB, see how they are structured, how they use built in variables and functions etc. I'll also try to apply some of them to MariaDB 10.5 running sysbench read write test with high enough concurrency,

Most popular tools are included into the bpftrace source code, along with examples of their usage. We can find them in the tools subdirectory:

[root@fc31 tools]# pwd
/mnt/home/openxs/git/bpftrace/tools
[root@fc31 tools]# ls
bashreadline.bt             loads_example.txt      syscount_example.txt
bashreadline_example.txt    mdflush.bt             tcpaccept.bt
biolatency.bt               mdflush_example.txt    tcpaccept_example.txt
biolatency_example.txt      naptime.bt             tcpconnect.bt
biosnoop.bt
                 naptime_example.txt    tcpconnect_example.txt
biosnoop_example.txt        oomkill.bt             tcpdrop.bt
biostacks.bt
                oomkill_example.txt    tcpdrop_example.txt
biostacks_example.txt       opensnoop.bt           tcplife.bt
bitesize.bt                 opensnoop_example.txt  tcplife_example.txt
bitesize_example.txt        pidpersec.bt           tcpretrans.bt
capable.bt                  pidpersec_example.txt  tcpretrans_example.txt
capable_example.txt         runqlat.bt             tcpsynbl.bt
CMakeLists.txt              runqlat_example.txt    tcpsynbl_example.txt
cpuwalk.bt                  runqlen.bt             threadsnoop.bt
cpuwalk_example.txt         runqlen_example.txt    threadsnoop_example.txt
dcsnoop.bt                  setuids.bt             vfscount.bt
dcsnoop_example.txt         setuids_example.txt    vfscount_example.txt
execsnoop.bt                statsnoop.bt           vfsstat.bt
execsnoop_example.txt       statsnoop_example.txt  vfsstat_example.txt
gethostlatency.bt           swapin.bt              writeback.bt
gethostlatency_example.txt  swapin_example.txt     writeback_example.txt
killsnoop.bt                syncsnoop.bt           xfsdist.bt
killsnoop_example.txt       syncsnoop_example.txt  xfsdist_example.txt
loads.bt                    syscount.bt

I highlighted the tools I am going to try. But let's check the source code for one of them, with quite non-trivial and 48 lines long code, biosnoop.bt, to begin with. My comments are after each code fragment below:

#!/usr/bin/env bpftrace
#include <linux/blkdev.h>

/*
 * biosnoop.bt   Block I/O tracing tool, showing per I/O latency.
 *               For Linux, uses bpftrace, eBPF.
 *
 * TODO: switch to block tracepoints. Add offset and size columns.
 *
 * This is a bpftrace version of the bcc tool of the same name.
 *
 * 15-Nov-2017    Brendan Gregg    Created this.
 */

Here we can see how to use shebang first line to run the program with bpftrace if it's executable. Next line shows that for some cases bpftrace (as other eBPF tools) may need headers (in this case kernel header) to be able to resolve references to complex structures passed as arguments.

BEGIN
{
    printf("%-12s %-7s %-16s %-6s %7s\n", "TIME(ms)", "DISK", "COMM", "PID", "LAT(ms)");
}

Next we see the BEGIN probe that, same as with awk, is executed once at the beginning of the program and in this case prints the formatted header for the further output.

kprobe:blk_account_io_start
{
    @start[arg0] = nsecs;
    @iopid[arg0] = pid;
    @iocomm[arg0] = comm;
    @disk[arg0] = ((struct request *)arg0)->rq_disk->disk_name;
}

Here we define a kernel probe for the blk_account_io_start function and store information in 4 associative arrays indexed by arg0, to store start time for the call in nanoseconds since the probe attached (nsecs),  pid of the calling program, calling command itself (comm) and disk name that we get from deep nested structure of the first traced function call argument, arg0, via type cast and pointers. That's why we needed kernel headers, to reference different structure members by name and eventually de-reference to proper offset/address.

kprobe:blk_account_io_done
/@start[arg0] != 0 && @iopid[arg0] != 0 && @iocomm[arg0] != ""/

{
    $now = nsecs;
    printf("%-12u %-7s %-16s %-6d %7d\n",
        elapsed / 1e6, @disk[arg0], @iocomm[arg0], @iopid[arg0],
        ($now - @start[arg0]) / 1e6);

    delete(@start[arg0]);
    delete(@iopid[arg0]);
    delete(@iocomm[arg0]);
    delete(@disk[arg0]);
}

In the probe above that we define for the block_account_io_done function we first make sure to do something only if the block_account_io_start was already called by this same process with the same argument, otherwise we would not match the times properly. The problem here is that the system call is completed not when the start function returns, but when the corresponding system call informs the caller that we are done. It's not as easy as kprobe/kretprobe for the same function implemented by the kernel.

The action of the probe is simple. We calculate the time difference in millseconds by comparing the stored start timestampt with current nsecs value and output the details collected. Then, and this is essential, we delete the element from the associative array (a.k.a. map) with delete(), so that repeataed call from the same process to the same disk are not mixed up together and we do not use more memory than really needed.

END
{
    clear(@start);
    clear(@iopid);
    clear(@iocomm);
    clear(@disk);
}

In the END probe we clear() all associative arrays that were used by the program, essnetially deleteting all the items in the maps. Otherwise, as we've seen in my lame examples previously, they are dumped at the end of the program, and for this tool that produce monitoring output while it runs, we definitely do not need that.

Now that we better understand how "real" bpftrace programs are usually structured and designed, let's try to run biosnoop.bt that traces block I/O, and shows the issuing process (at least, the process that was on-CPU at the time of queue insert) and the latency of the I/O:

[root@fc31 tools]# ./biosnoop.bt
Attaching 4 probes...
TIME(ms)     DISK    COMM             PID    LAT(ms)
98           sda     mariadbd         6147         0
98           sda     mariadbd         6147         0
...
195          sda     mariadbd         6147        92
196          sda     mariadbd         6147         0
...
201          sda     mariadbd         6147         0
274          sda     mariadbd         6147        73
275          sda     mariadbd         6147         0
278          sda     mariadbd         6147         2
293          sda     mariadbd         6147         9
295          sda     mariadbd         6147         0
295          sda     mariadbd         6147         0
303          sda     mariadbd         6147         7
303          sda     mariadbd         6147         0
303          sda     mariadbd         6147         0
303          sda     mariadbd         6147         0
304          sda     mariadbd         6147         0
304          sda     mariadbd         6147         0
305          sda     mariadbd         6147         0
305          sda     mariadbd         6147         0
306          sda     mariadbd         6147         0
335          sda     mariadbd         6147        29
336          sda     jbd2/dm-0-8      419         14
337          sda     mariadbd         6147         1
365          sda     jbd2/dm-0-8      419         28
365          sda     mariadbd         6147        28
365          sda     kworker/2:4      24472        0

392          sda     mariadbd         6147        26
...

Here we see block I/O requests for my only disk, sda, from the mariadbd and few other processes, with timestamps starting from the startup and related latency (that is less than 1 millisecond for most cases, but sometimes appoached 100 milliseconds on this HDD).

The next tool to check is biolatency.bt that traces block I/O and shows latency as a power-of-2 histogram using hist() function:

[root@fc31 tools]# ./biolatency.bt
Attaching 4 probes...
Tracing block device I/O... Hit Ctrl-C to end.
^C


@usecs:
[128, 256)            31 |@@                                                  |
[256, 512)           421 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[512, 1K)            754 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)              94 |@@@@@@                                              |
[2K, 4K)              38 |@@                                                  |
[4K, 8K)              36 |@@                                                  |
[8K, 16K)             65 |@@@@                                                |
[16K, 32K)           109 |@@@@@@@                                             |
[32K, 64K)            80 |@@@@@                                               |
[64K, 128K)           25 |@                                                   |
[128K, 256K)          16 |@                                                   |

So, during the monitoring interval, until I hit Ctrl+C, the majority of block I/O calls had the latency of 256 to 1024 microseconds, less than 1 millisecond. There were few longer waiting calls too.

The next tool to check is biostacks.bt that is supposed to show block I/O latency as a histogram, with the kernel stack trace that initiated the I/O (do you still remember about /proc kernel stacks sampling that can be used for similar purposes?). This can help explain disk I/O that is not directly requested by applications. I've got the following:

[root@fc31 tools]# ./biostacks.bt > /tmp/biostacks.txt
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:blk_start_request, skipping.
Attaching 5 probes...
Tracing block I/O with init stacks. Hit Ctrl-C to end.

^C[root@fc31 tools]# more /tmp/biostacks.txt

...

@usecs[
    blk_account_io_start+1
    blk_mq_make_request+481
    generic_make_request+653
    submit_bio+75
    ext4_io_submit+73
    ext4_writepages+694
    do_writepages+51
    __filemap_fdatawrite_range+172
    file_write_and_wait_range+107
    ext4_sync_file+240
    do_fsync+56
    __x64_sys_fdatasync+19
    do_syscall_64+77
    entry_SYSCALL_64_after_hwframe+68
]:
[4K, 8K)               2 |@@                                                  |
[8K, 16K)              2 |@@                                                  |
[16K, 32K)             9 |@@@@@@@@@@@@                                        |
[32K, 64K)             0 |                                                    |
[64K, 128K)            1 |@                                                   |
[128K, 256K)          38 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K)          19 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[512K, 1M)             5 |@@@@@@                                              |
[1M, 2M)               0 |                                                    |
[2M, 4M)               4 |@@@@@                                               |
[4M, 8M)               9 |@@@@@@@@@@@@                                        |
[8M, 16M)              2 |@@                                                  |
[16M, 32M)             1 |@                                                   |
[32M, 64M)             1 |@                                                   |
[64M, 128M)            1 |@                                                   |

@usecs[
    blk_account_io_start+1
    blk_mq_make_request+481
    generic_make_request+653
    submit_bio+75
    ext4_io_submit+73
    ext4_bio_write_page+609
    mpage_submit_page+97
    mpage_process_page_bufs+274
    mpage_prepare_extent_to_map+437
    ext4_writepages+668
    do_writepages+51
    __filemap_fdatawrite_range+172
    file_write_and_wait_range+107
    ext4_sync_file+240
    do_fsync+56
    __x64_sys_fdatasync+19
    do_syscall_64+77
    entry_SYSCALL_64_after_hwframe+68
]:
[16K, 32K)             8 |@                                                   |
[32K, 64K)            31 |@@@@                                                |
[64K, 128K)           14 |@@                                                  |
[128K, 256K)         346 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K)         180 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[512K, 1M)            22 |@@@                                                 |

[1M, 2M)               2 |                                                    |
[2M, 4M)               3 |                                                    |
[4M, 8M)               3 |                                                    |
[8M, 16M)              2 |                                                    |
[16M, 32M)             2 |                                                    |
[32M, 64M)             1 |                                                    |
[64M, 128M)            1 |                                                    |

I left a couple of stacks with typical high enough latency related to fdatasync on ext4.

bitesize.bt allows to see what are usual block I/O request size in bytes per program. This is what I've got for MariaDB:

[root@fc31 tools]# ./bitesize.bt
Attaching 3 probes...
Tracing block device I/O... Hit Ctrl-C to end.
^C
I/O size (bytes) histograms by process name:

@[NetworkManager]:
[4K, 8K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@[jbd2/dm-0-8]:
[4K, 8K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

...
@[mariadbd]:
[0]                    3 |@                                                   |
[1]                    0 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)               0 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)            94 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)            26 |@@@@@@@@@@@@@@                                      |
[64K, 128K)           16 |@@@@@@@@                                            |
[128K, 256K)           8 |@@@@                                                |
[256K, 512K)           1 |                                                    |
[512K, 1M)             2 |@                                                   |
[1M, 2M)               7 |@@@                                                 |
...

The majority of I/Os were in the 16K to 32K range, one or two InnoDB data pages.

The next well known tool is opensnoop.bt that traces the open() syscall system-wide, and prints various details. This is what I've got while sysbench test was starting:

[root@fc31 tools]# ./opensnoop.bt
Attaching 6 probes...
Tracing open syscalls... Hit Ctrl-C to end.
PID    COMM               FD ERR PATH
...
25558  sysbench           29   0 /usr/local/share/sysbench/oltp_read_write.lua
25558  sysbench           29   0 /usr/local/share/sysbench/oltp_read_write.lua
6147   mariadbd           45   0 ./sbtest/db.opt
6147   mariadbd           31   0 ./sbtest/db.opt
6147   mariadbd           35   0 ./sbtest/db.opt
6147   mariadbd           29   0 ./sbtest/db.opt
6147   mariadbd           27   0 ./sbtest/db.opt

25558  sysbench           -1   2 ./oltp_common.lua
25558  sysbench           -1   2 ./oltp_common/init.lua
25558  sysbench           -1   2 ./src/lua/oltp_common.lua
25558  sysbench           -1   2 /home/openxs/.luarocks/share/lua/5.1/oltp_common.lua
25558  sysbench           -1   2 /home/openxs/.luarocks/share/lua/5.1/oltp_common/init.lua
25558  sysbench           -1   2 /home/openxs/.luarocks/share/lua/oltp_common.lua
25558  sysbench           -1   2 /home/openxs/.luarocks/share/lua/oltp_common/init.lua
25558  sysbench           -1   2 /usr/local/share/lua/5.1/oltp_common.lua
25558  sysbench           -1   2 /usr/share/lua/5.1/oltp_common.lua

25558  sysbench           32   0 /usr/local/share/sysbench/oltp_common.lua
25558  sysbench           32   0 /usr/local/share/sysbench/oltp_common.lua
6147   mariadbd           80   0 ./sbtest/db.opt
...

It's interesting to find out that sysbench, based on test names, tries to open .lua files in some predefined lockations where they do not exist (based on error 2).

There are also similar tools to trace stat() and sync() calls:

[root@fc31 tools]# ./statsnoop.bt
Attaching 10 probes...
Tracing stat syscalls... Hit Ctrl-C to end.
PID    COMM             ERR PATH
...
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147   mariadbd           0 ./mysql/table_stats.MAI
6147   mariadbd           0 ./mysql/table_stats.MAD
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/column_stats.MAI
6147   mariadbd           0 ./mysql/column_stats.MAI
6147   mariadbd           0 ./mysql/column_stats.MAD

...
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147   mariadbd           0 ./mysql/table_stats.MAI
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/column_stats.MAI
6147   mariadbd           0 ./mysql/column_stats.MAI
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147   mariadbd           0 ./mysql/table_stats.MAI
6147   mariadbd           0 ./sbtest/sbtest3.frm
6147   mariadbd           0 ./sbtest/sbtest1.frm
6147   mariadbd           0 ./sbtest/sbtest2.frm

...

[root@fc31 tools]# ./syncsnoop.bt
Attaching 7 probes...
Tracing sync syscalls... Hit Ctrl-C to end.
TIME      PID    COMM             EVENT
12:10:36  621    auditd           tracepoint:syscalls:sys_enter_fsync
12:10:36  6147   mariadbd         tracepoint:syscalls:sys_enter_fdatasync
12:10:36  6147   mariadbd         tracepoint:syscalls:sys_enter_fdatasync

...

We see that MariaDB server does not only check sbtest.* tables used in the test, but also the tables with engine-independent statistics in the mysql database. In the latter output we see wallclock timestamp of the call (this is a problem with Performance Schema, by the way) that is provided by the time() builtin function of bpftrace. The printed timestamp is also async, it is the time at which userspace has processed the queued up event, not the time at which the bpftrace probe calls time().

 Some tools may not work at all, for example:

[root@fc31 tools]# ./threadsnoop.bt
./threadsnoop.bt:19-21: ERROR: uprobe target file '/lib/x86_64-linux-gnu/libpthread.so.0' does not exist or is not executable

[openxs@fc31 ~]$ ldd /home/openxs/dbs/maria10.5/bin/mariadbd | grep thread
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3d957bf000)

That's because the library in my case is in the different directory.

One can trace vfs_* functions too. For example, vfscount.bt just traces and counts all VFS calls:

[root@fc31 tools]# ./vfscount.bt
Attaching 65 probes...
Tracing VFS calls... Hit Ctrl-C to end.
^C

@[vfs_test_lock]: 2
@[vfs_symlink]: 3
@[vfs_setxattr]: 3
@[vfs_getxattr]: 4
@[vfs_mkdir]: 9
@[vfs_rmdir]: 9
@[vfs_rename]: 19
@[vfs_readlink]: 105
@[vfs_unlink]: 113
@[vfs_fallocate]: 394
@[vfs_statfs]: 450
@[vfs_lock_file]: 1081
@[vfs_fsync_range]: 1752
@[vfs_statx]: 2789
@[vfs_statx_fd]: 2846
@[vfs_open]: 2925
@[vfs_getattr]: 5360
@[vfs_getattr_nosec]: 5490
@[vfs_writev]: 6340
@[vfs_readv]: 12482
@[vfs_write]: 161284
@[vfs_read]: 307655

We can surely summarize calls per second as vfsstat.bt does:

[root@fc31 tools]# ./vfsstat.bt
Attaching 11 probes...
Tracing key VFS calls... Hit Ctrl-C to end.
12:15:14
@[vfs_open]: 22
@[vfs_writev]: 64
@[vfs_readv]: 124
@[vfs_read]: 1631
@[vfs_write]: 2015

12:15:15
@[vfs_readv]: 96
@[vfs_write]: 1006
@[vfs_read]: 1201
@[vfs_writev]: 3093

12:15:16
@[vfs_open]: 54
@[vfs_readv]: 139
@[vfs_writev]: 153
@[vfs_read]: 2640
@[vfs_write]: 4003

12:15:17
@[vfs_open]: 6
@[vfs_writev]: 89
@[vfs_readv]: 132
@[vfs_write]: 1709
@[vfs_read]: 3904

12:15:18
@[vfs_readv]: 271
@[vfs_write]: 1689
@[vfs_writev]: 2709
@[vfs_read]: 4479

^C

The interval probe is used for this in the code:

interval:s:1
{
    time();
    print(@);
    clear(@);
}

Let me check the last tool for today, writeback.bt, that traces when the kernel writeback procedure is writing dirtied pages to disk, and shows details such as the time, device numbers, reason for the write back, and the duration:

[root@fc31 tools]# ./writeback.bt
Attaching 4 probes...
Tracing writeback... Hit Ctrl-C to end.
TIME      DEVICE   PAGES    REASON           ms
12:16:02  253:0    65486    background       0.000
12:16:02  253:0    43351    periodic         0.005
12:16:02  253:0    43351    periodic         0.005
12:16:02  253:0    43351    periodic         0.000
12:16:02  253:0    65534    background       0.045
12:16:02  253:0    65534    background       0.000
12:16:05  253:0    43508    periodic         0.006
12:16:06  253:0    43575    periodic         0.004
12:16:06  8:0      43575    periodic         0.004
12:16:06  253:0    43575    periodic         0.001
12:16:07  253:0    65495    background       434.285
12:16:07  253:0    43947    periodic         0.005
12:16:07  253:0    43676    periodic         0.000
12:16:07  253:0    43549    periodic         22.272
12:16:07  253:0    43549    periodic         0.001
12:16:11  253:0    43604    periodic         301.541
12:16:11  253:0    43528    periodic         147.890
12:16:11  253:0    43433    periodic         119.225

12:16:11  253:0    43433    periodic         0.004
12:16:11  253:0    43433    periodic         0.000
^C

We clearly see notable time spent on background writeback at 12:16:07 and on periodic flushes at 12:16:11.

Yet another source of tools to check is Brendan Gregg's book and related GitHub source code examples. You can get them from GitHub as follows:

[openxs@fc31 git]$ git clone https://github.com/brendangregg/bpf-perf-tools-book.git
Cloning into 'bpf-perf-tools-book'...
remote: Enumerating objects: 600, done.
remote: Total 600 (delta 0), reused 0 (delta 0), pack-reused 600
Receiving objects: 100% (600/600), 991.41 KiB | 2.73 MiB/s, done.
Resolving deltas: 100% (394/394), done.
[openxs@fc31 git]$ cd bpf-perf-tools-book/
[openxs@fc31 bpf-perf-tools-book]$ ls
exercises  images  originals  README.md  updated

One day I'll check them in more details and share some outputs. This blog post is already too long...

Tracing across the River Thames with minimal impact back in 2019 :)

* * *

To summarize:

  1. You should check all the tools/*.bt tools and related examples, both to know what is ready to use and to study how proper bpftrace programs are written.
  2. Some tools may rely on kernel headers avaiable or specific pathnames of the libraries probed.
  3. The tools presented in this post are good for studying disk I/O issues that may impact MariaDB performance. Some are useful alternatives to strace everything...
  4. See also this my older post "BCC Tools for disk I/O Analysis and More" where bcc tools were used to monitor disk I/O on older Ubuntu 16.04. Some of them are more advanced and may have no bpftrace-based alternatives, but we all know that bpftrace is the future :)

No comments:

Post a Comment