In the previous post in this series I've presented a couple of my quick and dirty attempts to use bpftrace add uprobes to MariaDB server and dynamic libraries it uses, to trace queries and their execution times and collect stack traces related to mutex waits. Looks like for any non-trivial monitoring task we are going to end up with more than one probe and would need to do more processing to produce clean and useful results with minimal CPU and memory impact both in the process of collecting the data while in kernel context and processing them in user space.
I have a long way to go with my lame command towards this goal, so in this post I decided to check several existing bpftrace programs not directly related to MariaDB, see how they are structured, how they use built in variables and functions etc. I'll also try to apply some of them to MariaDB 10.5 running sysbench read write test with high enough concurrency,
Most popular tools are included into the bpftrace source code, along with examples of their usage. We can find them in the tools subdirectory:
[root@fc31 tools]# pwd
/mnt/home/openxs/git/bpftrace/tools
[root@fc31 tools]# ls
bashreadline.bt loads_example.txt syscount_example.txt
bashreadline_example.txt mdflush.bt tcpaccept.bt
biolatency.bt mdflush_example.txt tcpaccept_example.txt
biolatency_example.txt naptime.bt tcpconnect.bt
biosnoop.bt naptime_example.txt tcpconnect_example.txt
biosnoop_example.txt oomkill.bt tcpdrop.bt
biostacks.bt oomkill_example.txt tcpdrop_example.txt
biostacks_example.txt opensnoop.bt tcplife.bt
bitesize.bt opensnoop_example.txt tcplife_example.txt
bitesize_example.txt pidpersec.bt tcpretrans.bt
capable.bt pidpersec_example.txt tcpretrans_example.txt
capable_example.txt runqlat.bt tcpsynbl.bt
CMakeLists.txt runqlat_example.txt tcpsynbl_example.txt
cpuwalk.bt runqlen.bt threadsnoop.bt
cpuwalk_example.txt runqlen_example.txt threadsnoop_example.txt
dcsnoop.bt setuids.bt vfscount.bt
dcsnoop_example.txt setuids_example.txt vfscount_example.txt
execsnoop.bt statsnoop.bt vfsstat.bt
execsnoop_example.txt statsnoop_example.txt vfsstat_example.txt
gethostlatency.bt swapin.bt writeback.bt
gethostlatency_example.txt swapin_example.txt writeback_example.txt
killsnoop.bt syncsnoop.bt xfsdist.bt
killsnoop_example.txt syncsnoop_example.txt xfsdist_example.txt
loads.bt syscount.bt
I highlighted the tools I am going to try. But let's check the source code for one of them, with quite non-trivial and 48 lines long code, biosnoop.bt, to begin with. My comments are after each code fragment below:
#!/usr/bin/env bpftrace
#include <linux/blkdev.h>
/*
* biosnoop.bt Block I/O tracing tool, showing per I/O latency.
* For Linux, uses bpftrace, eBPF.
*
* TODO: switch to block tracepoints. Add offset and size columns.
*
* This is a bpftrace version of the bcc tool of the same name.
*
* 15-Nov-2017 Brendan Gregg Created this.
*/
Here we can see how to use shebang first line to run the program with bpftrace if it's executable. Next line shows that for some cases bpftrace (as other eBPF tools) may need headers (in this case kernel header) to be able to resolve references to complex structures passed as arguments.
BEGIN
{
printf("%-12s %-7s %-16s %-6s %7s\n", "TIME(ms)", "DISK", "COMM", "PID", "LAT(ms)");
}
Next we see the BEGIN probe that, same as with awk, is executed once at the beginning of the program and in this case prints the formatted header for the further output.
kprobe:blk_account_io_start
{
@start[arg0] = nsecs;
@iopid[arg0] = pid;
@iocomm[arg0] = comm;
@disk[arg0] = ((struct request *)arg0)->rq_disk->disk_name;
}
Here we define a kernel probe for the blk_account_io_start function and store information in 4 associative arrays indexed by arg0, to store start time for the call in nanoseconds since the probe attached (nsecs), pid of the calling program, calling command itself (comm) and disk name that we get from deep nested structure of the first traced function call argument, arg0, via type cast and pointers. That's why we needed kernel headers, to reference different structure members by name and eventually de-reference to proper offset/address.
kprobe:blk_account_io_done
/@start[arg0] != 0 && @iopid[arg0] != 0 && @iocomm[arg0] != ""/
{
$now = nsecs;
printf("%-12u %-7s %-16s %-6d %7d\n",
elapsed / 1e6, @disk[arg0], @iocomm[arg0], @iopid[arg0],
($now - @start[arg0]) / 1e6);
delete(@start[arg0]);
delete(@iopid[arg0]);
delete(@iocomm[arg0]);
delete(@disk[arg0]);
}
In the probe above that we define for the block_account_io_done function we first make sure to do something only if the block_account_io_start was already called by this same process with the same argument, otherwise we would not match the times properly. The problem here is that the system call is completed not when the start function returns, but when the corresponding system call informs the caller that we are done. It's not as easy as kprobe/kretprobe for the same function implemented by the kernel.
The action of the probe is simple. We calculate the time difference in millseconds by comparing the stored start timestampt with current nsecs value and output the details collected. Then, and this is essential, we delete the element from the associative array (a.k.a. map) with delete(), so that repeataed call from the same process to the same disk are not mixed up together and we do not use more memory than really needed.
END
{
clear(@start);
clear(@iopid);
clear(@iocomm);
clear(@disk);
}
In the END probe we clear() all associative arrays that were used by the program, essnetially deleteting all the items in the maps. Otherwise, as we've seen in my lame examples previously, they are dumped at the end of the program, and for this tool that produce monitoring output while it runs, we definitely do not need that.
Now that we better understand how "real" bpftrace programs are usually structured and designed, let's try to run biosnoop.bt that traces block I/O, and shows the issuing process (at least, the process that was on-CPU at the time of queue insert) and the latency of the I/O:
[root@fc31 tools]# ./biosnoop.bt
Attaching 4 probes...
TIME(ms) DISK COMM PID LAT(ms)
98 sda mariadbd 6147 0
98 sda mariadbd 6147 0
...
195 sda mariadbd 6147 92
196 sda mariadbd 6147 0
...
201 sda mariadbd 6147 0
274 sda mariadbd 6147 73
275 sda mariadbd 6147 0
278 sda mariadbd 6147 2
293 sda mariadbd 6147 9
295 sda mariadbd 6147 0
295 sda mariadbd 6147 0
303 sda mariadbd 6147 7
303 sda mariadbd 6147 0
303 sda mariadbd 6147 0
303 sda mariadbd 6147 0
304 sda mariadbd 6147 0
304 sda mariadbd 6147 0
305 sda mariadbd 6147 0
305 sda mariadbd 6147 0
306 sda mariadbd 6147 0
335 sda mariadbd 6147 29
336 sda jbd2/dm-0-8 419 14
337 sda mariadbd 6147 1
365 sda jbd2/dm-0-8 419 28
365 sda mariadbd 6147 28
365 sda kworker/2:4 24472 0
392 sda mariadbd 6147 26
...
Here we see block I/O requests for my only disk, sda, from the mariadbd and few other processes, with timestamps starting from the startup and related latency (that is less than 1 millisecond for most cases, but sometimes appoached 100 milliseconds on this HDD).
The next tool to check is biolatency.bt that traces block I/O and shows latency as a power-of-2 histogram using hist() function:
[root@fc31 tools]# ./biolatency.bt
Attaching 4 probes...
Tracing block device I/O... Hit Ctrl-C to end.
^C
@usecs:
[128, 256) 31 |@@ |
[256, 512) 421 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[512, 1K) 754 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K) 94 |@@@@@@ |
[2K, 4K) 38 |@@ |
[4K, 8K) 36 |@@ |
[8K, 16K) 65 |@@@@ |
[16K, 32K) 109 |@@@@@@@ |
[32K, 64K) 80 |@@@@@ |
[64K, 128K) 25 |@ |
[128K, 256K) 16 |@ |
So, during the monitoring interval, until I hit Ctrl+C, the majority of block I/O calls had the latency of 256 to 1024 microseconds, less than 1 millisecond. There were few longer waiting calls too.
The next tool to check is biostacks.bt that is supposed to show block I/O latency as a histogram, with the kernel stack trace that initiated the I/O (do you still remember about /proc kernel stacks sampling that can be used for similar purposes?). This can help explain disk I/O that is not directly requested by applications. I've got the following:
[root@fc31 tools]# ./biostacks.bt > /tmp/biostacks.txt
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:blk_start_request, skipping.
Attaching 5 probes...
Tracing block I/O with init stacks. Hit Ctrl-C to end.
^C[root@fc31 tools]# more /tmp/biostacks.txt
...
@usecs[
blk_account_io_start+1
blk_mq_make_request+481
generic_make_request+653
submit_bio+75
ext4_io_submit+73
ext4_writepages+694
do_writepages+51
__filemap_fdatawrite_range+172
file_write_and_wait_range+107
ext4_sync_file+240
do_fsync+56
__x64_sys_fdatasync+19
do_syscall_64+77
entry_SYSCALL_64_after_hwframe+68
]:
[4K, 8K) 2 |@@ |
[8K, 16K) 2 |@@ |
[16K, 32K) 9 |@@@@@@@@@@@@ |
[32K, 64K) 0 | |
[64K, 128K) 1 |@ |
[128K, 256K) 38 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K) 19 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[512K, 1M) 5 |@@@@@@ |
[1M, 2M) 0 | |
[2M, 4M) 4 |@@@@@ |
[4M, 8M) 9 |@@@@@@@@@@@@ |
[8M, 16M) 2 |@@ |
[16M, 32M) 1 |@ |
[32M, 64M) 1 |@ |
[64M, 128M) 1 |@ |
@usecs[
blk_account_io_start+1
blk_mq_make_request+481
generic_make_request+653
submit_bio+75
ext4_io_submit+73
ext4_bio_write_page+609
mpage_submit_page+97
mpage_process_page_bufs+274
mpage_prepare_extent_to_map+437
ext4_writepages+668
do_writepages+51
__filemap_fdatawrite_range+172
file_write_and_wait_range+107
ext4_sync_file+240
do_fsync+56
__x64_sys_fdatasync+19
do_syscall_64+77
entry_SYSCALL_64_after_hwframe+68
]:
[16K, 32K) 8 |@ |
[32K, 64K) 31 |@@@@ |
[64K, 128K) 14 |@@ |
[128K, 256K) 346 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K) 180 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[512K, 1M) 22 |@@@ |
[1M, 2M) 2 | |
[2M, 4M) 3 | |
[4M, 8M) 3 | |
[8M, 16M) 2 | |
[16M, 32M) 2 | |
[32M, 64M) 1 | |
[64M, 128M) 1 | |
I left a couple of stacks with typical high enough latency related to fdatasync on ext4.
bitesize.bt allows to see what are usual block I/O request size in bytes per program. This is what I've got for MariaDB:
[root@fc31 tools]# ./bitesize.bt
Attaching 3 probes...
Tracing block device I/O... Hit Ctrl-C to end.
^C
I/O size (bytes) histograms by process name:
@[NetworkManager]:
[4K, 8K) 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@[jbd2/dm-0-8]:
[4K, 8K) 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K) 0 | |
[16K, 32K) 0 | |
[32K, 64K) 0 | |
[64K, 128K) 0 | |
[128K, 256K) 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
...
@[mariadbd]:
[0] 3 |@ |
[1] 0 | |
[2, 4) 0 | |
[4, 8) 0 | |
[8, 16) 0 | |
[16, 32) 0 | |
[32, 64) 0 | |
[64, 128) 0 | |
[128, 256) 0 | |
[256, 512) 0 | |
[512, 1K) 0 | |
[1K, 2K) 0 | |
[2K, 4K) 0 | |
[4K, 8K) 0 | |
[8K, 16K) 0 | |
[16K, 32K) 94 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 26 |@@@@@@@@@@@@@@ |
[64K, 128K) 16 |@@@@@@@@ |
[128K, 256K) 8 |@@@@ |
[256K, 512K) 1 | |
[512K, 1M) 2 |@ |
[1M, 2M) 7 |@@@ |
...
The majority of I/Os were in the 16K to 32K range, one or two InnoDB data pages.
The next well known tool is opensnoop.bt that traces the open() syscall system-wide, and prints various details. This is what I've got while sysbench test was starting:
[root@fc31 tools]# ./opensnoop.bt
Attaching 6 probes...
Tracing open syscalls... Hit Ctrl-C to end.
PID COMM FD ERR PATH
...
25558 sysbench 29 0 /usr/local/share/sysbench/oltp_read_write.lua
25558 sysbench 29 0 /usr/local/share/sysbench/oltp_read_write.lua
6147 mariadbd 45 0 ./sbtest/db.opt
6147 mariadbd 31 0 ./sbtest/db.opt
6147 mariadbd 35 0 ./sbtest/db.opt
6147 mariadbd 29 0 ./sbtest/db.opt
6147 mariadbd 27 0 ./sbtest/db.opt
25558 sysbench -1 2 ./oltp_common.lua
25558 sysbench -1 2 ./oltp_common/init.lua
25558 sysbench -1 2 ./src/lua/oltp_common.lua
25558 sysbench -1 2 /home/openxs/.luarocks/share/lua/5.1/oltp_common.lua
25558 sysbench -1 2 /home/openxs/.luarocks/share/lua/5.1/oltp_common/init.lua
25558 sysbench -1 2 /home/openxs/.luarocks/share/lua/oltp_common.lua
25558 sysbench -1 2 /home/openxs/.luarocks/share/lua/oltp_common/init.lua
25558 sysbench -1 2 /usr/local/share/lua/5.1/oltp_common.lua
25558 sysbench -1 2 /usr/share/lua/5.1/oltp_common.lua
25558 sysbench 32 0 /usr/local/share/sysbench/oltp_common.lua
25558 sysbench 32 0 /usr/local/share/sysbench/oltp_common.lua
6147 mariadbd 80 0 ./sbtest/db.opt
...
It's interesting to find out that sysbench, based on test names, tries to open .lua files in some predefined lockations where they do not exist (based on error 2).
There are also similar tools to trace stat() and sync() calls:
[root@fc31 tools]# ./statsnoop.bt
Attaching 10 probes...
Tracing stat syscalls... Hit Ctrl-C to end.
PID COMM ERR PATH
...
6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147 mariadbd 0 ./mysql/table_stats.MAI
6147 mariadbd 0 ./mysql/table_stats.MAD
6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql
6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/column_stats.MAI
6147 mariadbd 0 ./mysql/column_stats.MAI
6147 mariadbd 0 ./mysql/column_stats.MAD
...
6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147 mariadbd 0 ./mysql/table_stats.MAI
6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql
6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/column_stats.MAI
6147 mariadbd 0 ./mysql/column_stats.MAI
6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql
6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147 mariadbd 0 ./mysql/table_stats.MAI
6147 mariadbd 0 ./sbtest/sbtest3.frm
6147 mariadbd 0 ./sbtest/sbtest1.frm
6147 mariadbd 0 ./sbtest/sbtest2.frm
...
[root@fc31 tools]# ./syncsnoop.bt
Attaching 7 probes...
Tracing sync syscalls... Hit Ctrl-C to end.
TIME PID COMM EVENT
12:10:36 621 auditd tracepoint:syscalls:sys_enter_fsync
12:10:36 6147 mariadbd tracepoint:syscalls:sys_enter_fdatasync
12:10:36 6147 mariadbd tracepoint:syscalls:sys_enter_fdatasync
...
We see that MariaDB server does not only check sbtest.* tables used in the test, but also the tables with engine-independent statistics in the mysql database. In the latter output we see wallclock timestamp of the call (this is a problem with Performance Schema, by the way) that is provided by the time() builtin function of bpftrace. The printed timestamp is also async, it is the time at which userspace has processed the queued up event, not the time at which the bpftrace probe calls time().
Some tools may not work at all, for example:
[root@fc31 tools]# ./threadsnoop.bt
./threadsnoop.bt:19-21: ERROR: uprobe target file '/lib/x86_64-linux-gnu/libpthread.so.0' does not exist or is not executable
[openxs@fc31 ~]$ ldd /home/openxs/dbs/maria10.5/bin/mariadbd | grep thread
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3d957bf000)
That's because the library in my case is in the different directory.
One can trace vfs_* functions too. For example, vfscount.bt just traces and counts all VFS calls:
[root@fc31 tools]# ./vfscount.bt
Attaching 65 probes...
Tracing VFS calls... Hit Ctrl-C to end.
^C
@[vfs_test_lock]: 2
@[vfs_symlink]: 3
@[vfs_setxattr]: 3
@[vfs_getxattr]: 4
@[vfs_mkdir]: 9
@[vfs_rmdir]: 9
@[vfs_rename]: 19
@[vfs_readlink]: 105
@[vfs_unlink]: 113
@[vfs_fallocate]: 394
@[vfs_statfs]: 450
@[vfs_lock_file]: 1081
@[vfs_fsync_range]: 1752
@[vfs_statx]: 2789
@[vfs_statx_fd]: 2846
@[vfs_open]: 2925
@[vfs_getattr]: 5360
@[vfs_getattr_nosec]: 5490
@[vfs_writev]: 6340
@[vfs_readv]: 12482
@[vfs_write]: 161284
@[vfs_read]: 307655
We can surely summarize calls per second as vfsstat.bt does:
[root@fc31 tools]# ./vfsstat.bt
Attaching 11 probes...
Tracing key VFS calls... Hit Ctrl-C to end.
12:15:14
@[vfs_open]: 22
@[vfs_writev]: 64
@[vfs_readv]: 124
@[vfs_read]: 1631
@[vfs_write]: 2015
12:15:15
@[vfs_readv]: 96
@[vfs_write]: 1006
@[vfs_read]: 1201
@[vfs_writev]: 3093
12:15:16
@[vfs_open]: 54
@[vfs_readv]: 139
@[vfs_writev]: 153
@[vfs_read]: 2640
@[vfs_write]: 4003
12:15:17
@[vfs_open]: 6
@[vfs_writev]: 89
@[vfs_readv]: 132
@[vfs_write]: 1709
@[vfs_read]: 3904
12:15:18
@[vfs_readv]: 271
@[vfs_write]: 1689
@[vfs_writev]: 2709
@[vfs_read]: 4479
^C
The interval probe is used for this in the code:
interval:s:1
{
time();
print(@);
clear(@);
}
Let me check the last tool for today, writeback.bt, that traces when the kernel writeback procedure is writing dirtied pages to disk, and shows details such as the time, device numbers, reason for the write back, and the duration:
[root@fc31 tools]# ./writeback.bt
Attaching 4 probes...
Tracing writeback... Hit Ctrl-C to end.
TIME DEVICE PAGES REASON ms
12:16:02 253:0 65486 background 0.000
12:16:02 253:0 43351 periodic 0.005
12:16:02 253:0 43351 periodic 0.005
12:16:02 253:0 43351 periodic 0.000
12:16:02 253:0 65534 background 0.045
12:16:02 253:0 65534 background 0.000
12:16:05 253:0 43508 periodic 0.006
12:16:06 253:0 43575 periodic 0.004
12:16:06 8:0 43575 periodic 0.004
12:16:06 253:0 43575 periodic 0.001
12:16:07 253:0 65495 background 434.285
12:16:07 253:0 43947 periodic 0.005
12:16:07 253:0 43676 periodic 0.000
12:16:07 253:0 43549 periodic 22.272
12:16:07 253:0 43549 periodic 0.001
12:16:11 253:0 43604 periodic 301.541
12:16:11 253:0 43528 periodic 147.890
12:16:11 253:0 43433 periodic 119.225
12:16:11 253:0 43433 periodic 0.004
12:16:11 253:0 43433 periodic 0.000
^C
We clearly see notable time spent on background writeback at 12:16:07 and on periodic flushes at 12:16:11.
Yet another source of tools to check is Brendan Gregg's book and related GitHub source code examples. You can get them from GitHub as follows:
[openxs@fc31 git]$ git clone https://github.com/brendangregg/bpf-perf-tools-book.git
Cloning into 'bpf-perf-tools-book'...
remote: Enumerating objects: 600, done.
remote: Total 600 (delta 0), reused 0 (delta 0), pack-reused 600
Receiving objects: 100% (600/600), 991.41 KiB | 2.73 MiB/s, done.
Resolving deltas: 100% (394/394), done.
[openxs@fc31 git]$ cd bpf-perf-tools-book/
[openxs@fc31 bpf-perf-tools-book]$ ls
exercises images originals README.md updated
One day I'll check them in more details and share some outputs. This blog post is already too long...
Tracing across the River Thames with minimal impact back in 2019 :) |
* * *
To summarize:
- You should check all the tools/*.bt tools and related examples, both to know what is ready to use and to study how proper bpftrace programs are written.
- Some tools may rely on kernel headers avaiable or specific pathnames of the libraries probed.
- The tools presented in this post are good for studying disk I/O issues that may impact MariaDB performance. Some are useful alternatives to strace everything...
- See also this my older post "BCC Tools for disk I/O Analysis and More" where bcc tools were used to monitor disk I/O on older Ubuntu 16.04. Some of them are more advanced and may have no bpftrace-based alternatives, but we all know that bpftrace is the future :)
No comments:
Post a Comment