Sunday, January 26, 2020

Using bpftrace on Fedora 29 - More Advanced Examples

It happened so that I did not have much time to devote to further eBPF and bpftrace-based tracing and profiling since October 2019, when I posted a simple example of how it can be used on Fedora 29 to "emulate" slow query log (to some extent, by recording queries and time it took to execute them) with dynamic probes. Recently I started preparing for my talk about ftrace, eBPF and bpftrace planned for FOSDEM MySQL, MariaDB and Friends Devroom, but rejected there and later accepted for "MariaDB Day Brussels 0202 2020" (that I suggest everyone to attend, at least if you do not have better plans for the second day of FOSDEM 2020).

So, I decided to check for any ways to use bpftrace as a quick profiler, maybe "embed" it into pt-pmp one day and, surely, get what MariaDB developers ask me most often while working on performance problems or hangs - stack traces. This blog post describes my "achievements" so far on this way.

I had the same version of bpftrace coming from Fedora RPM (that had not updated in the meantime):
[openxs@fc29 server]$ rpm -q -a | grep bpftrace
So, after checking this fine reference guide quickly, I came up with a simple one-liner:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }'
Error creating map: '@'
Error creating stack id map
This might have happened because kernel.perf_event_max_stackis smaller than 127. Try to tweak this value with sysctl kernel.perf_event_max_stack=<new value>
Error creating perf event map (-1)
Attaching 1 probe...
Segmentation fault
Ups! I tried to do what error message suggested:
[openxs@fc29 ~]$ sudo sysctl -w kernel.perf_event_max_stack=1000
kernel.perf_event_max_stack = 1000
But it had not helped. This is how I hit my first problem with bpftrace. Fedora had not cared to update it when it stopped matching "something" in a new kernel, for whatever reason. This is actually a real problem for eBPF programs - what they assume may not be "there" in a kernel, so they have to be checked carefully. There is a separate "BPF Type Format (BTF)" for encoding the debug info related to BPF program (or map) that may help to find out if the program is "compatible" without trying it. In my case incompatibility was obvious and I needed some fast way to proceed with bpftrace, in a hope to come up with some useful examples for my talk.

So, i decided to build new version of bpftrace from recent GitHub source (against bcc tools I had installed and supposedly up to date). This ended up with errors no matter what I tried:
[openxs@fc29 build]$ make
[  3%] Built target parser
[  4%] Built target arch
[  5%] Building CXX object src/ast/CMakeFiles/ast.dir/irbuilderbpf.cpp.o
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp: In member function 'void bpftrace::ast::IRBuilderBPF::CreateSignal(llvm::Value*)':
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp:669:16: error: 'BPF_FUNC_send_signal' was not declared in this scope
Quick search led me to the idea that it's actually a know and recently reported problem, see Issue #1014. As a solution I decided to rebuild bcc tools from GitHub source as well (the alternative to to change bpftrace source) and removed Fedora packages. To make a long story short, this is what I did eventually (from fc -l output):
1051     rm -rf bcc/
1052     rm -rf bpftrace/
1053     sudo dnf remove bcc
1054     git clone
1055     cd bcc/
1056     git submodule update --init --recursive
1057     sudo dnf install -y bison cmake ethtool flex git iperf libstdc++-static   python-netaddr python-pip gcc gcc-c++ make zlib-devel   elfutils-libelf-devel
1058     sudo dnf install -y luajit luajit-devel
1059     sudo dnf install -y
1060     sudo pip install pyroute2
1061     sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
1062     mkdir build
1063     cd build/
1064     cmake .. -DCMAKE_INSTALL_PREFIX=/usr
1065     make
1066     make test
1067     sudo make install
1068     cd ../..
1069     git clone
1070     cd bpftrace/
1071     git submodule update --init --recursive
1072     mkdir build
1073     cd build/
1074     cmake -DCMAKE_BUILD_TYPE=Release ../
1075     make
1076     ./tests/bpftrace_test
1077     fc -l
Basically I did it all as documented in related projects for my version of Fedora. Step highlighted (1056 above) is essential though, as it was missed here in any obvious way as a mandatory (while it is in recent versions). Step 1076 was successful:
[----------] Global test environment tear-down
[==========] 350 tests from 9 test cases ran. (2590 ms total)
[  PASSED  ] 350 tests.

[openxs@fc29 build]$ make testRunning tests...
Test project /mnt/home/openxs/git/bpftrace/build
    Start 1: bpftrace_test
1/3 Test #1: bpftrace_test ....................   Passed    2.42 sec
    Start 2: runtime_test
2/3 Test #2: runtime_test .....................***Failed   21.10 sec
    Start 3: tools-parsing-test
3/3 Test #3: tools-parsing-test ...............***Failed    5.88 sec

33% tests passed, 2 tests failed out of 3

Total Test time (real) =  29.41 sec

The following tests FAILED:
          2 - runtime_test (Failed)
          3 - tools-parsing-test (Failed)
Errors while running CTest
make: *** [Makefile:128: test] Error 8
while make test shown some failures (same as make test for bcc tools), but form previous experience this had not looked like a big deal. I continued basic testing that the resulting bpftrace now works (I tried to add kernel probe, guess what software runs on this workstation):
[openxs@fc29 build]$ sudo ./src/bpftrace -e 'kprobe:do_nanosleep { printf("sleep by %s\n", comm); }'
Attaching 1 probe...
sleep by percona-qan-age
sleep by percona-qan-age
sleep by skypeforlinux
sleep by mysqld
So, with some confidence built up, I've installed my new and very recent bpftrace tool I built myself:
[openxs@fc29 build]$ sudo make install
[  3%] Built target parser
-- Installing: /usr/local/bin/bpftrace
-- Installing: /usr/local/man/man8/xfsdist.8.gz

[openxs@fc29 build]$ which bpftrace

[openxs@fc29 build]$ cd ..
[openxs@fc29 bpftrace]$ git log -1
commit afafbf561522dd33fa316be3e33375bc662399ac (HEAD -> master, origin/master, origin/HEAD)
Author: bas smit <>
Date:   Fri Jan 17 12:33:43 2020 +0100

    Add `--info` flag

    This prints the supported kernel features
Quick test involved starting MariaDB 10.3 with thread pool enabled (more on "why thread pool?" in some next blog post):
[openxs@fc29 maria10.3]$ ./bin/mysqld_safe --no-defaults --thread_handling=pool-of-threads --thread_pool_idle_timeout=10 &
[1] 32643
and running sysbench:
[openxs@fc29 maria10.3]$ sysbench /usr/local/share/sysbench/oltp_update_index.lua --mysql-host= --mysql-user=root --mysql-port=3306 --threads=10 --tables=4 --table-size=1000000 --time=600 --report-interval=5 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 10
Report intermediate results every 5 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 5s ] thds: 10 tps: 54.37 qps: 54.37 (r/w/o: 0.00/54.37/0.00) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 10 tps: 40.60 qps: 40.60 (r/w/o: 0.00/40.60/0.00) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 10 tps: 40.00 qps: 40.00 (r/w/o: 0.00/40.00/0.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 10 tps: 38.00 qps: 38.00 (r/w/o: 0.00/38.00/0.00) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 10 tps: 34.00 qps: 34.00 (r/w/o: 0.00/34.00/0.00) lat (ms,95%): 746.32 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 10 tps: 36.40 qps: 36.40 (r/w/o: 0.00/36.40/0.00) lat (ms,95%): 549.52 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 10 tps: 37.80 qps: 37.80 (r/w/o: 0.00/37.80/0.00) lat (ms,95%): 601.29 err/s: 0.00 reconn/s: 0.00
and adding the probe like this in another shell:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }' > /tmp/bpf.stack
When I stopped collecting with Ctrl-C, bpftrace printed the content of associative arrays collected (with unique user space stack 9for all processes!) as index and count of time this stack was as a value). No segmentation fault, unlike with the version from package! In the resulting file we can see (among other things):
[openxs@fc29 bpftrace]$ cat /tmp/bpf.stack | more
Attaching 1 probe...

    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    page_cur_search_with_match_bytes(buf_block_t const*, dict_index_t const*, dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, unsigned long*, unsigned long*, page_cur_t*)+1773
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+8316
    btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.33]+146
    row_search_on_row_ref(btr_pcur_t*, unsigned long, dict_table_t const*, dtuple_t const*, mtr_t*)+85
    row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+503
    row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+1293
    row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+2271
]: 1


    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    btr_search_build_page_hash_index(dict_index_t*, buf_block_t*, rw_lock_t*, unsigned long, unsigned long, unsigned long)+2781
    btr_search_info_update_slow(btr_search_t*, btr_cur_t*)+622
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+10130
    row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+4823
    ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+338
    handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+280
    handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)+98
    mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*)+3560
    Prepared_statement::execute(String*, bool)+979
    Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+154
    mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool)+349
    mysqld_stmt_execute(THD*, char*, unsigned int)+37
    dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)+5123
]: 1


    trx_purge(unsigned long, bool)+1054
]: 2
    ut_crc32_sw(unsigned char const*, unsigned long)+1611
    buf_calc_page_crc32(unsigned char const*)+61
    buf_flush_init_for_writing(buf_block_t const*, unsigned char*, void*, unsigned long)+872
    buf_flush_write_block_low(buf_page_t*, buf_flush_t, bool)+149
    buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)+607
    buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*)+1627
    buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long)+392
]: 2

@[]: 6474
So, this is the default format to output stacks (one line per call, no address, frame number or anything, symbols properly resolved for my MariaDB built from GitHub source etc). But I had something like this mind eventually (from pt-pmp):
[openxs@fc29 bpftrace]$ pt-pmp
Tue Jan 21 13:52:12 EET 2020

     10 syscall(,__io_getevents_0_4(,LinuxAIOHandler::collect(,LinuxAIOHandler::poll(,os_aio_linux_handler(stl_vector.h:805),os_aio_handler(stl_vector.h:805),fil_aio_wait(,io_handler_thread(,start_thread(,clone(
      6 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),log_write_up_to(,trx_flush_log_if_needed_low(,trx_flush_log_if_needed(,trx_commit_complete_for_mysql(,innobase_commit(,commit_one_phase_2(,ha_commit_one_phase(,ha_commit_trans(,trans_commit_stmt(,mysql_execute_command(,Prepared_statement::execute(,Prepared_statement::execute_loop(,Prepared_statement::execute_loop(,mysql_stmt_execute_common(,mysqld_stmt_execute(,dispatch_command(,do_command(,threadpool_process_request(,tp_callback(,worker_main(,start_thread(,clone(
      3 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),srv_resume_thread(,srv_worker_thread(,start_thread(,clone(
      3 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1215),get_event(mysql_thread.h:1215),worker_main(,start_thread(,clone(
That is, number of times collapsed stack was noted and then the stack itself, without function arguments if possible. This kind of stack is generated from gdb outputs that looks as follows:
Thread 1 (Thread 0x7f29eee66900 (LWP 32728)):
#0  0x00007f29ef25f431 in poll () from /lib64/
#1  0x000055cc4ece656f in poll (__timeout=-1, __nfds=2, __fds=0x7ffe174108e0) at /usr/include/bits/poll2.h:41
#2  handle_connections_sockets () at /mnt/home/openxs/git/server/sql/
#3  0x000055cc4ece7a7c in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /mnt/home/openxs/git/server/sql/
#4  0x00007f29ef190413 in __libc_start_main () from /lib64/
#5  0x000055cc4ecd9ece in _start () at /mnt/home/openxs/git/server/sql/sql_list.h:158
So, each tread stack starts from 'Thread..." line, each frame is numbered and function call is actually the fourth word in the output (in awk terms).
Based on that I tried add probe to produce stack in perf format for mysqld process, with some decoration to have a clear marker of when new capture starts:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n", ustack(perf));}' > /tmp/ustack.txt
If you wonder about the details of command format, functions used etc, check the Reference Guide. The result looks as follows in hexdump -C output:
00000000  41 74 74 61 63 68 69 6e  67 20 31 20 70 72 6f 62  |Attaching 1 prob|
00000010  65 2e 2e 2e 0a 23 20 0a  09 37 66 32 39 65 66 62  |e....# ..7f29efb|
00000020  38 34 33 30 30 20 6e 61  6e 6f 73 6c 65 65 70 2b  |84300 nanosleep+|
00000030  36 34 20 28 2f 75 73 72  2f 6c 69 62 36 34 2f 6c  |64 (/usr/lib64/l|
00000040  69 62 70 74 68 72 65 61  64 2d 32 2e 32 38 2e 73  |ibpthread-2.28.s|
00000050  6f 29 0a 0a 23 20 0a 09  35 35 63 63 34 66 32 30  |o)..# ..55cc4f20|
This is how you can see where are the new lines, how many etc, for proper parsing. Next step was to check pt-pmp source code and, review and adapt some awk magic used there to collapse stacks, get rid of function arguments etc, and I've ended up with this draft version:
[openxs@fc29 maria10.3]$ cat /tmp/ustack.txt | awk '
BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

    199 __sched_yield,srv_purge_coordinator_thread,start_thread
     16 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     13 trx_purge,srv_purge_coordinator_thread,start_thread
      4 srv_purge_coordinator_thread,start_thread
      4 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      2 ut_crc32_sw,buf_calc_page_crc32,buf_page_is_corrupted,buf_page_io_complete,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
      2 syscall,
      2 __pwrite,
      2 mtr_t::commit,row_ins_sec_index_entry_low,row_ins_sec_index_entry,row_upd_sec_index_entry,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
The above looks reasonable (even if not yet a direct match to what pt-pmp produces, I am too lazy to do much awk these days).

What's more interesting is that with bpftrace we can also get kernel stack for thread while profiling. I took a quick look at what we may get as a result of combining both:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n%s\n", kstack(perf), ustack(perf));}' > /tmp/kstack.txt
[sudo] password for openxs:
I just pring one after the other without any separator in perf format, so I can apply the same trick as above to the result:
openxs@fc29 maria10.3]$ cat /tmp/kstack.txt | awk '                   BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

     93 __sched_text_start,schedule,__x64_sys_sched_yield,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     49 __sched_yield,srv_purge_coordinator_thread,start_thread
     42 do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     20 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     11 trx_purge,srv_purge_coordinator_thread,start_thread
      2 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      1 __x86_indirect_thunk_rax,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
      1 ut_crc32_sw,log_block_calc_checksum_crc32,log_write_up_to,trx_commit_complete_for_mysql,innobase_commit,commit_one_phase_2,ha_commit_trans,trans_commit_stmt,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
      1 _raw_spin_unlock_irqrestore,ata_scsi_queuecmd,scsi_queue_rq,blk_mq_dispatch_rq_list,blk_mq_do_dispatch_sched,blk_mq_sched_dispatch_requests,__blk_mq_run_hw_queue,__blk_mq_delay_run_hw_queue,blk_mq_run_hw_queue,blk_mq_sched_insert_requests,blk_mq_flush_plug_list,blk_flush_plug_list,blk_finish_plug,ext4_writepages,do_writepages,__filemap_fdatawrite_range,file_write_and_wait_range,ext4_sync_file,do_fsync,__x64_sys_fsync,do_syscall_64,entry_SYSCALL_64_after_hwframe,fsync,fil_flush_low,fil_flush_file_spaces,buf_dblwr_update,buf_page_io_complete,fil_aio_wait,io_handler_thread,start_thread
      1 page_simple_validate_new,buf_dblwr_flush_buffered_writes,pc_flush_slot,buf_flush_page_cleaner_coordinator.cold.151,start_thread
      1 page_cur_insert_rec_write_log,page_cur_insert_rec_low,page_copy_rec_list_end_no_locks,btr_page_reorganize_low,btr_can_merge_with_page,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
      1 os_thread_yield,srv_purge_coordinator_thread,start_thread
      1 os_thread_get_curr_id,fseg_free_page_func,btr_page_free,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
Looks cool to me, we see where we spend time in kernel while on-CPU, not only in user space code.

It's sometimes scary and not that easy to use bpftrace, but you'll get there eventually, same as we got to North Greenwich via that Emirates Air Line back in 2019. The only limit is the sky!
* * *
To summarize:
  1. bpftrace and bcc tools you get from vendor packages may not work as expected, so be ready to get nightly builds etc or, as I prefer, build from source. If this is a problem better use ftrace :) 
  2. It seems getting stack traces with bpftrace for pt-pmp is an easy next stap to make. I may be integrated as one of tracers supported here.
  3. Now we can also add kernel stack to the picure.
  4. "JUST DO BPF"!
I have many more details to share about dynamic tracing for MySQL and MariaDB DBAs. Stay tuned!


  1. Nice article...

    Sorry for not finding an updated bpftrace package for Fedora 29 (I'm the bpftrace Fedora package maintainer), but F29 EOL was 2019-11-26 ( For Fedora, my suggestion is to use always the latest release.

    In Fedora 30 and 31 you can find a more updated package (version 0.9.2), I haven't updated it yet to the latest version (0.9.3) due to some technical problems (related to libraries). In Fedora 32 (Rawhide) you can find the latest version.

    Another thing...

    I have an open PR adding support for C++ demangled names:

    If you want you can test it and if you find any problem, please let me know.

    1. Thank you for clarifying this. I have to upgrade to Fedora 32 soon. I'll monitor the PR or maybe just include it in the next source code build I'll for myself.