Saturday, January 23, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part I, Basic uprobes

There is still some non-zero probability that my talk called "Monitoring MariaDB Server with bpftrace on Linux" is accepted for the FOSDEM 2021 Monitoring and Observability devroom, so it's time to forget for a while about /proc sampling and revisit my old posts about bpftrace.

This time I am going to build recent bpftrace version from GitHub source with recent bcc tools:

[openxs@fc31 bcc]$ git log -1
commit 97cded04a9d6370ac722c6ad8e73b72c4794e851 (HEAD -> master, origin/master, origin/HEAD)
Author: Chunmei Xu <xuchunmei@linux.alibaba.com>
Date:   Fri Jan 15 09:51:27 2021 +0800

    test/test_histogram.py: fix test failed on kernel-5.10

    kernel commit(cf25e24db61cc) rename tsk->real_start_time to
    start_boottime, so test_hostogram will get failed on kernel>=5.5

    Signed-off-by: Chunmei Xu <xuchunmei@linux.alibaba.com>

...

[openxs@fc31 bpftrace]$ git log -1
commit 691c5e23259bfa82257016c65612fe9a3d6be7d4 (HEAD -> master, origin/master, origin/HEAD)
Author: Masanori Misono <m.misono760@gmail.com>
Date:   Wed Nov 25 05:51:08 2020 +0900

    Update changelog and fuzzing.md
...

and check how it works on Fedora 31. I wanted to write "up to date Fedora 31", but surely it's "up to date" for 2 months already, as it's EOL and no longer supported... This is something to fix next week by upgrading to Fedora 33 while I am on vacation.

The build process was not any different from the one described in this post. I've got some test failures for bcc tools:

...

84% tests passed, 7 tests failed out of 44

Total Test time (real) = 908.76 sec

The following tests FAILED:
          2 - c_test_static (Failed)
          3 - test_libbcc (Failed)
          4 - py_test_stat1_b (Failed)
          9 - py_test_trace1 (Failed)
         18 - py_test_clang (Failed)
         23 - py_test_stackid (Failed)
         29 - py_test_disassembler (Failed)
Errors while running CTest
make: *** [Makefile:106: test] Error 8
...

but eventually ended up with this up to date version of bpftrace that basically works for my purposes:

[openxs@fc31 build]$ /usr/local/bin/bpftrace --version
bpftrace v0.11.0-324-g691c5

[root@fc31 tools]# which bpftrace
/usr/local/bin/bpftrace


[openxs@fc31 ~]$ bpftrace --help
USAGE:
    bpftrace [options] filename
    bpftrace [options] - <stdin input>
    bpftrace [options] -e 'program'

OPTIONS:
    -B MODE        output buffering mode ('full', 'none')
    -f FORMAT      output format ('text', 'json')
    -o file        redirect bpftrace output to file
    -d             debug info dry run
    -dd            verbose debug info dry run
    -b             force BTF (BPF type format) processing
    -e 'program'   execute this program
    -h, --help     show this help message
    -I DIR         add the directory to the include search path
    --include FILE add an #include file before preprocessing
    -l [search]    list probes
    -p PID         enable USDT probes on PID
    -c 'CMD'       run CMD and enable USDT probes on resulting process
    --usdt-file-activation
                   activate usdt semaphores based on file path
    --unsafe       allow unsafe builtin functions
    -q             keep messages quiet
    -v             verbose messages
    --info         Print information about kernel BPF support
    -k             emit a warning when a bpf helper returns an error (except read functions)
    -kk            check all bpf helper functions
    -V, --version  bpftrace version
    --no-warnings  disable all warning messages

ENVIRONMENT:
    BPFTRACE_STRLEN             [default: 64] bytes on BPF stack per str()
    BPFTRACE_NO_CPP_DEMANGLE    [default: 0] disable C++ symbol demangling
    BPFTRACE_MAP_KEYS_MAX       [default: 4096] max keys in a map
    BPFTRACE_CAT_BYTES_MAX      [default: 10k] maximum bytes read by cat builtin
    BPFTRACE_MAX_PROBES         [default: 512] max number of probes
    BPFTRACE_LOG_SIZE           [default: 1000000] log size in bytes
    BPFTRACE_PERF_RB_PAGES      [default: 64] pages per CPU to allocate for ring buffer
    BPFTRACE_NO_USER_SYMBOLS    [default: 0] disable user symbol resolution
    BPFTRACE_CACHE_USER_SYMBOLS [default: auto] enable user symbol cache
    BPFTRACE_VMLINUX            [default: none] vmlinux path used for kernel symbol resolution
    BPFTRACE_BTF                [default: none] BTF file

EXAMPLES:
bpftrace -l '*sleep*'
    list probes containing "sleep"
bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
    trace processes calling sleep
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
    count syscalls by process name

I've highlighted options that I consider "new" or changed comparing to version 0.9 I've used here.

As the first test I tried to find out if this PR mentioned in the comments to one of my posts really made it to the current code and if I can use user probe names in demangled C++ format. For this I tried to capture all queries with a probe on dispatch_command function:

[openxs@fc31 ~]$ ps aux | grep mariadb
openxs      3196  0.0  0.0 217048  3828 pts/0    S    08:45   0:00 /bin/sh ./bin/mysqld_safe --no-defaults --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --port=3309
openxs      3293  140  3.5 3633176 287132 pts/0  Sl   08:45   2:56 /home/openxs/dbs/maria10.5/bin/mariadbd --no-defaults --basedir=/home/openxs/dbs/maria10.5 --datadir=/home/openxs/dbs/maria10.5/data --plugin-dir=/home/openxs/dbs/maria10.5/lib/plugin --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --log-error=/home/openxs/dbs/maria10.5/data/fc31.err --pid-file=fc31.pid --socket=/tmp/mariadb.sock --port=3309
openxs      3494  0.0  0.0 215992   844 pts/1    S+   08:47   0:00 grep --color=auto mariadb

[openxs@fc31 ~]$ sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command { printf("%s\n", str(arg2)); }'
Attaching 3 probes...
select @@version_comment limit 1
select @@version_comment limit 1
select 1+1
select 1+1
show processlist
show processlist
^C

So, demangled function name was accepted, but note "3 probes" above and duplicated SQL statements in the output. So, at least 2 of 3 probes were executed. I tried to "debug" the problem wityh the -d option:

[openxs@fc31 ~]$ sudo bpftrace -d -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command { printf("%s\n", str(arg2)); }'

AST
-------------------
Program
 uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command
  call: printf
   string: %s\n
   call: str
    builtin: arg2


AST after semantic analysis
-------------------
Program
 uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command
  call: printf :: type[none, ctx: 0]
   string: %s\n :: type[string[3], ctx: 0]
   call: str :: type[string[64], ctx: 0, AS(user)]
    builtin: arg2 :: type[unsigned int64, ctx: 0, AS(user)]

; ModuleID = 'bpftrace'
source_filename = "bpftrace"
target datalayout = "e-m:e-p:64:64-i64:64-n32:64-S128"
target triple = "bpf-pc-linux"

%printf_t = type { i64, [64 x i8] }

; Function Attrs: nounwind
declare i64 @llvm.bpf.pseudo(i64, i64) #0

define i64 @"uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command"(i8*) local_unnamed_addr section "s_uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command_1" {
entry:
  %str = alloca [64 x i8], align 1
  %printf_args = alloca %printf_t, align 8
  %1 = bitcast %printf_t* %printf_args to i8*
  call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1)
  %2 = getelementptr inbounds [64 x i8], [64 x i8]* %str, i64 0, i64 0
  %3 = bitcast %printf_t* %printf_args to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 8 %3, i8 0, i64 72, i1 false)
  call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %2)
  call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %2, i8 0, i64 64, i1 false)
  %4 = getelementptr i8, i8* %0, i64 96
  %5 = bitcast i8* %4 to i64*
  %arg2 = load volatile i64, i64* %5, align 8
  %probe_read_user_str = call i64 inttoptr (i64 114 to i64 ([64 x i8]*, i32, i64)*)([64 x i8]* nonnull %str, i32 64, i64 %arg2)
  %6 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1, i64 0
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %6, i8* nonnull align 1 %2, i64 64, i1 false)
  call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %2)
  %pseudo = call i64 @llvm.bpf.pseudo(i64 1, i64 1)
  %get_cpu_id = call i64 inttoptr (i64 8 to i64 ()*)()
  %perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*, i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 72)
  call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1)
  ret i64 0
}

; Function Attrs: argmemonly nounwind
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #1

; Function Attrs: argmemonly nounwind
declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg) #1

; Function Attrs: argmemonly nounwind
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #1

; Function Attrs: argmemonly nounwind
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i1 immarg) #1

attributes #0 = { nounwind }
attributes #1 = { argmemonly nounwind }

But the output does NOT list 3 probes and gives no hints. I had probably try to care better and provide (demangled) function signature

dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)

or check what probes are really added with

sudo cat /sys/kernel/tracing/uprobe_events

But being lazy, I ended up just double checking what mangled name to use:

openxs@ao756:~$ objdump -T /home/openxs/dbs/maria10.5/bin/mariadbd | grep dispatch_command
000000000070a170 g    DF .text  000000000000289b  Base        _Z16dispatch_command19enum_server_commandP3THDPcjbb
...

and used the same familiar mangled name in further probes:

[openxs@fc31 ~]$ sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb { printf("%s\n", str(arg2)); }'
Attaching 1 probe...
select @@version_comment limit 1
select user, host from mysql.user
SELECT DATABASE()
test
show databases
show tables
ts
select count(*) from t1
select count(*) from t
show tables

^C

The next bpftrace "oneliner" to try was my actually more advanced attempt to not only capture the test of SQL statements, but also the time to execute them via uretprobe, and make it work in the multithreaded environment. I quickly found out that one of examples in the older post has a bug and that explained "64" at the end of timestamps :) So, here is a more correct bpftrace program:

[openxs@fc31 ~]$ sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql[tid] = str(arg2); @start[tid] = nsecs; }                                                                               uretprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[tid] != 0/ { printf("%s : %u %u ms\n", @sql[tid], tid, (nsecs - @start[tid])/1000000); } '
Attaching 2 probes...
select sleep(1) : 4029 1000 ms
 : 4029 0 ms
select sleep(2) : 4281 2000 ms
 : 4281 0 ms
select sleep(3) : 4283 3000 ms
 : 4283 0 ms
select sleep(4) : 4282 4000 ms
 : 4282 0 ms
^C

@sql[4029]:
@sql[4281]:
@sql[4282]:
@sql[4283]:

@start[4029]: 2609790546240
@start[4281]: 2610789764269
@start[4283]: 2611790224979
@start[4282]: 2612789761146

The output was taken while this shell script was running:

[openxs@fc31 maria10.5]$ for i in `seq 1 4`; do mysql --socket=/tmp/mariadb.sock -e"select sleep($i)" & done

Just to remind you, I've used two associative arrays, @sql[] for queries and @start[] for start times, both indexed by tid - built in bpftrace variable for thread id. Note that bpftrace automatically outputs the content of all global associative arrays at the end, unless we free them explicitly. So, reimplementing a slow query log in bpftrace properly is no longer a "one liner" program, we have tyo care about more details.

As the next test, I tried to add user probe to the library to trace pthread_mutex_lock calls only for the mariadbd binary (as I did in this post with perf): 

[openxs@fc31 ~]$ ldd /home/openxs/dbs/maria10.5/bin/mariadbd | grep thread
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3d957bf000)
[openxs@fc31 ~]$ sudo bpftrace -e 'uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/ { @[ustack] = count(); }' > /tmp/bpfmutex.txt
[sudo] password for openxs:
^C^C

Here I am collecting and counting unique user stacks at the moment of calling pthread_mutex_lock(), while sysbench test is running:

...
[ 10s ] thds: 32 tps: 658.05 qps: 13199.78 (r/w/o: 9246.09/2634.40/1319.30) lat (ms,95%): 227.40 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 737.82 qps: 14752.19 (r/w/o: 10325.44/2951.30/1475.45) lat (ms,95%): 193.38 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 451.18 qps: 9023.16 (r/w/o: 6316.56/1804.03/902.57) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 379.09 qps: 7585.24 (r/w/o: 5310.19/1516.87/758.18) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 448.78 qps: 8985.48 (r/w/o: 6292.88/1795.14/897.47) lat (ms,95%): 350.33 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 400.33 qps: 7997.32 (r/w/o: 5595.86/1600.70/800.75) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 392.96 qps: 7865.59 (r/w/o: 5506.30/1573.36/785.93) lat (ms,95%): 369.77 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 410.02 qps: 8197.77 (r/w/o: 5739.36/1638.47/819.94) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 390.15 qps: 7803.48 (r/w/o: 5462.45/1560.62/780.41) lat (ms,95%): 427.07 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 405.08 qps: 8111.76 (r/w/o: 5677.96/1623.63/810.17) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
^C

Note some drop of performance that is yet to be measured properly (collection time vs exporting to the useland /tmp/bpfmutex.txt file. It is notable for sure for such a frequent event to trace.

In the results I see mostly unique stacks like these:

[openxs@fc31 ~]$ head -100 /tmp/bpfmutex.txt
Attaching 1 probe...


@[
    __pthread_mutex_lock+0
    sync_array_wait_event(sync_array_t*, sync_cell_t*&)+167
    rw_lock_sx_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int)+488
    pfs_rw_lock_sx_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int) [clone .constprop.0]+140
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+8555
    btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.0]+146
    row_search_index_entry(dict_index_t*, dtuple_t const*, unsigned long, btr_pcur_t*, mtr_t*)+47
    row_purge_remove_sec_if_poss_tree(purge_node_t*, dict_index_t*, dtuple_t const*)+497
    row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+1492
    row_purge_step(que_thr_t*)+738
    que_run_threads(que_thr_t*)+2264
    purge_worker_callback(void*)+355
    tpool::task_group::execute(tpool::task*)+170
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f00f7fc43d4
    0x56302093bd80
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 1
@[
    __pthread_mutex_lock+0
    mtr_t::commit()+2660
    row_ins_sec_index_entry_low(unsigned long, unsigned long, dict_index_t*, mem_block_info_t*, mem_block_info_t*, dtuple_t*, unsigned long, que_thr_t*)+563
    row_ins_sec_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, bool)+246
    row_ins_step(que_thr_t*)+1305
    row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t)+865
    ha_innobase::write_row(unsigned char const*)+177
    handler::ha_write_row(unsigned char const*)+464
    write_record(THD*, TABLE*, st_copy_info*, select_result*)+477
    mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*)+2967
    mysql_execute_command(THD*)+7722
    Prepared_statement::execute(String*, bool)+981
    Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+133
    mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool)+549
    mysqld_stmt_execute(THD*, char*, unsigned int)+44
    dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)+9302
    do_command(THD*)+274
    do_handle_one_connection(CONNECT*, bool)+1025
    handle_one_connection+93
    pfs_spawn_thread+322
    start_thread+226
]: 1
...

Looks like I have to generate them in perf format and then maybe aggregate somehow in bpftrace itself, in the END probe, similar to the way I did with awk postprocessing inspired by pt-pmp in this post. That should reduce the negative performance impact of the tracing, hopefully to the level that makes it practical to use in production. I'd l;ike to build flame graphs one day directly based on bpftrace outputs.

The view is still not entirely clear, but I am getting there, to be as flent with bpftrace as I am with perf

* * *

To summarize:

  1. bpftrace version 0.11 supports demangled C++ function signatures. You may still have problems making sure proper function is instrumented, so I continue to use mangled names.
  2. My plan is to find out how to trace/do with bpftrace anything I usually do with perf, because bpftrace is the future of ad hoc monitoring and tracing tools for Linux.
  3. I am yet to start collecting larger MariaDB and MySQL-related bpftrace programs in some repository for reuse, but there are many generci OS level tools to check in the bpftrace/tools subdirectory. They will be covered in my next blog post.
  4. It's time for me to upgrade to Fedora 33 and retest bcc tools and bpftrace there.

5 comments:

  1. Perf stacks can be generated with `ustack(perf)` as the stack expression. This was moderately recently fixed in https://github.com/iovisor/bpftrace/commit/b26f76cee1905bd70c2afc24ace7b29c27b63f58 ( not currently in a release ) but as you built this from source it should be there for you. Happy upgrade to fc33. Its working good for me.

    ReplyDelete
    Replies
    1. I've used ustack(perf) in another post, just forgot why it may metter for summarizing stacks while trying to come up with a one liner fast while working on other things concurrently :) My fault. I'll surely come up with a better example and hopefully a flame graph next time, and on supported Fedora version...

      Delete
  2. Well , although it might look stupid , if i try to enter that "git log -1" command i get an error that it doesn't know from where to download it . Could you please tell me the exact command ??

    ReplyDelete
    Replies
    1. You have to install git package with dependencies. In case of Fedora it would be:

      sudo dnf install git

      Read https://www.atlassian.com/git/tutorials/install-git#linux for more details.

      Delete
    2. With git installed, the steps to build bcc tools first and then bpftrace based on recent bcc tolls is the following (extract from my older post):

      git clone https://github.com/iovisor/bcc.git
      cd bcc/
      git submodule update --init --recursive
      sudo dnf install -y bison cmake ethtool flex git iperf libstdc++-static python-netaddr python-pip gcc gcc-c++ make zlib-devel elfutils-libelf-devel
      sudo dnf install -y luajit luajit-devel
      sudo dnf install -y http://repo.iovisor.org/yum/extra/mageia/cauldron/x86_64/netperf-2.7.0-1.mga6.x86_64.rpm
      sudo pip install pyroute2
      sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
      mkdir build
      cd build/
      cmake .. -DCMAKE_INSTALL_PREFIX=/usr
      make
      make test
      sudo make install
      cd ../..
      git clone https://github.com/iovisor/bpftrace.git
      cd bpftrace/
      git submodule update --init --recursive
      mkdir build
      cd build/
      cmake -DCMAKE_BUILD_TYPE=Release ../
      make
      ./tests/bpftrace_test
      sudo make install

      Delete