There is still some non-zero probability that my talk called "Monitoring MariaDB Server with bpftrace on Linux" is accepted for the FOSDEM 2021 Monitoring and Observability devroom, so it's time to forget for a while about /proc sampling and revisit my old posts about bpftrace.
This time I am going to build recent bpftrace version from GitHub source with recent bcc tools:
[openxs@fc31 bcc]$ git log -1
commit 97cded04a9d6370ac722c6ad8e73b72c4794e851 (HEAD -> master, origin/master, origin/HEAD)
Author: Chunmei Xu <xuchunmei@linux.alibaba.com>
Date: Fri Jan 15 09:51:27 2021 +0800
test/test_histogram.py: fix test failed on kernel-5.10
kernel commit(cf25e24db61cc) rename tsk->real_start_time to
start_boottime, so test_hostogram will get failed on kernel>=5.5
Signed-off-by: Chunmei Xu <xuchunmei@linux.alibaba.com>
...
[openxs@fc31 bpftrace]$ git log -1
commit 691c5e23259bfa82257016c65612fe9a3d6be7d4 (HEAD -> master, origin/master, origin/HEAD)
Author: Masanori Misono <m.misono760@gmail.com>
Date: Wed Nov 25 05:51:08 2020 +0900
Update changelog and fuzzing.md
...
and check how it works on Fedora 31. I wanted to write "up to date Fedora 31", but surely it's "up to date" for 2 months already, as it's EOL and no longer supported... This is something to fix next week by upgrading to Fedora 33 while I am on vacation.
The build process was not any different from the one described in this post. I've got some test failures for bcc tools:
...
84% tests passed, 7 tests failed out of 44
Total Test time (real) = 908.76 sec
The following tests FAILED:
2 - c_test_static (Failed)
3 - test_libbcc (Failed)
4 - py_test_stat1_b (Failed)
9 - py_test_trace1 (Failed)
18 - py_test_clang (Failed)
23 - py_test_stackid (Failed)
29 - py_test_disassembler (Failed)
Errors while running CTest
make: *** [Makefile:106: test] Error 8
...
but eventually ended up with this up to date version of bpftrace that basically works for my purposes:
[openxs@fc31 build]$ /usr/local/bin/bpftrace --version
bpftrace v0.11.0-324-g691c5
[root@fc31 tools]# which bpftrace
/usr/local/bin/bpftrace
[openxs@fc31 ~]$ bpftrace --help
USAGE:
bpftrace [options] filename
bpftrace [options] - <stdin input>
bpftrace [options] -e 'program'
OPTIONS:
-B MODE output buffering mode ('full', 'none')
-f FORMAT output format ('text', 'json')
-o file redirect bpftrace output to file
-d debug info dry run
-dd verbose debug info dry run
-b force BTF (BPF type format) processing
-e 'program' execute this program
-h, --help show this help message
-I DIR add the directory to the include search path
--include FILE add an #include file before preprocessing
-l [search] list probes
-p PID enable USDT probes on PID
-c 'CMD' run CMD and enable USDT probes on resulting process
--usdt-file-activation
activate usdt semaphores based on file path
--unsafe allow unsafe builtin functions
-q keep messages quiet
-v verbose messages
--info Print information about kernel BPF support
-k emit a warning when a bpf helper returns an error (except read functions)
-kk check all bpf helper functions
-V, --version bpftrace version
--no-warnings disable all warning messages
ENVIRONMENT:
BPFTRACE_STRLEN [default: 64] bytes on BPF stack per str()
BPFTRACE_NO_CPP_DEMANGLE [default: 0] disable C++ symbol demangling
BPFTRACE_MAP_KEYS_MAX [default: 4096] max keys in a map
BPFTRACE_CAT_BYTES_MAX [default: 10k] maximum bytes read by cat builtin
BPFTRACE_MAX_PROBES [default: 512] max number of probes
BPFTRACE_LOG_SIZE [default: 1000000] log size in bytes
BPFTRACE_PERF_RB_PAGES [default: 64] pages per CPU to allocate for ring buffer
BPFTRACE_NO_USER_SYMBOLS [default: 0] disable user symbol resolution
BPFTRACE_CACHE_USER_SYMBOLS [default: auto] enable user symbol cache
BPFTRACE_VMLINUX [default: none] vmlinux path used for kernel symbol resolution
BPFTRACE_BTF [default: none] BTF file
EXAMPLES:
bpftrace -l '*sleep*'
list probes containing "sleep"
bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
trace processes calling sleep
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
count syscalls by process name
I've highlighted options that I consider "new" or changed comparing to version 0.9 I've used here.
As the first test I tried to find out if this PR mentioned in the comments to one of my posts really made it to the current code and if I can use user probe names in demangled C++ format. For this I tried to capture all queries with a probe on dispatch_command function:
[openxs@fc31 ~]$ ps aux | grep mariadb
openxs 3196 0.0 0.0 217048 3828 pts/0 S 08:45 0:00 /bin/sh ./bin/mysqld_safe --no-defaults --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --port=3309
openxs 3293 140 3.5 3633176 287132 pts/0 Sl 08:45 2:56 /home/openxs/dbs/maria10.5/bin/mariadbd --no-defaults --basedir=/home/openxs/dbs/maria10.5 --datadir=/home/openxs/dbs/maria10.5/data --plugin-dir=/home/openxs/dbs/maria10.5/lib/plugin --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --log-error=/home/openxs/dbs/maria10.5/data/fc31.err --pid-file=fc31.pid --socket=/tmp/mariadb.sock --port=3309
openxs 3494 0.0 0.0 215992 844 pts/1 S+ 08:47 0:00 grep --color=auto mariadb
[openxs@fc31 ~]$ sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command { printf("%s\n", str(arg2)); }'
Attaching 3 probes...
select @@version_comment limit 1
select @@version_comment limit 1
select 1+1
select 1+1
show processlist
show processlist
^C
So, demangled function name was accepted, but note "3 probes" above and duplicated SQL statements in the output. So, at least 2 of 3 probes were executed. I tried to "debug" the problem wityh the -d option:
[openxs@fc31 ~]$ sudo bpftrace -d -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command { printf("%s\n", str(arg2)); }'
AST
-------------------
Program
uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command
call: printf
string: %s\n
call: str
builtin: arg2
AST after semantic analysis
-------------------
Program
uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command
call: printf :: type[none, ctx: 0]
string: %s\n :: type[string[3], ctx: 0]
call: str :: type[string[64], ctx: 0, AS(user)]
builtin: arg2 :: type[unsigned int64, ctx: 0, AS(user)]
; ModuleID = 'bpftrace'
source_filename = "bpftrace"
target datalayout = "e-m:e-p:64:64-i64:64-n32:64-S128"
target triple = "bpf-pc-linux"
%printf_t = type { i64, [64 x i8] }
; Function Attrs: nounwind
declare i64 @llvm.bpf.pseudo(i64, i64) #0
define i64 @"uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command"(i8*) local_unnamed_addr section "s_uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command_1" {
entry:
%str = alloca [64 x i8], align 1
%printf_args = alloca %printf_t, align 8
%1 = bitcast %printf_t* %printf_args to i8*
call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1)
%2 = getelementptr inbounds [64 x i8], [64 x i8]* %str, i64 0, i64 0
%3 = bitcast %printf_t* %printf_args to i8*
call void @llvm.memset.p0i8.i64(i8* nonnull align 8 %3, i8 0, i64 72, i1 false)
call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %2)
call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %2, i8 0, i64 64, i1 false)
%4 = getelementptr i8, i8* %0, i64 96
%5 = bitcast i8* %4 to i64*
%arg2 = load volatile i64, i64* %5, align 8
%probe_read_user_str = call i64 inttoptr (i64 114 to i64 ([64 x i8]*, i32, i64)*)([64 x i8]* nonnull %str, i32 64, i64 %arg2)
%6 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1, i64 0
call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %6, i8* nonnull align 1 %2, i64 64, i1 false)
call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %2)
%pseudo = call i64 @llvm.bpf.pseudo(i64 1, i64 1)
%get_cpu_id = call i64 inttoptr (i64 8 to i64 ()*)()
%perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*, i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 72)
call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1)
ret i64 0
}
; Function Attrs: argmemonly nounwind
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #1
; Function Attrs: argmemonly nounwind
declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg) #1
; Function Attrs: argmemonly nounwind
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #1
; Function Attrs: argmemonly nounwind
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i1 immarg) #1
attributes #0 = { nounwind }
attributes #1 = { argmemonly nounwind }
But the output does NOT list 3 probes and gives no hints. I had probably try to care better and provide (demangled) function signature
dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)
or check what probes are really added with
sudo cat /sys/kernel/tracing/uprobe_events
But being lazy, I ended up just double checking what mangled name to use:
openxs@ao756:~$ objdump -T /home/openxs/dbs/maria10.5/bin/mariadbd | grep dispatch_command
000000000070a170 g DF .text 000000000000289b Base _Z16dispatch_command19enum_server_commandP3THDPcjbb
...
and used the same familiar mangled name in further probes:
[openxs@fc31 ~]$ sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb { printf("%s\n", str(arg2)); }'
Attaching 1 probe...
select @@version_comment limit 1
select user, host from mysql.user
SELECT DATABASE()
test
show databases
show tables
ts
select count(*) from t1
select count(*) from t
show tables
^C
The next bpftrace "oneliner" to try was my actually more advanced attempt to not only capture the test of SQL statements, but also the time to execute them via uretprobe, and make it work in the multithreaded environment. I quickly found out that one of examples in the older post has a bug and that explained "64" at the end of timestamps :) So, here is a more correct bpftrace program:
[openxs@fc31 ~]$ sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql[tid] = str(arg2); @start[tid] = nsecs; } uretprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[tid] != 0/ { printf("%s : %u %u ms\n", @sql[tid], tid, (nsecs - @start[tid])/1000000); } '
Attaching 2 probes...
select sleep(1) : 4029 1000 ms
: 4029 0 ms
select sleep(2) : 4281 2000 ms
: 4281 0 ms
select sleep(3) : 4283 3000 ms
: 4283 0 ms
select sleep(4) : 4282 4000 ms
: 4282 0 ms
^C
@sql[4029]:
@sql[4281]:
@sql[4282]:
@sql[4283]:
@start[4029]: 2609790546240
@start[4281]: 2610789764269
@start[4283]: 2611790224979
@start[4282]: 2612789761146
The output was taken while this shell script was running:
[openxs@fc31 maria10.5]$ for i in `seq 1 4`; do mysql --socket=/tmp/mariadb.sock -e"select sleep($i)" & done
Just to remind you, I've used two associative arrays, @sql[] for queries and @start[] for start times, both indexed by tid
- built in bpftrace variable for thread id. Note that bpftrace automatically outputs the content of all global associative arrays at the end,
unless we free them explicitly. So, reimplementing a slow query log in bpftrace properly is no longer a "one liner" program, we have tyo care about more details.
As the next test, I tried to add user probe to the library to trace pthread_mutex_lock calls only for the mariadbd binary (as I did in this post with perf):
[openxs@fc31 ~]$ ldd /home/openxs/dbs/maria10.5/bin/mariadbd | grep thread
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3d957bf000)
[openxs@fc31 ~]$ sudo bpftrace -e 'uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/ { @[ustack] = count(); }' > /tmp/bpfmutex.txt
[sudo] password for openxs:
^C^C
Here I am collecting and counting unique user stacks at the moment of calling pthread_mutex_lock(), while sysbench test is running:
...
[ 10s ] thds: 32 tps: 658.05 qps: 13199.78 (r/w/o: 9246.09/2634.40/1319.30) lat (ms,95%): 227.40 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 737.82 qps: 14752.19 (r/w/o: 10325.44/2951.30/1475.45) lat (ms,95%): 193.38 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 451.18 qps: 9023.16 (r/w/o: 6316.56/1804.03/902.57) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 379.09 qps: 7585.24 (r/w/o: 5310.19/1516.87/758.18) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 448.78 qps: 8985.48 (r/w/o: 6292.88/1795.14/897.47) lat (ms,95%): 350.33 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 400.33 qps: 7997.32 (r/w/o: 5595.86/1600.70/800.75) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 392.96 qps: 7865.59 (r/w/o: 5506.30/1573.36/785.93) lat (ms,95%): 369.77 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 410.02 qps: 8197.77 (r/w/o: 5739.36/1638.47/819.94) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 390.15 qps: 7803.48 (r/w/o: 5462.45/1560.62/780.41) lat (ms,95%): 427.07 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 405.08 qps: 8111.76 (r/w/o: 5677.96/1623.63/810.17) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
^C
Note some drop of performance that is yet to be measured properly (collection time vs exporting to the useland /tmp/bpfmutex.txt file. It is notable for sure for such a frequent event to trace.
In the results I see mostly unique stacks like these:
[openxs@fc31 ~]$ head -100 /tmp/bpfmutex.txt
Attaching 1 probe...
@[
__pthread_mutex_lock+0
sync_array_wait_event(sync_array_t*, sync_cell_t*&)+167
rw_lock_sx_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int)+488
pfs_rw_lock_sx_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int) [clone .constprop.0]+140
btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+8555
btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.0]+146
row_search_index_entry(dict_index_t*, dtuple_t const*, unsigned long, btr_pcur_t*, mtr_t*)+47
row_purge_remove_sec_if_poss_tree(purge_node_t*, dict_index_t*, dtuple_t const*)+497
row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+1492
row_purge_step(que_thr_t*)+738
que_run_threads(que_thr_t*)+2264
purge_worker_callback(void*)+355
tpool::task_group::execute(tpool::task*)+170
tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
0x7f00f7fc43d4
0x56302093bd80
std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
0x2de907894810c083
]: 1
@[
__pthread_mutex_lock+0
mtr_t::commit()+2660
row_ins_sec_index_entry_low(unsigned long, unsigned long, dict_index_t*, mem_block_info_t*, mem_block_info_t*, dtuple_t*, unsigned long, que_thr_t*)+563
row_ins_sec_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, bool)+246
row_ins_step(que_thr_t*)+1305
row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t)+865
ha_innobase::write_row(unsigned char const*)+177
handler::ha_write_row(unsigned char const*)+464
write_record(THD*, TABLE*, st_copy_info*, select_result*)+477
mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*)+2967
mysql_execute_command(THD*)+7722
Prepared_statement::execute(String*, bool)+981
Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+133
mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool)+549
mysqld_stmt_execute(THD*, char*, unsigned int)+44
dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)+9302
do_command(THD*)+274
do_handle_one_connection(CONNECT*, bool)+1025
handle_one_connection+93
pfs_spawn_thread+322
start_thread+226
]: 1
...
Looks like I have to generate them in perf format and then maybe aggregate somehow in bpftrace itself, in the END probe, similar to the way I did with awk postprocessing inspired by pt-pmp in this post. That should reduce the negative performance impact of the tracing, hopefully to the level that makes it practical to use in production. I'd l;ike to build flame graphs one day directly based on bpftrace outputs.
The view is still not entirely clear, but I am getting there, to be as flent with bpftrace as I am with perf |
* * *
To summarize:
- bpftrace version 0.11 supports demangled C++ function signatures. You may still have problems making sure proper function is instrumented, so I continue to use mangled names.
- My plan is to find out how to trace/do with bpftrace anything I usually do with perf, because bpftrace is the future of ad hoc monitoring and tracing tools for Linux.
- I am yet to start collecting larger MariaDB and MySQL-related bpftrace programs in some repository for reuse, but there are many generci OS level tools to check in the bpftrace/tools subdirectory. They will be covered in my next blog post.
- It's time for me to upgrade to Fedora 33 and retest bcc tools and bpftrace there.
Perf stacks can be generated with `ustack(perf)` as the stack expression. This was moderately recently fixed in https://github.com/iovisor/bpftrace/commit/b26f76cee1905bd70c2afc24ace7b29c27b63f58 ( not currently in a release ) but as you built this from source it should be there for you. Happy upgrade to fc33. Its working good for me.
ReplyDeleteI've used ustack(perf) in another post, just forgot why it may metter for summarizing stacks while trying to come up with a one liner fast while working on other things concurrently :) My fault. I'll surely come up with a better example and hopefully a flame graph next time, and on supported Fedora version...
DeleteWell , although it might look stupid , if i try to enter that "git log -1" command i get an error that it doesn't know from where to download it . Could you please tell me the exact command ??
ReplyDeleteYou have to install git package with dependencies. In case of Fedora it would be:
Deletesudo dnf install git
Read https://www.atlassian.com/git/tutorials/install-git#linux for more details.
With git installed, the steps to build bcc tools first and then bpftrace based on recent bcc tolls is the following (extract from my older post):
Deletegit clone https://github.com/iovisor/bcc.git
cd bcc/
git submodule update --init --recursive
sudo dnf install -y bison cmake ethtool flex git iperf libstdc++-static python-netaddr python-pip gcc gcc-c++ make zlib-devel elfutils-libelf-devel
sudo dnf install -y luajit luajit-devel
sudo dnf install -y http://repo.iovisor.org/yum/extra/mageia/cauldron/x86_64/netperf-2.7.0-1.mga6.x86_64.rpm
sudo pip install pyroute2
sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
mkdir build
cd build/
cmake .. -DCMAKE_INSTALL_PREFIX=/usr
make
make test
sudo make install
cd ../..
git clone https://github.com/iovisor/bpftrace.git
cd bpftrace/
git submodule update --init --recursive
mkdir build
cd build/
cmake -DCMAKE_BUILD_TYPE=Release ../
make
./tests/bpftrace_test
sudo make install
I experiment according to your operation and it works.I'm eager to apply bcc/bpftrace to user-space program in order to troubleshooting more efficiently. Thanks for sharing.
ReplyDelete