Showing posts with label probe. Show all posts
Showing posts with label probe. Show all posts

Sunday, January 30, 2022

DTrace Basics and Tracing Every MariaDB Server Function Call

In my previous post I've explained how to build MariaDB Server 10.8 from GitHub source on macOS. This kind of (totally unsupported by both MariaDB Corporation and Foundation) build may be used to study new interesting testing, troubleshooting and performance problem solving options that DTrace provides.

To start adding custom probes and doing something non-trivial one has to understand (or, better, master) some basics of the language used. Code in the sample support-files/dtrace/query-time.d program looked clear and familiar enough (at least for those who've seen bpftrace probes). It consists of several probes like this:

mysql*:::query-start
{
   self->query = copyinstr(arg0);
   self->connid = arg1;
   self->db    = copyinstr(arg2);
   self->who   = strjoin(copyinstr(arg3),strjoin("@",copyinstr(arg4)));
   self->querystart = timestamp;
}

As we can read in the manual, the format for the probe specification is provider:module:function:name. An empty component in a probe specification matches anything. You can also describe probes with a pattern matching syntax that is similar to the syntax that is used by shell. 

In the case above  (USDT) provider name should start with mysql, we do not care about module and function (actually we maybe have USDT with the name in different functions), and we refer to the probe by name. 

If you are wondering to see other kinds of specific probes, you can check dtrace -l output:

Yuliyas-Air:~ Valerii$ sudo dtrace -l | head -20
dtrace: system integrity protection is on, some features will not be available

   ID   PROVIDER            MODULE                          FUNCTION NAME
    1     dtrace                                                     BEGIN
    2     dtrace                                                     END
    3     dtrace                                                     ERROR
...
   42    profile                                                     tick-1000
   43    profile                                                     tick-5000
   44  mach_trap                                        kern_invalid entry
   45  mach_trap                                        kern_invalid return
   46  mach_trap                    _kernelrpc_mach_vm_allocate_trap entry
Yuliyas-Air:~ Valerii$ sudo dtrace -l | tail -4
dtrace: system integrity protection is on, some features will not be available

254755   pid82428          mariadbd            do_command(THD*, bool) 5e4
254917   pid82428 libsystem_malloc.dylib                            malloc return
467101 security_exception639          csparser _ZN8Security7CFErrorC2Ev [Security::CFError::CFError()] throw-cf
467102 security_exception639          csparser _ZN8Security9UnixErrorC2Eib [Security::UnixError::UnixError(int, bool)] throw-unix

We should study different providers that DTrace supports:

  • dtrace - few special probes: BEGIN, END and ERROR.
  • profile - this provider includes probes that are associated with an interrupt that fires at some regular, specified time interval .profile-N probes basically fire N times per second on every CPU. tick-N probes fire on only one CPU per interval.
  • syscall - enables you to trace every system call entry and return.
  • pid* - enables tracing of any user process, as specified by its pid. It basically allows to trace entries (with entry probe name) and returns (with return probe name) to every user function, among other things.
  • ...

There are more providers actually, but those mentioned above allow to start reading and writing DTrace scripts for MariaDB server tracing and profiling, so let's stop listing them for now.

As we can see from this output 

Yuliyas-Air:~ Valerii$ dtrace
Usage: dtrace [-aACeFHlqSvVwZ] [-arch i386|x86_64] [-b bufsz] [-c cmd] [-D name[=def]]
        [-I path] [-L path] [-o output] [-p pid] [-s script] [-U name]
        [-x opt[=val]]

        [-P provider [[ predicate ] action ]]
        [-m [ provider: ] module [[ predicate ] action ]]
        [-f [[ provider: ] module: ] func [[ predicate ] action ]]
        [-n [[[ provider: ] module: ] func: ] name [[ predicate ] action ]]
        [-i probe-id [[ predicate ] action ]] [ args ... ]

        predicate -> '/' D-expression '/'
           action -> '{' D-statements '}'
...

for each probe we can define optional predicate (in slashes), expression that must evaluate to true (non-zero integer value) when probe fires for optional action (in curly brackets) to be executed. Action is a sequence of DTrace statements separated by semicolons. Note that DTrace does NOT support loops or complex conditional execution in the actions, for good reasons discussed elsewhere. We may also have optional arguments for the probe.

In our query-start probe above there is no predicate, so the action is executed unconditionally when we hit it. We use some built-in functions (like copyinsrt()) and variables, like arg0, arg1 and timestamp. We'll discuss them in details in some other post, but arg0 ... arg9 obviously refer to the first ten arguments of the probe and timestamp is the current value of a nanosecond timestamp counter.

In the action we see several assignments to self->something. These are thread-local variables that share a common name in your D code but refer to separate data storage associated with each operating system thread.

Now we can try to add some probes. Given this information:

Yuliyas-Air:~ Valerii$ ps aux | grep mariadbd
Valerii          37341   0.0  0.0  4267752    700 s003  S+   10:47AM   0:00.01 grep mariadbd
Valerii          82428   0.0  0.1  4834244   5960   ??  S    Wed09PM 116:47.57 /Users/Valerii/dbs/maria10.8/bin/mariadbd --no-defaults --basedir=/Users/Valerii/dbs/maria10.8 --datadir=/Users/Valerii/dbs/maria10.8/data --plugin-dir=/Users/Valerii/dbs/maria10.8/lib/plugin --log-error=/Users/Valerii/dbs/maria10.8/data/Yuliyas-Air.err --pid-file=Yuliyas-Air.pid

and knowing that function name portion of the current probe's description is avaiable ad probefunc built-in variable, we can try to trace the enter into the every tracable function in this MariaDB server process:

Yuliyas-Air:~ Valerii$ sudo dtrace -n 'pid82428:::entry { printf ("\nEnter %s", probefunc); }' | more
dtrace: system integrity protection is on, some features will not be available

dtrace: description 'pid82428:::entry ' matched 240404 probes
CPU     ID                    FUNCTION:NAME
  1 128445            cerror_nocancel:entry

Enter cerror_nocancel
  1 129499        _pthread_testcancel:entry
Enter _pthread_testcancel
  1 128444                    __error:entry
Enter __error
  1 129505    _pthread_cond_updateval:entry
Enter _pthread_cond_updateval
  1 129429         pthread_mutex_lock:entry
Enter pthread_mutex_lock
  1  36852                  my_hrtime:entry
Enter my_hrtime
  1 122312              clock_gettime:entry
Enter clock_gettime
  1 122657               gettimeofday:entry
Enter gettimeofday
  1 128412    __commpage_gettimeofday:entry
Enter __commpage_gettimeofday
  1 128413 __commpage_gettimeofday_internal:entry
Enter __commpage_gettimeofday_internal
  1 128343         mach_absolute_time:entry
Enter mach_absolute_time
:

I've used faimiliar C-style printf() function to print some custom string including the name of the function when we hit the probe. Note that dtrace attached probes to 240404 different functions without any problem, in default configuration. We immediately started to gte the output that shows function calls happening in background threads all the time. I've used more to be able to copy-paste small portion of the output and then stopped tracing with Ctrl+C after quiting from more.

Note that by default for each hit of the probe we get probe identifier, CPU that it fired on, function and probe name reported. Add -q option to suppress this and see only what we explicitly print in the probe:

Yuliyas-Air:~ Valerii$ sudo dtrace -q -n 'pid82428:::entry { printf ("\nEnter %s", probefunc); }' | more
dtrace: system integrity protection is on, some features will not be available


Enter cerror_nocancel
Enter _pthread_testcancel
Enter __error
Enter _pthread_cond_updateval
Enter pthread_mutex_lock
Enter my_hrtime
Enter clock_gettime
Enter gettimeofday
Enter __commpage_gettimeofday
Enter __commpage_gettimeofday_internal
...

If we want to trace all processes with "mariadbd" as the name of the executable, we can add predicate to the probe that does not specify pid at all:

Yuliyas-Air:~ Valerii$ sudo dtrace -q -n ':::entry / execname == "mariadbd" / { printf ("\nEnter %s", probefunc); }' | more
dtrace: system integrity protection is on, some features will not be available


Enter cerror_nocancel
Enter _pthread_testcancel
Enter __error
Enter _pthread_cond_updateval
Enter pthread_mutex_lock
Enter my_hrtime
Enter clock_gettime
Enter gettimeofday
...

So, we can produce a raw trace (to be stored or processed by some user level code outside of DTrace). But what if we would like to do some more advanced processing in the probe itself and share only the results? For example, we run some MariaDB tests and would like to know how many times specific function in the code was executed (if at all) in the process?

For this we have to use aggregations and rely on aggregate functions to do the job in kernel space. For now it's enough to know that @name[key] is the aggregation name indexed by the key (you can use several keys separated by commas) and count() aggregation function simly counts the number of times it was called. 

Another inportant detail to study to create useful DTrace scripts is scripts arguments substitution. When called with -s option dtrace prowides a set of built-in macro variables to the script, with nsame starting with $. Of them $0 ... $9 have the same meenings as in shell, so $1 is the first argument of the script.

With the above taken into account, we can use the following basic DTrace script to count function executions per name in any user process (quiet option has the same effect as -q):

Yuliyas-Air:~ Valerii$ cat /tmp/codecoverage.d
#!/usr/sbin/dtrace -s
#pragma D option quiet
pid$1::$2:entry {
        @calls[probefunc] = count();
}

This script substitutes the first argument for the process id in the pid provider and the second argument for the function name in the probe, thus we create one or more probes. The aggregation's collected results are printed by default when dtrace script ends, but there is also the printa() function to do this explicitly in the probes if needed. Let's try to apply this to MariaDB server 10.8 running as process 82428 and trace all functions (hence '*' as the explicite second argument):

Yuliyas-Air:~ Valerii$ sudo /tmp/codecoverage.d 82428 '*' >/tmp/res.txt
dtrace: system integrity protection is on, some features will not be available

^C
Yuliyas-Air:~ Valerii$ ls -l /tmp/res.txt
-rw-r--r--  1 Valerii  wheel  58772 Jan 28 20:53 /tmp/res.txt
Yuliyas-Air:~ Valerii$ head -10 /tmp/res.txt


  ACL_USER* find_by_username_or_anon<ACL_USER>(ACL_USER*, unsigned long, char const*, char const*, char const*, char const*, char                1
  ACL_USER::copy(st_mem_root*)                                      1
  Apc_target::init(st_mysql_mutex*)                                 1
  Binary_string::copy(char const*, unsigned long)                   1
  CONNECT::create_thd(THD*)                                         1
  Create_func_arg1::create_func(THD*, st_mysql_const_lex_string*, List<Item>*)                1
  Create_func_sleep::create_1_arg(THD*, Item*)                      1
  Current_schema_tracker::update(THD*, set_var*)                    1
Yuliyas-Air:~ Valerii$ tail -10 /tmp/res.txt
  buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool)            33895
  pthread_self                                                  36422
  mtr_t::memcpy_low(buf_block_t const&, unsigned short, void const*, unsigned long)            44143
  unsigned char* mtr_t::log_write<(unsigned char)48>(page_id_t, buf_page_t const*, unsigned long, bool, unsigned long)            44167
  mach_boottime_usec                                            45007
  mtr_t::modify(buf_block_t const&)                             45609
  pthread_mutex_unlock                                          56530
  pthread_mutex_lock                                            57295
  pthread_cond_broadcast                                        64197
  _platform_memmove$VARIANT$Haswell                             72767
Yuliyas-Air:~ Valerii$ cat /tmp/res.txt | grep do_command
  do_command(THD*, bool)                                            4

I've executed just several statements (4 if you check do_command() executions) over a short enough period of time, but some functions like pthread_mutex_lock were executed thousands of times in the background. Now we have a lame code coverage testing method that works at MariaDB server scale (unlike with bpftrace on Linux). There are performance implications (huge impact when I tried to do the same while running sysbench test with many concurrent threads) and limitations of internal buffer sizes that may cause some probes to be skipped etc, but basically it works!

DTrace is almost as beautiful as Côte d'Azur :)

 To summarize:

  1. We checked basic building blocks of simple DTrace one liners and scripts in some details, including probe structure, main providers, some built-in variables, thread-local variables referenced via self-> and aggregations.
  2. It is possible to add probes to each and every function in MariaDB server 10.8 and do both raw tracing and aggregation in the kernel context. Lame code coverage scripts do work.
  3. It is possible to create parametrized, generic scripts with command line arguments and other macros substituted.
  4. DTrace language is similar to what we see later implemented in bpftrace. After some checks of the manual one can easily switch from one tool to the other for basic tasks.
  5. Built-in variablers and functions may be named differently than in bpftrace.
  6. There are many more details to cover and interesting scripts to use with MariaDB server, so stay tuned!

Sunday, January 26, 2020

Using bpftrace on Fedora 29 - More Advanced Examples

It happened so that I did not have much time to devote to further eBPF and bpftrace-based tracing and profiling since October 2019, when I posted a simple example of how it can be used on Fedora 29 to "emulate" slow query log (to some extent, by recording queries and time it took to execute them) with dynamic probes. Recently I started preparing for my talk about ftrace, eBPF and bpftrace planned for FOSDEM MySQL, MariaDB and Friends Devroom, but rejected there and later accepted for "MariaDB Day Brussels 0202 2020" (that I suggest everyone to attend, at least if you do not have better plans for the second day of FOSDEM 2020).

So, I decided to check for any ways to use bpftrace as a quick profiler, maybe "embed" it into pt-pmp one day and, surely, get what MariaDB developers ask me most often while working on performance problems or hangs - stack traces. This blog post describes my "achievements" so far on this way.

I had the same version of bpftrace coming from Fedora RPM (that had not updated in the meantime):
[openxs@fc29 server]$ rpm -q -a | grep bpftrace
bpftrace-0.9-0.fc29.x86_64
So, after checking this fine reference guide quickly, I came up with a simple one-liner:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }'
Error creating map: '@'
Error creating stack id map
This might have happened because kernel.perf_event_max_stackis smaller than 127. Try to tweak this value with sysctl kernel.perf_event_max_stack=<new value>
Error creating perf event map (-1)
Attaching 1 probe...
Segmentation fault
Ups! I tried to do what error message suggested:
[openxs@fc29 ~]$ sudo sysctl -w kernel.perf_event_max_stack=1000
kernel.perf_event_max_stack = 1000
But it had not helped. This is how I hit my first problem with bpftrace. Fedora had not cared to update it when it stopped matching "something" in a new kernel, for whatever reason. This is actually a real problem for eBPF programs - what they assume may not be "there" in a kernel, so they have to be checked carefully. There is a separate "BPF Type Format (BTF)" for encoding the debug info related to BPF program (or map) that may help to find out if the program is "compatible" without trying it. In my case incompatibility was obvious and I needed some fast way to proceed with bpftrace, in a hope to come up with some useful examples for my talk.

So, i decided to build new version of bpftrace from recent GitHub source (against bcc tools I had installed and supposedly up to date). This ended up with errors no matter what I tried:
[openxs@fc29 build]$ make
[  3%] Built target parser
[  4%] Built target arch
[  5%] Building CXX object src/ast/CMakeFiles/ast.dir/irbuilderbpf.cpp.o
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp: In member function 'void bpftrace::ast::IRBuilderBPF::CreateSignal(llvm::Value*)':
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp:669:16: error: 'BPF_FUNC_send_signal' was not declared in this scope
       getInt64(BPF_FUNC_send_signal),
                ^~~~~~~~~~~~~~~~~~~~
...
Quick search led me to the idea that it's actually a know and recently reported problem, see Issue #1014. As a solution I decided to rebuild bcc tools from GitHub source as well (the alternative to to change bpftrace source) and removed Fedora packages. To make a long story short, this is what I did eventually (from fc -l output):
1051     rm -rf bcc/
1052     rm -rf bpftrace/
1053     sudo dnf remove bcc
1054     git clone https://github.com/iovisor/bcc.git
1055     cd bcc/
1056     git submodule update --init --recursive
1057     sudo dnf install -y bison cmake ethtool flex git iperf libstdc++-static   python-netaddr python-pip gcc gcc-c++ make zlib-devel   elfutils-libelf-devel
1058     sudo dnf install -y luajit luajit-devel
1059     sudo dnf install -y   http://repo.iovisor.org/yum/extra/mageia/cauldron/x86_64/netperf-2.7.0-1.mga6.x86_64.rpm
1060     sudo pip install pyroute2
1061     sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
1062     mkdir build
1063     cd build/
1064     cmake .. -DCMAKE_INSTALL_PREFIX=/usr
1065     make
1066     make test
1067     sudo make install
1068     cd ../..
1069     git clone https://github.com/iovisor/bpftrace.git
1070     cd bpftrace/
1071     git submodule update --init --recursive
1072     mkdir build
1073     cd build/
1074     cmake -DCMAKE_BUILD_TYPE=Release ../
1075     make
1076     ./tests/bpftrace_test
1077     fc -l
Basically I did it all as documented in related projects for my version of Fedora. Step highlighted (1056 above) is essential though, as it was missed here in any obvious way as a mandatory (while it is in recent versions). Step 1076 was successful:
...
[----------] Global test environment tear-down
[==========] 350 tests from 9 test cases ran. (2590 ms total)
[  PASSED  ] 350 tests.

[openxs@fc29 build]$ make testRunning tests...
Test project /mnt/home/openxs/git/bpftrace/build
    Start 1: bpftrace_test
1/3 Test #1: bpftrace_test ....................   Passed    2.42 sec
    Start 2: runtime_test
2/3 Test #2: runtime_test .....................***Failed   21.10 sec
    Start 3: tools-parsing-test
3/3 Test #3: tools-parsing-test ...............***Failed    5.88 sec

33% tests passed, 2 tests failed out of 3

Total Test time (real) =  29.41 sec

The following tests FAILED:
          2 - runtime_test (Failed)
          3 - tools-parsing-test (Failed)
Errors while running CTest
make: *** [Makefile:128: test] Error 8
while make test shown some failures (same as make test for bcc tools), but form previous experience this had not looked like a big deal. I continued basic testing that the resulting bpftrace now works (I tried to add kernel probe, guess what software runs on this workstation):
[openxs@fc29 build]$ sudo ./src/bpftrace -e 'kprobe:do_nanosleep { printf("sleep by %s\n", comm); }'
Attaching 1 probe...
sleep by percona-qan-age
sleep by percona-qan-age
sleep by skypeforlinux
sleep by mysqld
...
^C
So, with some confidence built up, I've installed my new and very recent bpftrace tool I built myself:
[openxs@fc29 build]$ sudo make install
[  3%] Built target parser
...
-- Installing: /usr/local/bin/bpftrace
...
-- Installing: /usr/local/man/man8/xfsdist.8.gz

[openxs@fc29 build]$ which bpftrace
/usr/local/bin/bpftrace

[openxs@fc29 build]$ cd ..
[openxs@fc29 bpftrace]$ git log -1
commit afafbf561522dd33fa316be3e33375bc662399ac (HEAD -> master, origin/master, origin/HEAD)
Author: bas smit <bas@baslab.org>
Date:   Fri Jan 17 12:33:43 2020 +0100

    Add `--info` flag

    This prints the supported kernel features
Quick test involved starting MariaDB 10.3 with thread pool enabled (more on "why thread pool?" in some next blog post):
[openxs@fc29 maria10.3]$ ./bin/mysqld_safe --no-defaults --thread_handling=pool-of-threads --thread_pool_idle_timeout=10 &
[1] 32643
and running sysbench:
[openxs@fc29 maria10.3]$ sysbench /usr/local/share/sysbench/oltp_update_index.lua --mysql-host=127.0.0.1 --mysql-user=root --mysql-port=3306 --threads=10 --tables=4 --table-size=1000000 --time=600 --report-interval=5 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 10
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 10 tps: 54.37 qps: 54.37 (r/w/o: 0.00/54.37/0.00) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 10 tps: 40.60 qps: 40.60 (r/w/o: 0.00/40.60/0.00) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 10 tps: 40.00 qps: 40.00 (r/w/o: 0.00/40.00/0.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 10 tps: 38.00 qps: 38.00 (r/w/o: 0.00/38.00/0.00) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 10 tps: 34.00 qps: 34.00 (r/w/o: 0.00/34.00/0.00) lat (ms,95%): 746.32 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 10 tps: 36.40 qps: 36.40 (r/w/o: 0.00/36.40/0.00) lat (ms,95%): 549.52 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 10 tps: 37.80 qps: 37.80 (r/w/o: 0.00/37.80/0.00) lat (ms,95%): 601.29 err/s: 0.00 reconn/s: 0.00
...
and adding the probe like this in another shell:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }' > /tmp/bpf.stack
^C
When I stopped collecting with Ctrl-C, bpftrace printed the content of associative arrays collected (with unique user space stack 9for all processes!) as index and count of time this stack was as a value). No segmentation fault, unlike with the version from package! In the resulting file we can see (among other things):
[openxs@fc29 bpftrace]$ cat /tmp/bpf.stack | more
Attaching 1 probe...

...
@[
    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    page_cur_search_with_match_bytes(buf_block_t const*, dict_index_t const*, dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, unsigned long*, unsigned long*, page_cur_t*)+1773
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+8316
    btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.33]+146
    row_search_on_row_ref(btr_pcur_t*, unsigned long, dict_table_t const*, dtuple_t const*, mtr_t*)+85
    row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+503
    row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+1293
    row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+2271
    row_purge_step(que_thr_t*)+1057
    que_run_threads(que_thr_t*)+2280
    srv_worker_thread+956
    start_thread+234
]: 1

...

@[
    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    btr_search_build_page_hash_index(dict_index_t*, buf_block_t*, rw_lock_t*, unsigned long, unsigned long, unsigned long)+2781
    btr_search_info_update_slow(btr_search_t*, btr_cur_t*)+622
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+10130
    row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+4823
    ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+338
    handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+280
    handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)+98
    handler::multi_range_read_next(void**)+183
    Mrr_simple_index_reader::get_next(void**)+74
    DsMrr_impl::dsmrr_next(void**)+66
    QUICK_RANGE_SELECT::get_next()+39
    rr_quick(READ_RECORD*)+25
    mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*)+3560
    mysql_execute_command(THD*)+8446
    Prepared_statement::execute(String*, bool)+979
    Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+154
    mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool)+349
    mysqld_stmt_execute(THD*, char*, unsigned int)+37
    dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)+5123
    do_command(THD*)+368
    tp_callback(TP_connection*)+314
    worker_main(void*)+160
    start_thread+234
]: 1

...

@[
    trx_purge(unsigned long, bool)+1054
    srv_purge_coordinator_thread+1872
    start_thread+234
]: 2
@[
    ut_crc32_sw(unsigned char const*, unsigned long)+1611
    buf_calc_page_crc32(unsigned char const*)+61
    buf_flush_init_for_writing(buf_block_t const*, unsigned char*, void*, unsigned long)+872
    buf_flush_write_block_low(buf_page_t*, buf_flush_t, bool)+149
    buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)+607
    buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*)+1627
    buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long)+392
    pc_flush_slot()+2617
    buf_flush_page_cleaner_coordinator.cold.151+1235
    start_thread+234
]: 2

...
@[]: 6474
So, this is the default format to output stacks (one line per call, no address, frame number or anything, symbols properly resolved for my MariaDB built from GitHub source etc). But I had something like this mind eventually (from pt-pmp):
[openxs@fc29 bpftrace]$ pt-pmp
Tue Jan 21 13:52:12 EET 2020

...
     10 syscall(libc.so.6),__io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:1932),LinuxAIOHandler::poll(os0file.cc:2093),os_aio_linux_handler(stl_vector.h:805),os_aio_handler(stl_vector.h:805),fil_aio_wait(fil0fil.cc:4435),io_handler_thread(srv0start.cc:326),start_thread(libpthread.so.0),clone(libc.so.6)
      6 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),log_write_up_to(log0log.cc:929),trx_flush_log_if_needed_low(trx0trx.cc:1218),trx_flush_log_if_needed(trx0trx.cc:1218),trx_commit_complete_for_mysql(trx0trx.cc:1218),innobase_commit(ha_innodb.cc:4637),commit_one_phase_2(handler.cc:1642),ha_commit_one_phase(handler.cc:1622),ha_commit_trans(handler.cc:1484),trans_commit_stmt(transaction.cc:505),mysql_execute_command(sql_parse.cc:6106),Prepared_statement::execute(sql_prepare.cc:4807),Prepared_statement::execute_loop(sql_prepare.cc:4235),Prepared_statement::execute_loop(sql_prepare.cc:4235),mysql_stmt_execute_common(sql_prepare.cc:3235),mysqld_stmt_execute(sql_prepare.cc:3132),dispatch_command(sql_parse.cc:1798),do_command(sql_parse.cc:1401),threadpool_process_request(threadpool_common.cc:359),tp_callback(threadpool_common.cc:359),worker_main(threadpool_generic.cc:1606),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),srv_resume_thread(srv0srv.cc:2533),srv_worker_thread(srv0srv.cc:2533),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1215),get_event(mysql_thread.h:1215),worker_main(threadpool_generic.cc:1602),start_thread(libpthread.so.0),clone(libc.so.6)
      ...
That is, number of times collapsed stack was noted and then the stack itself, without function arguments if possible. This kind of stack is generated from gdb outputs that looks as follows:
Thread 1 (Thread 0x7f29eee66900 (LWP 32728)):
#0  0x00007f29ef25f431 in poll () from /lib64/libc.so.6
#1  0x000055cc4ece656f in poll (__timeout=-1, __nfds=2, __fds=0x7ffe174108e0) at /usr/include/bits/poll2.h:41
#2  handle_connections_sockets () at /mnt/home/openxs/git/server/sql/mysqld.cc:6753
#3  0x000055cc4ece7a7c in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /mnt/home/openxs/git/server/sql/mysqld.cc:6222
#4  0x00007f29ef190413 in __libc_start_main () from /lib64/libc.so.6
#5  0x000055cc4ecd9ece in _start () at /mnt/home/openxs/git/server/sql/sql_list.h:158
So, each tread stack starts from 'Thread..." line, each frame is numbered and function call is actually the fourth word in the output (in awk terms).
 
Based on that I tried add probe to produce stack in perf format for mysqld process, with some decoration to have a clear marker of when new capture starts:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n", ustack(perf));}' > /tmp/ustack.txt
If you wonder about the details of command format, functions used etc, check the Reference Guide. The result looks as follows in hexdump -C output:
00000000  41 74 74 61 63 68 69 6e  67 20 31 20 70 72 6f 62  |Attaching 1 prob|
00000010  65 2e 2e 2e 0a 23 20 0a  09 37 66 32 39 65 66 62  |e....# ..7f29efb|
00000020  38 34 33 30 30 20 6e 61  6e 6f 73 6c 65 65 70 2b  |84300 nanosleep+|
00000030  36 34 20 28 2f 75 73 72  2f 6c 69 62 36 34 2f 6c  |64 (/usr/lib64/l|
00000040  69 62 70 74 68 72 65 61  64 2d 32 2e 32 38 2e 73  |ibpthread-2.28.s|
00000050  6f 29 0a 0a 23 20 0a 09  35 35 63 63 34 66 32 30  |o)..# ..55cc4f20|
...
This is how you can see where are the new lines, how many etc, for proper parsing. Next step was to check pt-pmp source code and http://poormansprofiler.org/, review and adapt some awk magic used there to collapse stacks, get rid of function arguments etc, and I've ended up with this draft version:
[openxs@fc29 maria10.3]$ cat /tmp/ustack.txt | awk '
BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

    199 __sched_yield,srv_purge_coordinator_thread,start_thread
     16 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     13 trx_purge,srv_purge_coordinator_thread,start_thread
      4 srv_purge_coordinator_thread,start_thread
      4 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      2 ut_crc32_sw,buf_calc_page_crc32,buf_page_is_corrupted,buf_page_io_complete,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
      2 syscall,
      2 __pwrite,
      2 mtr_t::commit,row_ins_sec_index_entry_low,row_ins_sec_index_entry,row_upd_sec_index_entry,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
...
The above looks reasonable (even if not yet a direct match to what pt-pmp produces, I am too lazy to do much awk these days).

What's more interesting is that with bpftrace we can also get kernel stack for thread while profiling. I took a quick look at what we may get as a result of combining both:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n%s\n", kstack(perf), ustack(perf));}' > /tmp/kstack.txt
[sudo] password for openxs:
^C
I just pring one after the other without any separator in perf format, so I can apply the same trick as above to the result:
openxs@fc29 maria10.3]$ cat /tmp/kstack.txt | awk '                   BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

     93 __sched_text_start,schedule,__x64_sys_sched_yield,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     49 __sched_yield,srv_purge_coordinator_thread,start_thread
     42 do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     20 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     11 trx_purge,srv_purge_coordinator_thread,start_thread
      2 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      1 __x86_indirect_thunk_rax,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
      1 ut_crc32_sw,log_block_calc_checksum_crc32,log_write_up_to,trx_commit_complete_for_mysql,innobase_commit,commit_one_phase_2,ha_commit_trans,trans_commit_stmt,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
...
      1 _raw_spin_unlock_irqrestore,ata_scsi_queuecmd,scsi_queue_rq,blk_mq_dispatch_rq_list,blk_mq_do_dispatch_sched,blk_mq_sched_dispatch_requests,__blk_mq_run_hw_queue,__blk_mq_delay_run_hw_queue,blk_mq_run_hw_queue,blk_mq_sched_insert_requests,blk_mq_flush_plug_list,blk_flush_plug_list,blk_finish_plug,ext4_writepages,do_writepages,__filemap_fdatawrite_range,file_write_and_wait_range,ext4_sync_file,do_fsync,__x64_sys_fsync,do_syscall_64,entry_SYSCALL_64_after_hwframe,fsync,fil_flush_low,fil_flush_file_spaces,buf_dblwr_update,buf_page_io_complete,fil_aio_wait,io_handler_thread,start_thread
      1 page_simple_validate_new,buf_dblwr_flush_buffered_writes,pc_flush_slot,buf_flush_page_cleaner_coordinator.cold.151,start_thread
      1 page_cur_insert_rec_write_log,page_cur_insert_rec_low,page_copy_rec_list_end_no_locks,btr_page_reorganize_low,btr_can_merge_with_page,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
      1 os_thread_yield,srv_purge_coordinator_thread,start_thread
      1 os_thread_get_curr_id,fseg_free_page_func,btr_page_free,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
...
Looks cool to me, we see where we spend time in kernel while on-CPU, not only in user space code.

It's sometimes scary and not that easy to use bpftrace, but you'll get there eventually, same as we got to North Greenwich via that Emirates Air Line back in 2019. The only limit is the sky!
* * *
To summarize:
  1. bpftrace and bcc tools you get from vendor packages may not work as expected, so be ready to get nightly builds etc or, as I prefer, build from source. If this is a problem better use ftrace :) 
  2. It seems getting stack traces with bpftrace for pt-pmp is an easy next stap to make. I may be integrated as one of tracers supported here.
  3. Now we can also add kernel stack to the picure.
  4. "JUST DO BPF"!
I have many more details to share about dynamic tracing for MySQL and MariaDB DBAs. Stay tuned!

Saturday, January 25, 2020

Dynamic Tracing of MariaDB Server With ftrace - Basic Example

Today I'd like to continue my series of blog posts about dynamic tracing of MySQL server (or Percona Server for MySQL, or MariaDB, or basically any other software, whatever you prefer) on Linux. This post ideally had to start it, as the approach discussed here is truly dynamic and suitable production servers under high load (unlike gdb tricks, most of the time), works more or less the same way starting from Linux 2.6.27, is "always there" at your fingertips (if you are root, but this is a requirement in a general case for all kinds of dynamic tracing), and does not even strictly require -debuginfo packages with symbolic information for basic usage. I mean ftrace.

To prove that this interface that appeared in 2008 is still usable, let me add the same kind of probe for tracing dispatch_command() function and printing SQL statement that somebody tries to execute (so we can avoid enabling general query log etc) with pure ftrace interface, without much explanations or background (that is presented below). In this case I use recent MariaDB Server 10.3.22 built from source on my Ubuntu 16.04 netbook (because I do not care about being shared in the feed of Planet MySQL anymore in any case).

You just have to do the following:
  1. Check that dynamic tracing with ftrace is possible:
    openxs@ao756:~$ mount | grep tracefs
    tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)
    If you do not see tracefs mounted, mount it with the following command (as root):
    mount -t tracefs nodev /sys/kernel/tracing
  2. Find the address of (mangled) dispatch_command() function in the mysqld binary:
    openxs@ao756:~$ objdump -T /home/openxs/dbs/maria10.3/bin/mysqld | grep dispatch_command
    0000000000587b90
    g    DF .text  000000000000236e  Base        _Z16dispatch_command19enum_server_commandP3THDPcjbb
    The address you need (in many cases) is the first word in the output, in hex. So, 0x0000000000587b90
  3. If you want to print values of parameters passed, you need to know how they are passed by the compiler used on the architecture used. Let me share the quote that applies to x86_64:
    "First six arguments are in rdi, rsi, rdx, rcx, r8d, r9d; remaining arguments are on the stack."
    In my case (see this post or details below, or check the source code yourself if in doubts) we need to print third argument, so it must be in the rdx (64-bit) register. For ftrace we have to use just %dx, trust me for now, I tried ;)
  4. Check probe syntax (see the picture below for quick reference) and come up with something like this to add a probe named "dc". Note that on Ubuntu 16.04 this had not worked via sudo directly, I had to become root via "sudo su -" (weird, something to check one day):
    root@ao756:~# echo 'p:dc /home/openxs/dbs/maria10.3/bin/mysqld:0x0000000000587b90 query=+0(%dx):string' > /sys/kernel/debug/tracing/uprobe_events
    All these happens while MariaDB server is running and procerssing queries, in a truly dynamic manner. There is no impact at all so far. Creators of MariaDB never added anything ot the code (or to the build process) to make this work. It's enough to have a binary that runs of Linux kernel that supports ftrace.
  5. Enable tracing for the probe added (this is when magic happens) and tracing in general:
    root@ao756:~# echo 1 > /sys/kernel/debug/tracing/events/uprobes/dc/enable
    root@ao756:~# echo 1 > /sys/kernel/debug/tracing/tracing_on
  6. Now assuming you run the following:
    root@ao756:~# mysql -uroot --socket=/tmp/mariadb.sock
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 92
    Server version: 5.5.5-10.3.22-MariaDB Source distribution

    Copyright (c) 2009-2019 Percona LLC and/or its affiliates
    Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

    Oracle is a registered trademark of Oracle Corporation and/or its
    affiliates. Other names may be trademarks of their respective
    owners.

    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

    mysql> select 1+1;
    +-----+
    | 1+1 |
    +-----+
    |   2 |
    +-----+
    1 row in set (0.00 sec)

    mysql> select version();
    +-----------------+
    | version()       |
    +-----------------+
    | 10.3.22-MariaDB |
    +-----------------+
    1 row in set (0.00 sec)
    you may get the trace as follows, for example:
    root@ao756:~# cat /sys/kernel/debug/tracing/trace
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 9851/9851   #P:2
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
              python-21369 [001] d... 209044.826580: : sys_wr: 0000000000000008
    ...
    (I had show the top of the output to give hint about format), or:
    root@ao756:~# cat /sys/kernel/debug/tracing/trace_pipe
              mysqld-1082  [000] d... 273258.971401: dc: (0x560d8a20fb90) query="select @@version_comment limit 1"
              mysqld-1082  [000] d... 273263.187839: dc: (0x560d8a20fb90) query="select 1+1"
              mysqld-1082  [001] d... 273269.128542: dc: (0x560d8a20fb90) query="select version()"
    ...
  7. If you want to stop this basic tracing, run these:
    root@ao756:~# echo 0 > /sys/kernel/debug/tracing/events/uprobes/dc/enable
    root@ao756:~# echo 0 > /sys/kernel/debug/tracing/tracing_on
So, with 7 simple steps we can enable dynamic tracing and see SQL statements executed and time when this happened.
Check https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/uprobetracer.rst
The magic behind ftrace and steps that I really had to perform when I tried to apply this approach for the very first time will be presented in the next posts. Stay tuned! 

Monday, October 7, 2019

Dynamic Tracing of MariaDB Server With bpftrace - Basic Example

Unlike the previous post, this one is not just a comment to some slides from the "Tracing and Profiling MySQL" talk at Percona Live Europe 2019. I am going to add the details that were missing there (as I was in a hurry and had forgotten to copy/paste proper outputs while testing). I am going to show how to add dynamic probe with "latest and greatest" bpftrace tool.

The goal is the same as before - try to add dynamic probe(s) to trace query execution. More specifically, to capture text of the queries executed by clients of MySQL server. As bpftrace requires new kernel and just does not work on Ubuntu 16.04, for the demonstration I use my Fedora 29 box with kernel 5.2.x and, for a change, get queries from Fedora's own MariaDB 10.3.17 installed from rpm packages there.

I have both bcc and bpftrace installed also from packages:
[openxs@fc29 ~]$ rpm -qa | grep bcc
python3-bcc-0.8.0-1.fc29.x86_64
bcc-0.8.0-1.fc29.x86_64
bcc-tools-0.8.0-1.fc29.x86_64
[openxs@fc29 ~]$ rpm -qa | grep bpf
bpftrace-0.9-0.fc29.x86_64
[openxs@fc29 ~]$
You can check fine manual for the details, but even basic -h option provides enough to start, as long as you already know some terms and probes syntax:
[openxs@fc29 ~]$ bpftrace -h
USAGE:
    bpftrace [options] filename
    bpftrace [options] -e 'program'

OPTIONS:
    -B MODE        output buffering mode ('line', 'full', or 'none')
    -d             debug info dry run
    -dd            verbose debug info dry run
    -e 'program'   execute this program
    -h             show this help message
    -l [search]    list probes
    -p PID         enable USDT probes on PID
    -c 'CMD'       run CMD and enable USDT probes on resulting process
    -v             verbose messages
    --version      bpftrace version

ENVIRONMENT:
    BPFTRACE_STRLEN           [default: 64] bytes on BPF stack per str()
    BPFTRACE_NO_CPP_DEMANGLE  [default: 0] disable C++ symbol demangling

EXAMPLES:
bpftrace -l '*sleep*'
    list probes containing "sleep"
bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
    trace processes calling sleep
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
    count syscalls by process name
In our case we need to define uprobe for proper mysqld binary and trace the dispatch_command() function. Before we start, note that parameters of dispatch_command() in MariaDB 10.3 are not the same as in Percona Server 5.7 I've used in the previous post. Basically, this function starts as follows in sql/sql_parse.cc:
   1570 bool dispatch_command(enum enum_server_command command, THD *thd,
   1571                       char* packet, uint packet_length, bool is_com_mult        i,
   1572                       bool is_next_command)
Note the third argument, packet. If the first argument, command. is SQL_QUERY, then packet contains the query text (as a zero-terminated string) for sure (it's also true for many other commands, but let me skip the details for now). That's why we'll use third argument in our uprobe to capture the SQL text.

Now, let's start the service and check the exact full path name for the mysql binary:
[openxs@fc29 ~]$ sudo service mariadb start
[sudo] password for openxs:
Redirecting to /bin/systemctl start mariadb.service
[openxs@fc29 ~]$ ps aux | grep mysqldmysql     9109  6.2  1.2 1699252 104108 ?      Ssl  09:30   0:00 /usr/libexec/mysqld --basedir=/usr
openxs    9175  0.0  0.0 215744   892 pts/0    S+   09:30   0:00 grep --color=auto mysqld
The first naive attempt to add the probe after cursory reading the documentation and checking few examples may look like this:
 [openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:dispatch_command { printf("%s\n", str(arg2)); }'
Attaching 1 probe...
Could not resolve symbol: /usr/libexec/mysqld:dispatch_command
It seems in my case, unlike perf, bpftrace is not "aware" of C++ names or symbolic information in a separate -debuginfo package. So, I need mangled name:
[openxs@fc29 ~]$ nm -na /usr/libexec/mysqld | grep dispatch_command
nm: /usr/libexec/mysqld: no symbols
[openxs@fc29 ~]$ nm -na /home/openxs/dbs/maria10.3/bin/mysqld | grep dispatch_command
00000000004a1eef t _Z16dispatch_command19enum_server_commandP3THDPcjbb.cold.344
00000000005c5190 T _Z16dispatch_command19enum_server_commandP3THDPcjbb
00000000005c5190 t _Z16dispatch_command19enum_server_commandP3THDPcjbb.localalias.256
Surely there is no symbols in the binary from Fedora package, so I checked the binary (of the same version) that I've built myself (as usual) and assumed that neither parameters nor mangling approach could be different. So, the next attempt to add dynamic probe would look as follows:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { printf("%s\n", str(arg2)); }'
Attaching 1 probe...
show databases
show tables
t1
select @@version_comment limit 1
select user, host from mysql.user
^C
It worked and you see above the output I've got for the following session:
[openxs@fc29 ~]$ mysql -uroot testReading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 19
Server version: 10.3.17-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> select user, host from mysql.user;
+------+-----------+
| user | host      |
+------+-----------+
| root | 127.0.0.1 |
| root | ::1       |
| root | fc29      |
| root | localhost |
+------+-----------+
4 rows in set (0.000 sec)
You can see some SQL statements generated when mysql command line connects, as well as packet value in some other packet ("t1") than COM_QUERY, probably. My probe had not even tried to check other parameters besides the supposed query text.

Now, the probe is defined on uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb - I've just used long, mangled version of the function name and full path name to the binary, and defined a dynamic probe (uprobe). There is no filter and the action for the probe is defined as { printf("%s\n", str(arg2)); } - that is, I print third argument (they are numbered starting from zero, arg0, arg1, arg2, ...) as a zero-terminated string. Without str() built in function I'd get just a pointer that could be printed as (unsigned) long integer, u64.

Basically, that's all. We have a quick and dirty way to capture all queries. No timing or anything, but it all depends on probe action that can use numerous built in variables and functions.

More "advanced" use of bpftrace, a lame attempt to capture time to execute query, may look like this:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql = str(arg2); @start[@sql] = nsecs; }
uretprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[@sql] != 0/ { printf("%s : %u64 ms\n", @sql, (nsecs - @start[@sql])/1000000); } '

Attaching 2 probes...
select sleep(3) : 300064 ms
select sleep(1) : 100064 ms
^C

@sql: select sleep(1)


@start[select sleep(3)]: 10666558704666
@start[select sleep(1)]: 10685614895675


[openxs@fc29 ~]$
In this case I try to store time since probe start into the associative array with query text as an "index" and start time (in nanoseconds) as a value.Then I calculate the difference from current nsecs value upon function return, in a separate uretprobe. I've used global variables, @sql for the query text, and @start[] for the array. It even seems to work well for a single threaded load based on the above. But as soon as I try to use multiple concurrent threads:
[openxs@fc29 ~]$ for i in `seq 1 4`; do mysql -uroot test -e"select sleep($i)" & done
it becomes clear that global variables are really global and my outputs are all wrong.

A bit better version may look like this:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql[tid] = str(arg2); @start[tid] = nsecs; }                                                                               uretprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[tid] != 0/ { printf("%s : %u64 %u64 ms\n", @sql[tid], tid, (nsecs - @start[tid])/1000000); } '
Attaching 2 probes...
select sleep(1) : 1120764 100064 ms
 : 1120764 064 ms
select sleep(2) : 949064 200064 ms
 : 949064 064 ms
select sleep(3) : 1120864 300064 ms
 : 1120864 064 ms
select sleep(4) : 1120664 400064 ms
 : 1120664 064 ms
select sleep(1) : 1120764 100064 ms
 : 1120764 064 ms
select sleep(2) : 949064 200064 ms
 : 949064 064 ms
select sleep(3) : 1120664 300064 ms
 : 1120664 064 ms
select sleep(4) : 1120864 400064 ms
 : 1120864 064 ms
^C

@sql[9490]:
@sql[11206]:
@sql[11207]:
@sql[11208]:


@start[11207]: 13609305005933
@start[9490]: 13610305621499
@start[11206]: 13611305753596
@start[11208]: 13612305235313


[openxs@fc29 ~]$
The output is for both sequential and concurrent execution of queries. I've used two associative arrays, @sql[] for queries and @start[] for start times, both indexed by tid - built in variable for thread id, that should not change, at least until we use pool-of-threads... You can also see that the tool by default outputs the content of all global associate arrays at the end, unless we free them explicitly.

* * *
This image of bpftrace internals is taken from the Brendan Gregg's post
bpftrace commands may be much more complicated and it may make sense to to store them in a separate file. The tool is near ideal of "quick and dirty" tests, and one day I'll write a way more complete posts with much better examples. But having a way to capture, filter and summarize queries in kernel space and send only the relevant results to a user space, all that in a safe manner, is really cool!

Sunday, October 6, 2019

Dynamic Tracing of MySQL Server With perf probe - Basic Example

I am going to write a series of blog posts based on my talks and experiences at Percona Live Europe 2019. The first one would be a kind of extended comment for a couple of slides from the "Tracing and Profiling MySQL" talk.

We can surely wait until Performance Schema instruments every other line of code or at least every important stage and wait in every storage engine we care about, but there is no real need for that. If you run any version of MySQL under Linux with more or less recent kernel (anything newer than 4.1 is good enough, in general), you can easily use dynamic tracing for any application (at least if there is symbolic information for the binaries), any time. As Brendan Gregg put it here:
"One benefit of dynamic tracing is that it can be enabled on a live system without restarting anything. You can take an already-running kernel or application and then begin dynamic instrumentation, which (safely) patches instructions in memory to add instrumentation. That means there is zero overhead or tax for this feature until you begin using it. One moment your binary is running unmodified and at full speed, and the next, it's running some extra instrumentation instructions that you dynamically added. Those instructions should eventually be removed once you've finished using your session of dynamic tracing."
One of the ways to use dynamic tracing (that is supported for a long time) is a perf profiler and its probe command. In the simplest case that I am going to illustrate here, probe is defined for a function defined in the code and refers to it by name. You can refer to the name of local variable, function parameter, local data structure member in the probe etc, and record the values of them with other probe data.

For a simple example let me consider recent Percona Server 5.7.x running on recent Ubuntu 16.04 with kernel 4.4.x. Let's assume I want to trace all calls to the dispatch_command() function and record every query every connection processes that way.

Skipping the details for now, let's assume I've found out (with gdb in my case, but it can be code review as well) that when this function is called I can see the query user wants to execute in the com_data structure passed via a pointer to the function:
(gdb) p com_data->com_query.query
$4 = 0x7fb0dba8d021 "select 2"
Based on this information and having -dbg package also installed for Percona Server I can add a probe dynamically any time using the following simple command (--add option is assumed by default):
openxs@ao756:~$ sudo perf probe -x /usr/sbin/mysqld 'dispatch_command com_data->com_query.query:string'
Added new event:
  probe_mysqld:dispatch_command (on dispatch_command in /usr/sbin/mysqld with query=com_data->com_query.query:string)

You can now use it in all perf tools, such as:

        perf record -e probe_mysqld:dispatch_command -aR sleep 1
In this probe I refer to the specific binary with -x option and full path name, and the function in that binary by name, and I say that I'd like to record the value of com_data->com_query.query as a zero-terminated string. Now I can use any variation of perf record command (with -F option to define sampling frequency, -g option to capture stack traces etc, see more here) and my probe will be one of the events captured.

For this simple example of tracing I'll use -e option to capture only the events related to the probe I defined. Probe name for this simple case by default consists of the binary name, colon (':') separator and function name. I'll use -R option to collect raw sample records and . I've also added -a option to collect samples on all CPUs. You can see the hint for possible command in the output above. 

So, I can record related events with default frequency as follows:
openxs@ao756:~$ sudo perf record -e 'probe_mysqld:dispatch_command*' -aR
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.676 MB perf.data (3 samples) ]
I let it work for some time in the foreground and then pressed Ctrl-C to stop collecting. Now I can check raw sample records with perf script command:
openxs@ao756:~$ sudo perf script >/tmp/queries.txt
openxs@ao756:~$ cat /tmp/queries.txt
          mysqld 31340 [001]  3888.849079: probe_mysqld:dispatch_command: (be9250) query="select 100"
          mysqld 31340 [001]  3891.648739: probe_mysqld:dispatch_command: (be9250) query="select user, host from mysql.user"
          mysqld 31340 [001]  3895.890141: probe_mysqld:dispatch_command: (be9250) query="select 2"
This is the detailed trace, with additional information (exact text of the query executed) added as requested. Output also included PID of the binary, CPU the sample was taken from and a timestamp.

When I am done with tracing, I can delete the probe with --del option referring it by name:
openxs@ao756:~$ sudo perf probe --del dispatch_commandRemoved event: probe_mysqld:dispatch_command
The (small, more on that later) overhead for tracing was added dynamically, only for the exact information I needed and only for the period of tracing. After the dynamic probe is removed we have exactly the same binary as originally started running with zero extra overhead. Now do this with Performance Schema :)

* * *
Slides are available at https://www.slideshare.net/valeriikravchuk1/


More details on the way other tools mentioned during the talk can be used by MySQL DBAs are coming soon in this blog. Stay tuned!