Monday, October 7, 2019

Dynamic Tracing of MariaDB Server With bpftrace - Basic Example

Unlike the previous post, this one is not just a comment to some slides from the "Tracing and Profiling MySQL" talk at Percona Live Europe 2019. I am going to add the details that were missing there (as I was in a hurry and had forgotten to copy/paste proper outputs while testing). I am going to show how to add dynamic probe with "latest and greatest" bpftrace tool.

The goal is the same as before - try to add dynamic probe(s) to trace query execution. More specifically, to capture text of the queries executed by clients of MySQL server. As bpftrace requires new kernel and just does not work on Ubuntu 16.04, for the demonstration I use my Fedora 29 box with kernel 5.2.x and, for a change, get queries from Fedora's own MariaDB 10.3.17 installed from rpm packages there.

I have both bcc and bpftrace installed also from packages:
[openxs@fc29 ~]$ rpm -qa | grep bcc
python3-bcc-0.8.0-1.fc29.x86_64
bcc-0.8.0-1.fc29.x86_64
bcc-tools-0.8.0-1.fc29.x86_64
[openxs@fc29 ~]$ rpm -qa | grep bpf
bpftrace-0.9-0.fc29.x86_64
[openxs@fc29 ~]$
You can check fine manual for the details, but even basic -h option provides enough to start, as long as you already know some terms and probes syntax:
[openxs@fc29 ~]$ bpftrace -h
USAGE:
    bpftrace [options] filename
    bpftrace [options] -e 'program'

OPTIONS:
    -B MODE        output buffering mode ('line', 'full', or 'none')
    -d             debug info dry run
    -dd            verbose debug info dry run
    -e 'program'   execute this program
    -h             show this help message
    -l [search]    list probes
    -p PID         enable USDT probes on PID
    -c 'CMD'       run CMD and enable USDT probes on resulting process
    -v             verbose messages
    --version      bpftrace version

ENVIRONMENT:
    BPFTRACE_STRLEN           [default: 64] bytes on BPF stack per str()
    BPFTRACE_NO_CPP_DEMANGLE  [default: 0] disable C++ symbol demangling

EXAMPLES:
bpftrace -l '*sleep*'
    list probes containing "sleep"
bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
    trace processes calling sleep
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
    count syscalls by process name
In our case we need to define uprobe for proper mysqld binary and trace the dispatch_command() function. Before we start, note that parameters of dispatch_command() in MariaDB 10.3 are not the same as in Percona Server 5.7 I've used in the previous post. Basically, this function starts as follows in sql/sql_parse.cc:
   1570 bool dispatch_command(enum enum_server_command command, THD *thd,
   1571                       char* packet, uint packet_length, bool is_com_mult        i,
   1572                       bool is_next_command)
Note the third argument, packet. If the first argument, command. is SQL_QUERY, then packet contains the query text (as a zero-terminated string) for sure (it's also true for many other commands, but let me skip the details for now). That's why we'll use third argument in our uprobe to capture the SQL text.

Now, let's start the service and check the exact full path name for the mysql binary:
[openxs@fc29 ~]$ sudo service mariadb start
[sudo] password for openxs:
Redirecting to /bin/systemctl start mariadb.service
[openxs@fc29 ~]$ ps aux | grep mysqldmysql     9109  6.2  1.2 1699252 104108 ?      Ssl  09:30   0:00 /usr/libexec/mysqld --basedir=/usr
openxs    9175  0.0  0.0 215744   892 pts/0    S+   09:30   0:00 grep --color=auto mysqld
The first naive attempt to add the probe after cursory reading the documentation and checking few examples may look like this:
 [openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:dispatch_command { printf("%s\n", str(arg2)); }'
Attaching 1 probe...
Could not resolve symbol: /usr/libexec/mysqld:dispatch_command
It seems in my case, unlike perf, bpftrace is not "aware" of C++ names or symbolic information in a separate -debuginfo package. So, I need mangled name:
[openxs@fc29 ~]$ nm -na /usr/libexec/mysqld | grep dispatch_command
nm: /usr/libexec/mysqld: no symbols
[openxs@fc29 ~]$ nm -na /home/openxs/dbs/maria10.3/bin/mysqld | grep dispatch_command
00000000004a1eef t _Z16dispatch_command19enum_server_commandP3THDPcjbb.cold.344
00000000005c5190 T _Z16dispatch_command19enum_server_commandP3THDPcjbb
00000000005c5190 t _Z16dispatch_command19enum_server_commandP3THDPcjbb.localalias.256
Surely there is no symbols in the binary from Fedora package, so I checked the binary (of the same version) that I've built myself (as usual) and assumed that neither parameters nor mangling approach could be different. So, the next attempt to add dynamic probe would look as follows:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { printf("%s\n", str(arg2)); }'
Attaching 1 probe...
show databases
show tables
t1
select @@version_comment limit 1
select user, host from mysql.user
^C
It worked and you see above the output I've got for the following session:
[openxs@fc29 ~]$ mysql -uroot testReading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 19
Server version: 10.3.17-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> select user, host from mysql.user;
+------+-----------+
| user | host      |
+------+-----------+
| root | 127.0.0.1 |
| root | ::1       |
| root | fc29      |
| root | localhost |
+------+-----------+
4 rows in set (0.000 sec)
You can see some SQL statements generated when mysql command line connects, as well as packet value in some other packet ("t1") than COM_QUERY, probably. My probe had not even tried to check other parameters besides the supposed query text.

Now, the probe is defined on uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb - I've just used long, mangled version of the function name and full path name to the binary, and defined a dynamic probe (uprobe). There is no filter and the action for the probe is defined as { printf("%s\n", str(arg2)); } - that is, I print third argument (they are numbered starting from zero, arg0, arg1, arg2, ...) as a zero-terminated string. Without str() built in function I'd get just a pointer that could be printed as (unsigned) long integer, u64.

Basically, that's all. We have a quick and dirty way to capture all queries. No timing or anything, but it all depends on probe action that can use numerous built in variables and functions.

More "advanced" use of bpftrace, a lame attempt to capture time to execute query, may look like this:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql = str(arg2); @start[@sql] = nsecs; }
uretprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[@sql] != 0/ { printf("%s : %u64 ms\n", @sql, (nsecs - @start[@sql])/1000000); } '

Attaching 2 probes...
select sleep(3) : 300064 ms
select sleep(1) : 100064 ms
^C

@sql: select sleep(1)


@start[select sleep(3)]: 10666558704666
@start[select sleep(1)]: 10685614895675


[openxs@fc29 ~]$
In this case I try to store time since probe start into the associative array with query text as an "index" and start time (in nanoseconds) as a value.Then I calculate the difference from current nsecs value upon function return, in a separate uretprobe. I've used global variables, @sql for the query text, and @start[] for the array. It even seems to work well for a single threaded load based on the above. But as soon as I try to use multiple concurrent threads:
[openxs@fc29 ~]$ for i in `seq 1 4`; do mysql -uroot test -e"select sleep($i)" & done
it becomes clear that global variables are really global and my outputs are all wrong.

A bit better version may look like this:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql[tid] = str(arg2); @start[tid] = nsecs; }                                                                               uretprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[tid] != 0/ { printf("%s : %u64 %u64 ms\n", @sql[tid], tid, (nsecs - @start[tid])/1000000); } '
Attaching 2 probes...
select sleep(1) : 1120764 100064 ms
 : 1120764 064 ms
select sleep(2) : 949064 200064 ms
 : 949064 064 ms
select sleep(3) : 1120864 300064 ms
 : 1120864 064 ms
select sleep(4) : 1120664 400064 ms
 : 1120664 064 ms
select sleep(1) : 1120764 100064 ms
 : 1120764 064 ms
select sleep(2) : 949064 200064 ms
 : 949064 064 ms
select sleep(3) : 1120664 300064 ms
 : 1120664 064 ms
select sleep(4) : 1120864 400064 ms
 : 1120864 064 ms
^C

@sql[9490]:
@sql[11206]:
@sql[11207]:
@sql[11208]:


@start[11207]: 13609305005933
@start[9490]: 13610305621499
@start[11206]: 13611305753596
@start[11208]: 13612305235313


[openxs@fc29 ~]$
The output is for both sequential and concurrent execution of queries. I've used two associative arrays, @sql[] for queries and @start[] for start times, both indexed by tid - built in variable for thread id, that should not change, at least until we use pool-of-threads... You can also see that the tool by default outputs the content of all global associate arrays at the end, unless we free them explicitly.

* * *
This image of bpftrace internals is taken from the Brendan Gregg's post
bpftrace commands may be much more complicated and it may make sense to to store them in a separate file. The tool is near ideal of "quick and dirty" tests, and one day I'll write a way more complete posts with much better examples. But having a way to capture, filter and summarize queries in kernel space and send only the relevant results to a user space, all that in a safe manner, is really cool!

Sunday, October 6, 2019

Dynamic Tracing of MySQL Server With perf probe - Basic Example

I am going to write a series of blog posts based on my talks and experiences at Percona Live Europe 2019. The first one would be a kind of extended comment for a couple of slides from the "Tracing and Profiling MySQL" talk.

We can surely wait until Performance Schema instruments every other line of code or at least every important stage and wait in every storage engine we care about, but there is no real need for that. If you run any version of MySQL under Linux with more or less recent kernel (anything newer than 4.1 is good enough, in general), you can easily use dynamic tracing for any application (at least if there is symbolic information for the binaries), any time. As Brendan Gregg put it here:
"One benefit of dynamic tracing is that it can be enabled on a live system without restarting anything. You can take an already-running kernel or application and then begin dynamic instrumentation, which (safely) patches instructions in memory to add instrumentation. That means there is zero overhead or tax for this feature until you begin using it. One moment your binary is running unmodified and at full speed, and the next, it's running some extra instrumentation instructions that you dynamically added. Those instructions should eventually be removed once you've finished using your session of dynamic tracing."
One of the ways to use dynamic tracing (that is supported for a long time) is a perf profiler and its probe command. In the simplest case that I am going to illustrate here, probe is defined for a function defined in the code and refers to it by name. You can refer to the name of local variable, function parameter, local data structure member in the probe etc, and record the values of them with other probe data.

For a simple example let me consider recent Percona Server 5.7.x running on recent Ubuntu 16.04 with kernel 4.4.x. Let's assume I want to trace all calls to the dispatch_command() function and record every query every connection processes that way.

Skipping the details for now, let's assume I've found out (with gdb in my case, but it can be code review as well) that when this function is called I can see the query user wants to execute in the com_data structure passed via a pointer to the function:
(gdb) p com_data->com_query.query
$4 = 0x7fb0dba8d021 "select 2"
Based on this information and having -dbg package also installed for Percona Server I can add a probe dynamically any time using the following simple command (--add option is assumed by default):
openxs@ao756:~$ sudo perf probe -x /usr/sbin/mysqld 'dispatch_command com_data->com_query.query:string'
Added new event:
  probe_mysqld:dispatch_command (on dispatch_command in /usr/sbin/mysqld with query=com_data->com_query.query:string)

You can now use it in all perf tools, such as:

        perf record -e probe_mysqld:dispatch_command -aR sleep 1
In this probe I refer to the specific binary with -x option and full path name, and the function in that binary by name, and I say that I'd like to record the value of com_data->com_query.query as a zero-terminated string. Now I can use any variation of perf record command (with -F option to define sampling frequency, -g option to capture stack traces etc, see more here) and my probe will be one of the events captured.

For this simple example of tracing I'll use -e option to capture only the events related to the probe I defined. Probe name for this simple case by default consists of the binary name, colon (':') separator and function name. I'll use -R option to collect raw sample records and . I've also added -a option to collect samples on all CPUs. You can see the hint for possible command in the output above. 

So, I can record related events with default frequency as follows:
openxs@ao756:~$ sudo perf record -e 'probe_mysqld:dispatch_command*' -aR
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.676 MB perf.data (3 samples) ]
I let it work for some time in the foreground and then pressed Ctrl-C to stop collecting. Now I can check raw sample records with perf script command:
openxs@ao756:~$ sudo perf script >/tmp/queries.txt
openxs@ao756:~$ cat /tmp/queries.txt
          mysqld 31340 [001]  3888.849079: probe_mysqld:dispatch_command: (be9250) query="select 100"
          mysqld 31340 [001]  3891.648739: probe_mysqld:dispatch_command: (be9250) query="select user, host from mysql.user"
          mysqld 31340 [001]  3895.890141: probe_mysqld:dispatch_command: (be9250) query="select 2"
This is the detailed trace, with additional information (exact text of the query executed) added as requested. Output also included PID of the binary, CPU the sample was taken from and a timestamp.

When I am done with tracing, I can delete the probe with --del option referring it by name:
openxs@ao756:~$ sudo perf probe --del dispatch_commandRemoved event: probe_mysqld:dispatch_command
The (small, more on that later) overhead for tracing was added dynamically, only for the exact information I needed and only for the period of tracing. After the dynamic probe is removed we have exactly the same binary as originally started running with zero extra overhead. Now do this with Performance Schema :)

* * *
Slides are available at https://www.slideshare.net/valeriikravchuk1/


More details on the way other tools mentioned during the talk can be used by MySQL DBAs are coming soon in this blog. Stay tuned!