Showing posts with label pt-pmp. Show all posts
Showing posts with label pt-pmp. Show all posts

Thursday, February 10, 2022

How to Summarize gdb Backtrace with pt-pmp (and Flamegraph)

This is going to be more like a note to myself and my readers than a real blog post. But still I'd like o document the trick that I am applying for years already, in its most convenient form. 

Assuming you have a backtrace (or full backtrace) created from a core file or by attaching gdb to a live (and maybe hanging) process with many threads, like MySQL or MariaDB server, what is the best way to summarize it quickly, to understand what most of the threads are doing/hanging at? Like this huge backtrace with hundreds of threads:

openxs@ao756:~$ ls -l /tmp/backtrace1.txt
-rw-rw-r-- 1 openxs openxs 2817054 лют 10 17:02 /tmp/backtrace1.txt
openxs@ao756:~$ grep LWP /tmp/backtrace1.txt | wc -l
1915

Here it is. You have to download pt-pmp from Percona Toolkit. Then source the pt-pmp script and rely on the aggregate_stacktrace function from it, quite advanced in summarizing stack traces:

openxs@ao756:~$ which pt-pmp
/usr/bin/pt-pmp
openxs@ao756:~$ . /usr/bin/pt-pmp
openxs@ao756:~$ cat /tmp/backtrace1.txt | aggregate_stacktrace > /tmp/pmp1.txt 
openxs@ao756:~$ ls -l /tmp/pmp1.txt
-rw-rw-r-- 1 openxs openxs 34174 лют 10 18:07 /tmp/pmp1.txt
openxs@ao756:~$ head -5 /tmp/pmp1.txt
    598 poll(libc.so.6),vio_io_wait(viosocket.c:945),vio_socket_io_wait(viosocket.c:108),vio_read(viosocket.c:184),my_real_read(net_serv.cc:892),my_net_read_packet_reallen(net_serv.cc:1162),my_net_read_packet(net_serv.cc:1146),do_command(sql_parse.cc:1262),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     82 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_s_lock_spin(sync0rw.cc:373),rw_lock_s_lock_func(sync0rw.ic:290),pfs_rw_lock_s_lock_func(sync0rw.ic:290),buf_page_get_gen(buf0buf.cc:4905),btr_cur_search_to_nth_level(btr0cur.cc:1243),btr_pcur_open_low(btr0pcur.ic:467),row_ins_scan_sec_index_for_duplicate(btr0pcur.ic:467),row_ins_sec_index_entry_low(btr0pcur.ic:467),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     55 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),lock_wait_suspend_thread(lock0wait.cc:347),row_mysql_handle_errors(row0mysql.cc:741),row_insert_for_mysql(row0mysql.cc:1428),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     38 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_x_lock_func(sync0rw.cc:733),pfs_rw_lock_x_lock_func(sync0rw.ic:544),buf_page_get_gen(buf0buf.cc:4918),btr_cur_search_to_nth_level(btr0cur.cc:1243),row_ins_sec_index_entry_low(row0ins.cc:2946),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     32 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_x_lock_func(sync0rw.cc:733),pfs_rw_lock_x_lock_func(sync0rw.ic:544),buf_page_get_gen(buf0buf.cc:4918),btr_cur_search_to_nth_level(btr0cur.cc:1243),row_ins_sec_index_entry_low(row0ins.cc:3040),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)

If a summary like above is not clear enough, we can surely go one step further and create a proper flame graph based on the output above:

openxs@ao756:~$ cat /tmp/pmp1.txt | awk '{print $2, $1}' | sed -e 's/,/;/g' | ~/git/FlameGraph/flamegraph.pl --countname="threads" --reverse - >/tmp/pmp1.svg  

Then with some creative zooming and search we can concentrate on waits:

Frames with "wait" substring in the function name are highlighted

That's all. As simple as it looks. Quick overview of the backtrace before you start digging there.

Wednesday, January 27, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part III, Creating a New Tool for Tracing Mutexes

I had some free time yesterday to do some more tests on Fedora, so I've got back to the old request for one of MariaDB developers that I first mentioned in the blog few months ago:

"ideally, collect stack traces of mariadbd, and sort them in descending order by time spent between pthread_mutex_lock and next pthread_mutex_unlock."

The original request above got no answer yet, and I was recently reminded about it. What I did with perf and recently with bpftrace was (lame) counting number of samples per unique stack, while I actually had to count time spent between acquiring and releasing specific mutexes.

From the very beginning I was sure that bpftrace should allow to get the answer easily, and after reviewing the way existing tools are coded, yesterday I decided to finally write some real, multiple liner bpftrace program, with multiple probes, myself. I wanted to fulfill the request literally, no matter how much that would "cost" for now, with bpftrace. It turned out that a couple of hours of calm vacation time is more than enough to get a draft of solution.

I've started with checking the pthread_mutext_lock manual page. From it I've got the (primitive) idea of two functions used in the process, with single argument, mutex pointer/address:

int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);

Multiple threads can try to lock the same mutex and those that found it locked will wait until unlock eventually makes the mutex available for acquire for one of them (as decided by the scheduler). I've made the assumption (correct me if I am wrong) that that same thread that locked the mutex must unlock it eventually. Based on that I came up with the following initial lame version of bpftrace program:

[openxs@fc31 ~]$ cat pthread_mutex.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop\n");
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/
{
    @start[arg0] = nsecs;
    @mutexid[arg0] = tid;
    @tidstack[tid] = ustack;
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_unlock
/comm == "mariadbd" && @start[arg0] != 0 && @mutexid[arg0] == tid/
{
    $now = nsecs;
    $time = $now - @start[arg0];
    @mutexstack[@tidstack[tid]] += $time;
    printf("Mutex: %u, time: %d\n", arg0, $time);
    delete(@start[arg0]);
    delete(@mutexid[arg0]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@mutexid);
    clear(@tidstack);
}
/* the end */

Do you recognize biosnoop.bt style? Yes, this is what I was inspired by... So, I've added two uprobes for the library providing the function, both checking that the call is done from the mariadbd. The first one, for lock, stores start time for the given mutex address, thread id that locked it, and stack trace of the thread at the moment of locking. The second one, for unlock, computes time difference since the same mutex was locked last, but it fires only if unlock thread has the same id that the lock one. Then I add this time difference to the time spent "within this stack trace", by referring to the thread stack stored as index in the @mutexstack[] associative array. Then I print some debugging output to see what happens in the process of tracing and remove items from the associative arrays that were added to them by the first probe.

In the END probe I just clean up all associative arrays but @mutexstack[], and, as we've seen before, then its content is just dumped to the output by the bpftrace. This is what I am going to post process later, after quick debugging session proves my idea gives some reasonable results.

So, with MariaDB 10.5 up and running, started like this (no real tuning for anything, no wonder QPS is not high in the tests below):

./bin/mysqld_safe --no-defaults --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --port=3309 &

and having zero user connections, I made pthread_mutex.bt executable and started my very first bpftrace program for the very first time (OK, honestly, few previous runs shown some syntax errors that I corrected):

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt
Attaching 4 probes...
Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop
Mutex: 652747168, time: 6598
Mutex: 629905136, time: 46594
...
Mutex: 652835840, time: 26491
Mutex: 652835712, time: 4569
^C


@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 23055
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+188
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 23803
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+210
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 24555
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::submit_task(tpool::task*)+88
    timer_handler+326
    start_thread+226
]: 31859
@mutexstack[
    __pthread_mutex_lock+0
    srv_monitor_task+130
    tpool::thread_pool_generic::timer_generic::execute(void*)+53
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 53282
@mutexstackERROR: failed to look up stack id 0 (pid 0): -1
[]: 322499

Not bad. I see mutexes are locked and (last ERROR aside) time is summed up per stack trace as planned. More over, stack traces look reasonable for 10.5 (generic thread pool is used inside InnoDB in this version, to run background tasks, are you aware of that?). Some symbols are not resolved, but what can I do about it? I'll just skip that addresses at some later stage, maybe.

I just decided to check what threads are locking mutexes, and modified print:

    printf("Mutex: %u, thread: %u, time: %d\n", arg0, tid, $time);

With that modification I also redirected errors to /dev/null and got this:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null
Attaching 4 probes...
Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop
Mutex: 652835712, thread: 4476, time: 6354
Mutex: 629905136, thread: 4476, time: 37053
Mutex: 621289632, thread: 4485, time: 5254
Mutex: 621289632, thread: 4485, time: 4797
Mutex: 652835840, thread: 4485, time: 31465
Mutex: 652835712, thread: 4485, time: 4374
Mutex: 652835712, thread: 4476, time: 6048
Mutex: 629905136, thread: 4476, time: 35703
Mutex: 621289632, thread: 4485, time: 4917
Mutex: 621289632, thread: 4485, time: 4779
Mutex: 652835840, thread: 4485, time: 30316
Mutex: 652835712, thread: 4485, time: 4389
Mutex: 652835712, thread: 4476, time: 6733
Mutex: 629905136, thread: 4476, time: 40936
Mutex: 621289632, thread: 4485, time: 4719
Mutex: 621289632, thread: 4485, time: 4725
Mutex: 652835840, thread: 4485, time: 30637
Mutex: 652835712, thread: 4485, time: 4441
^C


@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 13204
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+210
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 14301
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+188
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 14890
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::submit_task(tpool::task*)+88
    timer_handler+326
    start_thread+226
]: 19135
@mutexstack[]: 206110

I see different threads locking same mutexes etc. One day given more time I'd try to figure out what mutexes are that and what was the purpose of each thread (it can be seen based on OS thread id in the performance_schema.threads in 10.5, fortunately, or inferred from the stacks at the moment).

I've removed debug print (no interactive output, just final summarized data), changed ustack to ustack(perf) (assuming I know better how to deal with that output format later - it was not really a good idea), and ended up with this final version of the tool:

#!/usr/bin/env bpftrace

BEGIN
{
/*
    printf("Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop\n");
*/
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/
{
    @start[arg0] = nsecs;
    @mutexid[arg0] = tid;
    @tidstack[tid] = ustack(perf);    
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_unlock
/comm == "mariadbd" && @start[arg0] != 0 && @mutexid[arg0] == tid/
{
    $now = nsecs;
    $time = $now - @start[arg0];
    @mutexstack[@tidstack[tid]] += $time;
/*
    printf("Mutex: %u, thread: %u, time: %d\n", arg0, tid, $time);
*/
    delete(@start[arg0]);
    delete(@mutexid[arg0]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@mutexid);
    clear(@tidstack);
}
/* the end */

I saved stack trace to the file in /tmp, to work on the output further outside of the bpftrace. Again probably it was not the best idea, but I am not yet fluent with strings processing in bpftrace anyway, I rely on awk, sort etc.:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null >/tmp/pthread_mutex_perf_stacks.txt
^C[openxs@fc31 ~]cat /tmp/pthread_mutex_perf_stacks.txt
Attaching 4 probes...



@mutexstack[
        7f7e615a6e70 __pthread_mutex_lock+0 (/usr/lib64/libpthread-2.30.so)
        556223eb07cf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f7e60fb53d4 0x7f7e60fb53d4 ([unknown])
        5562258c9080 0x5562258c9080 ([unknown])
        556223eb0b60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        2de907894810c083 0x2de907894810c083 ([unknown])
]: 21352
@mutexstack[
        7f7e615a6e70 __pthread_mutex_lock+0 (/usr/lib64/libpthread-2.30.so)
        556223eb0d62 tpool::thread_pool_generic::timer_generic::execute(void*)+210 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        556223eb1c52 tpool::task::execute()+50 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        556223eb07cf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f7e60fb53d4 0x7f7e60fb53d4 ([unknown])
        5562258c9080 0x5562258c9080 ([unknown])
        556223eb0b60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        2de907894810c083 0x2de907894810c083 ([unknown])
]: 22975
...

After checking what I did with such stack traces previously to collapse them into one line per stack pt-pmp style, and multiple clarification runs and changes I ended up with the following awk code:

awk '
BEGIN { s = ""; }
/^@mutexstack\[\]/ { s = ""; }
/^@mutexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
/^]/ { print $2, s }
'

I process the lines that are not containing stacks, those around the stack block, resetting the stack s at the beginning and printing it at the end of the block, and for the stack lines I take only function name and ignore everything else to form a targ, and concatenate it to the stack already collected with comma (that was a wrong idea for future use) as a separator between function names. Original code that inspired all these came from pt-pmp as far as I remember. I just adapted it to the format, better than in the previous posts.

Post-processing the output with this awk code gave me the following:

[openxs@fc31 ~]$ cat /tmp/pthread_mutex_perf_stacks.txt | awk '
> BEGIN { s = ""; }
> /^@mutexstack\[\]/ { s = ""; }
> /^@mutexstack/ { s = ""; }
> /^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
> /^]/ { print $2, s }
> '
21352 __pthread_mutex_lock,tpool::thread_pool_generic::worker_main,,,,
22975 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
24568 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
33469 __pthread_mutex_lock,tpool::thread_pool_generic::submit_task,timer_handler,start_thread

Non-resolved addresses are removed, same as offsents from the functrion start. Now sorting remains, in descending order, on the first column as a key:

[openxs@fc31 ~]$ cat /tmp/pthread_mutex_perf_stacks.txt | awk '
BEGIN { s = ""; }
/^@mutexstack\[\]/ { s = ""; }
/^@mutexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
/^]/ { print $2, s }
' | sort -r -n -k 1,1
33469 __pthread_mutex_lock,tpool::thread_pool_generic::submit_task,timer_handler,start_thread
24568 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
22975 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
21352 __pthread_mutex_lock,tpool::thread_pool_generic::worker_main,,,,

That's what we have, for the server without user connections. Now let me put it under the high concurrent sysbench test load (good idea, isn't it?):

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 653.07 qps: 13097.92 (r/w/o: 9174.69/2613.89/1309.34) lat (ms,95%): 240.02 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 1025.71 qps: 20511.58 (r/w/o: 14358.12/4102.14/2051.32) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 588.21 qps: 11770.70 (r/w/o: 8238.74/2355.44/1176.52) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 306.22 qps: 6135.54 (r/w/o: 4298.14/1224.97/612.43) lat (ms,95%): 369.77 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 467.00 qps: 9339.64 (r/w/o: 6537.96/1867.69/933.99) lat (ms,95%): 308.84 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 302.19 qps: 6044.31 (r/w/o: 4230.60/1209.34/604.37) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 324.91 qps: 6496.60 (r/w/o: 4548.67/1298.12/649.81) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 303.58 qps: 6058.55 (r/w/o: 4238.05/1213.33/607.16) lat (ms,95%): 646.19 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 258.39 qps: 5176.10 (r/w/o: 3625.73/1033.58/516.79) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 213.72 qps: 4279.43 (r/w/o: 2995.93/856.07/427.43) lat (ms,95%): 707.07 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 32 tps: 208.29 qps: 4144.23 (r/w/o: 2896.58/831.07/416.58) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 32 tps: 456.29 qps: 9135.45 (r/w/o: 6397.03/1826.05/912.38) lat (ms,95%): 363.18 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 32 tps: 582.21 qps: 11641.73 (r/w/o: 8148.49/2328.63/1164.61) lat (ms,95%): 277.21 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 32 tps: 560.39 qps: 11208.17 (r/w/o: 7845.84/2241.55/1120.78) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 32 tps: 338.03 qps: 6768.93 (r/w/o: 4739.47/1353.41/676.05) lat (ms,95%): 442.73 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 32 tps: 410.20 qps: 8210.38 (r/w/o: 5748.19/1641.80/820.40) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 32 tps: 480.28 qps: 9599.94 (r/w/o: 6716.68/1922.81/960.45) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 32 tps: 397.62 qps: 7952.16 (r/w/o: 5568.62/1588.19/795.35) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 32 tps: 338.77 qps: 6769.31 (r/w/o: 4739.09/1352.78/677.44) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 32 tps: 417.81 qps: 8372.59 (r/w/o: 5857.10/1679.76/835.73) lat (ms,95%): 331.91 err/s: 0.00 reconn/s: 0.00
[ 210s ] thds: 32 tps: 267.40 qps: 5340.01 (r/w/o: 3742.10/1063.10/534.80) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 32 tps: 267.70 qps: 5355.78 (r/w/o: 3748.96/1071.42/535.41) lat (ms,95%): 590.56 err/s: 0.00 reconn/s: 0.00
[ 230s ] thds: 32 tps: 243.11 qps: 4859.74 (r/w/o: 3401.70/971.83/486.21) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 32 tps: 173.99 qps: 3474.97 (r/w/o: 2430.94/696.05/347.98) lat (ms,95%): 1013.60 err/s: 0.00 reconn/s: 0.00
[ 250s ] thds: 32 tps: 169.71 qps: 3403.05 (r/w/o: 2384.37/679.25/339.42) lat (ms,95%): 877.61 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 32 tps: 407.57 qps: 8151.27 (r/w/o: 5704.23/1631.89/815.15) lat (ms,95%): 272.27 err/s: 0.00 reconn/s: 0.00
...
[ 300s ] thds: 32 tps: 382.41 qps: 7641.05 (r/w/o: 5348.01/1528.43/764.62) lat (ms,95%): 434.83 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            1663592
        write:                           475312
        other:                           237656
        total:                           2376560
    transactions:                        118828 (396.04 per sec.)
    queries:                             2376560 (7920.89 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      396.0445
    time elapsed:                        300.0370s
    total number of events:              118828

Latency (ms):
         min:                                    2.17
         avg:                                   80.79
         max:                                 5012.10
         95th percentile:                      390.30
         sum:                              9600000.68

Threads fairness:
    events (avg/stddev):           3713.3750/40.53
    execution time (avg/stddev):   300.0000/0.01

Trust me that I started by bpftrace program after initial 20 seconds of the test run, and let it work at most 20 seconds. But the entire test, next 280 seconds, were notably affected by a visible drop in QPS! I pressed Ctrl-C but got the command probm back much later, not even after 300 seconds... I was wathing the output growth in another shell:

[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 264177 Jan 26 13:26 /tmp/pthread_mutex_perf_stacks.txt
...
[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 281111 Jan 26 13:27 /tmp/pthread_mutex_perf_stacks.txt
...
[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 4116283 Jan 26 13:35 /tmp/pthread_mutex_perf_stacks.txt

So I ended with 4M of text data expoirted to the userland, for just 20 seconds of data collection and with performance drop for many miutes for my 32 threds test on 4 cores old system. Not that impressive and I should care better to probably aggregate and process data more in my bpftrace program, or maybe just dump raw stak-time entries as they are collected. I'll test and see how to improve, as this way of collection is not suitable for production use on a loaded system :(

Anyway, I have to process what was collected with such an impact. To remind you, the data were collected this way:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null >/tmp/pthread_mutex_perf_stacks.txt
[sudo] password for openxs:
^C

and then I applied that same awk followed by sort command line as above to get collapsed stacks. This is what I've seen as a result:

[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | more
104251253 __pthread_mutex_lock,buf_flush_page,buf_flush_try_neighbors,buf_do_flu
sh_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
78920938
74828263
74770599
74622438
72853129
67893142
66546439 __pthread_mutex_lock,buf_do_flush_list_batch,buf_flush_lists,buf_flush_
page_cleaner,start_thread
61669188
59330217
55480213
55045396
53531941
53216338
...

I am yet to find out what are those non-resolved and all removed entries are coming from and what to do with them to not influence the analysis. For now I need to get rid of them as useless. This is how I did it to get "top 5" stacks with times (in nanoseconds) spent in them:

[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | awk '{ if (length($2) > 0) {print} }' | head -5
104251253 __pthread_mutex_lock,buf_flush_page,buf_flush_try_neighbors,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
66546439 __pthread_mutex_lock,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
31431176 __pthread_mutex_lock,buf_flush_try_neighbors,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
27100601 __pthread_mutex_lock,tpool::aio_linux::getevent_thread_routine,,,,
11730055 __pthread_mutex_lock,buf_flush_lists,buf_flush_page_cleaner,start_thread
[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | awk '{ if (length($2) > 0) {print} }' > /tmp/collapsed_clean_pthread_mutex.txt

I saved the output into the /tmp/collapsed_clean_pthread_mutex.txt file. The enxt step would be to represent the result in some nice graphical way, a flame graph! I have the software in place:

[openxs@fc31 ~]$ ls /mnt/home/openxs/git/FlameGraph/
aix-perf.pl                 stackcollapse-gdb.pl
demos                       stackcollapse-go.pl
dev                         stackcollapse-instruments.pl
difffolded.pl               stackcollapse-java-exceptions.pl
docs                        stackcollapse-jstack.pl
example-dtrace-stacks.txt   stackcollapse-ljp.awk
example-dtrace.svg          stackcollapse-perf.pl
example-perf-stacks.txt.gz  stackcollapse-perf-sched.awk
example-perf.svg            stackcollapse.pl
files.pl                    stackcollapse-pmc.pl
flamegraph.pl               stackcollapse-recursive.pl
jmaps                       stackcollapse-sample.awk
pkgsplit-perf.pl            stackcollapse-stap.pl
range-perf.pl               stackcollapse-vsprof.pl
README.md                   stackcollapse-vtune.pl
record-test.sh              stackcollapse-xdebug.php
stackcollapse-aix.pl        test
stackcollapse-bpftrace.pl   test.sh
stackcollapse-elfutils.pl

But I quickly recalled that flamegraph.pl expects this kind of format of the imput, ";" as separator and number as a second column, not the first:

bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;...;schedule 8

There is also a tool to collapse raw bpftrace stacks, stackcollapse-bpftrace.pl, and I have to check how it work for my case one day... Yesterday I just wanted to complet testing as soon as possible, so proceeded with a quick and dirty awk hack:

[openxs@fc31 ~]$ cat /tmp/collapsed_clean_pthread_mutex.txt | awk ' { gsub(",",";",$2); print "mariadbd;"$2, $1 }' | head -5
mariadbd;__pthread_mutex_lock;buf_flush_page;buf_flush_try_neighbors;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 104251253
mariadbd;__pthread_mutex_lock;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 66546439
mariadbd;__pthread_mutex_lock;buf_flush_try_neighbors;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 31431176
mariadbd;__pthread_mutex_lock;tpool::aio_linux::getevent_thread_routine;;;; 27100601
mariadbd;__pthread_mutex_lock;buf_flush_lists;buf_flush_page_cleaner;start_thread 11730055

This format looks acceptable, so I've generated the flame graph with the same hack:

[openxs@fc31 ~]$ cat /tmp/collapsed_clean_pthread_mutex.txt | awk ' { gsub(",",";",$2); print "mariadbd;"$2, $1 }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="pthread_mutex_waits in MariaDB 10.5" --countname=nsecs > ~/Documents/mutex.svg

and here is the result, with sdearhc for "tpool" highlighting how much time of the mutex waits is related to the thread pool of background InnoDB threads:

One can surely create a flame graph based on stacks collected by the bpftrace program, one way or the other...

I'll stop at this stage and maybe continue testing later this week. Stay tuned!

* * *

To summarize:

  • I am not yet sure if my logic in the programs above was correct. I have to think more about it.
  • I surely need to find another way to process the data, either by collapsing/processing stacks in my bpftrace program to make them smaller, or maybe by submitting raw stack/time data as they are collected to the user level. More tests to come...
  • It is easy to create custom bpftrace programs for collecting the data you need. I think memory allocations tracing is my next goal. Imagine a printout of memory allocated and not freed, per allocating thread... If only that had less impact on QPS than what my lame program above demonstrated :)

Saturday, April 4, 2020

Fun with Bugs #96 - On MySQL Bug Reports I am Subscribed to, Part XXX

My weekdays are still busy even during this lockdown period, but weekend is a perfect time for yet another blog post about MySQL bugs I considered interesting. Very few followers read my posts on other topics anyway, especially if they have the "M....DB" word...

So, here is a new list of optimizer, replication, InnoDB and few other bugs in GA versions of MySQL that I've subscribed to since the end of February, 2020:
  • Bug #98675 - "LIMIT clause in derived table garbles EXPLAIN Note". It's not a big deal and in specific case presented by Øystein Grøvlen optimizer works properly (note text aside), but some clarifications to the note and/or documentation would still help. Comments are useful to read and shows a great example of proper cooperation from Oracle side.
  • Bug #98719 - "Parallel CREATE TABLE statement binlogged in a wrong order". This bug report from Song Libing shows that Oracle engineers readily accept and verify bug reports where some code modification (adding debug sync point, for example) is needed to demonstrate the problem with a repeatable, deterministic test case. I've seen other bug reports where code modification was considered a problem and bug was not verified immediately as a result. Note that this bug seem to be not repeatable on MySQL 8.0.19 and this was explicitly tested, with test results shared in public. Good job, Umesh Shastry!
  • Bug #98734 - "Same digest for different queries if we use integer value in ORDER BY caluse". Different column numbers in ORDER BY often cause totally different execution plans, so assuming these queries are the same ("ORDER BY ?") for the purpose of digesting is misleading. Moreover, as noted by Lalit Choudhary from Percona, with column names instead of numbers such queries are not considered the same, so Performance Schema has to be more consistent.
  • Bug #98739 - "TempTable storage engine slow if mmap is required". Take care when large temporary tables are used for your queries in MySQL 8.0.x. By default (without temptable_use_mmap = OFF) when the table is larger than temptable_max_ram you may notice a very slow query execution. Nice finding by Marcelo Altmann. This is not the only related performance regression I've seen reported recently. Looks like TempTable storage engine problems is a real main topic of this post!
  • Bug #98750 - "SHOW FIELDS command regression on MySQL 8.0". This performance regression bug was reported by Juan Arruti. See also similar Bug #92195 - "checking permissions 90 time" that was wrongly(!) declared "Not a bug" back in 2018 just because of the way the problem was demonstrated. This time Przemyslaw Malkowski helped to make the point based on Performance Schema instrumentation, so there was no other option but to accept this as a real performance regression bug. Take care if you use SHOW statements with MySQL 8!
  • Bug #98782 - "Using TempTable engine for GROUP BY is slower than using MEMORY engine". In this bug report Øystein Grøvlen demonstrated that MEMORY engine is about 10% faster for temporary tables in MySQL 8. Make sure to use internal_tmp_mem_storage_engine=MEMORY if you care about performance.
  • Bug #98869 - "Temp ibt tablespace truncation at disconnection stuck InnoDB under large BP". Bug reporter, Fungo Wang, used different methods to show the performance impact of the regression probably introduced by WL#11613. From pt-pmp to perf and other OS level tools. Make sure to check all comments that point out to other bugs and problems.
  • Bug #98974 - "InnoDB temp table could hurt InnoDB perf badly". Yet another bug report by Fungo Wang. This time it took a lot of time and efforts from the bug reporter and many MySQL Community members (including me) to get this bug properly processed, even though it started with a detailed source code analysis, properly described test case, stack traces analysis and perf profiling details shared. The ideal bug report got far from ideal treatment, unfortunately.
  • Bug #98976 - "Case of referenced column of foreign key not corrected on import". Tim Düsterhus found that in versions with the fix for Bug #88718 - "Foreign key is always in lower case" the case is not always correct in the output of subsequent mysqldump.
  • Bug #98980 - "A state inside mysql_real_connect_nonblocking still blocks". This bug report by Jay Edgar was verified surprisingly fast. This is why it ended up in my list.
  • Bug #98990 - "avg_count_reset for monitor set owner is always NULL in I_S.INNODB_METRICS ". In this case Fungo Wang had not only found a bug, but also provided a patch that was accepted by Oracle after signing the OCA.
  • Bug #99006 - "GTID Synchronisation speed when doing START SLAVE". Simon Mudd noted that in some cases with GTIDs START SLAVE may take a lot of time:
    root@myhost [(none)]> start slave;
    Query OK, 0 rows affected (3 min 17.74 sec)
    and what's worse nothing is logged to the error log in the process to show the reason, progress or anything useful. Probably it's expected that with many binary logs finding proper GTID in them takes time, but some feedback would be useful. The bug does not have any clear public test case and is still under analysis.
  • Bug #99010 - "The mtr case --send command not use ps-protocol when run with --ps-protocol ". Does not sound like a big deal for anyone but developers who write MTR test cases, but in this report Ze Yang had provided source code analysis and quite detailed how to repeat steps and still got useless requests for additional feedback and nothing else till today. This is, again, unfortunate.
  • Bug #99039 - "PSI thread info not getting updated after statement execution". Who could imagine that regression bugs may be introduced even into Performance Schema? But Pranay Motupalli found one introduced into 8.0.18, 5.7.28+ and 5.6.47+ by this commit! I hope to see it fixed soon.
  • Bug #99051 - "XA commit may do engine commit before MYSQL_BIN_LOG::ordered_commit". XA transactions is one of my favorite weak areas of MySQL. In this bug report Dennis GAO described a case when XA COMMIT operation may do a engine commit before binlog flush on Ubuntu 18.04. He had contributed the fix based on the assumption that the best way is to ensure the sequence in the plugin_hash, so that the binlog plugin should be always before all the transnational engine plugins. I only hope that one day a long list of XA bugs will be fixed in one of forks if not in MySQL itself.
This is my favorite point of view this year.
To summarize:
  1. I see some really good examples of bugs verification by Oracle engineers recently. All supported versions are tested, nobody tries to argue against the bug reporter approach used to demonstrate the problem, even if it includes source code modifications. Looks really promising.
  2. Even ideal bug reports sometimes are not processed properly without extra efforts from MySQL Community, unfortunately.
  3. Percona engineers still contribute a lot to MySQL with both new bug reports and useful comments and clarifications. I hope Oracle appreciates that.
  4. Looks like code review and regression testing in Oracle still may benefit from some improvements, as we still see new regression bugs...

Sunday, January 26, 2020

Using bpftrace on Fedora 29 - More Advanced Examples

It happened so that I did not have much time to devote to further eBPF and bpftrace-based tracing and profiling since October 2019, when I posted a simple example of how it can be used on Fedora 29 to "emulate" slow query log (to some extent, by recording queries and time it took to execute them) with dynamic probes. Recently I started preparing for my talk about ftrace, eBPF and bpftrace planned for FOSDEM MySQL, MariaDB and Friends Devroom, but rejected there and later accepted for "MariaDB Day Brussels 0202 2020" (that I suggest everyone to attend, at least if you do not have better plans for the second day of FOSDEM 2020).

So, I decided to check for any ways to use bpftrace as a quick profiler, maybe "embed" it into pt-pmp one day and, surely, get what MariaDB developers ask me most often while working on performance problems or hangs - stack traces. This blog post describes my "achievements" so far on this way.

I had the same version of bpftrace coming from Fedora RPM (that had not updated in the meantime):
[openxs@fc29 server]$ rpm -q -a | grep bpftrace
bpftrace-0.9-0.fc29.x86_64
So, after checking this fine reference guide quickly, I came up with a simple one-liner:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }'
Error creating map: '@'
Error creating stack id map
This might have happened because kernel.perf_event_max_stackis smaller than 127. Try to tweak this value with sysctl kernel.perf_event_max_stack=<new value>
Error creating perf event map (-1)
Attaching 1 probe...
Segmentation fault
Ups! I tried to do what error message suggested:
[openxs@fc29 ~]$ sudo sysctl -w kernel.perf_event_max_stack=1000
kernel.perf_event_max_stack = 1000
But it had not helped. This is how I hit my first problem with bpftrace. Fedora had not cared to update it when it stopped matching "something" in a new kernel, for whatever reason. This is actually a real problem for eBPF programs - what they assume may not be "there" in a kernel, so they have to be checked carefully. There is a separate "BPF Type Format (BTF)" for encoding the debug info related to BPF program (or map) that may help to find out if the program is "compatible" without trying it. In my case incompatibility was obvious and I needed some fast way to proceed with bpftrace, in a hope to come up with some useful examples for my talk.

So, i decided to build new version of bpftrace from recent GitHub source (against bcc tools I had installed and supposedly up to date). This ended up with errors no matter what I tried:
[openxs@fc29 build]$ make
[  3%] Built target parser
[  4%] Built target arch
[  5%] Building CXX object src/ast/CMakeFiles/ast.dir/irbuilderbpf.cpp.o
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp: In member function 'void bpftrace::ast::IRBuilderBPF::CreateSignal(llvm::Value*)':
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp:669:16: error: 'BPF_FUNC_send_signal' was not declared in this scope
       getInt64(BPF_FUNC_send_signal),
                ^~~~~~~~~~~~~~~~~~~~
...
Quick search led me to the idea that it's actually a know and recently reported problem, see Issue #1014. As a solution I decided to rebuild bcc tools from GitHub source as well (the alternative to to change bpftrace source) and removed Fedora packages. To make a long story short, this is what I did eventually (from fc -l output):
1051     rm -rf bcc/
1052     rm -rf bpftrace/
1053     sudo dnf remove bcc
1054     git clone https://github.com/iovisor/bcc.git
1055     cd bcc/
1056     git submodule update --init --recursive
1057     sudo dnf install -y bison cmake ethtool flex git iperf libstdc++-static   python-netaddr python-pip gcc gcc-c++ make zlib-devel   elfutils-libelf-devel
1058     sudo dnf install -y luajit luajit-devel
1059     sudo dnf install -y   http://repo.iovisor.org/yum/extra/mageia/cauldron/x86_64/netperf-2.7.0-1.mga6.x86_64.rpm
1060     sudo pip install pyroute2
1061     sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
1062     mkdir build
1063     cd build/
1064     cmake .. -DCMAKE_INSTALL_PREFIX=/usr
1065     make
1066     make test
1067     sudo make install
1068     cd ../..
1069     git clone https://github.com/iovisor/bpftrace.git
1070     cd bpftrace/
1071     git submodule update --init --recursive
1072     mkdir build
1073     cd build/
1074     cmake -DCMAKE_BUILD_TYPE=Release ../
1075     make
1076     ./tests/bpftrace_test
1077     fc -l
Basically I did it all as documented in related projects for my version of Fedora. Step highlighted (1056 above) is essential though, as it was missed here in any obvious way as a mandatory (while it is in recent versions). Step 1076 was successful:
...
[----------] Global test environment tear-down
[==========] 350 tests from 9 test cases ran. (2590 ms total)
[  PASSED  ] 350 tests.

[openxs@fc29 build]$ make testRunning tests...
Test project /mnt/home/openxs/git/bpftrace/build
    Start 1: bpftrace_test
1/3 Test #1: bpftrace_test ....................   Passed    2.42 sec
    Start 2: runtime_test
2/3 Test #2: runtime_test .....................***Failed   21.10 sec
    Start 3: tools-parsing-test
3/3 Test #3: tools-parsing-test ...............***Failed    5.88 sec

33% tests passed, 2 tests failed out of 3

Total Test time (real) =  29.41 sec

The following tests FAILED:
          2 - runtime_test (Failed)
          3 - tools-parsing-test (Failed)
Errors while running CTest
make: *** [Makefile:128: test] Error 8
while make test shown some failures (same as make test for bcc tools), but form previous experience this had not looked like a big deal. I continued basic testing that the resulting bpftrace now works (I tried to add kernel probe, guess what software runs on this workstation):
[openxs@fc29 build]$ sudo ./src/bpftrace -e 'kprobe:do_nanosleep { printf("sleep by %s\n", comm); }'
Attaching 1 probe...
sleep by percona-qan-age
sleep by percona-qan-age
sleep by skypeforlinux
sleep by mysqld
...
^C
So, with some confidence built up, I've installed my new and very recent bpftrace tool I built myself:
[openxs@fc29 build]$ sudo make install
[  3%] Built target parser
...
-- Installing: /usr/local/bin/bpftrace
...
-- Installing: /usr/local/man/man8/xfsdist.8.gz

[openxs@fc29 build]$ which bpftrace
/usr/local/bin/bpftrace

[openxs@fc29 build]$ cd ..
[openxs@fc29 bpftrace]$ git log -1
commit afafbf561522dd33fa316be3e33375bc662399ac (HEAD -> master, origin/master, origin/HEAD)
Author: bas smit <bas@baslab.org>
Date:   Fri Jan 17 12:33:43 2020 +0100

    Add `--info` flag

    This prints the supported kernel features
Quick test involved starting MariaDB 10.3 with thread pool enabled (more on "why thread pool?" in some next blog post):
[openxs@fc29 maria10.3]$ ./bin/mysqld_safe --no-defaults --thread_handling=pool-of-threads --thread_pool_idle_timeout=10 &
[1] 32643
and running sysbench:
[openxs@fc29 maria10.3]$ sysbench /usr/local/share/sysbench/oltp_update_index.lua --mysql-host=127.0.0.1 --mysql-user=root --mysql-port=3306 --threads=10 --tables=4 --table-size=1000000 --time=600 --report-interval=5 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 10
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 10 tps: 54.37 qps: 54.37 (r/w/o: 0.00/54.37/0.00) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 10 tps: 40.60 qps: 40.60 (r/w/o: 0.00/40.60/0.00) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 10 tps: 40.00 qps: 40.00 (r/w/o: 0.00/40.00/0.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 10 tps: 38.00 qps: 38.00 (r/w/o: 0.00/38.00/0.00) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 10 tps: 34.00 qps: 34.00 (r/w/o: 0.00/34.00/0.00) lat (ms,95%): 746.32 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 10 tps: 36.40 qps: 36.40 (r/w/o: 0.00/36.40/0.00) lat (ms,95%): 549.52 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 10 tps: 37.80 qps: 37.80 (r/w/o: 0.00/37.80/0.00) lat (ms,95%): 601.29 err/s: 0.00 reconn/s: 0.00
...
and adding the probe like this in another shell:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }' > /tmp/bpf.stack
^C
When I stopped collecting with Ctrl-C, bpftrace printed the content of associative arrays collected (with unique user space stack 9for all processes!) as index and count of time this stack was as a value). No segmentation fault, unlike with the version from package! In the resulting file we can see (among other things):
[openxs@fc29 bpftrace]$ cat /tmp/bpf.stack | more
Attaching 1 probe...

...
@[
    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    page_cur_search_with_match_bytes(buf_block_t const*, dict_index_t const*, dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, unsigned long*, unsigned long*, page_cur_t*)+1773
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+8316
    btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.33]+146
    row_search_on_row_ref(btr_pcur_t*, unsigned long, dict_table_t const*, dtuple_t const*, mtr_t*)+85
    row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+503
    row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+1293
    row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+2271
    row_purge_step(que_thr_t*)+1057
    que_run_threads(que_thr_t*)+2280
    srv_worker_thread+956
    start_thread+234
]: 1

...

@[
    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    btr_search_build_page_hash_index(dict_index_t*, buf_block_t*, rw_lock_t*, unsigned long, unsigned long, unsigned long)+2781
    btr_search_info_update_slow(btr_search_t*, btr_cur_t*)+622
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+10130
    row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+4823
    ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+338
    handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+280
    handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)+98
    handler::multi_range_read_next(void**)+183
    Mrr_simple_index_reader::get_next(void**)+74
    DsMrr_impl::dsmrr_next(void**)+66
    QUICK_RANGE_SELECT::get_next()+39
    rr_quick(READ_RECORD*)+25
    mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*)+3560
    mysql_execute_command(THD*)+8446
    Prepared_statement::execute(String*, bool)+979
    Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+154
    mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool)+349
    mysqld_stmt_execute(THD*, char*, unsigned int)+37
    dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)+5123
    do_command(THD*)+368
    tp_callback(TP_connection*)+314
    worker_main(void*)+160
    start_thread+234
]: 1

...

@[
    trx_purge(unsigned long, bool)+1054
    srv_purge_coordinator_thread+1872
    start_thread+234
]: 2
@[
    ut_crc32_sw(unsigned char const*, unsigned long)+1611
    buf_calc_page_crc32(unsigned char const*)+61
    buf_flush_init_for_writing(buf_block_t const*, unsigned char*, void*, unsigned long)+872
    buf_flush_write_block_low(buf_page_t*, buf_flush_t, bool)+149
    buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)+607
    buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*)+1627
    buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long)+392
    pc_flush_slot()+2617
    buf_flush_page_cleaner_coordinator.cold.151+1235
    start_thread+234
]: 2

...
@[]: 6474
So, this is the default format to output stacks (one line per call, no address, frame number or anything, symbols properly resolved for my MariaDB built from GitHub source etc). But I had something like this mind eventually (from pt-pmp):
[openxs@fc29 bpftrace]$ pt-pmp
Tue Jan 21 13:52:12 EET 2020

...
     10 syscall(libc.so.6),__io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:1932),LinuxAIOHandler::poll(os0file.cc:2093),os_aio_linux_handler(stl_vector.h:805),os_aio_handler(stl_vector.h:805),fil_aio_wait(fil0fil.cc:4435),io_handler_thread(srv0start.cc:326),start_thread(libpthread.so.0),clone(libc.so.6)
      6 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),log_write_up_to(log0log.cc:929),trx_flush_log_if_needed_low(trx0trx.cc:1218),trx_flush_log_if_needed(trx0trx.cc:1218),trx_commit_complete_for_mysql(trx0trx.cc:1218),innobase_commit(ha_innodb.cc:4637),commit_one_phase_2(handler.cc:1642),ha_commit_one_phase(handler.cc:1622),ha_commit_trans(handler.cc:1484),trans_commit_stmt(transaction.cc:505),mysql_execute_command(sql_parse.cc:6106),Prepared_statement::execute(sql_prepare.cc:4807),Prepared_statement::execute_loop(sql_prepare.cc:4235),Prepared_statement::execute_loop(sql_prepare.cc:4235),mysql_stmt_execute_common(sql_prepare.cc:3235),mysqld_stmt_execute(sql_prepare.cc:3132),dispatch_command(sql_parse.cc:1798),do_command(sql_parse.cc:1401),threadpool_process_request(threadpool_common.cc:359),tp_callback(threadpool_common.cc:359),worker_main(threadpool_generic.cc:1606),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),srv_resume_thread(srv0srv.cc:2533),srv_worker_thread(srv0srv.cc:2533),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1215),get_event(mysql_thread.h:1215),worker_main(threadpool_generic.cc:1602),start_thread(libpthread.so.0),clone(libc.so.6)
      ...
That is, number of times collapsed stack was noted and then the stack itself, without function arguments if possible. This kind of stack is generated from gdb outputs that looks as follows:
Thread 1 (Thread 0x7f29eee66900 (LWP 32728)):
#0  0x00007f29ef25f431 in poll () from /lib64/libc.so.6
#1  0x000055cc4ece656f in poll (__timeout=-1, __nfds=2, __fds=0x7ffe174108e0) at /usr/include/bits/poll2.h:41
#2  handle_connections_sockets () at /mnt/home/openxs/git/server/sql/mysqld.cc:6753
#3  0x000055cc4ece7a7c in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /mnt/home/openxs/git/server/sql/mysqld.cc:6222
#4  0x00007f29ef190413 in __libc_start_main () from /lib64/libc.so.6
#5  0x000055cc4ecd9ece in _start () at /mnt/home/openxs/git/server/sql/sql_list.h:158
So, each tread stack starts from 'Thread..." line, each frame is numbered and function call is actually the fourth word in the output (in awk terms).
 
Based on that I tried add probe to produce stack in perf format for mysqld process, with some decoration to have a clear marker of when new capture starts:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n", ustack(perf));}' > /tmp/ustack.txt
If you wonder about the details of command format, functions used etc, check the Reference Guide. The result looks as follows in hexdump -C output:
00000000  41 74 74 61 63 68 69 6e  67 20 31 20 70 72 6f 62  |Attaching 1 prob|
00000010  65 2e 2e 2e 0a 23 20 0a  09 37 66 32 39 65 66 62  |e....# ..7f29efb|
00000020  38 34 33 30 30 20 6e 61  6e 6f 73 6c 65 65 70 2b  |84300 nanosleep+|
00000030  36 34 20 28 2f 75 73 72  2f 6c 69 62 36 34 2f 6c  |64 (/usr/lib64/l|
00000040  69 62 70 74 68 72 65 61  64 2d 32 2e 32 38 2e 73  |ibpthread-2.28.s|
00000050  6f 29 0a 0a 23 20 0a 09  35 35 63 63 34 66 32 30  |o)..# ..55cc4f20|
...
This is how you can see where are the new lines, how many etc, for proper parsing. Next step was to check pt-pmp source code and http://poormansprofiler.org/, review and adapt some awk magic used there to collapse stacks, get rid of function arguments etc, and I've ended up with this draft version:
[openxs@fc29 maria10.3]$ cat /tmp/ustack.txt | awk '
BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

    199 __sched_yield,srv_purge_coordinator_thread,start_thread
     16 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     13 trx_purge,srv_purge_coordinator_thread,start_thread
      4 srv_purge_coordinator_thread,start_thread
      4 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      2 ut_crc32_sw,buf_calc_page_crc32,buf_page_is_corrupted,buf_page_io_complete,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
      2 syscall,
      2 __pwrite,
      2 mtr_t::commit,row_ins_sec_index_entry_low,row_ins_sec_index_entry,row_upd_sec_index_entry,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
...
The above looks reasonable (even if not yet a direct match to what pt-pmp produces, I am too lazy to do much awk these days).

What's more interesting is that with bpftrace we can also get kernel stack for thread while profiling. I took a quick look at what we may get as a result of combining both:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n%s\n", kstack(perf), ustack(perf));}' > /tmp/kstack.txt
[sudo] password for openxs:
^C
I just pring one after the other without any separator in perf format, so I can apply the same trick as above to the result:
openxs@fc29 maria10.3]$ cat /tmp/kstack.txt | awk '                   BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

     93 __sched_text_start,schedule,__x64_sys_sched_yield,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     49 __sched_yield,srv_purge_coordinator_thread,start_thread
     42 do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     20 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     11 trx_purge,srv_purge_coordinator_thread,start_thread
      2 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      1 __x86_indirect_thunk_rax,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
      1 ut_crc32_sw,log_block_calc_checksum_crc32,log_write_up_to,trx_commit_complete_for_mysql,innobase_commit,commit_one_phase_2,ha_commit_trans,trans_commit_stmt,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
...
      1 _raw_spin_unlock_irqrestore,ata_scsi_queuecmd,scsi_queue_rq,blk_mq_dispatch_rq_list,blk_mq_do_dispatch_sched,blk_mq_sched_dispatch_requests,__blk_mq_run_hw_queue,__blk_mq_delay_run_hw_queue,blk_mq_run_hw_queue,blk_mq_sched_insert_requests,blk_mq_flush_plug_list,blk_flush_plug_list,blk_finish_plug,ext4_writepages,do_writepages,__filemap_fdatawrite_range,file_write_and_wait_range,ext4_sync_file,do_fsync,__x64_sys_fsync,do_syscall_64,entry_SYSCALL_64_after_hwframe,fsync,fil_flush_low,fil_flush_file_spaces,buf_dblwr_update,buf_page_io_complete,fil_aio_wait,io_handler_thread,start_thread
      1 page_simple_validate_new,buf_dblwr_flush_buffered_writes,pc_flush_slot,buf_flush_page_cleaner_coordinator.cold.151,start_thread
      1 page_cur_insert_rec_write_log,page_cur_insert_rec_low,page_copy_rec_list_end_no_locks,btr_page_reorganize_low,btr_can_merge_with_page,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
      1 os_thread_yield,srv_purge_coordinator_thread,start_thread
      1 os_thread_get_curr_id,fseg_free_page_func,btr_page_free,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
...
Looks cool to me, we see where we spend time in kernel while on-CPU, not only in user space code.

It's sometimes scary and not that easy to use bpftrace, but you'll get there eventually, same as we got to North Greenwich via that Emirates Air Line back in 2019. The only limit is the sky!
* * *
To summarize:
  1. bpftrace and bcc tools you get from vendor packages may not work as expected, so be ready to get nightly builds etc or, as I prefer, build from source. If this is a problem better use ftrace :) 
  2. It seems getting stack traces with bpftrace for pt-pmp is an easy next stap to make. I may be integrated as one of tracers supported here.
  3. Now we can also add kernel stack to the picure.
  4. "JUST DO BPF"!
I have many more details to share about dynamic tracing for MySQL and MariaDB DBAs. Stay tuned!

Saturday, March 11, 2017

Fun with Bugs #49 - Applying PMP to MySQL

As you maybe noted, in several recent posts I've provided some additional details for slides used during my FOSDEM talk on profiling MySQL. The only part not covered yet is related to using Poor Man's Profiler (and pt-pmp version of it). I see no reason to explain what it does and how to use it once again, but would like to show several recent enough MySQL bug reports where this tool was essential to find, explain or demonstrate the problem.

Quick search for active bugs with "pt-pmp" in MySQL bugs database produces 8 hits at the moment:
  •  Bug #85304 - "Reduce mutex contention of fil_system->mutex". It was reported by Zhai Weixiang few days ago, and pt-pmp output was used as a starting point for the analysis that ended up with a patch suggested.
  • Bug #85191 - "performance regression with HANDLER READ syntax", from the same reporter. In this report pt-pmp was used to prove the point and show what exactly threads were doing.
  • Bug #80979 - "Doublewrite buffer event waits should be annotated for Performance Schema", by Laurynas Biveinis. One more case when PMP shows where the time is spent by threads in some specific case, while there is no instrumentation (yet) for the related wait in Perfomance Schema.
  • Bug #77827 - "reduce cpu time costs of creating dummy index while change buffer is enabled", again by Zhai Weixiang. In this bug report he had used both perf to show that some notable time was spent on the operation, and pt-pmp to show the related backtraces.
  • Bug #73803 - "don't acquire global_sid_lock->rdlock if gtid is disabled". Once again, Zhai Weixiang used pt-pmp output as a starting point for further code analysis.I wonder why this bug is still "Open", by the way...
  • Bug #70237 - "the mysqladmin shutdown hangs". Guess who reported it after applying PMP when something hanged. As I stated in all my 3 recent FOSDEM talks, this is exactly what you have to do before killing and restarting MySQL server in production - get backtraces of all threads, raw or at least aggregated with pt-pmp... I am not sure why the bug was not noted in time, there are even ideas of patches shared. Time for somebody to process it formally.
  • Bug #69812 - "the server stalls at function row_vers_build_for_consistent_read". Same reporter, same tool used, same result - the bug report is still "Open". Looks like I know what my next post(s) in this "Fun with Bugs" series will be devoted to...
  • Bug #62018 - "innodb adaptive hash index mutex contention". It was reported by Mark Callaghan and PMP outputs were used as a part of the evidence. The bug is "Verified" and even got a patch suggested for 5.7.x by Percona engineers, but still had not got any proper attention from Oracle. I may have some more results related to the "cost" and "benefits" of adaptive hash indexing to share soon, so stay tuned...
Surely, there are way more bugs where PMP was used. Let me share one more that I noted while working on my recent talk on profiling (bug had not found time to put it on slides and comment on):

  • Bug #78277 - "InnoDB deadlock, thread stuck on kernel calls from transparent page compression", by Mark Callaghan. Again, PMP outputs were provided to prove the point and show where threads are stuck. The bug is "Open".

For many performance related cases applying pt-pmp and sharing the results becomes a de facto community requirement, as you can see, for example, in Bug #84025. Note that Umesh Shastry, who verified the bug, provided pt-pmp outputs in hist testing results. I'd suggest to have gdb and pt-pmp installed and ready to use on any production system using any version and fork of MySQL. Even if your bug will be ignored by Oracle, these outputs are useful for other community members who may hit similar cases or is not lazy to check and work on the code to provide a patch.