Saturday, January 30, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part V, Proper Way To Summarize Time Spent per Stack

In my previous post in this series I was tracing __lll_lock_wait() function and measuring time spent in it per unique MariaDB server's stack trace. I've got the results that had not looked like totally wrong or useless and even presented them as a flame graph, but it took a long time and during all thyat time, for minutes, sysbench test TPS and QPS results were notably smaller. So, it was a useful tracing, but surely not a low impact one.

I already stated before that the approach used is surely not optimal. There should be some better way to summarize the stack traces without such a performance impact. Today I continued to work on this problem. First of all, I tried to measure finally how much time exactly I had to wait for the results from bpftrace command:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt

real    9m51.904s
user    8m25.576s
sys     1m8.057s
[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks1.txt
-rw-rw-r--. 1 openxs openxs 1564291 Jan 30 11:58 /tmp/lll_lock_perf_stacks1.txt

So, test was running for 300 seconds (5 minutes) and monitoring results (for about 20 seconds) came 5 minutes later! This is definitely NOT acceptable for any regular use.

I start to check things and test more. I noted in the Reference Guide (that I've read entirely a couple of times yesterday and today) that ustack() function of bpftrace can accept the second argument, number of frames to show in stack. Based on the previous flame graphs most interesting stacks were not long, so I decided to check if 8 or just 5 may be still enough for useful results while reducing the performance impact. I changed the code here:

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf,5); -- <-- I also tried 8
}

and got these run times for 8 and 5 frames:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt

real    9m39.195s
user    8m9.188s
sys     1m8.430s

[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks1.txt
-rw-rw-r--. 1 openxs openxs 1085019 Jan 30 12:12 /tmp/lll_lock_perf_stacks1.txt

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks2.txt

real    6m6.801s
user    5m5.537s
sys     0m42.226s

[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks2.txt
-rw-rw-r--. 1 openxs openxs 728072 Jan 30 12:19 /tmp/lll_lock_perf_stacks2.txt

So, reducing stack to 5 frames made notable impact, and QPS drop was a bit smaller, but still waiting for 5 minutes is not an option in general.

I had not found really powerful functions to replace awk and collapse stacks inside the bpftrace program, so decided to export them "raw", on every hit of uretprobe, and then process by Linux tools as unusual. I'll skip steps, tests and failures and just show the resulting new program for the task:

[openxs@fc31 ~]$ cat lll_lock_wait2.bt
#!/usr/bin/env bpftrace

BEGIN
{
/*    printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n"); */
}

interval:s:$1 { exit(); }

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf);    
}

uretprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd" && @start[tid] != 0/
{
    $now = nsecs;
    $time = $now - @start[tid];
    @futexstack[@tidstack[tid]] += $time;
    print(@futexstack);
    delete(@futexstack[@tidstack[tid]]);
/*
    printf("Thread: %u, time: %d\n", tid, $time);
*/
    delete(@start[tid]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@tidstack);
    clear(@futexstack);
}

Changes vs the previous version are highlighted. Basically I print and then delete every stack collected, as function call ends, along with time, to further sum them up externally. The function used is the following:

print(@x[, top [, div]]) - Print the map, optionally the top entries only and with a divisor

I've also added the interval probe at the beginning, referring to the first program argument as seconds to run. It just forces exit after N seconds (0 by default means immediate exist, wrong format will be reported as error).

Now if I run it while test is running, this way:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait2.bt 10 2>/dev/null | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' > /tmp/collapsed_lll_lock_v2_raw.txt


real    0m10.646s
user    0m5.648s
sys     0m0.843s
[openxs@fc31 ~]$ ls -l /tmp/collapsed_lll_lock_v2_raw.txt
-rw-rw-r--. 1 openxs openxs 1566 Jan 30 13:12 /tmp/collapsed_lll_lock_v2_raw.txt
[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt
15531 __lll_lock_wait;
1233833 __lll_lock_wait;
10638 __lll_lock_wait;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;
7170 __lll_lock_wait;tpool::thread_pool_generic::worker_main;;;;
273330 __lll_lock_wait;tdc_acquire_share;open_table;open_tables;open_and_lock_tables;execute_sqlcom_select;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
1193083 __lll_lock_wait;trx_undo_report_row_operation;btr_cur_update_in_place;btr_cur_optimistic_update;row_upd_clust_rec;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
183353 __lll_lock_wait;
43231 __lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;ha_commit_trans;trans_commit;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread

I get a very condensed output almost immediately!  I can afford running for 60 seconds without much troubles:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait2.bt 60 2>/dev/null | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' > /tmp/collapsed_lll_lock_v2_raw.txt


real    1m0.959s
user    0m44.146s
sys     0m6.126s


[openxs@fc31 ~]$ ls -l /tmp/collapsed_lll_lock_v2_raw.txt
-rw-rw-r--. 1 openxs openxs 12128 Jan 30 13:17 /tmp/collapsed_lll_lock_v2_raw.txt

The impact of these 60 seconds is visible in the sysbench output:

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 625.68 qps: 12567.66 (r/w/o: 8800.49/2512.81/1254.36) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 891.51 qps: 17825.69 (r/w/o: 12479.31/3563.26/1783.13) lat (ms,95%): 142.39 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 512.09 qps: 10230.44 (r/w/o: 7163.62/2042.55/1024.27) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 369.10 qps: 7390.48 (r/w/o: 5171.39/1480.90/738.20) lat (ms,95%): 350.33 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 417.83 qps: 8347.73 (r/w/o: 5845.14/1667.03/835.56) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 484.03 qps: 9687.55 (r/w/o: 6782.09/1937.31/968.16) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 431.98 qps: 8640.35 (r/w/o: 6049.38/1727.01/863.95) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 354.60 qps: 7097.39 (r/w/o: 4968.86/1419.32/709.21) lat (ms,95%): 419.45 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 380.98 qps: 7600.82 (r/w/o: 5317.17/1521.80/761.85) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 423.01 qps: 8467.99 (r/w/o: 5928.71/1693.16/846.13) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 32 tps: 475.66 qps: 9525.07 (r/w/o: 6669.22/1904.53/951.32) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 32 tps: 409.13 qps: 8171.48 (r/w/o: 5718.38/1634.94/818.17) lat (ms,95%): 458.96 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 32 tps: 190.91 qps: 3826.72 (r/w/o: 2682.19/762.62/381.91) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 32 tps: 611.40 qps: 12229.99 (r/w/o: 8558.99/2448.20/1222.80) lat (ms,95%): 223.34 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 32 tps: 581.99 qps: 11639.19 (r/w/o: 8148.53/2326.68/1163.99) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 32 tps: 653.21 qps: 13058.51 (r/w/o: 9139.57/2612.52/1306.41) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 32 tps: 561.87 qps: 11231.27 (r/w/o: 7860.53/2246.99/1123.75) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 32 tps: 625.66 qps: 12526.32 (r/w/o: 8770.19/2504.82/1251.31) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 32 tps: 554.50 qps: 11088.01 (r/w/o: 7760.61/2218.40/1109.00) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 32 tps: 607.10 qps: 12143.89 (r/w/o: 8501.49/2428.20/1214.20) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 210s ] thds: 32 tps: 424.17 qps: 8467.63 (r/w/o: 5925.43/1693.87/848.33) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 32 tps: 466.80 qps: 9335.03 (r/w/o: 6533.32/1868.31/933.40) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 230s ] thds: 32 tps: 365.83 qps: 7330.56 (r/w/o: 5133.49/1465.21/731.86) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 32 tps: 411.27 qps: 8218.45 (r/w/o: 5754.25/1641.87/822.34) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 250s ] thds: 32 tps: 127.10 qps: 2534.71 (r/w/o: 1772.60/507.70/254.40) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 32 tps: 642.35 qps: 12856.29 (r/w/o: 8999.09/2572.50/1284.70) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 32 tps: 603.80 qps: 12078.79 (r/w/o: 8456.89/2414.30/1207.60) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 32 tps: 642.70 qps: 12857.60 (r/w/o: 9001.40/2570.80/1285.40) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 290s ] thds: 32 tps: 716.57 qps: 14325.60 (r/w/o: 10026.98/2865.48/1433.14) lat (ms,95%): 144.97 err/s: 0.00 reconn/s: 0.00
[ 300s ] thds: 32 tps: 611.16 qps: 12219.69 (r/w/o: 8551.80/2445.66/1222.23) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            2124836
        write:                           607096
        other:                           303548
        total:                           3035480
    transactions:                        151774 (505.83 per sec.)
    queries:                             3035480 (10116.62 per sec.)

    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      505.8309
    time elapsed:                        300.0489s
    total number of events:              151774

Latency (ms):
         min:                                    2.17
         avg:                                   63.25
         max:                                 8775.26
         95th percentile:                      331.91
         sum:                              9599021.34

Threads fairness:
    events (avg/stddev):           4742.9375/42.76
    execution time (avg/stddev):   299.9694/0.03

but it continues only for some time after the end of monitoring and overal QPS is not much affected.

Now what about top 5 stacks by wait time? Here they are (well, I had NOT cared to sum up several entires with the same stack, but you should know how to do this with awk and flamegraph.pl will do it for us later) :

[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt | sort -r -n -k 1,1 | head -5
340360731 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
335903154 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_upd_sec_index_entry;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
287819974 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
281374237 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
268817930 __lll_lock_wait;buf_page_init_for_read;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;

A lot of contention happens on INSERTing rows. I'd say it's expected! Now to create the flame graph we just have to re-order the output a bit:

[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="Time spent in ___lll_lock_wait in MariaDB 10.5, all frames" --countname=nsecs > ~/Documents/lll_lock_v2_2.svg

Here it is:

Most mutex waits for oltp_read_write test happen on inserting rows...

Monitoring over 60 seconds allowed to see more realistic picture of contention. It mostly happens on inserting rows by 32 threads into just 5 tables on a system with 4 cores only.

Now it seems we have the tool that can be used more than once ina lifetime and seems to provide a useful information, fast.

* * *

To summarize:

  1. One should not even try to summarize all stacks in a single associative arrays for any function that is called to often in many different contests inside the bftrace! There is a limit on number of items in the ma you may hit, and the impact of storing and exporting this is too high for monitoring more than a couple of seconds.
  2. Cleaning up stack traces should be done externally, by usual Linux test processing tools thta produce smaller summary output. The bpftrace program should NOT be designed to collect a lot of outputs for a long time and then output them at the end. Looks like smaller shunks exported to the userspace regularly is a better approach.
  3. We can use simple command line arguments for the program that can be literally substituted as parts of probe definitions. Next step would be to make the more generic tool with binary and function name to probe as command line arguments.
  4. bpftrace is cool!

Thursday, January 28, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part IV, Creating a New Tool for Tracing Time Spent in __lll_lock_wait

So far in this series I am just trying to find out what can be done with bpftrace and how to get interesting insights or implement something I was asked about. I am not sharing "best practices" (mostly worst, probably, when it's my own inventions) or any final answers on what to trace given the specific MariaDB performance problem. This is still work in progress, and on really early stage.

In the comments for my previous post you can see that I probably measured a wrong thing if the goal is to find the mutex that causes most of the contention. One of the comments suggested a different tracing idea:

"... indexed by the *callstack*, with the goal to find the place where waits happen. I think __lll_lock_wait could be a better indicator for contention, measuring uncontended pthread_mutex_locks could give some false alarms."

Today I had an hour or so again near my Fedora box with bpftrace, so I tried to modify my previous tool to store the time spent in the __lll_lock_wait function, and sum it up (the same inefficient way so far, I had no chance yet to test and find anything more suitable for production use) these times per unique stack that led to __lll_lock_wait call.

To remind you, __lll_lock_wait() is a low-level wrapper around the Linux futex system call. The prototype for this function is:

void __lll_lock_wait (int *futex, int private)
It is also from the libpthread.so library:

[openxs@fc31 ~]$ strings /lib64/libpthread.so.0 | grep '__lll_lock_wait'
__lll_lock_wait
__lll_lock_wait_private

So, my code will not require many modifications. Basic quick code to prove the concept is as simple as this:

[openxs@fc31 ~]$ cat lll_lock_wait.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n");
}

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/

{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf);
}

uretprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd" && @start[tid] != 0/

{
    $now = nsecs;
    $time = $now - @start[tid];
    @futexstack[@tidstack[tid]] += $time;

    printf("Thread: %u, time: %d\n", tid, $time);

    delete(@start[tid]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@tidstack);
    clear(@futexstack);
}

The main difference is that to measure time spent inside a single function I need uprobe on it and uretprobe on it. Function call arguments are NOT available in the uretprobe.

When I made this .bt executable and tried to run it with MariaDB 10.5 server started but not under any load I had NOT got any output from uretprobe - no wonder, there is no contention! With sysbench test started the result was different, a flood of outputs:

[openxs@fc31 ~]$ sudo ./lll_lock_wait.bt 2>/dev/null
...
Thread: 7488, time: 70520
Thread: 7494, time: 73351
Thread: 5790, time: 106635
Thread: 5790, time: 10008
Thread: 7484, time: 87016
Thread: 5790, time: 18723
^C

So, the program works to some extent and reports some timer per thread (but I have no mutex address in uretprobe). So, I modified it to remove prints and keep the stack associative array for the output in the END probe:

#!/usr/bin/env bpftrace

BEGIN
{
/*    printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n"); */
}

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf);    
}

uretprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd" && @start[tid] != 0/
{
    $now = nsecs;
    $time = $now - @start[tid];
    @futexstack[@tidstack[tid]] += $time;
/*
    printf("Thread: %u, time: %d\n", tid, $time);
*/
    delete(@start[tid]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@tidstack);
/*    clear(@futexstack); */
}

Then I started usual sysbench test for this series:

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 516.06 qps: 10362.55 (r/w/o: 7260.90/2066.43/1035.22) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 550.82 qps: 11025.98 (r/w/o: 7718.97/2205.28/1101.74) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00

and at this moment started my program and kept it running for some time more than 10 but less than 20 seconds:

[openxs@fc31 ~]$ sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks.txt
^C

I've got immediate drop in QPS and it continued till the end of 300 seconds test:

[ 30s ] thds: 32 tps: 397.67 qps: 7950.29 (r/w/o: 5565.37/1589.58/795.34) lat (ms,95%): 434.83 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 439.87 qps: 8787.15 (r/w/o: 6149.45/1757.97/879.74) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
...

    transactions:                        101900 (339.38 per sec.)
    queries:                             2038016 (6787.74 per sec.)

    ignored errors:                      1      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
...

I had to wait some more and ended up with this big output:

[openxs@fc31 maria10.5]$ ls -l /tmp/lll_lock_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 1347279 Jan 28 14:29 /tmp/lll_lock_perf_stacks.txt

[openxs@fc31 maria10.5]$ more /tmp/lll_lock_perf_stacks.txt
Attaching 4 probes...


@futexstack[
        7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)
        14e5f 0x14e5f ([unknown])
]: 4554

...
@futexstack[
        7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)
        55a71e6dd31c MDL_lock::remove_ticket(LF_PINS*, MDL_lock::Ticket_list MDL
_lock::*, MDL_ticket*)+60 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e6ddca5 MDL_context::release_lock(enum_mdl_duration, MDL_ticket*)+3
7 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71ebc5ef4 row_purge_step(que_thr_t*)+388 (/home/openxs/dbs/maria10.5/
bin/mariadbd)
        55a71eb874b8 que_run_threads(que_thr_t*)+2264 (/home/openxs/dbs/maria10.
5/bin/mariadbd)
        55a71ebe5773 purge_worker_callback(void*)+355 (/home/openxs/dbs/maria10.
5/bin/mariadbd)
        55a71ed2b99a tpool::task_group::execute(tpool::task*)+170 (/home/openxs/
dbs/maria10.5/bin/mariadbd)
        55a71ed2a7cf tpool::thread_pool_generic::worker_main(tpool::worker_data*
)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7fadbcac43d4 0x7fadbcac43d4 ([unknown])
        55a721f10f00 0x55a721f10f00 ([unknown])
        55a71ed2ab60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<v
oid (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_gen
eric*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bi
n/mariadbd)
        2de907894810c083 0x2de907894810c083 ([unknown])
]: 4958
...

@futexstack[
        7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)
        55a71eba3e20 row_ins_clust_index_entry_low(unsigned long, unsigned long,
 dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*)+4144 (/home
/openxs/dbs/maria10.5/bin/mariadbd)
        55a71eba4a36 row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr
_t*, unsigned long)+198 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71eba52b4 row_ins_step(que_thr_t*)+1956 (/home/openxs/dbs/maria10.5/b
in/mariadbd)
        55a71ebb5ea1 row_insert_for_mysql(unsigned char const*, row_prebuilt_t*,
 ins_mode_t)+865 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71eafadc1 ha_innobase::write_row(unsigned char const*)+177 (/home/ope
nxs/dbs/maria10.5/bin/mariadbd)
        55a71e7e5e10 handler::ha_write_row(unsigned char const*)+464 (/home/open
xs/dbs/maria10.5/bin/mariadbd)
        55a71e598b8d write_record(THD*, TABLE*, st_copy_info*, select_result*)+4
77 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e59f6d7 mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item
> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*)+2967 (/ho
me/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e5d894a mysql_execute_command(THD*)+7722 (/home/openxs/dbs/maria10.
5/bin/mariadbd)
        55a71e5ec945 Prepared_statement::execute(String*, bool)+981 (/home/openx
s/dbs/maria10.5/bin/mariadbd)
        55a71e5ecbd5 Prepared_statement::execute_loop(String*, bool, unsigned ch
ar*, unsigned char*)+133 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e5ed8f5 mysql_stmt_execute_common(THD*, unsigned long, unsigned cha
r*, unsigned char*, unsigned long, bool, bool)+549 (/home/openxs/dbs/maria10.5/b
in/mariadbd)
        55a71e5edb3c mysqld_stmt_execute(THD*, char*, unsigned int)+44 (/home/op
enxs/dbs/maria10.5/bin/mariadbd)
        55a71e5d4ba6 dispatch_command(enum_server_command, THD*, char*, unsigned
 int, bool, bool)+9302 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e5d5f12 do_command(THD*)+274 (/home/openxs/dbs/maria10.5/bin/mariad
bd)
        55a71e6d3bf1 do_handle_one_connection(CONNECT*, bool)+1025 (/home/openxs
/dbs/maria10.5/bin/mariadbd)
        55a71e6d406d handle_one_connection+93 (/home/openxs/dbs/maria10.5/bin/ma
riadbd)
        55a71ea4abf2 pfs_spawn_thread+322 (/home/openxs/dbs/maria10.5/bin/mariad
bd)
        7fadbd0b34e2 start_thread+226 (/usr/lib64/libpthread-2.30.so)
]: 5427

...

Some stacks surely look reasonable, so I continued with ccollapsing them into a simpler form:

[openxs@fc31 maria10.5]$ cat /tmp/lll_lock_perf_stacks.txt | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' | more

...
5380 __lll_lock_wait;
5406 __lll_lock_wait;
5421 __lll_lock_wait;
5427 __lll_lock_wait;row_ins_clust_index_entry_low;row_ins_clust_index_entry;row
_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;writ
e_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared
_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_
command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thre
ad;start_thread
5436 __lll_lock_wait;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_inn
obase::delete_row;handler::ha_delete_row;mysql_delete;mysql_execute_command;Prep
ared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_comm
on;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;hand
le_one_connection;pfs_spawn_thread;start_thread
5443 __lll_lock_wait;
5445 __lll_lock_wait;purge_worker_callback;tpool::task_group::execute;tpool::thr
ead_pool_generic::worker_main;;;;
5502 __lll_lock_wait;
...

The awk code is a bit different this time. I know that for flame graphs I need function calls separated by ';', so I am doing it immediately. Non-resolved stack traces are all removed etc. Now I have to sort this to find top N:

[openxs@fc31 maria10.5]$ cat /tmp/lll_lock_perf_stacks.txt | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' | sort -r -n -k 1,1 > /tmp/collapsed_llll_lock.txt

and then:

[openxs@fc31 maria10.5]$ cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print} }' | head -5
28276454 __lll_lock_wait;timer_handler;start_thread
35842 __lll_lock_wait;MDL_map::find_or_insert;MDL_context::try_acquire_lock_impl;MDL_context::acquire_lock;open_table;open_tables;open_and_lock_tables;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;;;;;;
35746 __lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;MDL_context::release_locks_stored_before;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
35675 __lll_lock_wait;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_innobase::delete_row;handler::ha_delete_row;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
35563 __lll_lock_wait;row_upd_sec_index_entry;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread

So, the above is the top 5 fully resolved stack traces where the longest time was spent inside the __lll_lock_wait function. Now let's create the flame graph. We need this output, stack first and nanoseconds spent in it next, comma separated, piped for processing by the same flamegraph.pl program as before:

[openxs@fc31 maria10.5]$ cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | head -3
__lll_lock_wait;timer_handler;start_thread 28276454
__lll_lock_wait;MDL_map::find_or_insert;MDL_context::try_acquire_lock_impl;MDL_context::acquire_lock;open_table;open_tables;open_and_lock_tables;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;;;;;; 35842
__lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;MDL_context::release_locks_stored_before;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread 35746

Not the flame graph was created as follows:

cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="Time spent in ___lll_lock_wait in MariaDB 10.5" --countname=nsecs > ~/Documents/lll_lock.svg

The resulting flame graph is presented below:

Time related to MDL is highlighted, some 3% only

I highlighted the impact of MDL-related stack traces, some 3%. The shape is totally different from the one of pthread_mutex_wait, so this shows a how different point of view may lead to different conclusions about the contention.

* * *

To summarize:

  1. I am still not any good with finding low impact way (if any) to get stack traces with time summed up. It seems associative array with them become too big if stacks are not preprocessed, and exporting it to user space at the end of the program takes minutes, literally, and impact performance during all this time.
  2. It's easy to measure time spent inside the function with uprobe + ureprobe for the function
  3. I am not sure yet about the tools to run in production, but for checking the ideas and creating prototypes bpftrace is really easy to use and flexible.
  4. It is not clear where the stack traces with all calls unresolved to symbols may come from.

Wednesday, January 27, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part III, Creating a New Tool for Tracing Mutexes

I had some free time yesterday to do some more tests on Fedora, so I've got back to the old request for one of MariaDB developers that I first mentioned in the blog few months ago:

"ideally, collect stack traces of mariadbd, and sort them in descending order by time spent between pthread_mutex_lock and next pthread_mutex_unlock."

The original request above got no answer yet, and I was recently reminded about it. What I did with perf and recently with bpftrace was (lame) counting number of samples per unique stack, while I actually had to count time spent between acquiring and releasing specific mutexes.

From the very beginning I was sure that bpftrace should allow to get the answer easily, and after reviewing the way existing tools are coded, yesterday I decided to finally write some real, multiple liner bpftrace program, with multiple probes, myself. I wanted to fulfill the request literally, no matter how much that would "cost" for now, with bpftrace. It turned out that a couple of hours of calm vacation time is more than enough to get a draft of solution.

I've started with checking the pthread_mutext_lock manual page. From it I've got the (primitive) idea of two functions used in the process, with single argument, mutex pointer/address:

int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);

Multiple threads can try to lock the same mutex and those that found it locked will wait until unlock eventually makes the mutex available for acquire for one of them (as decided by the scheduler). I've made the assumption (correct me if I am wrong) that that same thread that locked the mutex must unlock it eventually. Based on that I came up with the following initial lame version of bpftrace program:

[openxs@fc31 ~]$ cat pthread_mutex.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop\n");
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/
{
    @start[arg0] = nsecs;
    @mutexid[arg0] = tid;
    @tidstack[tid] = ustack;
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_unlock
/comm == "mariadbd" && @start[arg0] != 0 && @mutexid[arg0] == tid/
{
    $now = nsecs;
    $time = $now - @start[arg0];
    @mutexstack[@tidstack[tid]] += $time;
    printf("Mutex: %u, time: %d\n", arg0, $time);
    delete(@start[arg0]);
    delete(@mutexid[arg0]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@mutexid);
    clear(@tidstack);
}
/* the end */

Do you recognize biosnoop.bt style? Yes, this is what I was inspired by... So, I've added two uprobes for the library providing the function, both checking that the call is done from the mariadbd. The first one, for lock, stores start time for the given mutex address, thread id that locked it, and stack trace of the thread at the moment of locking. The second one, for unlock, computes time difference since the same mutex was locked last, but it fires only if unlock thread has the same id that the lock one. Then I add this time difference to the time spent "within this stack trace", by referring to the thread stack stored as index in the @mutexstack[] associative array. Then I print some debugging output to see what happens in the process of tracing and remove items from the associative arrays that were added to them by the first probe.

In the END probe I just clean up all associative arrays but @mutexstack[], and, as we've seen before, then its content is just dumped to the output by the bpftrace. This is what I am going to post process later, after quick debugging session proves my idea gives some reasonable results.

So, with MariaDB 10.5 up and running, started like this (no real tuning for anything, no wonder QPS is not high in the tests below):

./bin/mysqld_safe --no-defaults --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --port=3309 &

and having zero user connections, I made pthread_mutex.bt executable and started my very first bpftrace program for the very first time (OK, honestly, few previous runs shown some syntax errors that I corrected):

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt
Attaching 4 probes...
Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop
Mutex: 652747168, time: 6598
Mutex: 629905136, time: 46594
...
Mutex: 652835840, time: 26491
Mutex: 652835712, time: 4569
^C


@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 23055
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+188
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 23803
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+210
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 24555
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::submit_task(tpool::task*)+88
    timer_handler+326
    start_thread+226
]: 31859
@mutexstack[
    __pthread_mutex_lock+0
    srv_monitor_task+130
    tpool::thread_pool_generic::timer_generic::execute(void*)+53
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 53282
@mutexstackERROR: failed to look up stack id 0 (pid 0): -1
[]: 322499

Not bad. I see mutexes are locked and (last ERROR aside) time is summed up per stack trace as planned. More over, stack traces look reasonable for 10.5 (generic thread pool is used inside InnoDB in this version, to run background tasks, are you aware of that?). Some symbols are not resolved, but what can I do about it? I'll just skip that addresses at some later stage, maybe.

I just decided to check what threads are locking mutexes, and modified print:

    printf("Mutex: %u, thread: %u, time: %d\n", arg0, tid, $time);

With that modification I also redirected errors to /dev/null and got this:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null
Attaching 4 probes...
Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop
Mutex: 652835712, thread: 4476, time: 6354
Mutex: 629905136, thread: 4476, time: 37053
Mutex: 621289632, thread: 4485, time: 5254
Mutex: 621289632, thread: 4485, time: 4797
Mutex: 652835840, thread: 4485, time: 31465
Mutex: 652835712, thread: 4485, time: 4374
Mutex: 652835712, thread: 4476, time: 6048
Mutex: 629905136, thread: 4476, time: 35703
Mutex: 621289632, thread: 4485, time: 4917
Mutex: 621289632, thread: 4485, time: 4779
Mutex: 652835840, thread: 4485, time: 30316
Mutex: 652835712, thread: 4485, time: 4389
Mutex: 652835712, thread: 4476, time: 6733
Mutex: 629905136, thread: 4476, time: 40936
Mutex: 621289632, thread: 4485, time: 4719
Mutex: 621289632, thread: 4485, time: 4725
Mutex: 652835840, thread: 4485, time: 30637
Mutex: 652835712, thread: 4485, time: 4441
^C


@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 13204
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+210
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 14301
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+188
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 14890
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::submit_task(tpool::task*)+88
    timer_handler+326
    start_thread+226
]: 19135
@mutexstack[]: 206110

I see different threads locking same mutexes etc. One day given more time I'd try to figure out what mutexes are that and what was the purpose of each thread (it can be seen based on OS thread id in the performance_schema.threads in 10.5, fortunately, or inferred from the stacks at the moment).

I've removed debug print (no interactive output, just final summarized data), changed ustack to ustack(perf) (assuming I know better how to deal with that output format later - it was not really a good idea), and ended up with this final version of the tool:

#!/usr/bin/env bpftrace

BEGIN
{
/*
    printf("Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop\n");
*/
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/
{
    @start[arg0] = nsecs;
    @mutexid[arg0] = tid;
    @tidstack[tid] = ustack(perf);    
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_unlock
/comm == "mariadbd" && @start[arg0] != 0 && @mutexid[arg0] == tid/
{
    $now = nsecs;
    $time = $now - @start[arg0];
    @mutexstack[@tidstack[tid]] += $time;
/*
    printf("Mutex: %u, thread: %u, time: %d\n", arg0, tid, $time);
*/
    delete(@start[arg0]);
    delete(@mutexid[arg0]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@mutexid);
    clear(@tidstack);
}
/* the end */

I saved stack trace to the file in /tmp, to work on the output further outside of the bpftrace. Again probably it was not the best idea, but I am not yet fluent with strings processing in bpftrace anyway, I rely on awk, sort etc.:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null >/tmp/pthread_mutex_perf_stacks.txt
^C[openxs@fc31 ~]cat /tmp/pthread_mutex_perf_stacks.txt
Attaching 4 probes...



@mutexstack[
        7f7e615a6e70 __pthread_mutex_lock+0 (/usr/lib64/libpthread-2.30.so)
        556223eb07cf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f7e60fb53d4 0x7f7e60fb53d4 ([unknown])
        5562258c9080 0x5562258c9080 ([unknown])
        556223eb0b60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        2de907894810c083 0x2de907894810c083 ([unknown])
]: 21352
@mutexstack[
        7f7e615a6e70 __pthread_mutex_lock+0 (/usr/lib64/libpthread-2.30.so)
        556223eb0d62 tpool::thread_pool_generic::timer_generic::execute(void*)+210 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        556223eb1c52 tpool::task::execute()+50 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        556223eb07cf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f7e60fb53d4 0x7f7e60fb53d4 ([unknown])
        5562258c9080 0x5562258c9080 ([unknown])
        556223eb0b60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        2de907894810c083 0x2de907894810c083 ([unknown])
]: 22975
...

After checking what I did with such stack traces previously to collapse them into one line per stack pt-pmp style, and multiple clarification runs and changes I ended up with the following awk code:

awk '
BEGIN { s = ""; }
/^@mutexstack\[\]/ { s = ""; }
/^@mutexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
/^]/ { print $2, s }
'

I process the lines that are not containing stacks, those around the stack block, resetting the stack s at the beginning and printing it at the end of the block, and for the stack lines I take only function name and ignore everything else to form a targ, and concatenate it to the stack already collected with comma (that was a wrong idea for future use) as a separator between function names. Original code that inspired all these came from pt-pmp as far as I remember. I just adapted it to the format, better than in the previous posts.

Post-processing the output with this awk code gave me the following:

[openxs@fc31 ~]$ cat /tmp/pthread_mutex_perf_stacks.txt | awk '
> BEGIN { s = ""; }
> /^@mutexstack\[\]/ { s = ""; }
> /^@mutexstack/ { s = ""; }
> /^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
> /^]/ { print $2, s }
> '
21352 __pthread_mutex_lock,tpool::thread_pool_generic::worker_main,,,,
22975 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
24568 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
33469 __pthread_mutex_lock,tpool::thread_pool_generic::submit_task,timer_handler,start_thread

Non-resolved addresses are removed, same as offsents from the functrion start. Now sorting remains, in descending order, on the first column as a key:

[openxs@fc31 ~]$ cat /tmp/pthread_mutex_perf_stacks.txt | awk '
BEGIN { s = ""; }
/^@mutexstack\[\]/ { s = ""; }
/^@mutexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
/^]/ { print $2, s }
' | sort -r -n -k 1,1
33469 __pthread_mutex_lock,tpool::thread_pool_generic::submit_task,timer_handler,start_thread
24568 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
22975 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
21352 __pthread_mutex_lock,tpool::thread_pool_generic::worker_main,,,,

That's what we have, for the server without user connections. Now let me put it under the high concurrent sysbench test load (good idea, isn't it?):

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 653.07 qps: 13097.92 (r/w/o: 9174.69/2613.89/1309.34) lat (ms,95%): 240.02 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 1025.71 qps: 20511.58 (r/w/o: 14358.12/4102.14/2051.32) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 588.21 qps: 11770.70 (r/w/o: 8238.74/2355.44/1176.52) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 306.22 qps: 6135.54 (r/w/o: 4298.14/1224.97/612.43) lat (ms,95%): 369.77 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 467.00 qps: 9339.64 (r/w/o: 6537.96/1867.69/933.99) lat (ms,95%): 308.84 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 302.19 qps: 6044.31 (r/w/o: 4230.60/1209.34/604.37) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 324.91 qps: 6496.60 (r/w/o: 4548.67/1298.12/649.81) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 303.58 qps: 6058.55 (r/w/o: 4238.05/1213.33/607.16) lat (ms,95%): 646.19 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 258.39 qps: 5176.10 (r/w/o: 3625.73/1033.58/516.79) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 213.72 qps: 4279.43 (r/w/o: 2995.93/856.07/427.43) lat (ms,95%): 707.07 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 32 tps: 208.29 qps: 4144.23 (r/w/o: 2896.58/831.07/416.58) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 32 tps: 456.29 qps: 9135.45 (r/w/o: 6397.03/1826.05/912.38) lat (ms,95%): 363.18 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 32 tps: 582.21 qps: 11641.73 (r/w/o: 8148.49/2328.63/1164.61) lat (ms,95%): 277.21 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 32 tps: 560.39 qps: 11208.17 (r/w/o: 7845.84/2241.55/1120.78) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 32 tps: 338.03 qps: 6768.93 (r/w/o: 4739.47/1353.41/676.05) lat (ms,95%): 442.73 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 32 tps: 410.20 qps: 8210.38 (r/w/o: 5748.19/1641.80/820.40) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 32 tps: 480.28 qps: 9599.94 (r/w/o: 6716.68/1922.81/960.45) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 32 tps: 397.62 qps: 7952.16 (r/w/o: 5568.62/1588.19/795.35) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 32 tps: 338.77 qps: 6769.31 (r/w/o: 4739.09/1352.78/677.44) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 32 tps: 417.81 qps: 8372.59 (r/w/o: 5857.10/1679.76/835.73) lat (ms,95%): 331.91 err/s: 0.00 reconn/s: 0.00
[ 210s ] thds: 32 tps: 267.40 qps: 5340.01 (r/w/o: 3742.10/1063.10/534.80) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 32 tps: 267.70 qps: 5355.78 (r/w/o: 3748.96/1071.42/535.41) lat (ms,95%): 590.56 err/s: 0.00 reconn/s: 0.00
[ 230s ] thds: 32 tps: 243.11 qps: 4859.74 (r/w/o: 3401.70/971.83/486.21) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 32 tps: 173.99 qps: 3474.97 (r/w/o: 2430.94/696.05/347.98) lat (ms,95%): 1013.60 err/s: 0.00 reconn/s: 0.00
[ 250s ] thds: 32 tps: 169.71 qps: 3403.05 (r/w/o: 2384.37/679.25/339.42) lat (ms,95%): 877.61 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 32 tps: 407.57 qps: 8151.27 (r/w/o: 5704.23/1631.89/815.15) lat (ms,95%): 272.27 err/s: 0.00 reconn/s: 0.00
...
[ 300s ] thds: 32 tps: 382.41 qps: 7641.05 (r/w/o: 5348.01/1528.43/764.62) lat (ms,95%): 434.83 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            1663592
        write:                           475312
        other:                           237656
        total:                           2376560
    transactions:                        118828 (396.04 per sec.)
    queries:                             2376560 (7920.89 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      396.0445
    time elapsed:                        300.0370s
    total number of events:              118828

Latency (ms):
         min:                                    2.17
         avg:                                   80.79
         max:                                 5012.10
         95th percentile:                      390.30
         sum:                              9600000.68

Threads fairness:
    events (avg/stddev):           3713.3750/40.53
    execution time (avg/stddev):   300.0000/0.01

Trust me that I started by bpftrace program after initial 20 seconds of the test run, and let it work at most 20 seconds. But the entire test, next 280 seconds, were notably affected by a visible drop in QPS! I pressed Ctrl-C but got the command probm back much later, not even after 300 seconds... I was wathing the output growth in another shell:

[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 264177 Jan 26 13:26 /tmp/pthread_mutex_perf_stacks.txt
...
[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 281111 Jan 26 13:27 /tmp/pthread_mutex_perf_stacks.txt
...
[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 4116283 Jan 26 13:35 /tmp/pthread_mutex_perf_stacks.txt

So I ended with 4M of text data expoirted to the userland, for just 20 seconds of data collection and with performance drop for many miutes for my 32 threds test on 4 cores old system. Not that impressive and I should care better to probably aggregate and process data more in my bpftrace program, or maybe just dump raw stak-time entries as they are collected. I'll test and see how to improve, as this way of collection is not suitable for production use on a loaded system :(

Anyway, I have to process what was collected with such an impact. To remind you, the data were collected this way:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null >/tmp/pthread_mutex_perf_stacks.txt
[sudo] password for openxs:
^C

and then I applied that same awk followed by sort command line as above to get collapsed stacks. This is what I've seen as a result:

[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | more
104251253 __pthread_mutex_lock,buf_flush_page,buf_flush_try_neighbors,buf_do_flu
sh_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
78920938
74828263
74770599
74622438
72853129
67893142
66546439 __pthread_mutex_lock,buf_do_flush_list_batch,buf_flush_lists,buf_flush_
page_cleaner,start_thread
61669188
59330217
55480213
55045396
53531941
53216338
...

I am yet to find out what are those non-resolved and all removed entries are coming from and what to do with them to not influence the analysis. For now I need to get rid of them as useless. This is how I did it to get "top 5" stacks with times (in nanoseconds) spent in them:

[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | awk '{ if (length($2) > 0) {print} }' | head -5
104251253 __pthread_mutex_lock,buf_flush_page,buf_flush_try_neighbors,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
66546439 __pthread_mutex_lock,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
31431176 __pthread_mutex_lock,buf_flush_try_neighbors,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
27100601 __pthread_mutex_lock,tpool::aio_linux::getevent_thread_routine,,,,
11730055 __pthread_mutex_lock,buf_flush_lists,buf_flush_page_cleaner,start_thread
[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | awk '{ if (length($2) > 0) {print} }' > /tmp/collapsed_clean_pthread_mutex.txt

I saved the output into the /tmp/collapsed_clean_pthread_mutex.txt file. The enxt step would be to represent the result in some nice graphical way, a flame graph! I have the software in place:

[openxs@fc31 ~]$ ls /mnt/home/openxs/git/FlameGraph/
aix-perf.pl                 stackcollapse-gdb.pl
demos                       stackcollapse-go.pl
dev                         stackcollapse-instruments.pl
difffolded.pl               stackcollapse-java-exceptions.pl
docs                        stackcollapse-jstack.pl
example-dtrace-stacks.txt   stackcollapse-ljp.awk
example-dtrace.svg          stackcollapse-perf.pl
example-perf-stacks.txt.gz  stackcollapse-perf-sched.awk
example-perf.svg            stackcollapse.pl
files.pl                    stackcollapse-pmc.pl
flamegraph.pl               stackcollapse-recursive.pl
jmaps                       stackcollapse-sample.awk
pkgsplit-perf.pl            stackcollapse-stap.pl
range-perf.pl               stackcollapse-vsprof.pl
README.md                   stackcollapse-vtune.pl
record-test.sh              stackcollapse-xdebug.php
stackcollapse-aix.pl        test
stackcollapse-bpftrace.pl   test.sh
stackcollapse-elfutils.pl

But I quickly recalled that flamegraph.pl expects this kind of format of the imput, ";" as separator and number as a second column, not the first:

bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;...;schedule 8

There is also a tool to collapse raw bpftrace stacks, stackcollapse-bpftrace.pl, and I have to check how it work for my case one day... Yesterday I just wanted to complet testing as soon as possible, so proceeded with a quick and dirty awk hack:

[openxs@fc31 ~]$ cat /tmp/collapsed_clean_pthread_mutex.txt | awk ' { gsub(",",";",$2); print "mariadbd;"$2, $1 }' | head -5
mariadbd;__pthread_mutex_lock;buf_flush_page;buf_flush_try_neighbors;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 104251253
mariadbd;__pthread_mutex_lock;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 66546439
mariadbd;__pthread_mutex_lock;buf_flush_try_neighbors;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 31431176
mariadbd;__pthread_mutex_lock;tpool::aio_linux::getevent_thread_routine;;;; 27100601
mariadbd;__pthread_mutex_lock;buf_flush_lists;buf_flush_page_cleaner;start_thread 11730055

This format looks acceptable, so I've generated the flame graph with the same hack:

[openxs@fc31 ~]$ cat /tmp/collapsed_clean_pthread_mutex.txt | awk ' { gsub(",",";",$2); print "mariadbd;"$2, $1 }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="pthread_mutex_waits in MariaDB 10.5" --countname=nsecs > ~/Documents/mutex.svg

and here is the result, with sdearhc for "tpool" highlighting how much time of the mutex waits is related to the thread pool of background InnoDB threads:

One can surely create a flame graph based on stacks collected by the bpftrace program, one way or the other...

I'll stop at this stage and maybe continue testing later this week. Stay tuned!

* * *

To summarize:

  • I am not yet sure if my logic in the programs above was correct. I have to think more about it.
  • I surely need to find another way to process the data, either by collapsing/processing stacks in my bpftrace program to make them smaller, or maybe by submitting raw stack/time data as they are collected to the user level. More tests to come...
  • It is easy to create custom bpftrace programs for collecting the data you need. I think memory allocations tracing is my next goal. Imagine a printout of memory allocated and not freed, per allocating thread... If only that had less impact on QPS than what my lame program above demonstrated :)

Tuesday, January 26, 2021

What mysql_upgrade really does in MariaDB, Part II, Bugs and Missing Features

Both /proc sampling and bpftrace are cool topics to write about, but I should not forget my third upcoming talk at FOSDEM 2021 in less than 2 weeks, "Upgrading to a newer major version of MariaDB", that is mostly devoted to mysql_upgrade internals and is based on this blog post. Today I am going to provide some more background details and, probably for the first time, act as MariaDB entomologist and study some bug reports and feature requests related to mysql_upgrade (a.k.a. mariadb-upgrade in 10.5+) that were added over last 10 months or so.

I tried different ways to search for MariaDB JIRA issues related to mysql_upgrade. For examle, this is how I tried to find any bugs in mysql_upgrade closed over last two months (I recall there were few):

text ~ mysql_upgrade AND project = "MariaDB Server" AND status = Closed AND createdDate >= "2020-12-01" ORDER BY createdDate DESC

I've got a list of 7 reports, with 2 relevant bugs that are already fixed:

  • MDEV-24566 - "mysql_upgrade failed with "('mariadb.sys'@'localhost') does not exist" and mariadb 10.4/10.5 on docker". This is fixed in current docker images at https://hub.docker.com/r/mariadb/server. The problem was related to the Docker image only, and mysql_upgrade was actually affected by the initial database content there.
  • MDEV-24452 - "ALTER TABLE event take infinite time which for example breaks mysql_upgrade". Now this was a real blocker bug in recent 10.5. If any event managed to start before you executed mysql_upgrade, the utility and proper upgrade process was blocked. Good to see this fixed in upcoming 10.5.9.

So, Monty really fixes related bugs when they are reported, as promised. Let's consider now the following query, still not closed issues reported since April 1, 2020:

text ~ "mysql_upgrade" AND project = "MariaDB Server" AND status != Closed AND createdDate >= "2020-04-01" ORDER BY createdDate DESC

I checked then one by one (as they may be related to upgrade process but not to mysql_upgrade specifically) and placed into two lists, bugs and feature requests. Let me start with tasks (feature requests), so you know what was kind of missing by design or has to be done differently:

  • MDEV-24586 - "remove scripts/mysql_to_mariadb.sql". Actually proper logic is implemented in scripts/mysql_system_tables_fix.sql that forms part of mysql_upgrade, so separate script is no longer needed.
  • MDEV-24540 - "Detect incompatible MySQL virtual columns and report in error log." MariaDB does not support migration from MySQL-generated physical database tables containing virtual columns, and produces column mismatch errors, failures in mysql_upgrade etc. It would be great to have proper error messages from mysql_upgrade in this case, explaining the real problem and possible solutions (dump, drop and reload or whatever).
  • MDEV-24453 - "mysql_upgrade does not honor --verbose parameter". It is not passed to other binaries called and this may make debugging upgrade issues more complex.
  • MDEV-24316 - "cross upgrade from MySQL - have precheck tool". According to the reporter, Daniel Black, the goal would be to check on a database/global scale looking at tables, at features used, at settings, at character sets in table and determine the "migratablilty" of a given MySQL instance. I voted for this feature!
  • MDEV-24093 - "Detect during mysql_upgrade if type_mysql_json.so is needed and load it". After MDEV-18323, MYSQL_JSON type is available as a dynamically loadable plugin. To make mysql_upgrade runs seamlessly we need to make sure it is loaded appropriately and unloaded when done with upgrade). This is already in review, so will be implemented really soon.
  • MDEV-23962 - "Remove arc directory support". I think only Monty (bug reporter) knows what is this about. I don't :)
  • MDEV-23008 - "store mysql_upgrade version info in system table instead of local file". One of the really important feature requests from my colleague since 2005, Hartmut Holzgraefe.
  • MDEV-22357 - "Clearing InnoDB autoincrement counter when upgrading from MariaDB < 10.2.4". CHECK TABLE ... FOR UPGRADE should work differently for InnoDB tables, for mysql_upgrade to work properly.
  • MDEV-22323 - "Upgrading MariaDB". This is the "umbrella" task to cover all problematic  cases of MySQL to MariaDB, Percona Server to MariaDB and minor MariaDB server upgrades.
  • MDEV-22322 - "Percona Server -> Mariadb Upgrades". Summary of all the related issues. See MDEV-22679 etc.

Now back to more or less serious bugs that are still waiting for the fix:

  • MDEV-24579 - "Error table->get_ref_count() after update to 10.5.8". It seems DDL executed on mysql.* tables with InnoDB persistent statistics (like those executed by mysql_upgrade!) may cause problems for concurrent queries (up to assertion failure in this case). So we either should remove those tables (I wish!) or do something with mysql_upgrade, or (IMHO even better) do not let users connect and execute queries while mysql_upgrade is running, like MySQL 8 does when the server is started for the first time on older datadir and performs upgrade. Take care in the meantime...
  • MDEV-23652 - "Assertion failures upon reading InnoDB system table after normal upgrade from 10.2". Now this is a real bug :) Assertion failure during mysql_upgrade, this is surely something to fix!
  • MDEV-23636 - "mysql_upgrade [ERROR] InnoDB: Fetch of persistent statistics requested for table". I am not sure what's going on here, and why.
  • MDEV-23392 - "main.mysql_upgrade_view failed in buildbot with another wrong result". MTR test case failures is something to care about.
  • MDEV-22683 - "mysql_upgrade misses some changes to mysql schema". Over ttime and different versions, some structures in mysql schema get changed, but not all the changes make it to the scripts executed by mysql_ugrade. As a result a schema freshly created by mysql_install_db on a versions 10.x.y differs from a schema created on an earlier version and upgraded to 10.x.y by mysql_ugrade. The real diffs are listed, per version, in this bug report from Elena Stepanova.
  • MDEV-22655 - "CHECK TABLE ... FOR UPGRADE fails to report old datetime format". That's my favorite, unfortunately. Makes running mysql_upgrade useless for some cases of upgrade from pre-5.6 MySQL versions and leads to problems for tables partioned by datetime etc columns. See also MDEV-24499 - "Server upgrade causes compound index and related query to fail.".
  • MDEV-22645 - "default_role gets removed when migrating from 10.1 to 10.4". May have something to do with mydumper/myloader used, but still a problem.
  • MDEV-22482 - "pam v2: mysql_upgrade doesn't fix the ownership/privileges of auth_pam_tool". No comments.
  • MDEV-22477 - "mysql_upgrade fails with sql_notes=OFF". mysql_ugrade, or, more exactly mysql_system_tables.sql, uses @@warning_count variable in the upgrade logic. The variable, in turn, depends on the value of sql_notes. When it is OFF, @@warnings_count is not incremented, and mysql_upgrade doesn't work as expected.
After Monty's post in April 2020 many new mysql_upgrade bugs were reported, and some were already fixed. So, we are on the way...

* * *

To summarize:

  1. There are still some bugs and missing features in mysql_ugrade.
  2. MariaDB actively works on fixing them, as once promised by Monty.
  3. Check the lists in this blog post if you plan to upgrade to recent MariaDB 10.x.y versions, carefully.
  4. Please, report any problem with mysql_upgrade or upgrades in general to our JIRA.

Monday, January 25, 2021

Checking User Threads and Temporary Tables With gdb in MariaDB 10.4+, Step By Step

There were no posts about gdb tricks for a long time in this blog. This is surely unusual, but I had not done anything fancy with gdb for more than a year. Today I've got a chance finally to find something new in the code and answer yet another question based on code review and some basic gdb commands.

The question was about the way to find what temporary tables, if any, that are created by some connection. There is no way to do this in MariaDB 10.4, see MDEV-12459 for some related plans (and I_S.INNODB_TEMP_TABLE_INFO of MySQL that appeared in MariaDB for a short time only). My immediate answer was that this is surely stored somewhere in THD structure and I just have to find it (and a way to work with that information) using code review and/or gdb.

The first step was easy. I know that THD is defined in sql/sql_class.h, and there I see:

class THD: public THD_count, /* this must be first */
           public Statement,
           /*
             This is to track items changed during execution of a prepared
             statement/stored procedure. It's created by
             nocheck_register_item_tree_change() in memory root of THD,
             and freed in rollback_item_tree_changes().
             For conventional execution it's always empty.
           */
           public Item_change_list,
           public MDL_context_owner,
           public Open_tables_state
...

Temporary tables surely must be somewhere in that Open_tables_state. In the same file we can find the following:

class Open_tables_state
{
public:
  /**
    As part of class THD, this member is set during execution
    of a prepared statement. When it is set, it is used
    by the locking subsystem to report a change in table metadata.
    When Open_tables_state part of THD is reset to open
    a system or INFORMATION_SCHEMA table, the member is cleared
    to avoid spurious ER_NEED_REPREPARE errors -- system and
    INFORMATION_SCHEMA tables are not subject to metadata version
    tracking.
    @sa check_and_update_table_version()
  */
  Reprepare_observer *m_reprepare_observer;

  /**
    List of regular tables in use by this thread. Contains temporary and
    base tables that were opened with @see open_tables().
  */
  TABLE *open_tables;

  /**
    A list of temporary tables used by this thread. This includes
    user-level temporary tables, created with CREATE TEMPORARY TABLE,
    and internal temporary tables, created, e.g., to resolve a SELECT,
    or for an intermediate table used in ALTER.
  */
  All_tmp_tables_list *temporary_tables;

...

With this information I am ready to dive into gdb session. I have MariaDB 10.4.18 at hand and create a couple of temporary tables in connection with id 9:

MariaDB [test]> select version(), connection_id(), current_user();
+-----------------+-----------------+------------------+
| version()       | connection_id() | current_user()   |
+-----------------+-----------------+------------------+
| 10.4.18-MariaDB |               9 | openxs@localhost |
+-----------------+-----------------+------------------+
1 row in set (0,000 sec)

MariaDB [test]> create temporary table mytemp(c1 int, c2 varchar(100));
Query OK, 0 rows affected (0,034 sec)

MariaDB [test]> create temporary table mytemp2(id int, c2 int) engine=MyISAM;
Query OK, 0 rows affected (0,001 sec)

MariaDB [test]> show processlist;
+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+
| Id | User        | Host      | db   | Command | Time | State                    | Info             | Progress |
+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+
|  3 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|  4 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|  1 | system user |           | NULL | Daemon  | NULL | InnoDB purge worker      | NULL             |    0.000 |
|  2 | system user |           | NULL | Daemon  | NULL | InnoDB purge coordinator | NULL             |    0.000 |
|  5 | system user |           | NULL | Daemon  | NULL | InnoDB shutdown handler  | NULL             |    0.000 |
|  9 | openxs      | localhost | test | Query   |    0 | Init                     | show processlist |    0.000 |
+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+
6 rows in set (0,000 sec)

Now I attach gdb and immediately try to check what's inside the temporary_table filed of the do_command frame where thd is present:

openxs@ao756:~$ sudo gdb -p `pidof mysqld`
[sudo] password for openxs:
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 26620
[New LWP 26621]
... 28 more LWPs were here
[New LWP 26658]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f59fb69080d in poll () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) thread 31
[Switching to thread 31 (Thread 0x7f59dd6ca700 (LWP 26658))]
#0  0x00007f59fb69080d in poll () at ../sysdeps/unix/syscall-template.S:84
84      in ../sysdeps/unix/syscall-template.S
(gdb) p do_command::thd->thread_id
$1 = 9
(gdb) p do_command::thd->temporary_tables
$2 = (All_tmp_tables_list *) 0x5606b575cd28
(gdb) p *do_command::thd->temporary_tables
$3 = {<I_P_List_null_counter> = {<No data fields>}, <I_P_List_no_push_back<TMP_TABLE_SHARE>> = {<No data fields>}, m_first = 0x5606b60dfa38}
(gdb) p *do_command::thd->temporary_tables->m_first
$4 = {<TABLE_SHARE> = {table_category = TABLE_CATEGORY_TEMPORARY, name_hash = {
      key_offset = 0, key_length = 0, blength = 0, records = 0, flags = 0,
      array = {buffer = 0x0, elements = 0, max_element = 0,
        alloc_increment = 0, size_of_element = 0, malloc_flags = 0},
      get_key = 0x0, hash_function = 0x0, free = 0x0, charset = 0x0},
    mem_root = {free = 0x5606b60e3cf8, used = 0x5606b60e40e8, pre_alloc = 0x0,
      min_malloc = 32, block_size = 985, total_alloc = 2880, block_num = 6,
      first_block_usage = 0,
      error_handler = 0x5606b1e6bac0 <sql_alloc_error_handler()>,
      name = 0x5606b2538d63 "tmp_table_share"}, keynames = {count = 0,
      name = 0x0, type_names = 0x5606b60e3d78, type_lengths = 0x5606b60e3d94},
    fieldnames = {count = 2, name = 0x0, type_names = 0x5606b60e3d60,
      type_lengths = 0x5606b60e3d88}, intervals = 0x0, LOCK_ha_data = {
      m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
          __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,
            __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
      m_psi = 0x0}, LOCK_share = {m_mutex = {__data = {__lock = 0,
          __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0,
          __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
        __size = '\000' <repeats 39 times>, __align = 0}, m_psi = 0x0},
    tdc = 0x0, tabledef_version = {
      str = 0x5606b60e3d10 "~A\370\275_4\021К°Ё\364\267\342\023=\275",
      length = 16}, option_list = 0x0, option_struct = 0x0,
---Type <return> to continue, or q <return> to quit---field = 0x5606b60e3dQuit

You may be wondering why I jumped to Thread 31 immediately, how did I know that it corresponds to connection with thread_id 9, as I verified with later print? It was not pure luck, I knew I am the only user and just jumped to the last thread in order of creation. There is a better way for a general case, and it's navigating over a "list" of threads that must exist somewhere, as SHOW PROCESSLIST must have a way to get them all, easy one. We'll get back to that important task later in this post.

Now, in temporary_tables->m_first filed we have a table share, with a lot of details we may need. We can try to see some of them that were actually requested originally:

(gdb) p do_command::thd->temporary_tables->m_first->table_name
$5 = {str = 0x5606b60dfeef "mytemp2", length = 7}
(gdb) p do_command::thd->temporary_tables->m_first->table_name.str
$6 = 0x5606b60dfeef "mytemp2"
(gdb) p do_command::thd->temporary_tables->m_first->path
$7 = {str = 0x5606b60dfed8 "/tmp/#sql67fc_9_3", length = 17}
(gdb) p do_command::thd->temporary_tables->m_first->path.str
$8 = 0x5606b60dfed8 "/tmp/#sql67fc_9_3"

So, I can get as many details as are presented or can be found from the TABLE_SHARE structure. I see them immediately for the last temporary table I've created in that session. But what about the other, there might be many of them. I expected some kind of a linked list or array, but type information presented above gave me no real hint. Where is the next or previous item? This hints towards list by name, but that's all:

(gdb) p *do_command::thd->temporary_tables
$3 = {<I_P_List_null_counter> = {<No data fields>}, <I_P_List_no_push_back<TMP_TABLE_SHARE>> = {<No data fields>}, m_first = 0x5606b60dfa38}

The type, <I_P_List_no_push_back<TMP_TABLE_SHARE>>, looks like some template class instantiated with TMP_TABLE_SHARE, and I can find the source code:

template <typename T> class I_P_List_no_push_back;


/**
   Intrusive parameterized list.
   Unlike I_List does not require its elements to be descendant of ilink
   class and therefore allows them to participate in several such lists
   simultaneously.
   Unlike List is doubly-linked list and thus supports efficient deletion
   of element without iterator.
   @param T  Type of elements which will belong to list.
   @param B  Class which via its methods specifies which members
             of T should be used for participating in this list.
             Here is typical layout of such class:
             struct B
             {
               static inline T **next_ptr(T *el)
               {
                 return &el->next;
               }
               static inline T ***prev_ptr(T *el)
               {
                 return &el->prev;
               }
             };
   @param C  Policy class specifying how counting of elements in the list
             should be done. Instance of this class is also used as a place
             where information about number of list elements is stored.
             @sa I_P_List_null_counter, I_P_List_counter
   @param I  Policy class specifying whether I_P_List should support
             efficient push_back() operation. Instance of this class
             is used as place where we store information to support
             this operation.
             @sa I_P_List_no_push_back, I_P_List_fast_push_back.
*/

template <typename T, typename B,
          typename C = I_P_List_null_counter,
          typename I = I_P_List_no_push_back<T> >
class I_P_List : public C, public I
{
  T *m_first;
...

but I get lost in all these C++ stuff. Luckily I asked at the Engineering channel and got a hint that "I" in the name means "Intrusive" and that base type T is supposed to include pointers to the next and previous item. Moreover, in case of TMP_TABLE_SHARE they are named tmp_next and tmp_prev. I had to read the entire structure, as next and prev had not worked for me...

With this hint it was easy to proceed:

(gdb) p do_command::thd->temporary_tables->m_first->tmp_next
$12 = (TMP_TABLE_SHARE *) 0x5606b60df558
(gdb) set $t = do_command::thd->temporary_tables->m_first
(gdb) p $t
$13 = (TMP_TABLE_SHARE *) 0x5606b60dfa38
(gdb) p $t->table_name.str
$14 = 0x5606b60dfeef "mytemp2"
(gdb) set $t = $t->tmp_next
(gdb) p $t
$15 = (TMP_TABLE_SHARE *) 0x5606b60df558
(gdb) p $t->table_name.str
$16 = 0x5606b60dfa0f "mytemp"
(gdb) set $t = $t->tmp_next
(gdb) p $t
$17 = (TMP_TABLE_SHARE *) 0x0

The idea is to iterate while $t is not zero, starting from temporary_tables->m_first. You can surely put it into a Python loop for automation. One day I'll do this too. For now I am happy to be able to list all temporary tables with all the details manually, with gdb commands.

The remaining question is: how to iterate over user threads in this MariaDB version? No more global threads variable:

(gdb) p threads
No symbol "threads" in current context.

No surprize, we had that changed in MySQL 5.7+ too

Here I also used a hint from a way more experienced colleague, Sergei Golubchik. That's what we have now:

(gdb) p server_threads
$18 = {threads = {<base_ilist> = {first = 0x5606b60b6d28, last = {
        _vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,
        prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}, lock = {
    m_rwlock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0,
        __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0,
        __writer = 0, __shared = 0, __rwelision = 0 '\000',
        __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0},
      __size = '\000' <repeats 55 times>, __align = 0}, m_psi = 0x0}}
(gdb) ptype server_threads
type = class THD_list {
  private:
    I_List<THD> threads;
    mysql_rwlock_t lock;

  public:
    void init();
    void destroy();
    void insert(THD *);
    void erase(THD *);
    int iterate<std::vector<unsigned long long> >(my_bool (*)(THD *,
    std::vector<unsigned long long> *), std::vector<unsigned long long> *);
}
(gdb) p server_threads.threads
$19 = {<base_ilist> = {first = 0x5606b60b6d28, last = {
      _vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,
      prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}

From that I had to proceed myself. I already know what "I" means in these templates, so I expect to find the next pointer somewhere if I start from first:

(gdb) p server_threads.threads
$19 = {<base_ilist> = {first = 0x5606b60b6d28, last = {
      _vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,
      prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}
(gdb) p server_threads.threads.first
$20 = (ilink *) 0x5606b60b6d28
(gdb) p *server_threads.threads.first
$21 = {_vptr.ilink = 0x5606b2d08f80 <vtable for THD+16>,
  prev = 0x5606b2ed1de0 <server_threads>, next = 0x7f59980009a8}
(gdb) set $thd = (THD *)server_threads.threads.first
(gdb) p $thd->thread_id
$22 = 9

This was the initialization part, now let's check some more and iterate:

(gdb) p $thd->proc_info
$23 = 0x5606b2521cd9 "Reset for next command"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$24 = 5
(gdb) p $thd->proc_info
$25 = 0x5606b267a145 "InnoDB shutdown handler"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$26 = 2
(gdb) p $thd->main_security_ctx.user
$27 = 0x0
(gdb) p $thd->proc_info
$28 = 0x5606b26a3da9 "InnoDB purge coordinator"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$29 = 1
(gdb) p $thd->proc_info
$30 = 0x5606b26a3e20 "InnoDB purge worker"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$31 = 4
(gdb) p $thd->proc_info
$32 = 0x5606b26a3e20 "InnoDB purge worker"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$33 = 3
(gdb) p $thd->proc_info
$34 = 0x5606b26a3e20 "InnoDB purge worker"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$35 = 1095216660735
(gdb) p $thd->proc_info
$36 = 0x0
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd
$37 = (THD *) 0x0

The idea of iteration is also clear: we move to $thd->next if it's not zero. What we see matches the SHOW PROCESSLIST output with the exception of the last thread, with zero proc_info too. It is some "sentinel" that is not present in the PROCESSLIST. One day I'll figure out why is it so and automate checking all threads based on Python code of this kind, suggested by Shane Bester. Tonight I am just happy to document what I recently found, as all details related to gdb usage do change with time and new versions released.

Free travels and digging into the code in gdb with specific goal in mind  - I miss these activities recently

* * *

To summarize:

  1. It's relatively easy to find out all the details about every temporary table of any kind created in any MariaDB server user thread, in gdb.
  2. It's still fun to work on MariaDB, as you can promptly get help from developers no matter what crazy questions you may ask
  3. Changes towards more modern C++ may make it more diffical to debug in gdb initially for those unaware of the details of clasees implementation and design.