Saturday, January 30, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part V, Proper Way To Summarize Time Spent per Stack

In my previous post in this series I was tracing __lll_lock_wait() function and measuring time spent in it per unique MariaDB server's stack trace. I've got the results that had not looked like totally wrong or useless and even presented them as a flame graph, but it took a long time and during all thyat time, for minutes, sysbench test TPS and QPS results were notably smaller. So, it was a useful tracing, but surely not a low impact one.

I already stated before that the approach used is surely not optimal. There should be some better way to summarize the stack traces without such a performance impact. Today I continued to work on this problem. First of all, I tried to measure finally how much time exactly I had to wait for the results from bpftrace command:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt

real    9m51.904s
user    8m25.576s
sys     1m8.057s
[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks1.txt
-rw-rw-r--. 1 openxs openxs 1564291 Jan 30 11:58 /tmp/lll_lock_perf_stacks1.txt

So, test was running for 300 seconds (5 minutes) and monitoring results (for about 20 seconds) came 5 minutes later! This is definitely NOT acceptable for any regular use.

I start to check things and test more. I noted in the Reference Guide (that I've read entirely a couple of times yesterday and today) that ustack() function of bpftrace can accept the second argument, number of frames to show in stack. Based on the previous flame graphs most interesting stacks were not long, so I decided to check if 8 or just 5 may be still enough for useful results while reducing the performance impact. I changed the code here:

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf,5); -- <-- I also tried 8
}

and got these run times for 8 and 5 frames:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt

real    9m39.195s
user    8m9.188s
sys     1m8.430s

[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks1.txt
-rw-rw-r--. 1 openxs openxs 1085019 Jan 30 12:12 /tmp/lll_lock_perf_stacks1.txt

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks2.txt

real    6m6.801s
user    5m5.537s
sys     0m42.226s

[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks2.txt
-rw-rw-r--. 1 openxs openxs 728072 Jan 30 12:19 /tmp/lll_lock_perf_stacks2.txt

So, reducing stack to 5 frames made notable impact, and QPS drop was a bit smaller, but still waiting for 5 minutes is not an option in general.

I had not found really powerful functions to replace awk and collapse stacks inside the bpftrace program, so decided to export them "raw", on every hit of uretprobe, and then process by Linux tools as unusual. I'll skip steps, tests and failures and just show the resulting new program for the task:

[openxs@fc31 ~]$ cat lll_lock_wait2.bt
#!/usr/bin/env bpftrace

BEGIN
{
/*    printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n"); */
}

interval:s:$1 { exit(); }

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf);    
}

uretprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd" && @start[tid] != 0/
{
    $now = nsecs;
    $time = $now - @start[tid];
    @futexstack[@tidstack[tid]] += $time;
    print(@futexstack);
    delete(@futexstack[@tidstack[tid]]);
/*
    printf("Thread: %u, time: %d\n", tid, $time);
*/
    delete(@start[tid]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@tidstack);
    clear(@futexstack);
}

Changes vs the previous version are highlighted. Basically I print and then delete every stack collected, as function call ends, along with time, to further sum them up externally. The function used is the following:

print(@x[, top [, div]]) - Print the map, optionally the top entries only and with a divisor

I've also added the interval probe at the beginning, referring to the first program argument as seconds to run. It just forces exit after N seconds (0 by default means immediate exist, wrong format will be reported as error).

Now if I run it while test is running, this way:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait2.bt 10 2>/dev/null | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' > /tmp/collapsed_lll_lock_v2_raw.txt


real    0m10.646s
user    0m5.648s
sys     0m0.843s
[openxs@fc31 ~]$ ls -l /tmp/collapsed_lll_lock_v2_raw.txt
-rw-rw-r--. 1 openxs openxs 1566 Jan 30 13:12 /tmp/collapsed_lll_lock_v2_raw.txt
[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt
15531 __lll_lock_wait;
1233833 __lll_lock_wait;
10638 __lll_lock_wait;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;
7170 __lll_lock_wait;tpool::thread_pool_generic::worker_main;;;;
273330 __lll_lock_wait;tdc_acquire_share;open_table;open_tables;open_and_lock_tables;execute_sqlcom_select;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
1193083 __lll_lock_wait;trx_undo_report_row_operation;btr_cur_update_in_place;btr_cur_optimistic_update;row_upd_clust_rec;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
183353 __lll_lock_wait;
43231 __lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;ha_commit_trans;trans_commit;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread

I get a very condensed output almost immediately!  I can afford running for 60 seconds without much troubles:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait2.bt 60 2>/dev/null | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' > /tmp/collapsed_lll_lock_v2_raw.txt


real    1m0.959s
user    0m44.146s
sys     0m6.126s


[openxs@fc31 ~]$ ls -l /tmp/collapsed_lll_lock_v2_raw.txt
-rw-rw-r--. 1 openxs openxs 12128 Jan 30 13:17 /tmp/collapsed_lll_lock_v2_raw.txt

The impact of these 60 seconds is visible in the sysbench output:

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 625.68 qps: 12567.66 (r/w/o: 8800.49/2512.81/1254.36) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 891.51 qps: 17825.69 (r/w/o: 12479.31/3563.26/1783.13) lat (ms,95%): 142.39 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 512.09 qps: 10230.44 (r/w/o: 7163.62/2042.55/1024.27) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 369.10 qps: 7390.48 (r/w/o: 5171.39/1480.90/738.20) lat (ms,95%): 350.33 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 417.83 qps: 8347.73 (r/w/o: 5845.14/1667.03/835.56) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 484.03 qps: 9687.55 (r/w/o: 6782.09/1937.31/968.16) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 431.98 qps: 8640.35 (r/w/o: 6049.38/1727.01/863.95) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 354.60 qps: 7097.39 (r/w/o: 4968.86/1419.32/709.21) lat (ms,95%): 419.45 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 380.98 qps: 7600.82 (r/w/o: 5317.17/1521.80/761.85) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 423.01 qps: 8467.99 (r/w/o: 5928.71/1693.16/846.13) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 32 tps: 475.66 qps: 9525.07 (r/w/o: 6669.22/1904.53/951.32) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 32 tps: 409.13 qps: 8171.48 (r/w/o: 5718.38/1634.94/818.17) lat (ms,95%): 458.96 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 32 tps: 190.91 qps: 3826.72 (r/w/o: 2682.19/762.62/381.91) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 32 tps: 611.40 qps: 12229.99 (r/w/o: 8558.99/2448.20/1222.80) lat (ms,95%): 223.34 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 32 tps: 581.99 qps: 11639.19 (r/w/o: 8148.53/2326.68/1163.99) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 32 tps: 653.21 qps: 13058.51 (r/w/o: 9139.57/2612.52/1306.41) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 32 tps: 561.87 qps: 11231.27 (r/w/o: 7860.53/2246.99/1123.75) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 32 tps: 625.66 qps: 12526.32 (r/w/o: 8770.19/2504.82/1251.31) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 32 tps: 554.50 qps: 11088.01 (r/w/o: 7760.61/2218.40/1109.00) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 32 tps: 607.10 qps: 12143.89 (r/w/o: 8501.49/2428.20/1214.20) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 210s ] thds: 32 tps: 424.17 qps: 8467.63 (r/w/o: 5925.43/1693.87/848.33) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 32 tps: 466.80 qps: 9335.03 (r/w/o: 6533.32/1868.31/933.40) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 230s ] thds: 32 tps: 365.83 qps: 7330.56 (r/w/o: 5133.49/1465.21/731.86) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 32 tps: 411.27 qps: 8218.45 (r/w/o: 5754.25/1641.87/822.34) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 250s ] thds: 32 tps: 127.10 qps: 2534.71 (r/w/o: 1772.60/507.70/254.40) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 32 tps: 642.35 qps: 12856.29 (r/w/o: 8999.09/2572.50/1284.70) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 32 tps: 603.80 qps: 12078.79 (r/w/o: 8456.89/2414.30/1207.60) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 32 tps: 642.70 qps: 12857.60 (r/w/o: 9001.40/2570.80/1285.40) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 290s ] thds: 32 tps: 716.57 qps: 14325.60 (r/w/o: 10026.98/2865.48/1433.14) lat (ms,95%): 144.97 err/s: 0.00 reconn/s: 0.00
[ 300s ] thds: 32 tps: 611.16 qps: 12219.69 (r/w/o: 8551.80/2445.66/1222.23) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            2124836
        write:                           607096
        other:                           303548
        total:                           3035480
    transactions:                        151774 (505.83 per sec.)
    queries:                             3035480 (10116.62 per sec.)

    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      505.8309
    time elapsed:                        300.0489s
    total number of events:              151774

Latency (ms):
         min:                                    2.17
         avg:                                   63.25
         max:                                 8775.26
         95th percentile:                      331.91
         sum:                              9599021.34

Threads fairness:
    events (avg/stddev):           4742.9375/42.76
    execution time (avg/stddev):   299.9694/0.03

but it continues only for some time after the end of monitoring and overal QPS is not much affected.

Now what about top 5 stacks by wait time? Here they are (well, I had NOT cared to sum up several entires with the same stack, but you should know how to do this with awk and flamegraph.pl will do it for us later) :

[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt | sort -r -n -k 1,1 | head -5
340360731 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
335903154 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_upd_sec_index_entry;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
287819974 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
281374237 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
268817930 __lll_lock_wait;buf_page_init_for_read;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;

A lot of contention happens on INSERTing rows. I'd say it's expected! Now to create the flame graph we just have to re-order the output a bit:

[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="Time spent in ___lll_lock_wait in MariaDB 10.5, all frames" --countname=nsecs > ~/Documents/lll_lock_v2_2.svg

Here it is:

Most mutex waits for oltp_read_write test happen on inserting rows...

Monitoring over 60 seconds allowed to see more realistic picture of contention. It mostly happens on inserting rows by 32 threads into just 5 tables on a system with 4 cores only.

Now it seems we have the tool that can be used more than once ina lifetime and seems to provide a useful information, fast.

* * *

To summarize:

  1. One should not even try to summarize all stacks in a single associative arrays for any function that is called to often in many different contests inside the bftrace! There is a limit on number of items in the ma you may hit, and the impact of storing and exporting this is too high for monitoring more than a couple of seconds.
  2. Cleaning up stack traces should be done externally, by usual Linux test processing tools thta produce smaller summary output. The bpftrace program should NOT be designed to collect a lot of outputs for a long time and then output them at the end. Looks like smaller shunks exported to the userspace regularly is a better approach.
  3. We can use simple command line arguments for the program that can be literally substituted as parts of probe definitions. Next step would be to make the more generic tool with binary and function name to probe as command line arguments.
  4. bpftrace is cool!

2 comments:

  1. Beautiful work! Have you had the chance to use bpftrace on customer's production environments? I think I only had one or two chances so far, but I'd hope it would become more ubiquitous soon enough...

    ReplyDelete
    Replies
    1. Not yet, but now I am dreaming of a chance. I need a cooperative customer with Linux 5.x.y and some performance problem at hand. So far perf is what I use in real customer cases.

      Delete