Thursday, January 28, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part IV, Creating a New Tool for Tracing Time Spent in __lll_lock_wait

So far in this series I am just trying to find out what can be done with bpftrace and how to get interesting insights or implement something I was asked about. I am not sharing "best practices" (mostly worst, probably, when it's my own inventions) or any final answers on what to trace given the specific MariaDB performance problem. This is still work in progress, and on really early stage.

In the comments for my previous post you can see that I probably measured a wrong thing if the goal is to find the mutex that causes most of the contention. One of the comments suggested a different tracing idea:

"... indexed by the *callstack*, with the goal to find the place where waits happen. I think __lll_lock_wait could be a better indicator for contention, measuring uncontended pthread_mutex_locks could give some false alarms."

Today I had an hour or so again near my Fedora box with bpftrace, so I tried to modify my previous tool to store the time spent in the __lll_lock_wait function, and sum it up (the same inefficient way so far, I had no chance yet to test and find anything more suitable for production use) these times per unique stack that led to __lll_lock_wait call.

To remind you, __lll_lock_wait() is a low-level wrapper around the Linux futex system call. The prototype for this function is:

void __lll_lock_wait (int *futex, int private)
It is also from the libpthread.so library:

[openxs@fc31 ~]$ strings /lib64/libpthread.so.0 | grep '__lll_lock_wait'
__lll_lock_wait
__lll_lock_wait_private

So, my code will not require many modifications. Basic quick code to prove the concept is as simple as this:

[openxs@fc31 ~]$ cat lll_lock_wait.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n");
}

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/

{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf);
}

uretprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd" && @start[tid] != 0/

{
    $now = nsecs;
    $time = $now - @start[tid];
    @futexstack[@tidstack[tid]] += $time;

    printf("Thread: %u, time: %d\n", tid, $time);

    delete(@start[tid]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@tidstack);
    clear(@futexstack);
}

The main difference is that to measure time spent inside a single function I need uprobe on it and uretprobe on it. Function call arguments are NOT available in the uretprobe.

When I made this .bt executable and tried to run it with MariaDB 10.5 server started but not under any load I had NOT got any output from uretprobe - no wonder, there is no contention! With sysbench test started the result was different, a flood of outputs:

[openxs@fc31 ~]$ sudo ./lll_lock_wait.bt 2>/dev/null
...
Thread: 7488, time: 70520
Thread: 7494, time: 73351
Thread: 5790, time: 106635
Thread: 5790, time: 10008
Thread: 7484, time: 87016
Thread: 5790, time: 18723
^C

So, the program works to some extent and reports some timer per thread (but I have no mutex address in uretprobe). So, I modified it to remove prints and keep the stack associative array for the output in the END probe:

#!/usr/bin/env bpftrace

BEGIN
{
/*    printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n"); */
}

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf);    
}

uretprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd" && @start[tid] != 0/
{
    $now = nsecs;
    $time = $now - @start[tid];
    @futexstack[@tidstack[tid]] += $time;
/*
    printf("Thread: %u, time: %d\n", tid, $time);
*/
    delete(@start[tid]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@tidstack);
/*    clear(@futexstack); */
}

Then I started usual sysbench test for this series:

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 516.06 qps: 10362.55 (r/w/o: 7260.90/2066.43/1035.22) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 550.82 qps: 11025.98 (r/w/o: 7718.97/2205.28/1101.74) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00

and at this moment started my program and kept it running for some time more than 10 but less than 20 seconds:

[openxs@fc31 ~]$ sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks.txt
^C

I've got immediate drop in QPS and it continued till the end of 300 seconds test:

[ 30s ] thds: 32 tps: 397.67 qps: 7950.29 (r/w/o: 5565.37/1589.58/795.34) lat (ms,95%): 434.83 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 439.87 qps: 8787.15 (r/w/o: 6149.45/1757.97/879.74) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
...

    transactions:                        101900 (339.38 per sec.)
    queries:                             2038016 (6787.74 per sec.)

    ignored errors:                      1      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
...

I had to wait some more and ended up with this big output:

[openxs@fc31 maria10.5]$ ls -l /tmp/lll_lock_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 1347279 Jan 28 14:29 /tmp/lll_lock_perf_stacks.txt

[openxs@fc31 maria10.5]$ more /tmp/lll_lock_perf_stacks.txt
Attaching 4 probes...


@futexstack[
        7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)
        14e5f 0x14e5f ([unknown])
]: 4554

...
@futexstack[
        7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)
        55a71e6dd31c MDL_lock::remove_ticket(LF_PINS*, MDL_lock::Ticket_list MDL
_lock::*, MDL_ticket*)+60 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e6ddca5 MDL_context::release_lock(enum_mdl_duration, MDL_ticket*)+3
7 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71ebc5ef4 row_purge_step(que_thr_t*)+388 (/home/openxs/dbs/maria10.5/
bin/mariadbd)
        55a71eb874b8 que_run_threads(que_thr_t*)+2264 (/home/openxs/dbs/maria10.
5/bin/mariadbd)
        55a71ebe5773 purge_worker_callback(void*)+355 (/home/openxs/dbs/maria10.
5/bin/mariadbd)
        55a71ed2b99a tpool::task_group::execute(tpool::task*)+170 (/home/openxs/
dbs/maria10.5/bin/mariadbd)
        55a71ed2a7cf tpool::thread_pool_generic::worker_main(tpool::worker_data*
)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7fadbcac43d4 0x7fadbcac43d4 ([unknown])
        55a721f10f00 0x55a721f10f00 ([unknown])
        55a71ed2ab60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<v
oid (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_gen
eric*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bi
n/mariadbd)
        2de907894810c083 0x2de907894810c083 ([unknown])
]: 4958
...

@futexstack[
        7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)
        55a71eba3e20 row_ins_clust_index_entry_low(unsigned long, unsigned long,
 dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*)+4144 (/home
/openxs/dbs/maria10.5/bin/mariadbd)
        55a71eba4a36 row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr
_t*, unsigned long)+198 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71eba52b4 row_ins_step(que_thr_t*)+1956 (/home/openxs/dbs/maria10.5/b
in/mariadbd)
        55a71ebb5ea1 row_insert_for_mysql(unsigned char const*, row_prebuilt_t*,
 ins_mode_t)+865 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71eafadc1 ha_innobase::write_row(unsigned char const*)+177 (/home/ope
nxs/dbs/maria10.5/bin/mariadbd)
        55a71e7e5e10 handler::ha_write_row(unsigned char const*)+464 (/home/open
xs/dbs/maria10.5/bin/mariadbd)
        55a71e598b8d write_record(THD*, TABLE*, st_copy_info*, select_result*)+4
77 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e59f6d7 mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item
> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*)+2967 (/ho
me/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e5d894a mysql_execute_command(THD*)+7722 (/home/openxs/dbs/maria10.
5/bin/mariadbd)
        55a71e5ec945 Prepared_statement::execute(String*, bool)+981 (/home/openx
s/dbs/maria10.5/bin/mariadbd)
        55a71e5ecbd5 Prepared_statement::execute_loop(String*, bool, unsigned ch
ar*, unsigned char*)+133 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e5ed8f5 mysql_stmt_execute_common(THD*, unsigned long, unsigned cha
r*, unsigned char*, unsigned long, bool, bool)+549 (/home/openxs/dbs/maria10.5/b
in/mariadbd)
        55a71e5edb3c mysqld_stmt_execute(THD*, char*, unsigned int)+44 (/home/op
enxs/dbs/maria10.5/bin/mariadbd)
        55a71e5d4ba6 dispatch_command(enum_server_command, THD*, char*, unsigned
 int, bool, bool)+9302 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        55a71e5d5f12 do_command(THD*)+274 (/home/openxs/dbs/maria10.5/bin/mariad
bd)
        55a71e6d3bf1 do_handle_one_connection(CONNECT*, bool)+1025 (/home/openxs
/dbs/maria10.5/bin/mariadbd)
        55a71e6d406d handle_one_connection+93 (/home/openxs/dbs/maria10.5/bin/ma
riadbd)
        55a71ea4abf2 pfs_spawn_thread+322 (/home/openxs/dbs/maria10.5/bin/mariad
bd)
        7fadbd0b34e2 start_thread+226 (/usr/lib64/libpthread-2.30.so)
]: 5427

...

Some stacks surely look reasonable, so I continued with ccollapsing them into a simpler form:

[openxs@fc31 maria10.5]$ cat /tmp/lll_lock_perf_stacks.txt | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' | more

...
5380 __lll_lock_wait;
5406 __lll_lock_wait;
5421 __lll_lock_wait;
5427 __lll_lock_wait;row_ins_clust_index_entry_low;row_ins_clust_index_entry;row
_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;writ
e_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared
_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_
command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thre
ad;start_thread
5436 __lll_lock_wait;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_inn
obase::delete_row;handler::ha_delete_row;mysql_delete;mysql_execute_command;Prep
ared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_comm
on;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;hand
le_one_connection;pfs_spawn_thread;start_thread
5443 __lll_lock_wait;
5445 __lll_lock_wait;purge_worker_callback;tpool::task_group::execute;tpool::thr
ead_pool_generic::worker_main;;;;
5502 __lll_lock_wait;
...

The awk code is a bit different this time. I know that for flame graphs I need function calls separated by ';', so I am doing it immediately. Non-resolved stack traces are all removed etc. Now I have to sort this to find top N:

[openxs@fc31 maria10.5]$ cat /tmp/lll_lock_perf_stacks.txt | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' | sort -r -n -k 1,1 > /tmp/collapsed_llll_lock.txt

and then:

[openxs@fc31 maria10.5]$ cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print} }' | head -5
28276454 __lll_lock_wait;timer_handler;start_thread
35842 __lll_lock_wait;MDL_map::find_or_insert;MDL_context::try_acquire_lock_impl;MDL_context::acquire_lock;open_table;open_tables;open_and_lock_tables;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;;;;;;
35746 __lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;MDL_context::release_locks_stored_before;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
35675 __lll_lock_wait;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_innobase::delete_row;handler::ha_delete_row;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
35563 __lll_lock_wait;row_upd_sec_index_entry;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread

So, the above is the top 5 fully resolved stack traces where the longest time was spent inside the __lll_lock_wait function. Now let's create the flame graph. We need this output, stack first and nanoseconds spent in it next, comma separated, piped for processing by the same flamegraph.pl program as before:

[openxs@fc31 maria10.5]$ cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | head -3
__lll_lock_wait;timer_handler;start_thread 28276454
__lll_lock_wait;MDL_map::find_or_insert;MDL_context::try_acquire_lock_impl;MDL_context::acquire_lock;open_table;open_tables;open_and_lock_tables;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;;;;;; 35842
__lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;MDL_context::release_locks_stored_before;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread 35746

Not the flame graph was created as follows:

cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="Time spent in ___lll_lock_wait in MariaDB 10.5" --countname=nsecs > ~/Documents/lll_lock.svg

The resulting flame graph is presented below:

Time related to MDL is highlighted, some 3% only

I highlighted the impact of MDL-related stack traces, some 3%. The shape is totally different from the one of pthread_mutex_wait, so this shows a how different point of view may lead to different conclusions about the contention.

* * *

To summarize:

  1. I am still not any good with finding low impact way (if any) to get stack traces with time summed up. It seems associative array with them become too big if stacks are not preprocessed, and exporting it to user space at the end of the program takes minutes, literally, and impact performance during all this time.
  2. It's easy to measure time spent inside the function with uprobe + ureprobe for the function
  3. I am not sure yet about the tools to run in production, but for checking the ideas and creating prototypes bpftrace is really easy to use and flexible.
  4. It is not clear where the stack traces with all calls unresolved to symbols may come from.

No comments:

Post a Comment