Wednesday, January 27, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part III, Creating a New Tool for Tracing Mutexes

I had some free time yesterday to do some more tests on Fedora, so I've got back to the old request for one of MariaDB developers that I first mentioned in the blog few months ago:

"ideally, collect stack traces of mariadbd, and sort them in descending order by time spent between pthread_mutex_lock and next pthread_mutex_unlock."

The original request above got no answer yet, and I was recently reminded about it. What I did with perf and recently with bpftrace was (lame) counting number of samples per unique stack, while I actually had to count time spent between acquiring and releasing specific mutexes.

From the very beginning I was sure that bpftrace should allow to get the answer easily, and after reviewing the way existing tools are coded, yesterday I decided to finally write some real, multiple liner bpftrace program, with multiple probes, myself. I wanted to fulfill the request literally, no matter how much that would "cost" for now, with bpftrace. It turned out that a couple of hours of calm vacation time is more than enough to get a draft of solution.

I've started with checking the pthread_mutext_lock manual page. From it I've got the (primitive) idea of two functions used in the process, with single argument, mutex pointer/address:

int pthread_mutex_lock(pthread_mutex_t *mutex);
int pthread_mutex_unlock(pthread_mutex_t *mutex);

Multiple threads can try to lock the same mutex and those that found it locked will wait until unlock eventually makes the mutex available for acquire for one of them (as decided by the scheduler). I've made the assumption (correct me if I am wrong) that that same thread that locked the mutex must unlock it eventually. Based on that I came up with the following initial lame version of bpftrace program:

[openxs@fc31 ~]$ cat pthread_mutex.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop\n");
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/
{
    @start[arg0] = nsecs;
    @mutexid[arg0] = tid;
    @tidstack[tid] = ustack;
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_unlock
/comm == "mariadbd" && @start[arg0] != 0 && @mutexid[arg0] == tid/
{
    $now = nsecs;
    $time = $now - @start[arg0];
    @mutexstack[@tidstack[tid]] += $time;
    printf("Mutex: %u, time: %d\n", arg0, $time);
    delete(@start[arg0]);
    delete(@mutexid[arg0]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@mutexid);
    clear(@tidstack);
}
/* the end */

Do you recognize biosnoop.bt style? Yes, this is what I was inspired by... So, I've added two uprobes for the library providing the function, both checking that the call is done from the mariadbd. The first one, for lock, stores start time for the given mutex address, thread id that locked it, and stack trace of the thread at the moment of locking. The second one, for unlock, computes time difference since the same mutex was locked last, but it fires only if unlock thread has the same id that the lock one. Then I add this time difference to the time spent "within this stack trace", by referring to the thread stack stored as index in the @mutexstack[] associative array. Then I print some debugging output to see what happens in the process of tracing and remove items from the associative arrays that were added to them by the first probe.

In the END probe I just clean up all associative arrays but @mutexstack[], and, as we've seen before, then its content is just dumped to the output by the bpftrace. This is what I am going to post process later, after quick debugging session proves my idea gives some reasonable results.

So, with MariaDB 10.5 up and running, started like this (no real tuning for anything, no wonder QPS is not high in the tests below):

./bin/mysqld_safe --no-defaults --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --port=3309 &

and having zero user connections, I made pthread_mutex.bt executable and started my very first bpftrace program for the very first time (OK, honestly, few previous runs shown some syntax errors that I corrected):

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt
Attaching 4 probes...
Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop
Mutex: 652747168, time: 6598
Mutex: 629905136, time: 46594
...
Mutex: 652835840, time: 26491
Mutex: 652835712, time: 4569
^C


@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 23055
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+188
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 23803
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+210
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 24555
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::submit_task(tpool::task*)+88
    timer_handler+326
    start_thread+226
]: 31859
@mutexstack[
    __pthread_mutex_lock+0
    srv_monitor_task+130
    tpool::thread_pool_generic::timer_generic::execute(void*)+53
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 53282
@mutexstackERROR: failed to look up stack id 0 (pid 0): -1
[]: 322499

Not bad. I see mutexes are locked and (last ERROR aside) time is summed up per stack trace as planned. More over, stack traces look reasonable for 10.5 (generic thread pool is used inside InnoDB in this version, to run background tasks, are you aware of that?). Some symbols are not resolved, but what can I do about it? I'll just skip that addresses at some later stage, maybe.

I just decided to check what threads are locking mutexes, and modified print:

    printf("Mutex: %u, thread: %u, time: %d\n", arg0, tid, $time);

With that modification I also redirected errors to /dev/null and got this:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null
Attaching 4 probes...
Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop
Mutex: 652835712, thread: 4476, time: 6354
Mutex: 629905136, thread: 4476, time: 37053
Mutex: 621289632, thread: 4485, time: 5254
Mutex: 621289632, thread: 4485, time: 4797
Mutex: 652835840, thread: 4485, time: 31465
Mutex: 652835712, thread: 4485, time: 4374
Mutex: 652835712, thread: 4476, time: 6048
Mutex: 629905136, thread: 4476, time: 35703
Mutex: 621289632, thread: 4485, time: 4917
Mutex: 621289632, thread: 4485, time: 4779
Mutex: 652835840, thread: 4485, time: 30316
Mutex: 652835712, thread: 4485, time: 4389
Mutex: 652835712, thread: 4476, time: 6733
Mutex: 629905136, thread: 4476, time: 40936
Mutex: 621289632, thread: 4485, time: 4719
Mutex: 621289632, thread: 4485, time: 4725
Mutex: 652835840, thread: 4485, time: 30637
Mutex: 652835712, thread: 4485, time: 4441
^C


@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 13204
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+210
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 14301
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::timer_generic::execute(void*)+188
    tpool::task::execute()+50
    tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79
    0x7f7e60fb53d4
    0x5562258c9080
    std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0
    0x2de907894810c083
]: 14890
@mutexstack[
    __pthread_mutex_lock+0
    tpool::thread_pool_generic::submit_task(tpool::task*)+88
    timer_handler+326
    start_thread+226
]: 19135
@mutexstack[]: 206110

I see different threads locking same mutexes etc. One day given more time I'd try to figure out what mutexes are that and what was the purpose of each thread (it can be seen based on OS thread id in the performance_schema.threads in 10.5, fortunately, or inferred from the stacks at the moment).

I've removed debug print (no interactive output, just final summarized data), changed ustack to ustack(perf) (assuming I know better how to deal with that output format later - it was not really a good idea), and ended up with this final version of the tool:

#!/usr/bin/env bpftrace

BEGIN
{
/*
    printf("Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop\n");
*/
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/
{
    @start[arg0] = nsecs;
    @mutexid[arg0] = tid;
    @tidstack[tid] = ustack(perf);    
}

uprobe:/lib64/libpthread.so.0:pthread_mutex_unlock
/comm == "mariadbd" && @start[arg0] != 0 && @mutexid[arg0] == tid/
{
    $now = nsecs;
    $time = $now - @start[arg0];
    @mutexstack[@tidstack[tid]] += $time;
/*
    printf("Mutex: %u, thread: %u, time: %d\n", arg0, tid, $time);
*/
    delete(@start[arg0]);
    delete(@mutexid[arg0]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@mutexid);
    clear(@tidstack);
}
/* the end */

I saved stack trace to the file in /tmp, to work on the output further outside of the bpftrace. Again probably it was not the best idea, but I am not yet fluent with strings processing in bpftrace anyway, I rely on awk, sort etc.:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null >/tmp/pthread_mutex_perf_stacks.txt
^C[openxs@fc31 ~]cat /tmp/pthread_mutex_perf_stacks.txt
Attaching 4 probes...



@mutexstack[
        7f7e615a6e70 __pthread_mutex_lock+0 (/usr/lib64/libpthread-2.30.so)
        556223eb07cf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f7e60fb53d4 0x7f7e60fb53d4 ([unknown])
        5562258c9080 0x5562258c9080 ([unknown])
        556223eb0b60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        2de907894810c083 0x2de907894810c083 ([unknown])
]: 21352
@mutexstack[
        7f7e615a6e70 __pthread_mutex_lock+0 (/usr/lib64/libpthread-2.30.so)
        556223eb0d62 tpool::thread_pool_generic::timer_generic::execute(void*)+210 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        556223eb1c52 tpool::task::execute()+50 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        556223eb07cf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f7e60fb53d4 0x7f7e60fb53d4 ([unknown])
        5562258c9080 0x5562258c9080 ([unknown])
        556223eb0b60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        2de907894810c083 0x2de907894810c083 ([unknown])
]: 22975
...

After checking what I did with such stack traces previously to collapse them into one line per stack pt-pmp style, and multiple clarification runs and changes I ended up with the following awk code:

awk '
BEGIN { s = ""; }
/^@mutexstack\[\]/ { s = ""; }
/^@mutexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
/^]/ { print $2, s }
'

I process the lines that are not containing stacks, those around the stack block, resetting the stack s at the beginning and printing it at the end of the block, and for the stack lines I take only function name and ignore everything else to form a targ, and concatenate it to the stack already collected with comma (that was a wrong idea for future use) as a separator between function names. Original code that inspired all these came from pt-pmp as far as I remember. I just adapted it to the format, better than in the previous posts.

Post-processing the output with this awk code gave me the following:

[openxs@fc31 ~]$ cat /tmp/pthread_mutex_perf_stacks.txt | awk '
> BEGIN { s = ""; }
> /^@mutexstack\[\]/ { s = ""; }
> /^@mutexstack/ { s = ""; }
> /^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
> /^]/ { print $2, s }
> '
21352 __pthread_mutex_lock,tpool::thread_pool_generic::worker_main,,,,
22975 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
24568 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
33469 __pthread_mutex_lock,tpool::thread_pool_generic::submit_task,timer_handler,start_thread

Non-resolved addresses are removed, same as offsents from the functrion start. Now sorting remains, in descending order, on the first column as a key:

[openxs@fc31 ~]$ cat /tmp/pthread_mutex_perf_stacks.txt | awk '
BEGIN { s = ""; }
/^@mutexstack\[\]/ { s = ""; }
/^@mutexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
/^]/ { print $2, s }
' | sort -r -n -k 1,1
33469 __pthread_mutex_lock,tpool::thread_pool_generic::submit_task,timer_handler,start_thread
24568 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
22975 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,
21352 __pthread_mutex_lock,tpool::thread_pool_generic::worker_main,,,,

That's what we have, for the server without user connections. Now let me put it under the high concurrent sysbench test load (good idea, isn't it?):

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 653.07 qps: 13097.92 (r/w/o: 9174.69/2613.89/1309.34) lat (ms,95%): 240.02 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 1025.71 qps: 20511.58 (r/w/o: 14358.12/4102.14/2051.32) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 588.21 qps: 11770.70 (r/w/o: 8238.74/2355.44/1176.52) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 306.22 qps: 6135.54 (r/w/o: 4298.14/1224.97/612.43) lat (ms,95%): 369.77 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 467.00 qps: 9339.64 (r/w/o: 6537.96/1867.69/933.99) lat (ms,95%): 308.84 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 302.19 qps: 6044.31 (r/w/o: 4230.60/1209.34/604.37) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 324.91 qps: 6496.60 (r/w/o: 4548.67/1298.12/649.81) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 303.58 qps: 6058.55 (r/w/o: 4238.05/1213.33/607.16) lat (ms,95%): 646.19 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 258.39 qps: 5176.10 (r/w/o: 3625.73/1033.58/516.79) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 213.72 qps: 4279.43 (r/w/o: 2995.93/856.07/427.43) lat (ms,95%): 707.07 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 32 tps: 208.29 qps: 4144.23 (r/w/o: 2896.58/831.07/416.58) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 32 tps: 456.29 qps: 9135.45 (r/w/o: 6397.03/1826.05/912.38) lat (ms,95%): 363.18 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 32 tps: 582.21 qps: 11641.73 (r/w/o: 8148.49/2328.63/1164.61) lat (ms,95%): 277.21 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 32 tps: 560.39 qps: 11208.17 (r/w/o: 7845.84/2241.55/1120.78) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 32 tps: 338.03 qps: 6768.93 (r/w/o: 4739.47/1353.41/676.05) lat (ms,95%): 442.73 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 32 tps: 410.20 qps: 8210.38 (r/w/o: 5748.19/1641.80/820.40) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 32 tps: 480.28 qps: 9599.94 (r/w/o: 6716.68/1922.81/960.45) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 32 tps: 397.62 qps: 7952.16 (r/w/o: 5568.62/1588.19/795.35) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 32 tps: 338.77 qps: 6769.31 (r/w/o: 4739.09/1352.78/677.44) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 32 tps: 417.81 qps: 8372.59 (r/w/o: 5857.10/1679.76/835.73) lat (ms,95%): 331.91 err/s: 0.00 reconn/s: 0.00
[ 210s ] thds: 32 tps: 267.40 qps: 5340.01 (r/w/o: 3742.10/1063.10/534.80) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 32 tps: 267.70 qps: 5355.78 (r/w/o: 3748.96/1071.42/535.41) lat (ms,95%): 590.56 err/s: 0.00 reconn/s: 0.00
[ 230s ] thds: 32 tps: 243.11 qps: 4859.74 (r/w/o: 3401.70/971.83/486.21) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 32 tps: 173.99 qps: 3474.97 (r/w/o: 2430.94/696.05/347.98) lat (ms,95%): 1013.60 err/s: 0.00 reconn/s: 0.00
[ 250s ] thds: 32 tps: 169.71 qps: 3403.05 (r/w/o: 2384.37/679.25/339.42) lat (ms,95%): 877.61 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 32 tps: 407.57 qps: 8151.27 (r/w/o: 5704.23/1631.89/815.15) lat (ms,95%): 272.27 err/s: 0.00 reconn/s: 0.00
...
[ 300s ] thds: 32 tps: 382.41 qps: 7641.05 (r/w/o: 5348.01/1528.43/764.62) lat (ms,95%): 434.83 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            1663592
        write:                           475312
        other:                           237656
        total:                           2376560
    transactions:                        118828 (396.04 per sec.)
    queries:                             2376560 (7920.89 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      396.0445
    time elapsed:                        300.0370s
    total number of events:              118828

Latency (ms):
         min:                                    2.17
         avg:                                   80.79
         max:                                 5012.10
         95th percentile:                      390.30
         sum:                              9600000.68

Threads fairness:
    events (avg/stddev):           3713.3750/40.53
    execution time (avg/stddev):   300.0000/0.01

Trust me that I started by bpftrace program after initial 20 seconds of the test run, and let it work at most 20 seconds. But the entire test, next 280 seconds, were notably affected by a visible drop in QPS! I pressed Ctrl-C but got the command probm back much later, not even after 300 seconds... I was wathing the output growth in another shell:

[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 264177 Jan 26 13:26 /tmp/pthread_mutex_perf_stacks.txt
...
[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 281111 Jan 26 13:27 /tmp/pthread_mutex_perf_stacks.txt
...
[openxs@fc31 ~]$ ls -l /tmp/pthread_mutex_perf_stacks.txt
-rw-rw-r--. 1 openxs openxs 4116283 Jan 26 13:35 /tmp/pthread_mutex_perf_stacks.txt

So I ended with 4M of text data expoirted to the userland, for just 20 seconds of data collection and with performance drop for many miutes for my 32 threds test on 4 cores old system. Not that impressive and I should care better to probably aggregate and process data more in my bpftrace program, or maybe just dump raw stak-time entries as they are collected. I'll test and see how to improve, as this way of collection is not suitable for production use on a loaded system :(

Anyway, I have to process what was collected with such an impact. To remind you, the data were collected this way:

[openxs@fc31 ~]$ sudo ./pthread_mutex.bt 2>/dev/null >/tmp/pthread_mutex_perf_stacks.txt
[sudo] password for openxs:
^C

and then I applied that same awk followed by sort command line as above to get collapsed stacks. This is what I've seen as a result:

[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | more
104251253 __pthread_mutex_lock,buf_flush_page,buf_flush_try_neighbors,buf_do_flu
sh_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
78920938
74828263
74770599
74622438
72853129
67893142
66546439 __pthread_mutex_lock,buf_do_flush_list_batch,buf_flush_lists,buf_flush_
page_cleaner,start_thread
61669188
59330217
55480213
55045396
53531941
53216338
...

I am yet to find out what are those non-resolved and all removed entries are coming from and what to do with them to not influence the analysis. For now I need to get rid of them as useless. This is how I did it to get "top 5" stacks with times (in nanoseconds) spent in them:

[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | awk '{ if (length($2) > 0) {print} }' | head -5
104251253 __pthread_mutex_lock,buf_flush_page,buf_flush_try_neighbors,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
66546439 __pthread_mutex_lock,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
31431176 __pthread_mutex_lock,buf_flush_try_neighbors,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread
27100601 __pthread_mutex_lock,tpool::aio_linux::getevent_thread_routine,,,,
11730055 __pthread_mutex_lock,buf_flush_lists,buf_flush_page_cleaner,start_thread
[openxs@fc31 ~]$ cat /tmp/collapsed_pthread_mutex.txt | awk '{ if (length($2) > 0) {print} }' > /tmp/collapsed_clean_pthread_mutex.txt

I saved the output into the /tmp/collapsed_clean_pthread_mutex.txt file. The enxt step would be to represent the result in some nice graphical way, a flame graph! I have the software in place:

[openxs@fc31 ~]$ ls /mnt/home/openxs/git/FlameGraph/
aix-perf.pl                 stackcollapse-gdb.pl
demos                       stackcollapse-go.pl
dev                         stackcollapse-instruments.pl
difffolded.pl               stackcollapse-java-exceptions.pl
docs                        stackcollapse-jstack.pl
example-dtrace-stacks.txt   stackcollapse-ljp.awk
example-dtrace.svg          stackcollapse-perf.pl
example-perf-stacks.txt.gz  stackcollapse-perf-sched.awk
example-perf.svg            stackcollapse.pl
files.pl                    stackcollapse-pmc.pl
flamegraph.pl               stackcollapse-recursive.pl
jmaps                       stackcollapse-sample.awk
pkgsplit-perf.pl            stackcollapse-stap.pl
range-perf.pl               stackcollapse-vsprof.pl
README.md                   stackcollapse-vtune.pl
record-test.sh              stackcollapse-xdebug.php
stackcollapse-aix.pl        test
stackcollapse-bpftrace.pl   test.sh
stackcollapse-elfutils.pl

But I quickly recalled that flamegraph.pl expects this kind of format of the imput, ";" as separator and number as a second column, not the first:

bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;...;schedule 8

There is also a tool to collapse raw bpftrace stacks, stackcollapse-bpftrace.pl, and I have to check how it work for my case one day... Yesterday I just wanted to complet testing as soon as possible, so proceeded with a quick and dirty awk hack:

[openxs@fc31 ~]$ cat /tmp/collapsed_clean_pthread_mutex.txt | awk ' { gsub(",",";",$2); print "mariadbd;"$2, $1 }' | head -5
mariadbd;__pthread_mutex_lock;buf_flush_page;buf_flush_try_neighbors;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 104251253
mariadbd;__pthread_mutex_lock;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 66546439
mariadbd;__pthread_mutex_lock;buf_flush_try_neighbors;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 31431176
mariadbd;__pthread_mutex_lock;tpool::aio_linux::getevent_thread_routine;;;; 27100601
mariadbd;__pthread_mutex_lock;buf_flush_lists;buf_flush_page_cleaner;start_thread 11730055

This format looks acceptable, so I've generated the flame graph with the same hack:

[openxs@fc31 ~]$ cat /tmp/collapsed_clean_pthread_mutex.txt | awk ' { gsub(",",";",$2); print "mariadbd;"$2, $1 }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="pthread_mutex_waits in MariaDB 10.5" --countname=nsecs > ~/Documents/mutex.svg

and here is the result, with sdearhc for "tpool" highlighting how much time of the mutex waits is related to the thread pool of background InnoDB threads:

One can surely create a flame graph based on stacks collected by the bpftrace program, one way or the other...

I'll stop at this stage and maybe continue testing later this week. Stay tuned!

* * *

To summarize:

  • I am not yet sure if my logic in the programs above was correct. I have to think more about it.
  • I surely need to find another way to process the data, either by collapsing/processing stacks in my bpftrace program to make them smaller, or maybe by submitting raw stack/time data as they are collected to the user level. More tests to come...
  • It is easy to create custom bpftrace programs for collecting the data you need. I think memory allocations tracing is my next goal. Imagine a printout of memory allocated and not freed, per allocating thread... If only that had less impact on QPS than what my lame program above demonstrated :)

7 comments:

  1. After some thinking about it, again, I guess that what we want to measure, in general case, is not the "time from mutex_lock to mutex_unlock" . This time can even be large, and still the mutex is not contended.
    I think we want to test contention. In the first approximation, it would be the clock time that pthread_mutex_lock() takes. In the second approximation it would probably be the time that __lll_lock_wait takes (I might as well be wrong, disclaimer, i only study GNUism-Linuxism by looking at stacktraces).


    Is it possible to write a tool that measures duration of one function?

    ReplyDelete
    Replies
    1. Seems possible, as per https://github.com/iovisor/bpftrace/blob/master/docs/tutorial_one_liners.md (Lesson 7, timing read()s).

      Delete
    2. If I get you right, you want wall clock time spent in a function, from enter to return. This is easy, indeed, we have to capture nsecs in uprobe, upon entry, and then nsecs in uretprobe, upon return, and get the difference that is time spent in the function. We can store this in some global associative array, indexed by... what? Fubction name, thread id, some function argument? Also, what functions to instrument this way, with 2 probes per each? pthread_mutex_lock or __lll_lock_wait?

      I can try to do some of these even later today, but need more clear goal: time spent in rthread_mutex_lock per... what? What else to collect for each call if not stack that led us to it?

      Delete
    3. ...global associative array, indexed by the *callstack*, with the goal to find the place where waits happen. I think _lll_lock_wait could be a better indicator for contention, measuring uncontended pthread_mutex_locks could give some false alarms.

      Delete
    4. I tried to do this in the next post:

      https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_28.html

      Delete
    5. Skip the above! It was just a step towards the right direction. In my next post I'll show the way to trace with less impact on performance and much faster time to get the results. Moreover, theoretically I can trace any function that new way, for any N seconds passed as an argument etc. Also the flame graph for the period of 60 seconds does look really reasonable to me.

      So, stay tuned. The post is coming tonight or tomorrow.

      Delete
    6. Check this: http://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_30.html

      Delete