Blog of (former?) MySQL Entomologist: 2021

Sunday, October 3, 2021

bpftrace as a code/function coverage tool for MariaDB server

I am going to speak about bpftrace again soon, this time at MariaDB Server Fest 2021. Among other useful applications of bpftrace I mention using it as a code coverage (or, more precisely, function coverage) tool to check if some test executes all/specific functions in MariaDB server source code. In this blog post I am going to present some related tests in more details.

For testing I use new enough MariaDB server and bpftrace both built from GitHub source code on my Ubuntu 20.04 "home server":

openxs@ao756:~/dbs/maria10.6$ bin/mariadbd --version
bin/mariadbd Ver 10.6.5-MariaDB for Linux on x86_64 (MariaDB Server)
openxs@ao756:~/dbs/maria10.6$ bpftrace -V
bpftrace v0.13.0-120-gc671
openxs@ao756:~/dbs/maria10.6$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"

The idea I tried immediately as a lame function coverage approach was to attach a probe to print function name for every function in the mariadbd binary:

openxs@ao756:~/dbs/maria10.6$ sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:* { printf("%s\n", func); }'
ERROR: Can't attach to 34765 probes because it exceeds the current limit of 512 probes.
You can increase the limit through the BPFTRACE_MAX_PROBES environment variable, but BE CAREFUL since a high number of probes attached can cause your system to crash.
openxs@ao756:~/dbs/maria10.6$ bpftrace --help 2>&1 | grep MAX_PROBES
BPFTRACE_MAX_PROBES [default: 512] max number of probes

This way I found out that bpftrace found 34765 different functions in this binary, but by default it can have at most 512 probes per call. So, let me increase BPFTRACE_MAX_PROBES, as suggested (I don't mind system crash, it's for testing anyway):

openxs@ao756:~/dbs/maria10.6$ sudo BPFTRACE_MAX_PROBES=35000 bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:* { printf("%s\n", func); }'
Attaching 34765 probes...
bpf: Failed to load program: Too many open files
processed 17 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 0

bpf: Failed to load program: Too many open files
processed 17 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 0

ERROR: Error loading program: uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:thd_client_ip (try -v)
Segmentation fault
openxs@ao756:~/dbs/maria10.6$ ulimit -n
1024

This time I crashed bpftrace, and before that hit the limit on number of open files, quite small by default. What if I try to increase the limit? Let me try:

openxs@ao756:~/dbs/maria10.6$ ulimit -n 40000
openxs@ao756:~/dbs/maria10.6$ ulimit -n
40000
openxs@ao756:~/dbs/maria10.6$ sudo BPFTRACE_MAX_PROBES=35000 bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:* { printf("%s\n", func); }'
Attaching 34765 probes...
ERROR: Offset outside the function bounds ('register_tm_clones' size is 0)
Segmentation fault

Still crash, but with different error message before. Looks like some functions are not traceable for some reason I have yet to understand. Segmentation fault still may be related to the number of probes to create, but we can get the same error with much less probes (and for a different function):

openxs@ao756:~/dbs/maria10.6$ sudo BPFTRACE_MAX_PROBES=40000 bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:*do* { printf("%s\n", func); }'
Attaching 1076 probes...
ERROR: Offset outside the function bounds ('__do_global_dtors_aux' size is 0)
openxs@ao756:~/dbs/maria10.6$

The problematic function above is actually added by the compiler for callings the destructors of static objects (if we trust this source), it's not even from MariaDB server code itself. register_tm_clones function we hit before is even more mysterious (read this). It is related to transactional memory model in C++ somehow. Looks like on day I'll have to just create a list of all traceable and important functions in the MariaDB server source code and add explicit probe for each. Lame attempts to trace all do not work.

But we surely can trace a lot of known useful functions, like in the following example where I add a probe for each function named like 'ha_*' and try to count how many times it was called during some sysbench test:

openxs@ao756:~/dbs/maria10.6$ sudo BPFTRACE_MAX_PROBES=1000 bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:ha_* { @cnt[func] += 1; }' > /tmp/funcs2.txt
[sudo] password for openxs:
^Copenxs@ao756:~/dbs/maria10.6$

The test that I executed in another shell was the following (numbers do not really matter):

openxs@ao756:~/dbs/maria10.6$ sysbench oltp_read_write --mysql-socket=/tmp/mariadb.sock --mysql-user=openxs --tables=2 --table-size=1000 --threads=2 run --time=1600 --report-interval=10
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 2
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 2 tps: 74.96 qps: 1502.59 (r/w/o: 1052.20/300.26/150.13) lat (ms,95%): 44.17 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 2 tps: 70.31 qps: 1404.51 (r/w/o: 983.07/280.82/140.61) lat (ms,95%): 36.24 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 2 tps: 82.50 qps: 1650.50 (r/w/o: 1155.30/330.20/165.00) lat (ms,95%): 40.37 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 2 tps: 82.60 qps: 1652.60 (r/w/o: 1156.80/330.60/165.20) lat (ms,95%): 33.12 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 2 tps: 50.20 qps: 1004.48 (r/w/o: 703.29/200.80/100.40) lat (ms,95%): 89.16 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 2 tps: 74.80 qps: 1494.61 (r/w/o: 1045.91/299.20/149.50) lat (ms,95%): 44.98 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 2 tps: 82.70 qps: 1654.31 (r/w/o: 1158.21/330.60/165.50) lat (ms,95%): 33.72 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 2 tps: 81.60 qps: 1632.61 (r/w/o: 1142.80/326.60/163.20) lat (ms,95%): 44.17 err/s: 0.00 reconn/s: 0.00
^C
openxs@ao756:~/dbs/maria10.6$

I let it work for a minute or so, and now I can check what ha_* functions of MariaDB server were called, and how many times:

openxs@ao756:~/dbs/maria10.6$ cat /tmp/funcs2.txt
Attaching 953 probes...

@cnt[ha_rollback_trans(THD*, bool)]: 2
@cnt[ha_close_connection(THD*)]: 2
@cnt[ha_heap::open(char const*, int, unsigned int)]: 6296
@cnt[ha_heap::close()]: 6298
@cnt[ha_heap::update_key_stats()]: 6320
@cnt[ha_heap::drop_table(char const*)]: 6353
@cnt[ha_lock_engine(THD*, handlerton const*)]: 6524
@cnt[ha_innobase::delete_row(unsigned char const*)]: 6551
@cnt[ha_innobase::rnd_init(bool)]: 6602
@cnt[ha_innobase::innobase_lock_autoinc()]: 6604
@cnt[ha_innobase::write_row(unsigned char const*)]: 6604
@cnt[ha_heap::table_flags() const]: 11746
@cnt[ha_heap::info(unsigned int)]: 12606
@cnt[ha_check_if_updates_are_ignored(THD*, handlerton*, char const*)]: 13094
@cnt[ha_innobase::update_row(unsigned char const*, unsigned char const*)]: 13122
@cnt[ha_heap::~ha_heap()]: 18819
@cnt[ha_heap::rnd_init(bool)]: 18865
@cnt[ha_innobase::unlock_row()]: 26252
@cnt[ha_innobase::records_in_range(unsigned int, st_key_range const*, st_key_range const*, st_page_range*)]: 26320
@cnt[ha_innobase::try_semi_consistent_read(bool)]: 26413
@cnt[ha_innobase::referenced_by_foreign_key()]: 26432
@cnt[ha_innobase::was_semi_consistent_read()]: 33022
@cnt[ha_innobase::multi_range_read_explain_info(unsigned int, char*, unsigned long)]: 46221
@cnt[ha_innobase::read_time(unsigned int, unsigned int, unsigned long long)]: 46228
@cnt[ha_innobase::multi_range_read_info_const(unsigned int, st_range_seq_if*, void*, unsigned int, unsigned int*, unsigned int*, Cost_estimate*)]: 46230
@cnt[ha_innobase::multi_range_read_init(st_range_seq_if*, void*, unsigned int, unsigned int, st_handler_buffer*)]: 46232
@cnt[ha_heap::extra(ha_extra_function)]: 50393
@cnt[ha_innobase::estimate_rows_upper_bound()]: 52828
@cnt[ha_innobase::scan_time()]: 72638
@cnt[ha_innobase::lock_count() const]: 107072
@cnt[ha_innobase::index_init(unsigned int, bool)]: 111372
@cnt[ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)]: 111432
@cnt[ha_innobase::info(unsigned int)]: 112253
@cnt[ha_innobase::info_low(unsigned int, bool)]: 112271
@cnt[ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)]: 118088
@cnt[ha_innobase::change_active_index(unsigned int)]: 118700
@cnt[ha_innobase::reset()]: 118934
@cnt[ha_commit_trans(THD*, bool)]: 123967
@cnt[ha_commit_one_phase(THD*, bool)]: 124096
@cnt[ha_check_and_coalesce_trx_read_only(THD*, Ha_trx_info*, bool)]: 124363
@cnt[ha_innobase::column_bitmaps_signal()]: 191279
@cnt[ha_innobase::table_flags() const]: 213982
@cnt[ha_innobase::build_template(bool)]: 230402
@cnt[ha_innobase::external_lock(THD*, int)]: 236446
@cnt[ha_innobase::rnd_end()]: 237608
@cnt[ha_innobase::extra(ha_extra_function)]: 415923
@cnt[ha_innobase::index_flags(unsigned int, unsigned int, bool) const]: 451710
@cnt[ha_heap::write_row(unsigned char const*)]: 628369
@cnt[ha_heap::position(unsigned char const*)]: 629184
@cnt[ha_heap::rnd_pos(unsigned char*, unsigned char*)]: 629341
@cnt[ha_heap::rnd_next(unsigned char*)]: 635061
@cnt[ha_innobase::position(unsigned char const*)]: 660590
@cnt[ha_innobase::index_next(unsigned char*)]: 2621915
@cnt[ha_innobase::multi_range_read_next(void**)]: 2708026
openxs@ao756:~/dbs/maria10.6$

I think this is really awesome! I can trace 900+ function calls with a bpftrace one liner and get the @cnt associative array printed at the end with a (not mangled!) function name as a key and number of times function was called while we run the test as a value, ordered by increasing the value, automatically!

bpftrace may help MariaDB DBAs and developers in many cases!

This is already useful to find out if any of the functions traced was called at all. I can run several bpftrace command lines concurrently, each tracing a batch of different function calls and then summarize their outputs, I can measure time spent in each function, get stack traces if needed, and so on. I hope you already agree that the idea to use bpftrace as a function coverage tool may work even for such a complex software as MariaDB server.

I demonstrate this during my upcoming talk. See you there on Wednesday, October 6!

Wednesday, May 12, 2021

Dynamic Tracing of Memory Allocations in MySQL With bcc Tools

Last year I started my lame attempts to apply different Linux dynamic tracing tools and approaches to frequent events like memory allocations. In this blog post I already described how to use perf to add user probe to trace malloc() calls with the number of bytes requested. Unfortunately this approach is not practical for production use for more than several seconds.

Recently I plan with bpftrace a lot and so far ended up with an easy way to trace calls and call stacks, and was on my way to trace only outstanding allocations and care only about memory areas not yet freed. If you are interested, the primitive approach may look like this:

[openxs@fc33 ~]$ cat malloc.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("Tracing MariaDB's malloc() calls, Ctrl-C to stop\n");
}

interval:s:$1 { exit(); }

uprobe:/lib64/libc.so.6:malloc
/ comm == "mariadbd" /
{
    @size[tid] += arg0;
/*    printf("Allocating %d bytes in thread %u...\n", arg0, tid); */
}

uretprobe:/lib64/libc.so.6:malloc
/ comm == "mariadbd" && @size[tid] > 0 /
{
    @memory[tid,retval] = @size[tid];
    @stack[ustack(perf)] += @size[tid];

    print(@stack);
    clear(@stack);

    delete(@size[tid]);
}

uprobe:/lib64/libc.so.6:free
/ comm == "mariadbd" /
{
    delete(@memory[tid, arg0]);
/*    printf("Freeing %p...\n", arg0); */
}

END
{
    clear(@size);
    clear(@memory);
    clear(@stack);
}
[openxs@fc33 ~]$

But while it works (both for tracing that is commented out above and for summarizing the non-freed allocations) and produced some outputs as expected:

[openxs@fc33 ~]$ time sudo ./malloc.bt 1 2>/dev/null >/tmp/malloc_raw.txt

real    8m47.963s
user    2m53.513s
sys     5m50.685s

[openxs@fc33 maria10.5]$ ls -l /tmp/malloc_raw.txt
-rw-r--r--. 1 openxs openxs 461675 Apr 22 10:13 /tmp/malloc_raw.txt
[openxs@fc33 ~]$ tail /tmp/malloc_raw.txt
        558c68a27d12 row_purge_step(que_thr_t*)+770 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c689e8256 que_run_threads(que_thr_t*)+2166 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a47ab7 purge_worker_callback(void*)+375 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b88989 tpool::task_group::execute(tpool::task*)+137 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b87bdf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f599ae455f4 0x7f599ae455f4 ([unknown])
        558c69fc26b0 0x558c69fc26b0 ([unknown])
        558c68b87cb0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        dde907894810c083 0xdde907894810c083 ([unknown])
]: 33

...

@stack[
        558c689bd25c mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+108 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a3f2ae row_vers_old_has_index_entry(bool, unsigned char const*, mtr_t*, dict_index_t*, dtuple_t const*, unsigned long, unsigned long)+126 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a24c60 row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+512 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a2627b row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+971 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a272c3 row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+1459 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a27d12 row_purge_step(que_thr_t*)+770 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c689e8256 que_run_threads(que_thr_t*)+2166 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a47ab7 purge_worker_callback(void*)+375 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b88989 tpool::task_group::execute(tpool::task*)+137 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b87bdf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f599ae455f4 0x7f599ae455f4 ([unknown])
        558c69fc26b0 0x558c69fc26b0 ([unknown])
        558c68b87cb0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        dde907894810c083 0xdde907894810c083 ([unknown])
]: 1152

it took 8 minutes(!) to deal with data collected for over 1 second of tracing under high load, and caused notable load:

top - 09:59:04 up 1:16, 3 users, load average: 1.60, 7.14, 5.31
Tasks: 228 total,   2 running, 226 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.8 us, 17.5 sy, 0.0 ni, 71.6 id, 0.0 wa, 0.2 hi, 0.0 si, 0.0 st
MiB Mem :   7916.5 total,   1759.6 free,   2527.0 used,   3629.9 buff/cache
MiB Swap:   3958.0 total,   3958.0 free,      0.0 used.   4910.8 avail Mem

    PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
14813 root      20   0 307872 111852 75516 R 99.7   1.4   7:57.17 bpftrace
...

and drop in performance for the system in the meantime:

[openxs@fc33 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 759.62 qps: 15246.37 (r/w/o: 10676.06/3047.88/1522.44) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00
...
[ 70s ] thds: 32 tps: 708.20 qps: 14174.96 (r/w/o: 9920.64/2837.91/1416.41) lat (ms,95%): 74.46 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 354.60 qps: 7080.28 (r/w/o: 4964.46/1406.62/709.21) lat (ms,95%): 134.90 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 332.91 qps: 6661.34 (r/w/o: 4657.50/1338.03/665.81) lat (ms,95%): 132.49 err/s: 0.00 reconn/s: 0.00
...

So I mention that my lame tracing approach here for no one to try to do the same - monitoring that caused 2 times drop in TPS for minutes is hardly acceptable. I obviously made some mistake that I have yet to identify. Probably resolving stack traces and summarizing them in kernel context was too much for the system, and I can not do better in bpftrace itself, if only use it to just produce raw traces.

The approach above is too primitive also, as I traced only malloc(), while theoretically realloc() and callock() calls may be used. So in a hope to see how this task is approached by really experienced people, I checked what bcc tools provide for tracing memory allocations.

The memleak.py tool there is quite advanced. It allows both to trace individual calls and output periodic summaries of outstanding allocations:

openxs@ao756:~/git/BPF-tools$ /usr/share/bcc/tools/memleak -h
usage: memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
               [--combined-only] [--wa-missing-free] [-s SAMPLE_RATE] [-T TOP]
               [-z MIN_SIZE] [-Z MAX_SIZE] [-O OBJ] [--percpu]
               [interval] [count]

Trace outstanding memory allocations that weren't freed.
Supports both user-mode allocations made with libc functions and kernel-mode
allocations made with kmalloc/kmem_cache_alloc/get_free_pages and corresponding
memory release functions.

positional arguments:
interval              interval in seconds to print outstanding allocations
count                 number of times to print the report before exiting

optional arguments:
-h, --help            show this help message and exit
-p PID, --pid PID     the PID to trace; if not specified, trace kernel
                        allocs
-t, --trace           print trace messages for each alloc/free call
-a, --show-allocs     show allocation addresses and sizes as well as call
                        stacks
-o OLDER, --older OLDER
                        prune allocations younger than this age in
                        milliseconds
-c COMMAND, --command COMMAND
                        execute and trace the specified command
--combined-only       show combined allocation statistics only
--wa-missing-free     Workaround to alleviate misjudgments when free is
                        missing
-s SAMPLE_RATE, --sample-rate SAMPLE_RATE
                        sample every N-th allocation to decrease the overhead
-T TOP, --top TOP     display only this many top allocating stacks (by size)
-z MIN_SIZE, --min-size MIN_SIZE
                        capture only allocations larger than this size
-Z MAX_SIZE, --max-size MAX_SIZE
                        capture only allocations smaller than this size
-O OBJ, --obj OBJ     attach to allocator functions in the specified object
--percpu              trace percpu allocations

EXAMPLES:

./memleak -p $(pidof allocs)
        Trace allocations and display a summary of "leaked" (outstanding)
        allocations every 5 seconds
./memleak -p $(pidof allocs) -t
        Trace allocations and display each individual allocator function call
./memleak -ap $(pidof allocs) 10
        Trace allocations and display allocated addresses, sizes, and stacks
        every 10 seconds for outstanding allocations
./memleak -c "./allocs"
        Run the specified command and trace its allocations
./memleak
        Trace allocations in kernel mode and display a summary of outstanding
        allocations every 5 seconds
./memleak -o 60000
        Trace allocations in kernel mode and display a summary of outstanding
        allocations that are at least one minute (60 seconds) old
./memleak -s 5
        Trace roughly every 5th allocation, to reduce overhead
openxs@ao756:~/git/BPF-tools$

I've applied it to MySQL 8.0.25 recently built on my Ubuntu 20.04 and running sysbench oltp_read_write load test:

openxs@ao756:~/dbs/8.0$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/tmp/mysql8.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 4 tps: 18.49 qps: 377.44 (r/w/o: 264.49/75.57/37.38) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 14.80 qps: 296.01 (r/w/o: 207.21/59.20/29.60) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 4 tps: 25.99 qps: 519.89 (r/w/o: 363.92/103.98/51.99) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 4 tps: 25.30 qps: 506.04 (r/w/o: 354.23/101.21/50.60) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 4 tps: 21.90 qps: 437.92 (r/w/o: 306.54/87.58/43.79) lat (ms,95%): 356.70 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 4 tps: 23.51 qps: 470.05 (r/w/o: 329.10/93.93/47.01) lat (ms,95%): 308.84 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 4 tps: 20.29 qps: 405.99 (r/w/o: 284.12/81.28/40.59) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 4 tps: 20.51 qps: 408.20 (r/w/o: 286.17/81.02/41.01) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 4 tps: 22.80 qps: 457.80 (r/w/o: 320.03/92.18/45.59) lat (ms,95%): 383.33 err/s: 0.00 reconn/s: 0.00
^C

Ignore ther absolute numbers, but note that (unlike with my bpftrace program) there was no very significant drop in QPS over that 20+ seconds I was collecting stacks for outstanding allocations in another shell:

openxs@ao756:~/git/BPF-tools$ time sudo /usr/share/bcc/tools/memleak -p $(pidof mysqld) --top 100 >/tmp/memleak.out
[sudo] password for openxs:
^C
real    0m21,142s
user    0m0,998s
sys     0m0,466s

Now, what was collected? Let' check top 40 rows:

openxs@ao756:~/git/BPF-tools$ head -40 /tmp/memleak.out
Attaching to pid 3416, Ctrl+C to quit.
[13:51:26] Top 100 stacks with outstanding allocations:
        1536 bytes in 2 allocations from stack
                mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0xc8 [mysqld]
                mem_heap_add_block(mem_block_info_t*, unsigned long)+0x53 [mysqld]
                row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x78c [mysqld]
                row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x2b78 [mysqld]
                ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+0x32e [mysqld]
                handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+0x389 [mysqld]
                handler::read_range_first(key_range const*, key_range const*, bool, bool)+0x6e [mysqld]
                ha_innobase::read_range_first(key_range const*, key_range const*, bool, bool)+0x27 [mysqld]
                handler::multi_range_read_next(char**)+0x135 [mysqld]
                handler::ha_multi_range_read_next(char**)+0x2c [mysqld]
                QUICK_RANGE_SELECT::get_next()+0x5a [mysqld]
                IndexRangeScanIterator::Read()+0x3f [mysqld]
                FilterIterator::Read()+0x18 [mysqld]
                MaterializeIterator::MaterializeQueryBlock(MaterializeIterator::QueryBlock const&, unsigned long long*)+0x133 [mysqld]
                MaterializeIterator::Init()+0x319 [mysqld]
                filesort(THD*, Filesort*, RowIterator*, unsigned long, unsigned long long, Filesort_info*, Sort_result*, unsigned long long*)+0x39d [mysqld]
                SortingIterator::DoSort()+0x72 [mysqld]
                SortingIterator::Init()+0x34 [mysqld]
                Query_expression::ExecuteIteratorQuery(THD*)+0x2ea [mysqld]
                Query_expression::execute(THD*)+0x33 [mysqld]
                Sql_cmd_dml::execute_inner(THD*)+0x30b [mysqld]
                Sql_cmd_dml::execute(THD*)+0x545 [mysqld]
                mysql_execute_command(THD*, bool)+0x9f0 [mysqld]
                Prepared_statement::execute(String*, bool)+0x8b0 [mysqld]
                Prepared_statement::execute_loop(String*, bool)+0x117 [mysqld]
                mysqld_stmt_execute(THD*, Prepared_statement*, bool, unsigned long, PS_PARAM*)+0x1b1 [mysqld]
                dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x175d [mysqld]
                do_command(THD*)+0x1a4 [mysqld]
                handle_connection+0x258 [mysqld]
                pfs_spawn_thread+0x162 [mysqld]
                start_thread+0xd9 [libpthread-2.31.so]
        2304 bytes in 3 allocations from stack
                mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0xc8 [mysqld]
                mem_heap_add_block(mem_block_info_t*, unsigned long)+0x53 [mysqld]
                row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x78c [mysqld]
                row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x2b78 [mysqld]
                ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+0x32e [mysqld]
openxs@ao756:~/git/BPF-tools$

This information may be really useful for further analysis. But it is not easy to collapse/summarize for further memory flame graphs creation, while I'd really use them to quickly see where most of the memory is allocated.

One day I'll get back to memleak and create some script to fold its output. But for this blog post I was looking for qucik and dirty ways and, according to the page linked above, for that I had either to use general purpose stackcount.py (that just counts the number of occurences per unique stack), or Brendan Gregg's unsupported mallocstacks, which is similar to stackcount but sums the size_t argument to malloc() as the metric. I've used the latter and had to make a small fix to make it run on my Ubuntu 20.04 netbook:

openxs@ao756:~/git/BPF-tools$ git diff old/2017-12-23/mallocstacks.py
diff --git a/old/2017-12-23/mallocstacks.py b/old/2017-12-23/mallocstacks.py
index 8891e82..92271ed 100755
--- a/old/2017-12-23/mallocstacks.py
+++ b/old/2017-12-23/mallocstacks.py
@@ -96,7 +96,7 @@ struct key_t {
char name[TASK_COMM_LEN];
};
BPF_HASH(bytes, struct key_t);
-BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE)
+BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE);

int trace_malloc(struct pt_regs *ctx, size_t size) {
u32 pid = bpf_get_current_pid_tgid();
openxs@ao756:~/git/BPF-tools$

Then I've used it as follows, run for some 10+ seconds against the same MySQL 8.0.25 under the same load (and with similar acceptable impact as with memleak):

openxs@ao756:~/git/BPF-tools/old/2017-12-23$ sudo ~/git/BPF-tools/old/2017-12-23/mallocstacks.py -p $(pidof mysqld) -f >/tmp/alloc.out
^C
openxs@ao756:~/git/BPF-tools/old/2017-12-23$ ls -l /tmp/alloc.out
-rw-rw-r-- 1 openxs openxs 859059 тра 12 10:28 /tmp/alloc.out
openxs@ao756:~/git/BPF-tools$ head -2 /tmp/alloc.out
mysqld;[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::~_State_impl();dict_stats_thread();[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run();dict_stats_thread();dict_stats_update(dict_table_t*, dict_stats_upd_option_t);dict_stats_save(dict_table_t*, index_id_t const*);dict_stats_exec_sql(pars_info_t*, char const*, trx_t*);que_eval_sql(pars_info_t*, char const*, unsigned long, trx_t*);que_run_threads(que_thr_t*);row_sel_step(que_thr_t*);row_sel(sel_node_t*, que_thr_t*);eval_cmp(func_node_t*);eval_node_alloc_val_buf(void*, unsigned long);__libc_malloc 33
mysqld;[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::~_State_impl();dict_stats_thread();[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run();dict_stats_thread();dict_stats_update(dict_table_t*, dict_stats_upd_option_t);dict_stats_save(dict_table_t*, index_id_t const*);dict_stats_exec_sql(pars_info_t*, char const*, trx_t*);trx_commit_for_mysql(trx_t*);trx_commit(trx_t*);trx_commit_low(trx_t*, mtr_t*);__libc_malloc 40
openxs@ao756:~/git/BPF-tools/old/2017-12-23$ cd
openxs@ao756:~$ cat /tmp/alloc.out | ~/git/FlameGraph/flamegraph.pl --color=mem --title="malloc() Flame Graph" --countname="bytes" >/tmp/mysql8_malloc.svg

So, we've got nice long folded stacks with the amount of bytes allocated (no matter if freed or not already), pure (as low impact as possible) malloc() tracing in a way immediately usable by flamegraph.pl, that produced the following output:

Flame graph showing where most of the memory was allocated from in MySQL 8.0.25 over 10+ seconds, while running standard sysbench oltp_read_write test.

Primitive, but it worked and allowed to see that most of allocations were related to filesort.

* * *

To summarize:

Relatively low impact tracing of memory allocations (ongoing and outstanding) is possible, using bcc tools.
One may use memleak tool to get quick insights on outstanding memory allocations, no matter where and how they were made, with period sampling to reduce the performance impact, or rely on some custom or general tracing tools to collect some metric per stack trace and represent the result as memory flame graphs.
Looks like it may make sense to do just primitive tracing with bpftrace and not try to overload the tool wiuth collecting per stack data in the maps, as stacks resolution seem to take too much CPU resources.

Sunday, May 9, 2021

Off-CPU Analysis Attempt to Find the Reason of Performance Regression in MariaDB 10.4

I did not write new blog posts for more than 2 months already. Busy days... But now I am on vacation and my Percona Live Online 2021 talk on flame graphs is coming soon, so I decided to renew my experience with bcc tools and try to get some insights for one of the bugs I've reported for MariaDB using the off-CPU flame graphs.

The idea was to check why sysbench oltp_read_write test started to work notably slower in a newer version of MariaDB 10.4.x after 10.4.15. On my good, old and slow Acer netbook with recently updated Ubuntu:

openxs@ao756:~/git/server$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
openxs@ao756:~/git/server$ uname -a
Linux ao756 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I've compiled current MariaDB 10.4 from GitHub source following my usual way:

openxs@ao756:~/git/server$ git checkout 10.4
...
openxs@ao756:~/git/server$ git pull
...
openxs@ao756:~/git/server$ git submodule update --init --recursive
Submodule path 'extra/wolfssl/wolfssl': checked out '9c87f979a7f1d3a6d786b260653d566c1d31a1c4'
Submodule path 'libmariadb': checked out '180c543704d627a50a52aaf60e24ca14e0ec4686'
Submodule path 'wsrep-lib': checked out 'f271ad0c6e3c647df83c1d5ec9cd26d77cef2337'
Submodule path 'wsrep-lib/wsrep-API/v26': checked out '76cf223c690845bbf561cb820a46e06a18ad80d1'
openxs@ao756:~/git/server$ git branch
10.3
* 10.4
10.5
10.6
openxs@ao756:~/git/server$ git log -1
commit 583b72ad0ddbc46a7aaeda1c1373b89d4bded9ea (HEAD -> 10.4, origin/bb-10.4-merge, origin/10.4)
Merge: 473e85e9316 a4139f8d68b
Author: Oleksandr Byelkin <sanja@mariadb.com>
Date:   Fri May 7 11:50:24 2021 +0200

    Merge branch 'bb-10.4-release' into 10.4
openxs@ao756:~/git/server$ rm CMakeCache.txt
openxs@ao756:~/git/server$ cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/maria10.4 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DPLUGIN_TOKUDB=NO -DWITH_SSL=system
...
-- Generating done
-- Build files have been written to: /home/openxs/git/server
openxs@ao756:~/git/server$ time make -j 3
...
[100%] Building C object extra/mariabackup/CMakeFiles/mariabackup.dir/__/__/libmysqld/libmysql.c.o
[100%] Linking CXX executable mariabackup
[100%] Built target mariabackup

real    74m9,550s
user    134m10,837s
sys     7m0,387s

openxs@ao756:~/git/server$ make install && make clean
...
openxs@ao756:~/git/server$ cd ~/dbs/maria10.4
openxs@ao756:~/dbs/maria10.4$ bin/mysqld_safe --no-defaults --port=3309 --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 &
[1] 27483
openxs@ao756:~/dbs/maria10.4$ 210507 19:15:37 mysqld_safe Logging to '/home/openxs/dbs/maria10.4/data/ao756.err'.
210507 19:15:37 mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/maria10.4/data

openxs@ao756:~/dbs/maria10.4$ bin/mysql --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 8
Server version: 10.4.19-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> drop database if exists sbtest;
Query OK, 5 rows affected (1.297 sec)

MariaDB [(none)]> create database sbtest;
Query OK, 1 row affected (0.001 sec)

MariaDB [(none)]> exit
Bye

and compared to 10.4.15 with the following test:

openxs@ao756:~/dbs/maria10.4$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=4 prepare
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Initializing worker threads...

Creating table 'sbtest2'...
Creating table 'sbtest3'...
Creating table 'sbtest4'...
Creating table 'sbtest1'...
Inserting 100000 records into 'sbtest1'
Inserting 100000 records into 'sbtest4'
Inserting 100000 records into 'sbtest2'
Inserting 100000 records into 'sbtest3'
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest4'...
Creating table 'sbtest5'...
Inserting 100000 records into 'sbtest5'
Creating a secondary index on 'sbtest2'...
Creating a secondary index on 'sbtest3'...
Creating a secondary index on 'sbtest5'...

openxs@ao756:~/dbs/maria10.4$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
...
    transactions:                        35630 (118.54 per sec.)
    queries:                             712600 (2370.87 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.5612s
    total number of events:              35630

Latency (ms):
         min:                                    3.30
         avg:                                   33.70
         max:                                 2200.17
         95th percentile:                      164.45
...

openxs@ao756:~/dbs/maria10.4.15$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
...
    transactions:                        56785 (189.25 per sec.)
    queries:                             1135700 (3784.99 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.0501s
    total number of events:              56785

Latency (ms):
         min:                                    3.15
         avg:                                   21.13
         max:                                  704.36
         95th percentile:                      108.68
...

So, basically with the same test with all tables fitting into the buffer pool (1G) and all but few default settings current MariaDB 10.4.19 demonstrate up to 60% drop in throughput and increase of 95th latency on this netbook (even more than 15% or so reported on faster quad core Fedora desktop previously).

If you read MDEV-24272 carefully, the regression was tracked up to a specific commit, but I tried to apply various tools to actually see where more time is spent now, specifically. Profiling with perf and creating on-CPU flame graphs had not given me any clear insight that would explain that increase in latency, so my next idea was to trace off-CPU time spent, that is, try to find out how long MariaDB server waits and where in the code that's mostly happen.

For this I've used the offcputime tool:

openxs@ao756:~$ /usr/share/bcc/tools/offcputime -h
usage: offcputime [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f]
                  [--stack-storage-size STACK_STORAGE_SIZE]
                  [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]
                  [duration]

Summarize off-CPU time by stack trace

positional arguments:
duration              duration of trace, in seconds

optional arguments:
-h, --help            show this help message and exit
-p PID, --pid PID     trace this PID only
-t TID, --tid TID     trace this TID only
-u, --user-threads-only
                        user threads only (no kernel threads)
-k, --kernel-threads-only
                        kernel threads only (no user threads)
-U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
-K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
-d, --delimited       insert delimiter between kernel/user stacks
-f, --folded          output folded format
--stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 1024)
-m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store
                        traces (default 1)
-M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we
                        store traces (default U64_MAX)
--state STATE         filter on this thread state bitmask (eg, 2 ==
                        TASK_UNINTERRUPTIBLE) see include/linux/sched.h

examples:
    ./offcputime             # trace off-CPU stack time until Ctrl-C
    ./offcputime 5           # trace for 5 seconds only
    ./offcputime -f 5        # 5 seconds, and output in folded format
    ./offcputime -m 1000     # trace only events that last more than 1000 usec
    ./offcputime -M 10000    # trace only events that last less than 10000 usec
    ./offcputime -p 185      # only trace threads for PID 185
    ./offcputime -t 188      # only trace thread 188
    ./offcputime -u          # only trace user threads (no kernel)
    ./offcputime -k          # only trace kernel threads (no user)
    ./offcputime -U          # only show user space stacks (no kernel)
    ./offcputime -K          # only show kernel space stacks (no user)

I've stored the outputs in /dev/shm to have less impact on disk I/O that I suspected as one of the reasons:

openxs@ao756:~$ ls /dev/shm
openxs@ao756:~$ mkdir /dev/shm/out
openxs@ao756:~$ ls /dev/shm
out

Basically the following commands were used to create folded (ready to use for building flame graphs) user-space only stacks and time spent off-CPU in them over 60 seconds of tracing while sysbench tests were running for clean setup on MariaDB 10.4.15 and then on current 10.4.19, and create flame graphs based on those stacks where start_thread frame is present (to clean up irrelevant details):

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10415.out
WARNING: 27 stack traces lost and could not be displayed.
openxs@ao756:~$ cat /dev/shm/out/offcpu_10415.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10415.svg

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10419.out
WARNING: 24 stack traces lost and could not be displayed.
openxs@ao756:~$ cat /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10419.svg

As a result I've got the following graphs (.png screenshots from real .svg files below). On 10.4.15 we spent around 43 seconds (out of 60 we monitored) off-CPU, mostly in do_command() and waiting for network I/O, and the graph was the following:

Off-CPU time for MariaDB 10.4.15

In case of 10.4.19 the graph is very different and we seem to have spent 79 seconds off-CPU, mostly in background io_handler_thread():

Off-CPU time for MariaDB 10.4.19

I was surprised to see more than 60 seconds spent off-CPU in this case. Maybe this is possible because I have 2 cores and MariaDB threads were waiting on both most of the time.

I've then tried to use differential flame graph to highlight the call stacks that the main difference is related to. I've crated it from existing folded stack traces with the following command:

openxs@ao756:~$ ~/git/FlameGraph/difffolded.pl /dev/shm/out/offcpu_10415.out /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Diff Flame Graph" --countname=us > /tmp/offcpu_diff.svg

The resulting graph is presented below:

The source of increase is highlighted in red

Here we clearly see that main increase in time spent waiting in 10.4.19 is related to io_handler_thread(), but increase happened in almost all background threads.

* * *

To summarize:

When some performance regression happens you should check not only those code paths in the application where software started to spend more time working, but also where it started to wait more.
In case of https://jira.mariadb.org/browse/MDEV-24272 we clearly started to flush more to disk from the very beginning of sysbench oltp_read_write test in newer versions of MariaDB 10.4.x, and on my slow encrypted HDD this matters a lot. The load that was supposed to be CPU-bound (as we have large enough buffer pool) becomes disk-bound.
Flame graphs are cool for highlighting the difference and in this post I've shown both a classical smart way to produce them without too much impact, and a way to highlight the difference in them with a differential flam graph produced by the difffolded.pl tool created by Brendan Gregg.
Other cases when Flame Graphs may help MySQL or MariaDB DBAs are discussed during my upcoming Percona Live 2021 talk on May 12. See you there!
I'll get back to this nice regression bug to study the test case in more details with other tools, maybe more than once. Stay tuned!

Sunday, February 28, 2021

5 Years of Working for MariaDB Corporation

March 1, 2016, was my first day of working for MariaDB Corporation. It means that I worked for them full 5 years in a row! It's the longest period I've ever spent working in the same company. I worked for more than 7 years in MySQL Bugs Verification team of MySQL Support, but formally it was for 3 different companies over that period, MySQL AB, Sun and Oracle. So tonight, after 2 weekend on-call shifts in a row I want to summarize what I lost over that 5 years and what I won.

I never planned to leave Percona after just a bit more than 3 years, but I had to do something to either fix the direction of their services business development after April, 2015, or at least to make some points that are noticed and remembered. I made them by January 26, 2016, and then had to move on.

So, what I lost after leaving Percona:

In Percona I worked in the best Support team in the industry at the moment!
In Percona I was proud to be working for a company that does the right things, both for business around open source software, Community and customers.
In Percona I was involved in decision making, I influenced bug fixing process and priorities and was a kind of authority on everything I cared to state or do.
I had a really good salary, regular bonuses, longer vacations or otherwise properly compensated extra working time and all the opportunities to become a public person in MySQL Community.
I spent a lot of time working with Peter Zaitsev and really brilliant engineers both from Development, QA and Support.
After few initial months of getting used to a lot of work and work style, then till April 14 or so, 2015, it was a really comfortable place for me to work at and do things I like and good at.

I lost most of the above when I left. No more decision making of any kind (it was my decision to avoid that while joining MariaDB, to begin with). No more bugs processing or prioritizing. No more Percona Live conferences till 2019 when I finally managed to clarify the problems I had with (cancelled) participation in Percona Live Europe 2015. Nobody ever asked me to blog about anything since 2016 and until the beginning of 2020 (when MariaDB Foundation got really interested in my public performances). Joining MariaDB Corporation made me a suspicious member of MySQL Community and eventually forced me to leave Planet MySQL where my posts were not appreciated.

It takes me just one short tweet with MySQL bug number to share to have some of these bugs immediately hidden from the Community and made private. Looks like people suspect I have some secret agenda and mission from MariaDB Corporation, while I have none related to MySQL, neither to software nor to bugs in it - I do it at my own time and based on my own views that are not influenced by anyone...

Now what I gained from joining MariaDB Support team:

I still work in the best Support team in the industry, with new (to me) brilliant people, some of my former colleagues in MySQL and some of those I worked with in Percona and managed to get back into my team now in MariaDB Corporation.
I work harder than ever, at least twice as more as I did in Percona (at least speaking about the real life customer issues). The issues are more interesting, complex and challenging in general, and cover everything from MySQL 4.0.x to NDB cluster, MySQL 8.0.x and all versions of MariaDB Server, MaxScale and Connectors, and Galera clusters, with everything MySQL-related that Percona releases in between! This work is properly compensated in recent years.
Yes, I do work on Galera issues a lot, read the megabytes of logs and make sense of everything Galera. Who could imagine I'll got that way back in 2015?
I work closely and directly with the best developers, from Codership's Galera team, MariaDB Foundation developers, to Sergey Golubchik, Marko Mäkelä, Vladislav Vaintroub, Elena Stepanova and good old MySQL Optimizer team (now MariaDB's), and Monty himself, and more... We chat, we talk and discuss technical topics almost every day! I do some work they ask about, I build more things from sources than ever. It's really great and had almost never been the case before I joined MariaDB. I love this part of the corporate culture here.
My blog that you read now is way more popular than ever before 2016. At good times in 2017 I had more than 1000 reads per day, for weeks and months.
I am presenting at the conferences way more often than ever in Percona, from FOSDEM to Percona Live and everything MariaDB.
My influence and impact on MySQL Community got increased. I was declared a MySQL Community Contributor of the Year 2019. As often happens, it's easier to make impact when you are outsider. They can not ignore you even if that's only because you are considered "asshole" and "enemy" with "corporate agenda", for whatever reasons.

So far I do not regret that I made a decision in favor of MariaDB back in 2016, even though it forced me to keep up with or ignore many things I don't like at my current company. I am sorry that back in 2010 Monty and Kay were just 10 days too late to get me to SkySQL of the times. I had signed the contract with Oracle, and in 2012 there I really was mostly wasting my time, unfortunately...


That switch formally happened on March 1, 2016. It was a good decision.

Just few random notes at the and of a hard 7 days week of work. I hope you would not blame me too much for these. I also hope I'll still have my job tomorrow :)

Sunday, February 14, 2021

What about the "Fun with Bugs" series on YouTube?

My "Fun with Bugs" series of blog posts about interesting or badly handled MySQL bug reports ended more than 7 months ago. The time had come for that. But I honestly miss that media for bitching about something wrong in MySQL once in a while...

The year of COVID-19 pandemic with video conferences that replaced normal offline ones forced me to start recording videos, and I even used to like the process so much that I am working on my wife's channel content as a hobby. I've created my own channel as well, for uploading some draft/bad/long/extended videos recorded in the process of work on my online talks:

Being fluent enough with recording videos at home (using different software tools and cameras) and publishing them at YouTube, I wonder now if it makes sense to turn this activity into a regular and MySQL-concentrated one? My next talk(s) will be submitted to Percona Live 2021, but it means they may go live only in May, and I'd like to be on screens earlier than that.

So, I wonder should I maybe have a regular video recordings shared, let's say, once a week every Tuesday, up to 5 minutes long at most, and devoted to some MySQL-related topics? What topics would you like me to cover? Would you mind if it will be a five minutes talk about a recent MySQL bug report or few of them, either interesting in general or badly handled by my former Oracle colleagues? Something else to better spend megabytes of video on? Leave it to younger and more attractive and experienced speakers? Keep writing here or stop bitching about the bugs once and for good?

I am waiting for your comments in this blog post and in social media that I'll share it till March 5, 2021. Then I'll decide on how to proceed with this regular YouTube videos idea.

Friday, February 5, 2021

On Upcoming FOSDEM 2021

FOSDEM 2021 starts tomorrow. This time I have 3 talks to present. Here they are, in the order of appearance, with links to my related blog posts:

"Upgrading to a newer major version of MariaDB" - it should start at 10:30 on Saturday in the MariaDB devroom and is mostly devoted to mysql_upgrade. Related blog posts are:
- "What mysql_upgrade really does in MariaDB, Part I " - original study of internal workings of mysql_upgrade utility inspired by Monty's post.
- "What mysql_upgrade really does in MariaDB, Part II, Bugs and Missing Features" - recent review of bugs and feature requests for mysql_upgrade reported since the previous blog post.
"Monitoring MariaDB Server with bpftrace on Linux" - it should start at 12:40 on Sunday in the Monitoring and Observability devroom and is devoted to bpftrace basics. A lot more information is provided in recent blog posts:
- "Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part I, Basic uprobes" - few details on building recent version from source and basic probes to capture SQL queries and their execution time and to trace pthread_mutex_lock calls
- "Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part II, Using the Existing Tools" - review and tests of some bpftrace-based tools/small programs that come with it, mostly related to disk I/O monitoring. It's a;ways great to study by example and have more tools for typical tasks.
- "Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part III, Creating a New Tool for Tracing Mutexes" - on my first lame attempt to get interesting stacks and time spent waiting in them for pthread_mutex_lock and pthread_mutex_unlock pair of functions.
- "Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part IV, Creating a New Tool for Tracing Time Spent in __lll_lock_wait" - in that case I tried to measure time, per unique stack, spent inside the __lll_lock_wait function when MariaDB was under high concurrent load. Interesting results, but the performance drop is notable and it took too much time to get the outputs...
- "Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part V, Proper Way To Summarize Time Spent per Stack" - in this last post I finally managed to find a proper balance of spreading processing between bpftrace and user space awk etc tools, and got fast and useful results. Some more bpftrace features were also used.
"Linux /proc filesystem for MySQL DBAs" - this talk should start at 15:00 on Sunday in MySQL devroom and is devoted to a totally different (comparing to dynamic tracing I am so big fan of) way to get insights about MySQL internal working, waits etc - sampling of files in /proc file system. I had written 3 blog posts on the topic:
- "Linux /proc Filesystem for MySQL DBAs - Part I, Basics" - mostly quotes from man 5 proc and few tests with Percona Server 5.7.x running under load.
- "Linux /proc Filesystem for MySQL DBAs - Part II, Threads of the mysqld Process" - some basic ideas on how to match MySQL "thread"/connection with Linux thread that we can then sample via /proc. The approach works for MySQL 5.7+ and MariaDB 10.5+. For older versions one probably has to use gdb to identify threads for monitoring.
- "Linux /proc Filesystem for MySQL DBAs - Part III, 0x.tools by Tanel Poder" - using these great and simple tools for regular monitoring and... to create an off-CPU flame graphs for MySQL server that is I/O bound or otherwise spe3nds time waiting on something in the kernel.

Slides are uploaded to the talks pages and will be shared via SlideShare. Draft, longer versions of the talks that I've recorded but FOSDEM system had not accepted will also be shared at my YouTube channel soon.

Usual view on my way to FOSDEM ULB site... I miss Brussels!

Other talks in these devrooms that I consider interesting (will not be able to attend them all live though):

See you there! FOSDEM was a real driver of my more or less advanced performance studies this year.

Saturday, January 30, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part V, Proper Way To Summarize Time Spent per Stack

In my previous post in this series I was tracing __lll_lock_wait() function and measuring time spent in it per unique MariaDB server's stack trace. I've got the results that had not looked like totally wrong or useless and even presented them as a flame graph, but it took a long time and during all thyat time, for minutes, sysbench test TPS and QPS results were notably smaller. So, it was a useful tracing, but surely not a low impact one.

I already stated before that the approach used is surely not optimal. There should be some better way to summarize the stack traces without such a performance impact. Today I continued to work on this problem. First of all, I tried to measure finally how much time exactly I had to wait for the results from bpftrace command:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt

real    9m51.904s
user    8m25.576s
sys     1m8.057s
[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks1.txt
-rw-rw-r--. 1 openxs openxs 1564291 Jan 30 11:58 /tmp/lll_lock_perf_stacks1.txt

So, test was running for 300 seconds (5 minutes) and monitoring results (for about 20 seconds) came 5 minutes later! This is definitely NOT acceptable for any regular use.

I start to check things and test more. I noted in the Reference Guide (that I've read entirely a couple of times yesterday and today) that ustack() function of bpftrace can accept the second argument, number of frames to show in stack. Based on the previous flame graphs most interesting stacks were not long, so I decided to check if 8 or just 5 may be still enough for useful results while reducing the performance impact. I changed the code here:

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
@start[tid] = nsecs;
@tidstack[tid] = ustack(perf,5); -- <-- I also tried 8
}

and got these run times for 8 and 5 frames:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt

real    9m39.195s
user    8m9.188s
sys     1m8.430s

[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks1.txt
-rw-rw-r--. 1 openxs openxs 1085019 Jan 30 12:12 /tmp/lll_lock_perf_stacks1.txt

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks2.txt

real    6m6.801s
user    5m5.537s
sys     0m42.226s

[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks2.txt
-rw-rw-r--. 1 openxs openxs 728072 Jan 30 12:19 /tmp/lll_lock_perf_stacks2.txt

So, reducing stack to 5 frames made notable impact, and QPS drop was a bit smaller, but still waiting for 5 minutes is not an option in general.

I had not found really powerful functions to replace awk and collapse stacks inside the bpftrace program, so decided to export them "raw", on every hit of uretprobe, and then process by Linux tools as unusual. I'll skip steps, tests and failures and just show the resulting new program for the task:

[openxs@fc31 ~]$ cat lll_lock_wait2.bt
#!/usr/bin/env bpftrace

BEGIN
{
/*    printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n"); */
}

interval:s:$1 { exit(); }

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf);
}

uretprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd" && @start[tid] != 0/
{
    $now = nsecs;
    $time = $now - @start[tid];
    @futexstack[@tidstack[tid]] += $time;
    print(@futexstack);
    delete(@futexstack[@tidstack[tid]]);
/*
    printf("Thread: %u, time: %d\n", tid, $time);
*/
    delete(@start[tid]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@tidstack);
    clear(@futexstack);
}

Changes vs the previous version are highlighted. Basically I print and then delete every stack collected, as function call ends, along with time, to further sum them up externally. The function used is the following:

print(@x[, top [, div]]) - Print the map, optionally the top entries only and with a divisor

I've also added the interval probe at the beginning, referring to the first program argument as seconds to run. It just forces exit after N seconds (0 by default means immediate exist, wrong format will be reported as error).

Now if I run it while test is running, this way:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait2.bt 10 2>/dev/null | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' > /tmp/collapsed_lll_lock_v2_raw.txt

real    0m10.646s
user    0m5.648s
sys     0m0.843s
[openxs@fc31 ~]$ ls -l /tmp/collapsed_lll_lock_v2_raw.txt
-rw-rw-r--. 1 openxs openxs 1566 Jan 30 13:12 /tmp/collapsed_lll_lock_v2_raw.txt
[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt
15531 __lll_lock_wait;
1233833 __lll_lock_wait;
10638 __lll_lock_wait;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;
7170 __lll_lock_wait;tpool::thread_pool_generic::worker_main;;;;
273330 __lll_lock_wait;tdc_acquire_share;open_table;open_tables;open_and_lock_tables;execute_sqlcom_select;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
1193083 __lll_lock_wait;trx_undo_report_row_operation;btr_cur_update_in_place;btr_cur_optimistic_update;row_upd_clust_rec;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
183353 __lll_lock_wait;
43231 __lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;ha_commit_trans;trans_commit;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread

I get a very condensed output almost immediately! I can afford running for 60 seconds without much troubles:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait2.bt 60 2>/dev/null | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' > /tmp/collapsed_lll_lock_v2_raw.txt

real    1m0.959s
user    0m44.146s
sys     0m6.126s

[openxs@fc31 ~]$ ls -l /tmp/collapsed_lll_lock_v2_raw.txt
-rw-rw-r--. 1 openxs openxs 12128 Jan 30 13:17 /tmp/collapsed_lll_lock_v2_raw.txt

The impact of these 60 seconds is visible in the sysbench output:

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 625.68 qps: 12567.66 (r/w/o: 8800.49/2512.81/1254.36) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 891.51 qps: 17825.69 (r/w/o: 12479.31/3563.26/1783.13) lat (ms,95%): 142.39 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 512.09 qps: 10230.44 (r/w/o: 7163.62/2042.55/1024.27) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 369.10 qps: 7390.48 (r/w/o: 5171.39/1480.90/738.20) lat (ms,95%): 350.33 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 417.83 qps: 8347.73 (r/w/o: 5845.14/1667.03/835.56) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 484.03 qps: 9687.55 (r/w/o: 6782.09/1937.31/968.16) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 431.98 qps: 8640.35 (r/w/o: 6049.38/1727.01/863.95) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 354.60 qps: 7097.39 (r/w/o: 4968.86/1419.32/709.21) lat (ms,95%): 419.45 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 380.98 qps: 7600.82 (r/w/o: 5317.17/1521.80/761.85) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 423.01 qps: 8467.99 (r/w/o: 5928.71/1693.16/846.13) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 32 tps: 475.66 qps: 9525.07 (r/w/o: 6669.22/1904.53/951.32) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 32 tps: 409.13 qps: 8171.48 (r/w/o: 5718.38/1634.94/818.17) lat (ms,95%): 458.96 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 32 tps: 190.91 qps: 3826.72 (r/w/o: 2682.19/762.62/381.91) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 32 tps: 611.40 qps: 12229.99 (r/w/o: 8558.99/2448.20/1222.80) lat (ms,95%): 223.34 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 32 tps: 581.99 qps: 11639.19 (r/w/o: 8148.53/2326.68/1163.99) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 32 tps: 653.21 qps: 13058.51 (r/w/o: 9139.57/2612.52/1306.41) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 32 tps: 561.87 qps: 11231.27 (r/w/o: 7860.53/2246.99/1123.75) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 32 tps: 625.66 qps: 12526.32 (r/w/o: 8770.19/2504.82/1251.31) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 32 tps: 554.50 qps: 11088.01 (r/w/o: 7760.61/2218.40/1109.00) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 32 tps: 607.10 qps: 12143.89 (r/w/o: 8501.49/2428.20/1214.20) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 210s ] thds: 32 tps: 424.17 qps: 8467.63 (r/w/o: 5925.43/1693.87/848.33) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 32 tps: 466.80 qps: 9335.03 (r/w/o: 6533.32/1868.31/933.40) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 230s ] thds: 32 tps: 365.83 qps: 7330.56 (r/w/o: 5133.49/1465.21/731.86) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 32 tps: 411.27 qps: 8218.45 (r/w/o: 5754.25/1641.87/822.34) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 250s ] thds: 32 tps: 127.10 qps: 2534.71 (r/w/o: 1772.60/507.70/254.40) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 32 tps: 642.35 qps: 12856.29 (r/w/o: 8999.09/2572.50/1284.70) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 32 tps: 603.80 qps: 12078.79 (r/w/o: 8456.89/2414.30/1207.60) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 32 tps: 642.70 qps: 12857.60 (r/w/o: 9001.40/2570.80/1285.40) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 290s ] thds: 32 tps: 716.57 qps: 14325.60 (r/w/o: 10026.98/2865.48/1433.14) lat (ms,95%): 144.97 err/s: 0.00 reconn/s: 0.00
[ 300s ] thds: 32 tps: 611.16 qps: 12219.69 (r/w/o: 8551.80/2445.66/1222.23) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            2124836
        write:                           607096
        other:                           303548
        total:                           3035480
    transactions:                        151774 (505.83 per sec.)
    queries:                             3035480 (10116.62 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      505.8309
    time elapsed:                        300.0489s
    total number of events:              151774

Latency (ms):
         min:                                    2.17
         avg:                                   63.25
         max:                                 8775.26
         95th percentile:                      331.91
         sum:                              9599021.34

Threads fairness:
    events (avg/stddev):           4742.9375/42.76
    execution time (avg/stddev):   299.9694/0.03

but it continues only for some time after the end of monitoring and overal QPS is not much affected.

Now what about top 5 stacks by wait time? Here they are (well, I had NOT cared to sum up several entires with the same stack, but you should know how to do this with awk and flamegraph.pl will do it for us later) :

[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt | sort -r -n -k 1,1 | head -5
340360731 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
335903154 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_upd_sec_index_entry;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
287819974 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
281374237 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
268817930 __lll_lock_wait;buf_page_init_for_read;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;

A lot of contention happens on INSERTing rows. I'd say it's expected! Now to create the flame graph we just have to re-order the output a bit:

[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="Time spent in ___lll_lock_wait in MariaDB 10.5, all frames" --countname=nsecs > ~/Documents/lll_lock_v2_2.svg

Here it is:

Most mutex waits for oltp_read_write test happen on inserting rows...

Monitoring over 60 seconds allowed to see more realistic picture of contention. It mostly happens on inserting rows by 32 threads into just 5 tables on a system with 4 cores only.

Now it seems we have the tool that can be used more than once ina lifetime and seems to provide a useful information, fast.

* * *

To summarize:

One should not even try to summarize all stacks in a single associative arrays for any function that is called to often in many different contests inside the bftrace! There is a limit on number of items in the ma you may hit, and the impact of storing and exporting this is too high for monitoring more than a couple of seconds.
Cleaning up stack traces should be done externally, by usual Linux test processing tools thta produce smaller summary output. The bpftrace program should NOT be designed to collect a lot of outputs for a long time and then output them at the end. Looks like smaller shunks exported to the userspace regularly is a better approach.
We can use simple command line arguments for the program that can be literally substituted as parts of probe definitions. Next step would be to make the more generic tool with binary and function name to probe as command line arguments.
bpftrace is cool!