Wednesday, May 12, 2021

Dynamic Tracing of Memory Allocations in MySQL With bcc Tools

Last year I started my lame attempts to apply different Linux dynamic tracing tools and approaches to frequent events like memory allocations. In this blog post I already described how to use perf to add user probe to trace malloc() calls with the number of bytes requested. Unfortunately this approach is not practical for production use for more than several seconds.

Recently I plan with bpftrace a lot and so far ended up with an easy way to trace calls and call stacks, and was on my way to trace only outstanding allocations and care only about memory areas not yet freed. If you are interested, the primitive approach may look like this:

[openxs@fc33 ~]$ cat malloc.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("Tracing MariaDB's malloc() calls, Ctrl-C to stop\n");
}

interval:s:$1 { exit(); }

uprobe:/lib64/libc.so.6:malloc
/ comm == "mariadbd" /
{
    @size[tid] += arg0;
/*    printf("Allocating %d bytes in thread %u...\n", arg0, tid);  */
}

uretprobe:/lib64/libc.so.6:malloc
/ comm == "mariadbd" && @size[tid] > 0 /
{
    @memory[tid,retval] = @size[tid];
    @stack[ustack(perf)] += @size[tid];

    print(@stack);
    clear(@stack);

    delete(@size[tid]);
}

uprobe:/lib64/libc.so.6:free
/ comm == "mariadbd" /
{
    delete(@memory[tid, arg0]);
/*    printf("Freeing %p...\n", arg0); */
}

END
{
    clear(@size);
    clear(@memory);
    clear(@stack);
}
[openxs@fc33 ~]$

But while it works (both for tracing that is commented out above and for summarizing the non-freed allocations) and produced some outputs as expected:

[openxs@fc33 ~]$ time sudo ./malloc.bt 1 2>/dev/null >/tmp/malloc_raw.txt

real    8m47.963s
user    2m53.513s
sys     5m50.685s

[openxs@fc33 maria10.5]$ ls -l /tmp/malloc_raw.txt
-rw-r--r--. 1 openxs openxs 461675 Apr 22 10:13 /tmp/malloc_raw.txt
[openxs@fc33 ~]$ tail /tmp/malloc_raw.txt
        558c68a27d12 row_purge_step(que_thr_t*)+770 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c689e8256 que_run_threads(que_thr_t*)+2166 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a47ab7 purge_worker_callback(void*)+375 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b88989 tpool::task_group::execute(tpool::task*)+137 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b87bdf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f599ae455f4 0x7f599ae455f4 ([unknown])
        558c69fc26b0 0x558c69fc26b0 ([unknown])
        558c68b87cb0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        dde907894810c083 0xdde907894810c083 ([unknown])
]: 33

...

@stack[
        558c689bd25c mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+108 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a3f2ae row_vers_old_has_index_entry(bool, unsigned char const*, mtr_t*, dict_index_t*, dtuple_t const*, unsigned long, unsigned long)+126 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a24c60 row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+512 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a2627b row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+971 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a272c3 row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+1459 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a27d12 row_purge_step(que_thr_t*)+770 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c689e8256 que_run_threads(que_thr_t*)+2166 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a47ab7 purge_worker_callback(void*)+375 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b88989 tpool::task_group::execute(tpool::task*)+137 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b87bdf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f599ae455f4 0x7f599ae455f4 ([unknown])
        558c69fc26b0 0x558c69fc26b0 ([unknown])
        558c68b87cb0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        dde907894810c083 0xdde907894810c083 ([unknown])
]: 1152

it took 8 minutes(!) to deal with data collected for over 1 second of tracing under high load, and caused notable load:

top - 09:59:04 up  1:16,  3 users,  load average: 1.60, 7.14, 5.31
Tasks: 228 total,   2 running, 226 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.8 us, 17.5 sy,  0.0 ni, 71.6 id,  0.0 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem :   7916.5 total,   1759.6 free,   2527.0 used,   3629.9 buff/cache
MiB Swap:   3958.0 total,   3958.0 free,      0.0 used.   4910.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  14813 root      20   0  307872 111852  75516 R  99.7   1.4   7:57.17 bpftrace
...

and drop in performance for the system in the meantime:

[openxs@fc33 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 759.62 qps: 15246.37 (r/w/o: 10676.06/3047.88/1522.44) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00
...
[ 70s ] thds: 32 tps: 708.20 qps: 14174.96 (r/w/o: 9920.64/2837.91/1416.41) lat (ms,95%): 74.46 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 354.60 qps: 7080.28 (r/w/o: 4964.46/1406.62/709.21) lat (ms,95%): 134.90 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 332.91 qps: 6661.34 (r/w/o: 4657.50/1338.03/665.81) lat (ms,95%): 132.49 err/s: 0.00 reconn/s: 0.00

...

So I mention that my lame tracing approach here for no one to try to do the same - monitoring that caused 2 times drop in TPS for minutes is hardly acceptable. I obviously made some mistake that I have yet to identify. Probably resolving stack traces and summarizing them in kernel context was too much for the system, and I can not do better in bpftrace itself, if only use it to just produce raw traces.

The approach above is too primitive also, as I traced only malloc(), while theoretically realloc() and callock() calls may be used. So in a hope to see how this task is approached by really experienced people, I checked what bcc tools provide for tracing memory allocations.

The memleak.py tool there is quite advanced. It allows both to trace individual calls and output periodic summaries of outstanding allocations:

openxs@ao756:~/git/BPF-tools$ /usr/share/bcc/tools/memleak -h
usage: memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
               [--combined-only] [--wa-missing-free] [-s SAMPLE_RATE] [-T TOP]
               [-z MIN_SIZE] [-Z MAX_SIZE] [-O OBJ] [--percpu]
               [interval] [count]

Trace outstanding memory allocations that weren't freed.
Supports both user-mode allocations made with libc functions and kernel-mode
allocations made with kmalloc/kmem_cache_alloc/get_free_pages and corresponding
memory release functions.

positional arguments:
  interval              interval in seconds to print outstanding allocations
  count                 number of times to print the report before exiting

optional arguments:
  -h, --help            show this help message and exit
  -p PID, --pid PID     the PID to trace; if not specified, trace kernel
                        allocs
  -t, --trace           print trace messages for each alloc/free call
  -a, --show-allocs     show allocation addresses and sizes as well as call
                        stacks
  -o OLDER, --older OLDER
                        prune allocations younger than this age in
                        milliseconds
  -c COMMAND, --command COMMAND
                        execute and trace the specified command
  --combined-only       show combined allocation statistics only
  --wa-missing-free     Workaround to alleviate misjudgments when free is
                        missing
  -s SAMPLE_RATE, --sample-rate SAMPLE_RATE
                        sample every N-th allocation to decrease the overhead
  -T TOP, --top TOP     display only this many top allocating stacks (by size)
  -z MIN_SIZE, --min-size MIN_SIZE
                        capture only allocations larger than this size
  -Z MAX_SIZE, --max-size MAX_SIZE
                        capture only allocations smaller than this size
  -O OBJ, --obj OBJ     attach to allocator functions in the specified object
  --percpu              trace percpu allocations

EXAMPLES:

./memleak -p $(pidof allocs)
        Trace allocations and display a summary of "leaked" (outstanding)
        allocations every 5 seconds
./memleak -p $(pidof allocs) -t
        Trace allocations and display each individual allocator function call
./memleak -ap $(pidof allocs) 10
        Trace allocations and display allocated addresses, sizes, and stacks
        every 10 seconds for outstanding allocations
./memleak -c "./allocs"
        Run the specified command and trace its allocations
./memleak
        Trace allocations in kernel mode and display a summary of outstanding
        allocations every 5 seconds
./memleak -o 60000
        Trace allocations in kernel mode and display a summary of outstanding
        allocations that are at least one minute (60 seconds) old
./memleak -s 5
        Trace roughly every 5th allocation, to reduce overhead
openxs@ao756:~/git/BPF-tools$

I've applied it to MySQL 8.0.25 recently built on my Ubuntu 20.04 and running sysbench oltp_read_write load test:

openxs@ao756:~/dbs/8.0$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/tmp/mysql8.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 4 tps: 18.49 qps: 377.44 (r/w/o: 264.49/75.57/37.38) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 14.80 qps: 296.01 (r/w/o: 207.21/59.20/29.60) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 4 tps: 25.99 qps: 519.89 (r/w/o: 363.92/103.98/51.99) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 4 tps: 25.30 qps: 506.04 (r/w/o: 354.23/101.21/50.60) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 4 tps: 21.90 qps: 437.92 (r/w/o: 306.54/87.58/43.79) lat (ms,95%): 356.70 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 4 tps: 23.51 qps: 470.05 (r/w/o: 329.10/93.93/47.01) lat (ms,95%): 308.84 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 4 tps: 20.29 qps: 405.99 (r/w/o: 284.12/81.28/40.59) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 4 tps: 20.51 qps: 408.20 (r/w/o: 286.17/81.02/41.01) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00

[ 90s ] thds: 4 tps: 22.80 qps: 457.80 (r/w/o: 320.03/92.18/45.59) lat (ms,95%): 383.33 err/s: 0.00 reconn/s: 0.00
^C

Ignore ther absolute numbers, but note that (unlike with my bpftrace program) there was no very significant drop in QPS over that 20+ seconds I was collecting stacks for outstanding allocations in another shell:

openxs@ao756:~/git/BPF-tools$ time sudo /usr/share/bcc/tools/memleak -p $(pidof mysqld) --top 100 >/tmp/memleak.out
[sudo] password for openxs:
^C
real    0m21,142s
user    0m0,998s
sys     0m0,466s

Now, what was collected? Let' check top 40 rows:

openxs@ao756:~/git/BPF-tools$ head -40 /tmp/memleak.out
Attaching to pid 3416, Ctrl+C to quit.
[13:51:26] Top 100 stacks with outstanding allocations:
        1536 bytes in 2 allocations from stack
                mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0xc8 [mysqld]
                mem_heap_add_block(mem_block_info_t*, unsigned long)+0x53 [mysqld]
                row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x78c [mysqld]
                row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x2b78 [mysqld]
                ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+0x32e [mysqld]
                handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+0x389 [mysqld]
                handler::read_range_first(key_range const*, key_range const*, bool, bool)+0x6e [mysqld]
                ha_innobase::read_range_first(key_range const*, key_range const*, bool, bool)+0x27 [mysqld]
                handler::multi_range_read_next(char**)+0x135 [mysqld]
                handler::ha_multi_range_read_next(char**)+0x2c [mysqld]
                QUICK_RANGE_SELECT::get_next()+0x5a [mysqld]
                IndexRangeScanIterator::Read()+0x3f [mysqld]
                FilterIterator::Read()+0x18 [mysqld]
                MaterializeIterator::MaterializeQueryBlock(MaterializeIterator::QueryBlock const&, unsigned long long*)+0x133 [mysqld]
                MaterializeIterator::Init()+0x319 [mysqld]
                filesort(THD*, Filesort*, RowIterator*, unsigned long, unsigned long long, Filesort_info*, Sort_result*, unsigned long long*)+0x39d [mysqld]
                SortingIterator::DoSort()+0x72 [mysqld]
                SortingIterator::Init()+0x34 [mysqld]
                Query_expression::ExecuteIteratorQuery(THD*)+0x2ea [mysqld]
                Query_expression::execute(THD*)+0x33 [mysqld]
                Sql_cmd_dml::execute_inner(THD*)+0x30b [mysqld]
                Sql_cmd_dml::execute(THD*)+0x545 [mysqld]
                mysql_execute_command(THD*, bool)+0x9f0 [mysqld]
                Prepared_statement::execute(String*, bool)+0x8b0 [mysqld]
                Prepared_statement::execute_loop(String*, bool)+0x117 [mysqld]
                mysqld_stmt_execute(THD*, Prepared_statement*, bool, unsigned long, PS_PARAM*)+0x1b1 [mysqld]
                dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x175d [mysqld]
                do_command(THD*)+0x1a4 [mysqld]
                handle_connection+0x258 [mysqld]
                pfs_spawn_thread+0x162 [mysqld]
                start_thread+0xd9 [libpthread-2.31.so]
        2304 bytes in 3 allocations from stack
                mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0xc8 [mysqld]
                mem_heap_add_block(mem_block_info_t*, unsigned long)+0x53 [mysqld]
                row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x78c [mysqld]
                row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x2b78 [mysqld]
                ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+0x32e [mysqld]
openxs@ao756:~/git/BPF-tools$

This information may be really useful for further analysis. But it is not easy to collapse/summarize for further memory flame graphs creation, while I'd really use them to quickly see where most of the memory is allocated.

One day I'll get back to memleak and create some script to fold its output. But for this blog post I was looking for qucik and dirty ways and, according to the page linked above, for that I had either to use general purpose stackcount.py (that just counts the number of occurences per unique stack), or Brendan Gregg's unsupported mallocstacks, which is similar to stackcount but sums the size_t argument to malloc() as the metric. I've used the latter and had to make a small fix to make it run on my Ubuntu 20.04 netbook:

openxs@ao756:~/git/BPF-tools$ git diff old/2017-12-23/mallocstacks.py
diff --git a/old/2017-12-23/mallocstacks.py b/old/2017-12-23/mallocstacks.py
index 8891e82..92271ed 100755
--- a/old/2017-12-23/mallocstacks.py
+++ b/old/2017-12-23/mallocstacks.py
@@ -96,7 +96,7 @@ struct key_t {
     char name[TASK_COMM_LEN];
 };
 BPF_HASH(bytes, struct key_t);
-BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE)
+BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE);

 int trace_malloc(struct pt_regs *ctx, size_t size) {
     u32 pid = bpf_get_current_pid_tgid();
openxs@ao756:~/git/BPF-tools$

Then I've used it as follows, run for some 10+ seconds against the same MySQL 8.0.25 under the same load (and with similar acceptable impact as with memleak):

openxs@ao756:~/git/BPF-tools/old/2017-12-23$ sudo ~/git/BPF-tools/old/2017-12-23/mallocstacks.py -p $(pidof mysqld) -f >/tmp/alloc.out
^C
openxs@ao756:~/git/BPF-tools/old/2017-12-23$ ls -l /tmp/alloc.out
-rw-rw-r-- 1 openxs openxs 859059 тра 12 10:28 /tmp/alloc.out
openxs@ao756:~/git/BPF-tools$ head -2 /tmp/alloc.out
mysqld;[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::~_State_impl();dict_stats_thread();[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run();dict_stats_thread();dict_stats_update(dict_table_t*, dict_stats_upd_option_t);dict_stats_save(dict_table_t*, index_id_t const*);dict_stats_exec_sql(pars_info_t*, char const*, trx_t*);que_eval_sql(pars_info_t*, char const*, unsigned long, trx_t*);que_run_threads(que_thr_t*);row_sel_step(que_thr_t*);row_sel(sel_node_t*, que_thr_t*);eval_cmp(func_node_t*);eval_node_alloc_val_buf(void*, unsigned long);__libc_malloc 33
mysqld;[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::~_State_impl();dict_stats_thread();[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run();dict_stats_thread();dict_stats_update(dict_table_t*, dict_stats_upd_option_t);dict_stats_save(dict_table_t*, index_id_t const*);dict_stats_exec_sql(pars_info_t*, char const*, trx_t*);trx_commit_for_mysql(trx_t*);trx_commit(trx_t*);trx_commit_low(trx_t*, mtr_t*);__libc_malloc 40
openxs@ao756:~/git/BPF-tools/old/2017-12-23$ cd
openxs@ao756:~$ cat /tmp/alloc.out | ~/git/FlameGraph/flamegraph.pl --color=mem --title="malloc() Flame Graph" --countname="bytes" >/tmp/mysql8_malloc.svg

So, we've got nice long folded stacks with the amount of bytes allocated (no matter if freed or not already), pure (as low impact as possible) malloc() tracing in a way immediately usable by flamegraph.pl, that produced the following output:

Flame graph showing where most of the memory was allocated from in MySQL 8.0.25 over 10+ seconds, while running standard sysbench oltp_read_write test.

Primitive, but it worked and allowed to see that most of allocations were related to filesort.

* * *

To summarize:
  • Relatively low impact tracing of memory allocations (ongoing and outstanding) is possible, using bcc tools.
  • One may use memleak tool to get quick insights on outstanding memory allocations, no matter where and how they were made, with period sampling to reduce the performance impact, or rely on some custom or general tracing tools to collect some metric per stack trace and represent the result as memory flame graphs.
  • Looks like it may make sense to do just primitive tracing with bpftrace and not try to overload the tool wiuth collecting per stack data in the maps, as stacks resolution seem to take too much CPU resources.

No comments:

Post a Comment