Blog of (former?) MySQL Entomologist: flame graph

Showing posts with label flame graph. Show all posts

Thursday, February 10, 2022

How to Summarize gdb Backtrace with pt-pmp (and Flamegraph)

This is going to be more like a note to myself and my readers than a real blog post. But still I'd like o document the trick that I am applying for years already, in its most convenient form.

Assuming you have a backtrace (or full backtrace) created from a core file or by attaching gdb to a live (and maybe hanging) process with many threads, like MySQL or MariaDB server, what is the best way to summarize it quickly, to understand what most of the threads are doing/hanging at? Like this huge backtrace with hundreds of threads:

openxs@ao756:~$ ls -l /tmp/backtrace1.txt
-rw-rw-r-- 1 openxs openxs 2817054 лют 10 17:02 /tmp/backtrace1.txt
openxs@ao756:~$ grep LWP /tmp/backtrace1.txt | wc -l
1915

Here it is. You have to download pt-pmp from Percona Toolkit. Then source the pt-pmp script and rely on the aggregate_stacktrace function from it, quite advanced in summarizing stack traces:

openxs@ao756:~$ which pt-pmp
/usr/bin/pt-pmp
openxs@ao756:~$ . /usr/bin/pt-pmp
openxs@ao756:~$ cat /tmp/backtrace1.txt | aggregate_stacktrace > /tmp/pmp1.txt
openxs@ao756:~$ ls -l /tmp/pmp1.txt
-rw-rw-r-- 1 openxs openxs 34174 лют 10 18:07 /tmp/pmp1.txt
openxs@ao756:~$ head -5 /tmp/pmp1.txt
    598 poll(libc.so.6),vio_io_wait(viosocket.c:945),vio_socket_io_wait(viosocket.c:108),vio_read(viosocket.c:184),my_real_read(net_serv.cc:892),my_net_read_packet_reallen(net_serv.cc:1162),my_net_read_packet(net_serv.cc:1146),do_command(sql_parse.cc:1262),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     82 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_s_lock_spin(sync0rw.cc:373),rw_lock_s_lock_func(sync0rw.ic:290),pfs_rw_lock_s_lock_func(sync0rw.ic:290),buf_page_get_gen(buf0buf.cc:4905),btr_cur_search_to_nth_level(btr0cur.cc:1243),btr_pcur_open_low(btr0pcur.ic:467),row_ins_scan_sec_index_for_duplicate(btr0pcur.ic:467),row_ins_sec_index_entry_low(btr0pcur.ic:467),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     55 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),lock_wait_suspend_thread(lock0wait.cc:347),row_mysql_handle_errors(row0mysql.cc:741),row_insert_for_mysql(row0mysql.cc:1428),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     38 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_x_lock_func(sync0rw.cc:733),pfs_rw_lock_x_lock_func(sync0rw.ic:544),buf_page_get_gen(buf0buf.cc:4918),btr_cur_search_to_nth_level(btr0cur.cc:1243),row_ins_sec_index_entry_low(row0ins.cc:2946),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     32 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_x_lock_func(sync0rw.cc:733),pfs_rw_lock_x_lock_func(sync0rw.ic:544),buf_page_get_gen(buf0buf.cc:4918),btr_cur_search_to_nth_level(btr0cur.cc:1243),row_ins_sec_index_entry_low(row0ins.cc:3040),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)

If a summary like above is not clear enough, we can surely go one step further and create a proper flame graph based on the output above:

openxs@ao756:~$ cat /tmp/pmp1.txt | awk '{print $2, $1}' | sed -e 's/,/;/g' | ~/git/FlameGraph/flamegraph.pl --countname="threads" --reverse - >/tmp/pmp1.svg

Then with some creative zooming and search we can concentrate on waits:

Frames with "wait" substring in the function name are highlighted

That's all. As simple as it looks. Quick overview of the backtrace before you start digging there.

Monday, January 10, 2022

Differential Flame Graphs to Highlight Performance Schema Waits Impact

Yet another type of flame graphs that I had not discussed yet is a differential flame graph (again invented by Brendan Gregg). It shows the difference of two flame graphs in a clear way (assuming they are comparable - it's on you to make sure comparison makes sense and interpret the output properly). The flame graph is drawn using the "after" profile (such that the frame widths show values for the second graph), and then colorized by the delta to show how we got there. If the metric for the specific frame in the same stack is larger then it is shown in red. If it's smaller the frame is blue (hence the name red/blue differential flame graphs). The saturation is relative to the delta, so dark red is for the frame that has much bigger value in the second graph.While and very light frames can be ignored.

Let me apply this approach to the flame graphs showing waits reported by MySQL's performance schema (built as described in this blog post). As a proof of concept I'll use an easy to interpret case where the same oltp_write_only.lua sysbench test is run with different values of innodb_flush_log_at_trx_commit and otherwise the same settings like 16 concurrent threads on my old 2 cores Ubuntu 20.04 "home server" with slow HDD.

So, here is the first run:

sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua cleanup
sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua prepare
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'truncate table events_waits_history_long'
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'set global innodb_flush_log_at_trx_commit=0'
sysbench --table-size=1000000 --threads=16 --time=120 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua run
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits64_0.txt
cat /tmp/waits64_0.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' > /tmp/w64_0.out

That ended up with this sysbench statistics:

SQL statistics:
    queries performed:
        read:                            0
        write:                           26252
        other:                           13126
        total:                           39378
    transactions:                        6563   (54.60 per sec.)
    queries:                             39378 (327.59 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          120.2017s
    total number of events:              6563

Latency (ms):
         min:                                   66.37
         avg:                                  292.94
         max:                                 1222.92
         95th percentile:                      530.08
         sum:                              1922587.72

Threads fairness:
    events (avg/stddev):           410.1875/4.69
    execution time (avg/stddev):   120.1617/0.05

and then the second:

sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua cleanup
sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua prepare
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'truncate table events_waits_history_long'
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'set global innodb_flush_log_at_trx_commit=1'
sysbench --table-size=1000000 --threads=16 --time=120 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua run
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits64_1.txt
cat /tmp/waits64_1.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' > /tmp/w64_1.out

that produced the following:

SQL statistics:
    queries performed:
        read:                            0
        write:                           16238
        other:                           8119
        total:                           24357
    transactions:                        4059   (33.71 per sec.)
    queries:                             24357 (202.27 per sec.)
    ignored errors:                      1      (0.01 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          120.4139s
    total number of events:              4059

Latency (ms):
         min:                                  151.87
         avg:                                  474.29
         max:                                 1316.65
         95th percentile:                      773.68
         sum:                              1925150.61

Threads fairness:
    events (avg/stddev):           253.6875/3.51
    execution time (avg/stddev):   120.3219/0.13

Now with resulting "collapsed stacks" in /tmp/w64_?.out files, we can build a simple differential flame graph

openxs@ao756:~/dbs/8.0$ ~/git/FlameGraph/difffolded.pl /tmp/w64_0.out /tmp/w64_1.out | ~/git/FlameGraph/flamegraph.pl --count picoseconds --title Waits > /tmp/w64_01_diff.svg

that, when captured as a .png file for this blog post looks as follows:

Differential flame graph visualizing the impact of innodb_flush_log_at_trx_commit value (the difference 1 makes vs 0) on waits reported by performance_schema for the same oltp_write_only.lua test

We clearly see the negative impact on performance (54 TPS vs 33 TPS). We clearly see that time spent on InnoDB log I/O increased, same as idle time somewhat increased. The highest relative increase was for /wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done condition variable (5+%) and that was probably related to binlog group commit implementation where we could not write to the binary log until redo log is flushed. Wild guess, surely.

I deliberately used as simple flame graphs as possible to make interpretation of the difference based on just a screenshot easier. This lame test demonstrated that we really can see what we expected - somewhat increased redo log I/O waits, and they are highlighted as red. Good enough for the proof of concept and way easier to speculate about than usual Off-CPU differential flame graph like this:

that I built while attempting to reproduce one real life MariaDB performance problem...

Saturday, January 8, 2022

Visualizing MySQL Plan Execution Time With Flame Graphs

In the previous post I've already shown one non-classical but useful application of flame graphs as a visualization of something interesting for MySQL DBAs (besides usual stack traces from profilers), time spent on waits or even statements execution by stages with related waits as reported by performance_schema. Today I am going to abuse the tools even more and try to show the real impact of each step in the query execution as reported by the EXPLAIN ANALYZE statement that is supported since MySQL 8.0.18.

The idea to use flame graphs for SQL plans visualization was (to the best of my knowledge) first suggested by Tanel Poder in this article, in context of Oracle RDBMS. There are several implementations of the idea in a free software here and there, for example.

Unfortunately I had not found any similar posts for MySQL and tools mentioned above do not work with MySQL. It made me wondering why?

For quite a some time we have a way to get several metrics for each execution step of the query plan in MySQL. Let's start with some stupid query (used to prove the concept and test possible steps to implement it):

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 33
Server version: 8.0.27 Source distribution

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> explain analyze select user, host from mysql.user u1 where u1.user not in (select distinct user from mysql.user) order by host desc\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop antijoin (cost=3.75 rows=25) (actual time=0.139..0.139 rows=0 loops=1)
    -> Covering index scan on u1 using PRIMARY (reverse) (cost=0.75 rows=5) (actual time=0.058..0.064 rows=5 loops=1)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) (actual time=0.003..0.003 rows=1 loops=5)
        -> Materialize with deduplication (cost=1.25..1.25 rows=5) (actual time=0.071..0.071 rows=4 loops=1)
            -> Filter: (mysql.`user`.`User` is not null) (cost=0.75 rows=5) (actual time=0.022..0.035 rows=5 loops=1)
                -> Covering index scan on user using PRIMARY (cost=0.75 rows=5) (actual time=0.020..0.031 rows=5 loops=1)

1 row in set (0,02 sec)

We see some hierarchical representation of query execution steps (it's Oracle's famous TREE format of EXPLAIN output, the only one supported for EXPLAIN ANALYZE). Each row represents some step, explain what the step was doing, quite verbosely, and then provides several useful metrics like cost, estimated number of rows and then actual time to return the first and all rows for this step etc. The actual time to return all rows at this step is what DBAs are usually interested in The steps form a tree, but the way it's represented is a bit unusual - we do not see JSON or any other structured format as some other RDBMSes provide. Instead each row has some level in the hierarchy represented by the number of spaces before the very informative "->" prompt. The first row with its additional "EXPLAIN: " prompts aside, each nesting level adds 4 spaces, as highlighted by the light green background above.

Maybe this unusual representation (not in the table with plan steps like in Oracle and not as JSON) prevented people from quickly implementing flame graphs-based query "profiler" for MySQL. But this definitely is not going to stop me from hacking something that may even work. I am not really a developer during recent 16+ years already, so I will not use Node.js or Python or anything fancy - just old Unix text processing tools like sed and awk, and surely the power of SQL (that small part I managed to master).

So, with the output like the above saved in a file:

openxs@ao756:~/dbs/8.0$ cat /tmp/explain.txt
*************************** 1. row ***************************
EXPLAIN: -> Nested loop antijoin (cost=3.75 rows=25) (actual time=0.115..0.115 rows=0 loops=1)
    -> Covering index scan on u1 using PRIMARY (reverse) (cost=0.75 rows=5) (actual time=0.045..0.049 rows=5 loops=1)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) (actual time=0.002..0.002 rows=1 loops=5)
        -> Materialize with deduplication (cost=1.25..1.25 rows=5) (actual time=0.063..0.063 rows=4 loops=1)
            -> Filter: (mysql.`user`.`User` is not null) (cost=0.75 rows=5) (actual time=0.025..0.034 rows=5 loops=1)
                -> Covering index scan on user using PRIMARY (cost=0.75 rows=5) (actual time=0.024..0.032 rows=5 loops=1)

openxs@ao756:~/dbs/8.0$

The first stage is simple and represented by the following command line I came up with quite fast:

openxs@ao756:~/dbs/8.0$ cat /tmp/explain.txt | awk 'NR > 1' | sed 's/EXPLAIN: //' | sed 's/(cost[^)][^)]*)//' | sed 's/(actual time=//' | sed 's/$..*$\.\...*/\1/' | sed 's/ $[^ ][^ ]*$$/; \1/'
-> Nested loop antijoin ; 0.115
    -> Covering index scan on u1 using PRIMARY (reverse) ; 0.045
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`); 0.002
        -> Materialize with deduplication ; 0.063
            -> Filter: (mysql.`user`.`User` is not null) ; 0.025
                -> Covering index scan on user using PRIMARY ; 0.024

openxs@ao756:~/dbs/8.0$

At each step sequentially I removed the first line of the output, removed that stupid "EXPLAIN :" "prompt", removed cost and number of rows estimations and then extracted just the actual time to return all rows from the step (separated by ";" as flamegraph.pl wants) , and kept the step detailed description with the initial spaces.

The next stage of processing took me a lot of time and efforts, as I had to convert the output above into a different format that is suitable for loading into the database:

openxs@ao756:~/dbs/8.0$ cat /tmp/plan.txt
1;0;0; Nested loop antijoin ; 0.115
2;1;1; Covering index scan on u1 using PRIMARY (reverse) ; 0.045
3;1;1; Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`); 0.002
4;3;2; Materialize with deduplication ; 0.063
5;4;3; Filter: (mysql.`user`.`User` is not null) ; 0.025
6;5;4; Covering index scan on user using PRIMARY ; 0.024

The format is simple: semicolon separated row number (step in the plan), row number of the "parent" step in the hierarchy (with 0 used for the very first step, obviously), the level of the step in the hierarchy, just in case, that same detailed step description and the value of the metric for the step (time to return all rows in milliseconds). This file is then loaded into the following MySQL table:

mysql> desc plan;
+--------+---------------+------+-----+---------+-------+
| Field | Type          | Null | Key | Default | Extra |
+--------+---------------+------+-----+---------+-------+
| seq    | int           | YES |     | NULL    |       |
| parent | int           | YES |     | NULL    |       |
| level | int           | YES |     | NULL    |       |
| step   | varchar(1024) | YES |     | NULL    |       |
| val    | decimal(10,3) | YES |     | NULL    |       |
+--------+---------------+------+-----+---------+-------+
5 rows in set (0,00 sec)

mysql> truncate plan;
Query OK, 0 rows affected (1,77 sec)

mysql> load data infile '/tmp/plan.txt' into table plan fields terminated by ';';
Query OK, 6 rows affected (0,13 sec)
Records: 7 Deleted: 0 Skipped: 0 Warnings: 0

mysql> select * from plan;
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
| seq | parent | level | step                                                                                     | val   |
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
|    1 |      0 |     0 | Nested loop antijoin                                                                    | 0.115 |
|    2 |      1 |     1 | Covering index scan on u1 using PRIMARY (reverse)                                       | 0.045 |
|    3 |      1 |     1 | Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) | 0.002 |
|    4 |      3 |     2 | Materialize with deduplication                                                          | 0.063 |
|    5 |      4 |     3 | Filter: (mysql.`user`.`User` is not null)                                               | 0.025 |
|    6 |      5 |     4 | Covering index scan on user using PRIMARY                                               | 0.024 |
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
6 rows in set (0,00 sec)

Now, how to end up with that format? I spent a lot of time with awk and created the following script:

awk '
function st_push(val) {
st_array[st_pos++] = val;
}

function st_pop() {
return (st_size() > 0) ? st_array[--st_pos] : "ERROR";
}

function st_size() {
return st_pos;
}

BEGIN { p[0] = 0; level[0] = 0; parent[0] = 0; }

NF > 0 {
# print "> " NR ";" $1;
split($0,a,"->");
lvl = length(a[1])/4;

level[NR] = lvl;
parent[NR] = NR-1;

if (level[NR] > level[NR-1]) {
    st_push(parent[NR]);
    p[level[NR]] = NR-1;
}
else if (level[NR] < level[NR-1]) {
    for (i=0; i<=(level[NR-1]-level[NR]); i++) {
      parent[NR] = st_pop();
      # print "poped " parent[NR];
    }
}
else {
    parent[NR]=parent[NR-1]
}
print NR ";" parent[NR] ";" lvl ";" a[2]
}'

that starts with 3 functions implementing a stack and then 3 associative arrays: p[] to store current parent row number for the specific nesting level, level[] to store hierarchy level of specific row and parent[] to store that number of row that is a parent of a given row in the plan.

To find the level of the row I split() the input row with at least one default filed separator ';' inside, using "->" as a field separator (assuming it is not used in the step description, ever) and then end up with initial spaces in a[1] and the rest (the step description and metrics value) in a[2]. I divide the length of a[1] by 4 to find the level.

This was a simple part. Then I try to build a value of a parent row using a stack to save previous parent row number for the level. It took some testing with a more primitive input format to come up with "tree traversal" code for such a weird tree representation format that seem to work on few different nested structures tested (correct me if it fails for more complex plans). I am not a developer any more so the code above may be wrong.

So, with the plan in the table, where each line has a seq number of seq number of the parent line as parent, one needs a simple enough recursive CTE to sum up all possible paths and summarizing time spent in the path:

mysql> with recursive cte_plan as (select seq, parent, level, concat(seq, concat(' -', step)) as step, round(val*1000) as val from plan where level = 0 union all select p.seq, p.parent, p.level, concat(c.step, concat(';', concat(p.seq, concat(' -',p.step)))), round(p.val*1000) as val from plan p join cte_plan c on p.parent = c.seq) select step, val from cte_plan\G
*************************** 1. row ***************************
step: 1 - Nested loop antijoin
val: 115
*************************** 2. row ***************************
step: 1 - Nested loop antijoin ;2 - Covering index scan on u1 using PRIMARY (reverse)
val: 45
*************************** 3. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)
val: 2
*************************** 4. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication
val: 63
*************************** 5. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null)
val: 25
*************************** 6. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) ;6 - Covering index scan on user using PRIMARY
val: 24
6 rows in set (0,01 sec)

Note that I've multiplied the metrics value by 1000 as flamegraph.pl script that expects metrics to be integer numbers. So they are actually in microseconds. I also added step number to each step, as actually we may have exact same step in different places in the hierarchy if wee read from the same table more than once in the query.

The final code to produce the input for flamegraph.pl is like this:

openxs@ao756:~/dbs/8.0$ ~/dbs/8.0/bin/mysql -uroot --socket=/tmp/mysql8.sock test -B -e"with recursive cte_plan as (select seq, parent, level, concat(seq, concat(' -', step)) as step, round(val*1000) as val from plan where level = 0 union all select p.seq, p.parent, p.level, concat(c.step, concat(';', concat(p.seq, concat(' -',p.step)))), round(p.val*1000) as val from plan p join cte_plan c on p.parent = c.seq) select step, val from cte_plan;" | awk ' NR > 1' > /tmp/processed_plan.txt
openxs@ao756:~/dbs/8.0$ cat /tmp/processed_plan.txt
1 - Nested loop antijoin        115
1 - Nested loop antijoin ;2 - Covering index scan on u1 using PRIMARY (reverse) 45
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)   2
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication       63
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null)         25
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) ;6 - Covering index scan on user using PRIMARY 24

This is how I create the final flame graph representing the plan steps):

openxs@ao756:~/dbs/8.0$ cat /tmp/processed_plan.txt | ~/git/FlameGraph/flamegraph.pl --title "EXPLAIN Steps" --inverted --countname miliseconds > /tmp/explain2.svg

Surely these steps can be simplified, put into a script etc. I checked intermediate stages one by one, left some debugging details and so on. You can do better, but the final result is this nice icicle graph:

Time spent on each step of query execution represented as an icicle graph

where I highlighted the impact of the "Covering index scan on u1 using PRIMARY (reverse)" step. As usual with flame graphs, in the original .svg file you can do search, zoom in an out etc. For complex queries it may be really useful.

As a DBA I'd ask for something better than these weird stages (even if scripted). I wish we had all execution steps in tables or view, like in Oracle's v$sql_plan and v$sql_plan_statistics_all, or at least in JSON format maybe (like EXPLAIN FORMAT=JSON produces for the plan itself). One day, maybe (I am going to file a feature request for this later). Yet another day I plan to use the output of ANALYZE FORMAT=JSON in MariaDB to get similar information :)

Thursday, January 6, 2022

Visualizing Performance Schema Events with Flame Graphs

Happy New Year 2022, dear readers! It's not the first time I am writing something about or illustrate some points with a flame graph. No wonder. Flame graph concept and related tools by Brendan Gregg provide a great way for visualizing metrics in any nested hierarchies, with function calls stacks being just the most popular and known example of them.

In this new blog post that I am writing while preparing to my upcoming FOSDEM 2022 talk "Flame Graphs for MySQL DBAs" I am going to explore a different hierarchy that is obvious from these simple SQL queries and their outputs:

mysql> select event_name, timer_wait from events_waits_history_long order by 1 desc limit 5;
+----------------------------------------------+------------+
| event_name                                   | timer_wait |
+----------------------------------------------+------------+
| wait/synch/sxlock/innodb/trx_purge_latch     |     747273 |
| wait/synch/sxlock/innodb/index_tree_rw_lock |     767343 |
| wait/synch/sxlock/innodb/hash_table_locks    |     637557 |
| wait/synch/sxlock/innodb/hash_table_locks    |     280980 |
| wait/synch/sxlock/innodb/dict_operation_lock |    1731372 |
+----------------------------------------------+------------+
5 rows in set (0,02 sec)

Another one, from the summary table:

mysql> select event_name, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name order by 2 desc limit 5;
+---------------------------------------------------------+-------------------+
| event_name                                              | sum_timer_wait    |
+---------------------------------------------------------+-------------------+
| wait/synch/cond/mysqlx/scheduler_dynamic_worker_pending | 20961067338483726 |
| idle                                                    | 1461111486462000 |
| wait/io/socket/sql/client_connection                    | 1373040224524194 |
| wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done            | 1238396604397386 |
| wait/io/file/innodb/innodb_temp_file                    |   330058399794447 |
+---------------------------------------------------------+-------------------+
5 rows in set (0,05 sec)

We clearly see a hierarchy of waits of different types. It takes just few simple steps to convert this kind of output to the formal expected by the flamegraph.pl tool:

# The input is stack frames and sample counts formatted as single lines. Each
# frame in the stack is semicolon separated, with a space and count at the end
# of the line. These can be generated for Linux perf script output using
# stackcollapse-perf.pl, for DTrace using stackcollapse.pl, and for other tools
# using the other stackcollapse programs. Example input:
#
# swapper;start_kernel;rest_init;cpu_idle;default_idle;native_safe_halt 1
#
# An optional extra column of counts can be provided to generate a differential
# flame graph of the counts, colored red for more, and blue for less. This
# can be useful when using flame graphs for non-regression testing.
# See the header comment in the difffolded.pl program for instructions.

Let me show a really simple way. We can get the lines without extra decorations:

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits.txt
openxs@ao756:~/dbs/8.0$ head -10 /tmp/waits.txt
event_name      timer_wait
wait/synch/mutex/innodb/log_writer_mutex        933255
wait/synch/mutex/innodb/log_flush_notifier_mutex        937938
wait/synch/mutex/innodb/log_flusher_mutex       561960
wait/synch/mutex/innodb/trx_sys_serialisation_mutex     345873
wait/synch/mutex/innodb/log_write_notifier_mutex        441540
wait/synch/mutex/mysqlx/lock_list_access        908502
wait/synch/mutex/innodb/log_checkpointer_mutex 927903
wait/synch/mutex/innodb/flush_list_mutex        160560
wait/synch/mutex/innodb/log_limits_mutex        120420

and then with an assumption that there are no spaces inside the event_name we can apply simple combination of sed and awk commands to replace '/' with expected ';' and make sure there is just one space before the metric (time in picoseconds in this case). The rest is for the flamegraph.pl script to handle:

openxs@ao756:~/dbs/8.0$ cat /tmp/waits.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --inverted --colors io --title "Waits" --countname picoseconds --width 1000 > /tmp/wait.svg
openxs@ao756:~/dbs/8.0$

I've used a few options above, including --inverted and --colors io to end up with "icicles". The resulting graph with "frames" named like "log" highlighted is below:

Icicle graph for waits, showing the impact of "idle" and InnoDB logging-related waits

The graph above is not very funny looking. That's because the hierarchy of event names in performance_schema is not very deep. You can surely make similar highlights with proper SQL filtering, but .svg file is interactive and you can get a lot of insights after running just one query. You can surely add more fun by adding more data and using more complex SQL statements and text processing.

Since MySQL started to support recursive CTEs I always wanted to use it to navigate through more complex transactions/statements/stages/waits hierarchy that is present in the performance_schema. So, today I tried to use a "proof o concept" recursive CTE to build a bit more funny flame graph and get myself ready to summarize time spent per statement with the option to drill down to related waits if needed. I ended up with this kind of query to show statements by type by stage and by wait related to a stage. No comments to begin with, just a lame SQL (somewhat inspired by the sys.ps_trace_thread()'s cursor here:

with recursive ssw as (
with sw as (
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long
)
select 0 as level, sw.* from sw
where sw.event_name not like 'wait%'
and sw.event_name not like 'stage%'
and nesting_event_id is null
union all
select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait
from ssw
inner join sw on sw.nesting_event_id = ssw.event_id
)
select event_name, timer_wait from ssw
order by event_id;

My idea is to start with statements that are at the top level (nested_event_id is NULL) and then concatenate statement events with stages and waits in the hierarchy, to get a longer stack trace for the flame graph. With the following steps (note that I needed longer column for deeper hierarchy):

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e"with recursive ssw as (
with sw as (
select event_id, cast(event_name as char(1024)) as event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long
)
select 0 as level, sw.* from sw
where sw.event_name not like 'wait%'
and sw.event_name not like 'stage%'
and nesting_event_id is null
union all
select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait
from ssw
inner join sw on sw.nesting_event_id = ssw.event_id
)
select event_name, timer_wait from ssw
order by event_id
" > /tmp/sqlstages2.txt
openxs@ao756:~/dbs/8.0$ ls -l /tmp/sqlstages2.txt
-rw-rw-r-- 1 openxs openxs 1372235 січ 6 22:36 /tmp/sqlstages2.txt
openxs@ao756:~/dbs/8.0$ tail -10 /tmp/sqlstages2.txt
statement/com/Close stmt/stage/sql/cleaning up 439000
statement/com/Close stmt        7371000
statement/com/Close stmt/stage/sql/starting     6554000
statement/com/Close stmt/stage/sql/cleaning up 433000
statement/com/Close stmt        6733000
statement/com/Close stmt/stage/sql/starting     5979000
statement/com/Close stmt/stage/sql/cleaning up 433000
statement/com/Quit      5343000
statement/com/Quit/stage/sql/starting   4552000
statement/com/Quit/stage/sql/cleaning up        409000

and then:

cat /tmp/sqlstages2.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --colors io --title "Waits" --countname picoseconds > /tmp/stages2.svg

I ended up with a graph like this:

Flame graph of time spent per statement type/stage/wait while running sysbench test

after running sysbench test:

openxs@ao756:~/dbs/8.0$ sysbench --table-size=1000000 --threads=4 --time=20 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_read_write.lua run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 4
Report intermediate results every 5 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 5s ] thds: 4 tps: 4.00 qps: 85.72 (r/w/o: 60.74/13.99/10.99) lat (ms,95%): 1771.29 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 4 tps: 8.00 qps: 166.02 (r/w/o: 116.61/27.40/22.00) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 4 tps: 10.20 qps: 207.40 (r/w/o: 144.60/33.20/29.60) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 10.20 qps: 204.00 (r/w/o: 142.80/33.20/28.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            2324
        write:                           539
        other:                           457
        total:                           3320
    transactions:                        166    (8.20 per sec.)
    queries:                             3320   (164.05 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          20.2331s
    total number of events:              166

Latency (ms):
         min:                                  177.42
         avg:                                  484.72
         max:                                 2039.24
         95th percentile:                      960.30
         sum:                                80463.77

Threads fairness:
    events (avg/stddev):           41.5000/0.87
    execution time (avg/stddev):   20.1159/0.07

Note on the graph above that stages time does NOT sum up to the select or Prepare total time spent - it means large part of the code (where time IS spent) is NOT instrumented as stages of execution. There is still a lot of work to do on Performance Schema it seems...

I could include exact SQL statements text into the graph as well (maybe with a separate stored procedure, not just a lame combination of single SQL statement and simple text processing command line), but that would not help much unless I use width of multiple screens.

To summarize, flame graphs are great for quick overview of the impact of individual "stages" into a cumulative measure collected in a hierarchy. Recursive CTEs are also cool. One day I'll proceed with further steps on this way, but for tonight the proof of concept and "invention" of "P_S Flame Graphs" for MySQL DBAs is enough. Stay tuned!

Wednesday, May 12, 2021

Dynamic Tracing of Memory Allocations in MySQL With bcc Tools

Last year I started my lame attempts to apply different Linux dynamic tracing tools and approaches to frequent events like memory allocations. In this blog post I already described how to use perf to add user probe to trace malloc() calls with the number of bytes requested. Unfortunately this approach is not practical for production use for more than several seconds.

Recently I plan with bpftrace a lot and so far ended up with an easy way to trace calls and call stacks, and was on my way to trace only outstanding allocations and care only about memory areas not yet freed. If you are interested, the primitive approach may look like this:

[openxs@fc33 ~]$ cat malloc.bt
#!/usr/bin/env bpftrace

BEGIN
{
    printf("Tracing MariaDB's malloc() calls, Ctrl-C to stop\n");
}

interval:s:$1 { exit(); }

uprobe:/lib64/libc.so.6:malloc
/ comm == "mariadbd" /
{
    @size[tid] += arg0;
/*    printf("Allocating %d bytes in thread %u...\n", arg0, tid); */
}

uretprobe:/lib64/libc.so.6:malloc
/ comm == "mariadbd" && @size[tid] > 0 /
{
    @memory[tid,retval] = @size[tid];
    @stack[ustack(perf)] += @size[tid];

    print(@stack);
    clear(@stack);

    delete(@size[tid]);
}

uprobe:/lib64/libc.so.6:free
/ comm == "mariadbd" /
{
    delete(@memory[tid, arg0]);
/*    printf("Freeing %p...\n", arg0); */
}

END
{
    clear(@size);
    clear(@memory);
    clear(@stack);
}
[openxs@fc33 ~]$

But while it works (both for tracing that is commented out above and for summarizing the non-freed allocations) and produced some outputs as expected:

[openxs@fc33 ~]$ time sudo ./malloc.bt 1 2>/dev/null >/tmp/malloc_raw.txt

real    8m47.963s
user    2m53.513s
sys     5m50.685s

[openxs@fc33 maria10.5]$ ls -l /tmp/malloc_raw.txt
-rw-r--r--. 1 openxs openxs 461675 Apr 22 10:13 /tmp/malloc_raw.txt
[openxs@fc33 ~]$ tail /tmp/malloc_raw.txt
        558c68a27d12 row_purge_step(que_thr_t*)+770 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c689e8256 que_run_threads(que_thr_t*)+2166 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a47ab7 purge_worker_callback(void*)+375 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b88989 tpool::task_group::execute(tpool::task*)+137 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b87bdf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f599ae455f4 0x7f599ae455f4 ([unknown])
        558c69fc26b0 0x558c69fc26b0 ([unknown])
        558c68b87cb0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        dde907894810c083 0xdde907894810c083 ([unknown])
]: 33

...

@stack[
        558c689bd25c mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+108 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a3f2ae row_vers_old_has_index_entry(bool, unsigned char const*, mtr_t*, dict_index_t*, dtuple_t const*, unsigned long, unsigned long)+126 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a24c60 row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+512 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a2627b row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+971 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a272c3 row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+1459 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a27d12 row_purge_step(que_thr_t*)+770 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c689e8256 que_run_threads(que_thr_t*)+2166 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68a47ab7 purge_worker_callback(void*)+375 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b88989 tpool::task_group::execute(tpool::task*)+137 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        558c68b87bdf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        7f599ae455f4 0x7f599ae455f4 ([unknown])
        558c69fc26b0 0x558c69fc26b0 ([unknown])
        558c68b87cb0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)
        dde907894810c083 0xdde907894810c083 ([unknown])
]: 1152

it took 8 minutes(!) to deal with data collected for over 1 second of tracing under high load, and caused notable load:

top - 09:59:04 up 1:16, 3 users, load average: 1.60, 7.14, 5.31
Tasks: 228 total,   2 running, 226 sleeping,   0 stopped,   0 zombie
%Cpu(s): 10.8 us, 17.5 sy, 0.0 ni, 71.6 id, 0.0 wa, 0.2 hi, 0.0 si, 0.0 st
MiB Mem :   7916.5 total,   1759.6 free,   2527.0 used,   3629.9 buff/cache
MiB Swap:   3958.0 total,   3958.0 free,      0.0 used.   4910.8 avail Mem

    PID USER      PR NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
14813 root      20   0 307872 111852 75516 R 99.7   1.4   7:57.17 bpftrace
...

and drop in performance for the system in the meantime:

[openxs@fc33 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 759.62 qps: 15246.37 (r/w/o: 10676.06/3047.88/1522.44) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00
...
[ 70s ] thds: 32 tps: 708.20 qps: 14174.96 (r/w/o: 9920.64/2837.91/1416.41) lat (ms,95%): 74.46 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 354.60 qps: 7080.28 (r/w/o: 4964.46/1406.62/709.21) lat (ms,95%): 134.90 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 332.91 qps: 6661.34 (r/w/o: 4657.50/1338.03/665.81) lat (ms,95%): 132.49 err/s: 0.00 reconn/s: 0.00
...

So I mention that my lame tracing approach here for no one to try to do the same - monitoring that caused 2 times drop in TPS for minutes is hardly acceptable. I obviously made some mistake that I have yet to identify. Probably resolving stack traces and summarizing them in kernel context was too much for the system, and I can not do better in bpftrace itself, if only use it to just produce raw traces.

The approach above is too primitive also, as I traced only malloc(), while theoretically realloc() and callock() calls may be used. So in a hope to see how this task is approached by really experienced people, I checked what bcc tools provide for tracing memory allocations.

The memleak.py tool there is quite advanced. It allows both to trace individual calls and output periodic summaries of outstanding allocations:

openxs@ao756:~/git/BPF-tools$ /usr/share/bcc/tools/memleak -h
usage: memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]
               [--combined-only] [--wa-missing-free] [-s SAMPLE_RATE] [-T TOP]
               [-z MIN_SIZE] [-Z MAX_SIZE] [-O OBJ] [--percpu]
               [interval] [count]

Trace outstanding memory allocations that weren't freed.
Supports both user-mode allocations made with libc functions and kernel-mode
allocations made with kmalloc/kmem_cache_alloc/get_free_pages and corresponding
memory release functions.

positional arguments:
interval              interval in seconds to print outstanding allocations
count                 number of times to print the report before exiting

optional arguments:
-h, --help            show this help message and exit
-p PID, --pid PID     the PID to trace; if not specified, trace kernel
                        allocs
-t, --trace           print trace messages for each alloc/free call
-a, --show-allocs     show allocation addresses and sizes as well as call
                        stacks
-o OLDER, --older OLDER
                        prune allocations younger than this age in
                        milliseconds
-c COMMAND, --command COMMAND
                        execute and trace the specified command
--combined-only       show combined allocation statistics only
--wa-missing-free     Workaround to alleviate misjudgments when free is
                        missing
-s SAMPLE_RATE, --sample-rate SAMPLE_RATE
                        sample every N-th allocation to decrease the overhead
-T TOP, --top TOP     display only this many top allocating stacks (by size)
-z MIN_SIZE, --min-size MIN_SIZE
                        capture only allocations larger than this size
-Z MAX_SIZE, --max-size MAX_SIZE
                        capture only allocations smaller than this size
-O OBJ, --obj OBJ     attach to allocator functions in the specified object
--percpu              trace percpu allocations

EXAMPLES:

./memleak -p $(pidof allocs)
        Trace allocations and display a summary of "leaked" (outstanding)
        allocations every 5 seconds
./memleak -p $(pidof allocs) -t
        Trace allocations and display each individual allocator function call
./memleak -ap $(pidof allocs) 10
        Trace allocations and display allocated addresses, sizes, and stacks
        every 10 seconds for outstanding allocations
./memleak -c "./allocs"
        Run the specified command and trace its allocations
./memleak
        Trace allocations in kernel mode and display a summary of outstanding
        allocations every 5 seconds
./memleak -o 60000
        Trace allocations in kernel mode and display a summary of outstanding
        allocations that are at least one minute (60 seconds) old
./memleak -s 5
        Trace roughly every 5th allocation, to reduce overhead
openxs@ao756:~/git/BPF-tools$

I've applied it to MySQL 8.0.25 recently built on my Ubuntu 20.04 and running sysbench oltp_read_write load test:

openxs@ao756:~/dbs/8.0$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/tmp/mysql8.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Report intermediate results every 10 second(s)
Initializing random number generator from current time

Initializing worker threads...

Threads started!

[ 10s ] thds: 4 tps: 18.49 qps: 377.44 (r/w/o: 264.49/75.57/37.38) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 14.80 qps: 296.01 (r/w/o: 207.21/59.20/29.60) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 4 tps: 25.99 qps: 519.89 (r/w/o: 363.92/103.98/51.99) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 4 tps: 25.30 qps: 506.04 (r/w/o: 354.23/101.21/50.60) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 4 tps: 21.90 qps: 437.92 (r/w/o: 306.54/87.58/43.79) lat (ms,95%): 356.70 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 4 tps: 23.51 qps: 470.05 (r/w/o: 329.10/93.93/47.01) lat (ms,95%): 308.84 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 4 tps: 20.29 qps: 405.99 (r/w/o: 284.12/81.28/40.59) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 4 tps: 20.51 qps: 408.20 (r/w/o: 286.17/81.02/41.01) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 4 tps: 22.80 qps: 457.80 (r/w/o: 320.03/92.18/45.59) lat (ms,95%): 383.33 err/s: 0.00 reconn/s: 0.00
^C

Ignore ther absolute numbers, but note that (unlike with my bpftrace program) there was no very significant drop in QPS over that 20+ seconds I was collecting stacks for outstanding allocations in another shell:

openxs@ao756:~/git/BPF-tools$ time sudo /usr/share/bcc/tools/memleak -p $(pidof mysqld) --top 100 >/tmp/memleak.out
[sudo] password for openxs:
^C
real    0m21,142s
user    0m0,998s
sys     0m0,466s

Now, what was collected? Let' check top 40 rows:

openxs@ao756:~/git/BPF-tools$ head -40 /tmp/memleak.out
Attaching to pid 3416, Ctrl+C to quit.
[13:51:26] Top 100 stacks with outstanding allocations:
        1536 bytes in 2 allocations from stack
                mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0xc8 [mysqld]
                mem_heap_add_block(mem_block_info_t*, unsigned long)+0x53 [mysqld]
                row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x78c [mysqld]
                row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x2b78 [mysqld]
                ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+0x32e [mysqld]
                handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+0x389 [mysqld]
                handler::read_range_first(key_range const*, key_range const*, bool, bool)+0x6e [mysqld]
                ha_innobase::read_range_first(key_range const*, key_range const*, bool, bool)+0x27 [mysqld]
                handler::multi_range_read_next(char**)+0x135 [mysqld]
                handler::ha_multi_range_read_next(char**)+0x2c [mysqld]
                QUICK_RANGE_SELECT::get_next()+0x5a [mysqld]
                IndexRangeScanIterator::Read()+0x3f [mysqld]
                FilterIterator::Read()+0x18 [mysqld]
                MaterializeIterator::MaterializeQueryBlock(MaterializeIterator::QueryBlock const&, unsigned long long*)+0x133 [mysqld]
                MaterializeIterator::Init()+0x319 [mysqld]
                filesort(THD*, Filesort*, RowIterator*, unsigned long, unsigned long long, Filesort_info*, Sort_result*, unsigned long long*)+0x39d [mysqld]
                SortingIterator::DoSort()+0x72 [mysqld]
                SortingIterator::Init()+0x34 [mysqld]
                Query_expression::ExecuteIteratorQuery(THD*)+0x2ea [mysqld]
                Query_expression::execute(THD*)+0x33 [mysqld]
                Sql_cmd_dml::execute_inner(THD*)+0x30b [mysqld]
                Sql_cmd_dml::execute(THD*)+0x545 [mysqld]
                mysql_execute_command(THD*, bool)+0x9f0 [mysqld]
                Prepared_statement::execute(String*, bool)+0x8b0 [mysqld]
                Prepared_statement::execute_loop(String*, bool)+0x117 [mysqld]
                mysqld_stmt_execute(THD*, Prepared_statement*, bool, unsigned long, PS_PARAM*)+0x1b1 [mysqld]
                dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x175d [mysqld]
                do_command(THD*)+0x1a4 [mysqld]
                handle_connection+0x258 [mysqld]
                pfs_spawn_thread+0x162 [mysqld]
                start_thread+0xd9 [libpthread-2.31.so]
        2304 bytes in 3 allocations from stack
                mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0xc8 [mysqld]
                mem_heap_add_block(mem_block_info_t*, unsigned long)+0x53 [mysqld]
                row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x78c [mysqld]
                row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x2b78 [mysqld]
                ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+0x32e [mysqld]
openxs@ao756:~/git/BPF-tools$

This information may be really useful for further analysis. But it is not easy to collapse/summarize for further memory flame graphs creation, while I'd really use them to quickly see where most of the memory is allocated.

One day I'll get back to memleak and create some script to fold its output. But for this blog post I was looking for qucik and dirty ways and, according to the page linked above, for that I had either to use general purpose stackcount.py (that just counts the number of occurences per unique stack), or Brendan Gregg's unsupported mallocstacks, which is similar to stackcount but sums the size_t argument to malloc() as the metric. I've used the latter and had to make a small fix to make it run on my Ubuntu 20.04 netbook:

openxs@ao756:~/git/BPF-tools$ git diff old/2017-12-23/mallocstacks.py
diff --git a/old/2017-12-23/mallocstacks.py b/old/2017-12-23/mallocstacks.py
index 8891e82..92271ed 100755
--- a/old/2017-12-23/mallocstacks.py
+++ b/old/2017-12-23/mallocstacks.py
@@ -96,7 +96,7 @@ struct key_t {
char name[TASK_COMM_LEN];
};
BPF_HASH(bytes, struct key_t);
-BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE)
+BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE);

int trace_malloc(struct pt_regs *ctx, size_t size) {
u32 pid = bpf_get_current_pid_tgid();
openxs@ao756:~/git/BPF-tools$

Then I've used it as follows, run for some 10+ seconds against the same MySQL 8.0.25 under the same load (and with similar acceptable impact as with memleak):

openxs@ao756:~/git/BPF-tools/old/2017-12-23$ sudo ~/git/BPF-tools/old/2017-12-23/mallocstacks.py -p $(pidof mysqld) -f >/tmp/alloc.out
^C
openxs@ao756:~/git/BPF-tools/old/2017-12-23$ ls -l /tmp/alloc.out
-rw-rw-r-- 1 openxs openxs 859059 тра 12 10:28 /tmp/alloc.out
openxs@ao756:~/git/BPF-tools$ head -2 /tmp/alloc.out
mysqld;[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::~_State_impl();dict_stats_thread();[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run();dict_stats_thread();dict_stats_update(dict_table_t*, dict_stats_upd_option_t);dict_stats_save(dict_table_t*, index_id_t const*);dict_stats_exec_sql(pars_info_t*, char const*, trx_t*);que_eval_sql(pars_info_t*, char const*, unsigned long, trx_t*);que_run_threads(que_thr_t*);row_sel_step(que_thr_t*);row_sel(sel_node_t*, que_thr_t*);eval_cmp(func_node_t*);eval_node_alloc_val_buf(void*, unsigned long);__libc_malloc 33
mysqld;[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::~_State_impl();dict_stats_thread();[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run();dict_stats_thread();dict_stats_update(dict_table_t*, dict_stats_upd_option_t);dict_stats_save(dict_table_t*, index_id_t const*);dict_stats_exec_sql(pars_info_t*, char const*, trx_t*);trx_commit_for_mysql(trx_t*);trx_commit(trx_t*);trx_commit_low(trx_t*, mtr_t*);__libc_malloc 40
openxs@ao756:~/git/BPF-tools/old/2017-12-23$ cd
openxs@ao756:~$ cat /tmp/alloc.out | ~/git/FlameGraph/flamegraph.pl --color=mem --title="malloc() Flame Graph" --countname="bytes" >/tmp/mysql8_malloc.svg

So, we've got nice long folded stacks with the amount of bytes allocated (no matter if freed or not already), pure (as low impact as possible) malloc() tracing in a way immediately usable by flamegraph.pl, that produced the following output:

Flame graph showing where most of the memory was allocated from in MySQL 8.0.25 over 10+ seconds, while running standard sysbench oltp_read_write test.

Primitive, but it worked and allowed to see that most of allocations were related to filesort.

* * *

To summarize:

Relatively low impact tracing of memory allocations (ongoing and outstanding) is possible, using bcc tools.
One may use memleak tool to get quick insights on outstanding memory allocations, no matter where and how they were made, with period sampling to reduce the performance impact, or rely on some custom or general tracing tools to collect some metric per stack trace and represent the result as memory flame graphs.
Looks like it may make sense to do just primitive tracing with bpftrace and not try to overload the tool wiuth collecting per stack data in the maps, as stacks resolution seem to take too much CPU resources.

Sunday, May 9, 2021

Off-CPU Analysis Attempt to Find the Reason of Performance Regression in MariaDB 10.4

I did not write new blog posts for more than 2 months already. Busy days... But now I am on vacation and my Percona Live Online 2021 talk on flame graphs is coming soon, so I decided to renew my experience with bcc tools and try to get some insights for one of the bugs I've reported for MariaDB using the off-CPU flame graphs.

The idea was to check why sysbench oltp_read_write test started to work notably slower in a newer version of MariaDB 10.4.x after 10.4.15. On my good, old and slow Acer netbook with recently updated Ubuntu:

openxs@ao756:~/git/server$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"
openxs@ao756:~/git/server$ uname -a
Linux ao756 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

I've compiled current MariaDB 10.4 from GitHub source following my usual way:

openxs@ao756:~/git/server$ git checkout 10.4
...
openxs@ao756:~/git/server$ git pull
...
openxs@ao756:~/git/server$ git submodule update --init --recursive
Submodule path 'extra/wolfssl/wolfssl': checked out '9c87f979a7f1d3a6d786b260653d566c1d31a1c4'
Submodule path 'libmariadb': checked out '180c543704d627a50a52aaf60e24ca14e0ec4686'
Submodule path 'wsrep-lib': checked out 'f271ad0c6e3c647df83c1d5ec9cd26d77cef2337'
Submodule path 'wsrep-lib/wsrep-API/v26': checked out '76cf223c690845bbf561cb820a46e06a18ad80d1'
openxs@ao756:~/git/server$ git branch
10.3
* 10.4
10.5
10.6
openxs@ao756:~/git/server$ git log -1
commit 583b72ad0ddbc46a7aaeda1c1373b89d4bded9ea (HEAD -> 10.4, origin/bb-10.4-merge, origin/10.4)
Merge: 473e85e9316 a4139f8d68b
Author: Oleksandr Byelkin <sanja@mariadb.com>
Date:   Fri May 7 11:50:24 2021 +0200

    Merge branch 'bb-10.4-release' into 10.4
openxs@ao756:~/git/server$ rm CMakeCache.txt
openxs@ao756:~/git/server$ cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/maria10.4 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DPLUGIN_TOKUDB=NO -DWITH_SSL=system
...
-- Generating done
-- Build files have been written to: /home/openxs/git/server
openxs@ao756:~/git/server$ time make -j 3
...
[100%] Building C object extra/mariabackup/CMakeFiles/mariabackup.dir/__/__/libmysqld/libmysql.c.o
[100%] Linking CXX executable mariabackup
[100%] Built target mariabackup

real    74m9,550s
user    134m10,837s
sys     7m0,387s

openxs@ao756:~/git/server$ make install && make clean
...
openxs@ao756:~/git/server$ cd ~/dbs/maria10.4
openxs@ao756:~/dbs/maria10.4$ bin/mysqld_safe --no-defaults --port=3309 --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 &
[1] 27483
openxs@ao756:~/dbs/maria10.4$ 210507 19:15:37 mysqld_safe Logging to '/home/openxs/dbs/maria10.4/data/ao756.err'.
210507 19:15:37 mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/maria10.4/data

openxs@ao756:~/dbs/maria10.4$ bin/mysql --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor. Commands end with ; or \g.
Your MariaDB connection id is 8
Server version: 10.4.19-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> drop database if exists sbtest;
Query OK, 5 rows affected (1.297 sec)

MariaDB [(none)]> create database sbtest;
Query OK, 1 row affected (0.001 sec)

MariaDB [(none)]> exit
Bye

and compared to 10.4.15 with the following test:

openxs@ao756:~/dbs/maria10.4$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=4 prepare
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Initializing worker threads...

Creating table 'sbtest2'...
Creating table 'sbtest3'...
Creating table 'sbtest4'...
Creating table 'sbtest1'...
Inserting 100000 records into 'sbtest1'
Inserting 100000 records into 'sbtest4'
Inserting 100000 records into 'sbtest2'
Inserting 100000 records into 'sbtest3'
Creating a secondary index on 'sbtest1'...
Creating a secondary index on 'sbtest4'...
Creating table 'sbtest5'...
Inserting 100000 records into 'sbtest5'
Creating a secondary index on 'sbtest2'...
Creating a secondary index on 'sbtest3'...
Creating a secondary index on 'sbtest5'...

openxs@ao756:~/dbs/maria10.4$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
...
    transactions:                        35630 (118.54 per sec.)
    queries:                             712600 (2370.87 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.5612s
    total number of events:              35630

Latency (ms):
         min:                                    3.30
         avg:                                   33.70
         max:                                 2200.17
         95th percentile:                      164.45
...

openxs@ao756:~/dbs/maria10.4.15$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run
...
    transactions:                        56785 (189.25 per sec.)
    queries:                             1135700 (3784.99 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          300.0501s
    total number of events:              56785

Latency (ms):
         min:                                    3.15
         avg:                                   21.13
         max:                                  704.36
         95th percentile:                      108.68
...

So, basically with the same test with all tables fitting into the buffer pool (1G) and all but few default settings current MariaDB 10.4.19 demonstrate up to 60% drop in throughput and increase of 95th latency on this netbook (even more than 15% or so reported on faster quad core Fedora desktop previously).

If you read MDEV-24272 carefully, the regression was tracked up to a specific commit, but I tried to apply various tools to actually see where more time is spent now, specifically. Profiling with perf and creating on-CPU flame graphs had not given me any clear insight that would explain that increase in latency, so my next idea was to trace off-CPU time spent, that is, try to find out how long MariaDB server waits and where in the code that's mostly happen.

For this I've used the offcputime tool:

openxs@ao756:~$ /usr/share/bcc/tools/offcputime -h
usage: offcputime [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f]
                  [--stack-storage-size STACK_STORAGE_SIZE]
                  [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]
                  [duration]

Summarize off-CPU time by stack trace

positional arguments:
duration              duration of trace, in seconds

optional arguments:
-h, --help            show this help message and exit
-p PID, --pid PID     trace this PID only
-t TID, --tid TID     trace this TID only
-u, --user-threads-only
                        user threads only (no kernel threads)
-k, --kernel-threads-only
                        kernel threads only (no user threads)
-U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
-K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
-d, --delimited       insert delimiter between kernel/user stacks
-f, --folded          output folded format
--stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 1024)
-m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store
                        traces (default 1)
-M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we
                        store traces (default U64_MAX)
--state STATE         filter on this thread state bitmask (eg, 2 ==
                        TASK_UNINTERRUPTIBLE) see include/linux/sched.h

examples:
    ./offcputime             # trace off-CPU stack time until Ctrl-C
    ./offcputime 5           # trace for 5 seconds only
    ./offcputime -f 5        # 5 seconds, and output in folded format
    ./offcputime -m 1000     # trace only events that last more than 1000 usec
    ./offcputime -M 10000    # trace only events that last less than 10000 usec
    ./offcputime -p 185      # only trace threads for PID 185
    ./offcputime -t 188      # only trace thread 188
    ./offcputime -u          # only trace user threads (no kernel)
    ./offcputime -k          # only trace kernel threads (no user)
    ./offcputime -U          # only show user space stacks (no kernel)
    ./offcputime -K          # only show kernel space stacks (no user)

I've stored the outputs in /dev/shm to have less impact on disk I/O that I suspected as one of the reasons:

openxs@ao756:~$ ls /dev/shm
openxs@ao756:~$ mkdir /dev/shm/out
openxs@ao756:~$ ls /dev/shm
out

Basically the following commands were used to create folded (ready to use for building flame graphs) user-space only stacks and time spent off-CPU in them over 60 seconds of tracing while sysbench tests were running for clean setup on MariaDB 10.4.15 and then on current 10.4.19, and create flame graphs based on those stacks where start_thread frame is present (to clean up irrelevant details):

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10415.out
WARNING: 27 stack traces lost and could not be displayed.
openxs@ao756:~$ cat /dev/shm/out/offcpu_10415.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10415.svg

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10419.out
WARNING: 24 stack traces lost and could not be displayed.
openxs@ao756:~$ cat /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10419.svg

As a result I've got the following graphs (.png screenshots from real .svg files below). On 10.4.15 we spent around 43 seconds (out of 60 we monitored) off-CPU, mostly in do_command() and waiting for network I/O, and the graph was the following:

Off-CPU time for MariaDB 10.4.15

In case of 10.4.19 the graph is very different and we seem to have spent 79 seconds off-CPU, mostly in background io_handler_thread():

Off-CPU time for MariaDB 10.4.19

I was surprised to see more than 60 seconds spent off-CPU in this case. Maybe this is possible because I have 2 cores and MariaDB threads were waiting on both most of the time.

I've then tried to use differential flame graph to highlight the call stacks that the main difference is related to. I've crated it from existing folded stack traces with the following command:

openxs@ao756:~$ ~/git/FlameGraph/difffolded.pl /dev/shm/out/offcpu_10415.out /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Diff Flame Graph" --countname=us > /tmp/offcpu_diff.svg

The resulting graph is presented below:

The source of increase is highlighted in red

Here we clearly see that main increase in time spent waiting in 10.4.19 is related to io_handler_thread(), but increase happened in almost all background threads.

* * *

To summarize:

When some performance regression happens you should check not only those code paths in the application where software started to spend more time working, but also where it started to wait more.
In case of https://jira.mariadb.org/browse/MDEV-24272 we clearly started to flush more to disk from the very beginning of sysbench oltp_read_write test in newer versions of MariaDB 10.4.x, and on my slow encrypted HDD this matters a lot. The load that was supposed to be CPU-bound (as we have large enough buffer pool) becomes disk-bound.
Flame graphs are cool for highlighting the difference and in this post I've shown both a classical smart way to produce them without too much impact, and a way to highlight the difference in them with a differential flam graph produced by the difffolded.pl tool created by Brendan Gregg.
Other cases when Flame Graphs may help MySQL or MariaDB DBAs are discussed during my upcoming Percona Live 2021 talk on May 12. See you there!
I'll get back to this nice regression bug to study the test case in more details with other tools, maybe more than once. Stay tuned!