Happy New Year 2022, dear readers! It's not the first time I am writing something about or illustrate some points with a flame graph. No wonder. Flame graph concept and related tools by Brendan Gregg provide a great way for visualizing metrics in any nested hierarchies, with function calls stacks being just the most popular and known example of them.
In this new blog post that I am writing while preparing to my upcoming FOSDEM 2022 talk "Flame Graphs for MySQL DBAs" I am going to explore a different hierarchy that is obvious from these simple SQL queries and their outputs:
mysql> select event_name, timer_wait from events_waits_history_long order by 1 desc limit 5;
+----------------------------------------------+------------+
| event_name | timer_wait |
+----------------------------------------------+------------+
| wait/synch/sxlock/innodb/trx_purge_latch | 747273 |
| wait/synch/sxlock/innodb/index_tree_rw_lock | 767343 |
| wait/synch/sxlock/innodb/hash_table_locks | 637557 |
| wait/synch/sxlock/innodb/hash_table_locks | 280980 |
| wait/synch/sxlock/innodb/dict_operation_lock | 1731372 |
+----------------------------------------------+------------+
5 rows in set (0,02 sec)
Another one, from the summary table:
mysql> select event_name, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name order by 2 desc limit 5;
+---------------------------------------------------------+-------------------+
| event_name | sum_timer_wait |
+---------------------------------------------------------+-------------------+
| wait/synch/cond/mysqlx/scheduler_dynamic_worker_pending | 20961067338483726 |
| idle | 1461111486462000 |
| wait/io/socket/sql/client_connection | 1373040224524194 |
| wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done | 1238396604397386 |
| wait/io/file/innodb/innodb_temp_file | 330058399794447 |
+---------------------------------------------------------+-------------------+
5 rows in set (0,05 sec)
We clearly see a hierarchy of waits of different types. It takes just few simple steps to convert this kind of output to the formal expected by the flamegraph.pl tool:
# The input is stack frames and sample counts formatted as single lines. Each
# frame in the stack is semicolon separated, with a space and count at the end
# of the line. These can be generated for Linux perf script output using
# stackcollapse-perf.pl, for DTrace using stackcollapse.pl, and for other tools
# using the other stackcollapse programs. Example input:
#
# swapper;start_kernel;rest_init;cpu_idle;default_idle;native_safe_halt 1
#
# An optional extra column of counts can be provided to generate a differential
# flame graph of the counts, colored red for more, and blue for less. This
# can be useful when using flame graphs for non-regression testing.
# See the header comment in the difffolded.pl program for instructions.
Let me show a really simple way. We can get the lines without extra decorations:
openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits.txt
openxs@ao756:~/dbs/8.0$ head -10 /tmp/waits.txt
event_name timer_wait
wait/synch/mutex/innodb/log_writer_mutex 933255
wait/synch/mutex/innodb/log_flush_notifier_mutex 937938
wait/synch/mutex/innodb/log_flusher_mutex 561960
wait/synch/mutex/innodb/trx_sys_serialisation_mutex 345873
wait/synch/mutex/innodb/log_write_notifier_mutex 441540
wait/synch/mutex/mysqlx/lock_list_access 908502
wait/synch/mutex/innodb/log_checkpointer_mutex 927903
wait/synch/mutex/innodb/flush_list_mutex 160560
wait/synch/mutex/innodb/log_limits_mutex 120420
and then with an assumption that there are no spaces inside the event_name we can apply simple combination of sed and awk commands to replace '/' with expected ';' and make sure there is just one space before the metric (time in picoseconds in this case). The rest is for the flamegraph.pl script to handle:
openxs@ao756:~/dbs/8.0$ cat /tmp/waits.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --inverted --colors io --title "Waits" --countname picoseconds --width 1000 > /tmp/wait.svg
openxs@ao756:~/dbs/8.0$
I've used a few options above, including --inverted and --colors io to end up with "icicles". The resulting graph with "frames" named like "log" highlighted is below:
Icicle graph for waits, showing the impact of "idle" and InnoDB logging-related waits |
The graph above is not very funny looking. That's because the hierarchy of event names in performance_schema is not very deep. You can surely make similar highlights with proper SQL filtering, but .svg file is interactive and you can get a lot of insights after running just one query. You can surely add more fun by adding more data and using more complex SQL statements and text processing.
Since MySQL started to support recursive CTEs I always wanted to use it to navigate through more complex transactions/statements/stages/waits hierarchy that is present in the performance_schema. So, today I tried to use a "proof o concept" recursive CTE to build a bit more funny flame graph and get myself ready to summarize time spent per statement with the option to drill down to related waits if needed. I ended up with this kind of query to show statements by type by stage and by wait related to a stage. No comments to begin with, just a lame SQL (somewhat inspired by the sys.ps_trace_thread()'s cursor here:
with recursive ssw as (
with sw as (
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long
)
select 0 as level, sw.* from sw
where sw.event_name not like 'wait%'
and sw.event_name not like 'stage%'
and nesting_event_id is null
union all
select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait
from ssw
inner join sw on sw.nesting_event_id = ssw.event_id
)
select event_name, timer_wait from ssw
order by event_id;
My idea is to start with statements that are at the top level (nested_event_id is NULL) and then concatenate statement events with stages and waits in the hierarchy, to get a longer stack trace for the flame graph. With the following steps (note that I needed longer column for deeper hierarchy):
openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e"with recursive ssw as (
with sw as (
select event_id, cast(event_name as char(1024)) as event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union
select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long
)
select 0 as level, sw.* from sw
where sw.event_name not like 'wait%'
and sw.event_name not like 'stage%'
and nesting_event_id is null
union all
select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait
from ssw
inner join sw on sw.nesting_event_id = ssw.event_id
)
select event_name, timer_wait from ssw
order by event_id
" > /tmp/sqlstages2.txt
openxs@ao756:~/dbs/8.0$ ls -l /tmp/sqlstages2.txt
-rw-rw-r-- 1 openxs openxs 1372235 січ 6 22:36 /tmp/sqlstages2.txt
openxs@ao756:~/dbs/8.0$ tail -10 /tmp/sqlstages2.txt
statement/com/Close stmt/stage/sql/cleaning up 439000
statement/com/Close stmt 7371000
statement/com/Close stmt/stage/sql/starting 6554000
statement/com/Close stmt/stage/sql/cleaning up 433000
statement/com/Close stmt 6733000
statement/com/Close stmt/stage/sql/starting 5979000
statement/com/Close stmt/stage/sql/cleaning up 433000
statement/com/Quit 5343000
statement/com/Quit/stage/sql/starting 4552000
statement/com/Quit/stage/sql/cleaning up 409000
and then:
cat /tmp/sqlstages2.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --colors io --title "Waits" --countname picoseconds > /tmp/stages2.svg
I ended up with a graph like this:
Flame graph of time spent per statement type/stage/wait while running sysbench test |
after running sysbench test:
openxs@ao756:~/dbs/8.0$ sysbench --table-size=1000000 --threads=4 --time=20 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_read_write.lua run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)
Running the test with following options:
Number of threads: 4
Report intermediate results every 5 second(s)
Initializing random number generator from current time
Initializing worker threads...
Threads started!
[ 5s ] thds: 4 tps: 4.00 qps: 85.72 (r/w/o: 60.74/13.99/10.99) lat (ms,95%): 1771.29 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 4 tps: 8.00 qps: 166.02 (r/w/o: 116.61/27.40/22.00) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 4 tps: 10.20 qps: 207.40 (r/w/o: 144.60/33.20/29.60) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 10.20 qps: 204.00 (r/w/o: 142.80/33.20/28.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
SQL statistics:
queries performed:
read: 2324
write: 539
other: 457
total: 3320
transactions: 166 (8.20 per sec.)
queries: 3320 (164.05 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
General statistics:
total time: 20.2331s
total number of events: 166
Latency (ms):
min: 177.42
avg: 484.72
max: 2039.24
95th percentile: 960.30
sum: 80463.77
Threads fairness:
events (avg/stddev): 41.5000/0.87
execution time (avg/stddev): 20.1159/0.07
Note on the graph above that stages time does NOT sum up to the select or Prepare total time spent - it means large part of the code (where time IS spent) is NOT instrumented as stages of execution. There is still a lot of work to do on Performance Schema it seems...
I could include exact SQL statements text into the graph as well (maybe with a separate stored procedure, not just a lame combination of single SQL statement and simple text processing command line), but that would not help much unless I use width of multiple screens.
To summarize, flame graphs are great for quick overview of the impact of individual "stages" into a cumulative measure collected in a hierarchy. Recursive CTEs are also cool. One day I'll proceed with further steps on this way, but for tonight the proof of concept and "invention" of "P_S Flame Graphs" for MySQL DBAs is enough. Stay tuned!
No comments:
Post a Comment