Thursday, January 6, 2022

Visualizing Performance Schema Events with Flame Graphs

Happy New Year 2022, dear readers! It's not the first time I am writing something about or illustrate some points with a flame graph. No wonder. Flame graph concept and related tools by Brendan Gregg provide a great way for visualizing metrics in any nested hierarchies, with function calls stacks being just the most popular and known example of them.

In this new blog post that I am writing while preparing to my upcoming FOSDEM 2022 talk "Flame Graphs for MySQL DBAs" I am going to explore a different hierarchy that is obvious from these simple SQL queries and their outputs:

mysql> select event_name, timer_wait from events_waits_history_long order by 1 desc limit 5;
+----------------------------------------------+------------+
| event_name                                   | timer_wait |
+----------------------------------------------+------------+
| wait/synch/sxlock/innodb/trx_purge_latch     |     747273 |
| wait/synch/sxlock/innodb/index_tree_rw_lock  |     767343 |
| wait/synch/sxlock/innodb/hash_table_locks    |     637557 |
| wait/synch/sxlock/innodb/hash_table_locks    |     280980 |
| wait/synch/sxlock/innodb/dict_operation_lock |    1731372 |
+----------------------------------------------+------------+
5 rows in set (0,02 sec)

Another one, from the summary table:

mysql> select event_name, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name order by 2 desc limit 5;
+---------------------------------------------------------+-------------------+
| event_name                                              | sum_timer_wait    |
+---------------------------------------------------------+-------------------+
| wait/synch/cond/mysqlx/scheduler_dynamic_worker_pending | 20961067338483726 |
| idle                                                    |  1461111486462000 |
| wait/io/socket/sql/client_connection                    |  1373040224524194 |
| wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done            |  1238396604397386 |
| wait/io/file/innodb/innodb_temp_file                    |   330058399794447 |
+---------------------------------------------------------+-------------------+
5 rows in set (0,05 sec)

We clearly see a hierarchy of waits of different types. It takes just few simple steps to convert this kind of output to the formal expected by the flamegraph.pl tool:

# The input is stack frames and sample counts formatted as single lines.  Each
# frame in the stack is semicolon separated, with a space and count at the end
# of the line.  These can be generated for Linux perf script output using
# stackcollapse-perf.pl, for DTrace using stackcollapse.pl, and for other tools
# using the other stackcollapse programs.  Example input:
#
#  swapper;start_kernel;rest_init;cpu_idle;default_idle;native_safe_halt 1
#
# An optional extra column of counts can be provided to generate a differential
# flame graph of the counts, colored red for more, and blue for less.  This
# can be useful when using flame graphs for non-regression testing.
# See the header comment in the difffolded.pl program for instructions.

Let me show a really simple way. We can get the lines without extra decorations:

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits.txt
openxs@ao756:~/dbs/8.0$ head -10 /tmp/waits.txt
event_name      timer_wait
wait/synch/mutex/innodb/log_writer_mutex        933255
wait/synch/mutex/innodb/log_flush_notifier_mutex        937938
wait/synch/mutex/innodb/log_flusher_mutex       561960
wait/synch/mutex/innodb/trx_sys_serialisation_mutex     345873
wait/synch/mutex/innodb/log_write_notifier_mutex        441540
wait/synch/mutex/mysqlx/lock_list_access        908502
wait/synch/mutex/innodb/log_checkpointer_mutex  927903
wait/synch/mutex/innodb/flush_list_mutex        160560
wait/synch/mutex/innodb/log_limits_mutex        120420

and then with an assumption that there are no spaces inside the event_name we can apply simple combination of sed and awk commands to replace '/' with expected ';' and make sure there is just one space before the metric (time in picoseconds in this case). The rest is for the flamegraph.pl script to handle:

openxs@ao756:~/dbs/8.0$ cat /tmp/waits.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --inverted --colors io --title "Waits" --countname picoseconds --width 1000 > /tmp/wait.svg
openxs@ao756:~/dbs/8.0$

I've used a few options above, including --inverted and --colors io to end up with "icicles". The resulting graph with "frames" named like "log" highlighted is below:

Icicle graph for waits, showing the impact of "idle" and InnoDB logging-related waits

The graph above is not very funny looking. That's because the hierarchy of event names in performance_schema is not very deep. You can surely make similar highlights with proper SQL filtering, but .svg file is interactive and you can get a lot of insights after running just one query. You can surely add more fun by adding more data and using more complex SQL statements and text processing.

Since MySQL started to support recursive CTEs I always wanted to use it to navigate through more complex transactions/statements/stages/waits hierarchy that is present in the performance_schema. So, today I tried to use a "proof o concept" recursive CTE to build a bit more funny flame graph and get myself ready to summarize time spent per statement with the option to drill down to related waits if needed. I ended up with this kind of query to show statements by type by stage and by wait related to a stage. No comments to begin with, just a lame SQL (somewhat inspired by the sys.ps_trace_thread()'s cursor here:

with recursive ssw as (
with sw as (
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long
)
select 0 as level, sw.* from sw
  where sw.event_name not like 'wait%'
  and sw.event_name not like 'stage%'
  and nesting_event_id is null
union all
select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait
  from ssw
  inner join sw on sw.nesting_event_id = ssw.event_id
)
select event_name, timer_wait from ssw
order by event_id;

My idea is to start with statements that are at the top level (nested_event_id is NULL) and then concatenate statement events with stages and waits in the hierarchy, to get a longer stack trace for the flame graph. With the following steps (note that I needed longer column for deeper hierarchy):

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e"with recursive ssw as (
with sw as (
  select event_id, cast(event_name as char(1024)) as event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long
)
select 0 as level, sw.* from sw
  where sw.event_name not like 'wait%'
  and sw.event_name not like 'stage%'
  and nesting_event_id is null
union all
select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait
  from ssw
  inner join sw on sw.nesting_event_id = ssw.event_id
)
select event_name, timer_wait from ssw
order by event_id
" > /tmp/sqlstages2.txt

openxs@ao756:~/dbs/8.0$ ls -l /tmp/sqlstages2.txt
-rw-rw-r-- 1 openxs openxs 1372235 січ  6 22:36 /tmp/sqlstages2.txt
openxs@ao756:~/dbs/8.0$ tail -10 /tmp/sqlstages2.txt
statement/com/Close stmt/stage/sql/cleaning up  439000
statement/com/Close stmt        7371000
statement/com/Close stmt/stage/sql/starting     6554000
statement/com/Close stmt/stage/sql/cleaning up  433000
statement/com/Close stmt        6733000
statement/com/Close stmt/stage/sql/starting     5979000
statement/com/Close stmt/stage/sql/cleaning up  433000
statement/com/Quit      5343000
statement/com/Quit/stage/sql/starting   4552000
statement/com/Quit/stage/sql/cleaning up        409000

and then:

cat /tmp/sqlstages2.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --colors io --title "Waits" --countname picoseconds > /tmp/stages2.svg

I ended up with a graph like this:

Flame graph of time spent per statement type/stage/wait while running sysbench test

after running sysbench test:

openxs@ao756:~/dbs/8.0$ sysbench --table-size=1000000 --threads=4 --time=20 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_read_write.lua run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 4
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 4 tps: 4.00 qps: 85.72 (r/w/o: 60.74/13.99/10.99) lat (ms,95%): 1771.29 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 4 tps: 8.00 qps: 166.02 (r/w/o: 116.61/27.40/22.00) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 4 tps: 10.20 qps: 207.40 (r/w/o: 144.60/33.20/29.60) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 10.20 qps: 204.00 (r/w/o: 142.80/33.20/28.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            2324
        write:                           539
        other:                           457
        total:                           3320
    transactions:                        166    (8.20 per sec.)
    queries:                             3320   (164.05 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          20.2331s
    total number of events:              166

Latency (ms):
         min:                                  177.42
         avg:                                  484.72
         max:                                 2039.24
         95th percentile:                      960.30
         sum:                                80463.77

Threads fairness:
    events (avg/stddev):           41.5000/0.87
    execution time (avg/stddev):   20.1159/0.07

Note on the graph above that stages time does NOT sum up to the select or Prepare total time spent - it means large part of the code (where time IS spent) is NOT instrumented as stages of execution. There is still a lot of work to do on Performance Schema it seems...

I could include exact SQL statements text into the graph as well (maybe with a separate stored procedure, not just a lame combination of single SQL statement and simple text processing command line), but that would not help much unless I use width of multiple screens.

To summarize, flame graphs are great for quick overview of the impact of individual "stages" into a cumulative measure collected in a hierarchy. Recursive CTEs are also cool. One day I'll proceed with further steps on this way, but for tonight the proof of concept and "invention" of "P_S Flame Graphs" for MySQL DBAs is enough. Stay tuned!

No comments:

Post a Comment