Saturday, January 15, 2022

Accessing Complex Structures in MariaDB Server Code in bpftrace Probes - First Steps

I had already written many blog posts about bpftrace. All my public talks devoted to bpftrace included a slide about "problems", and one of them usually sounded like this:

...access to complex structures (bpftrace needs headers)...

So, today I decided to demonstrate how to resolve this minor problem. As an example I tried to reproduce gdb debugging steps from this older post. I want to trace and report table and row level locks set during execution of various SQL statements against InnoDB tables in MariaDB Server 10.6.6 built from more or less current GitHub code. To reduce the performance impact I'll use recent version of bpftrace build from GitHub source:

openxs@ao756:~/git/bpftrace/build$ src/bpftrace --version
bpftrace v0.14.0-50-g4228

First thing to find out (given the same test case as in that older blog post) is what functions to add uprobe on and what information is available. Let's start with gdb and set a couple of breakpoints on lock_tables and lock_row_lock functions:

openxs@ao756:~/dbs/maria10.6$ sudo gdb -p `pidof mariadbd`
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 14850
[New LWP 14852]
[New LWP 14853]
[New LWP 14854]
[New LWP 14855]
[New LWP 14859]
[New LWP 14861]
[New LWP 14862]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
0x00007f126bc04aff in __GI___poll (fds=0x55ef5e36c838, nfds=3,
    timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) b lock_table
Breakpoint 1 at 0x55ef5c44a840: lock_table. (4 locations)
(gdb) b lock_rec_lock
Breakpoint 2 at 0x55ef5c1c4896: lock_rec_lock. (2 locations)
(gdb) c
Continuing.
[New Thread 0x7f1245ffd700 (LWP 14899)]
[New Thread 0x7f1246fff700 (LWP 14900)]
[New Thread 0x7f1268518700 (LWP 14904)]
[Switching to Thread 0x7f1268518700 (LWP 14904)]

Thread 11 "mariadbd" hit Breakpoint 1, lock_table (table=0x7f1218065b20,
    mode=LOCK_IS, thr=0x7f121806cc10)
    at /home/openxs/git/server/storage/innobase/lock/lock0lock.cc:3481
warning: Source file is more recent than executable.
3481    {
(gdb) p table
$1 = (dict_table_t *) 0x7f1218065b20
(gdb) p table->name
$2 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}
(gdb) c
Continuing.

Thread 11 "mariadbd" hit Breakpoint 1, lock_table (table=0x7f1218065b20,
    mode=LOCK_IS, thr=0x7f121806cc10)
    at /home/openxs/git/server/storage/innobase/include/que0que.ic:37
warning: Source file is more recent than executable.
37              return(thr->graph->trx);
(gdb) c
Continuing.

Thread 11 "mariadbd" hit Breakpoint 2, lock_rec_lock (impl=false, mode=2,
    block=0x7f12480325a0, heap_no=2, index=0x7f1218066f90, thr=0x7f121806cc10)
    at /home/openxs/git/server/storage/innobase/include/que0que.ic:37
37              return(thr->graph->trx);
(gdb) p index
$3 = (dict_index_t *) 0x7f1218066f90
(gdb) p index->name
$4 = {m_name = 0x7f1218067120 "PRIMARY"}
(gdb) p index->table->name
$5 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}
(gdb) p index->table
$6 = (dict_table_t *) 0x7f1218065b20
(gdb) q
A debugging session is active.

        Inferior 1 [process 14850] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/openxs/dbs/maria10.6/bin/mariadbd, process 14850
[Inferior 1 (process 14850) detached]
openxs@ao756:~/dbs/maria10.6$

From the above we already see that breakpoints are set in more than one place (so we expect more that one function name to match for the uprobe). We also see some structures used as functions arguments to access, and so we need a way to define them for tracing.

Now, if we try something lame to add a probe (should be done as root or via sudo) for just lock_tables:

openxs@ao756:~/git/bpftrace/build$ src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table { printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'
ERROR: bpftrace currently only supports running as the root user.
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table { printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'
[sudo] password for openxs:
Attaching 20 probes...
^C

we can already suspect something bad, as we ended up with 20 probes. We can even list them in a readable way:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | c++filt
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_names(THD*, DDL_options_st const&, TABLE_LIST*, TABLE_LIST*, unsigned long, unsigned int)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_names(THD*, DDL_options_st const&, TABLE_LIST*, TABLE_LIST*, unsigned long, unsigned int) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_create(dict_table_t*, unsigned int, trx_t*, ib_lock_t*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_for_trx(dict_table_t*, trx_t*, lock_mode)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_has_locks(dict_table_t*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_resurrect(dict_table_t*, trx_t*, lock_mode)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_resurrect(dict_table_t*, trx_t*, lock_mode) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_lock_list_init(ut_list_base<ib_lock_t, ut_list_node<ib_lock_t> lock_table_t::*>*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_print(_IO_FILE*, ib_lock_t const*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_print(_IO_FILE*, ib_lock_t const*) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_wsrep(dict_table_t*, lock_mode, que_thr_t*, trx_t*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_wsrep(dict_table_t*, lock_mode, que_thr_t*, trx_t*) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .constprop.0]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .constprop.0] [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables_precheck(THD*, TABLE_LIST*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables_open_and_lock_tables(THD*, TABLE_LIST*)
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | wc -l
20

So, as expected, out probe maptaches any function with "lock_tables" in its name, and using the first row of the output (demangled function signature) does NOT help:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | head -1 | c++filt
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)'
stdin:1:1-59: ERROR: syntax error, unexpected (, expecting {
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can NOT use demandled signature, but we can use mangled equivalent:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | head -1
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t

Like this:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t
  { printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'

Attaching 1 probe...
lock_table: 0x7f1218065b20, 0, 0x7f121806cc10
lock_table: 0x7f121801d690, 4, 0x7f1218068d60
lock_table: 0x7f121801d690, 1, 0x7f1218068d60
lock_table: 0x7f122400d350, 1, 0x7f121807a158
lock_table: 0x7f122400d350, 1, 0x7f121807a158
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
^C

I've got the above output whilke execvuting rhis SQL statement:

MariaDB [test]> insert into t(val) select 100 from tt;
Query OK, 4 rows affected (0.038 sec)
Records: 4  Duplicates: 0  Warnings: 0

A bit more table level lock requests than one would expect, and from more than one thread. This can be explained, but the real problem is that we see lock mode, locking thread address but NOT the actual name of the locked table (that is hidden inside a complex structure). Fro0m gdb output we know that the first argument (arg0) of the function is a pointer to dict_table_t structure, but how is it defined? 

I do not know InnoDB source code by heart, so I have to search (with some assumptions in mind this is easier):

openxs@ao756:~/git/server$ grep -rn 'struct dict_table_t' * | grep '\.h'
storage/innobase/include/row0import.h:34:struct dict_table_t;
storage/innobase/include/srv0start.h:33:struct dict_table_t;
storage/innobase/include/dict0mem.h:1788:struct dict_table_t {
storage/innobase/include/dict0types.h:39:struct dict_table_t;

I've highlighted the line where structure definition begins, so I can see what's there (only the much later important part is quoted):

...
public:
        /** Id of the table. */
        table_id_t                              id;
        /** dict_sys.id_hash chain node */
        dict_table_t*                           id_hash;
        /** Table name in name_hash */
        table_name_t                            name;
...

Even the name is not a simple scalar data type, so we can find the definition in te code (I had not cared much to make sure this is really a 10.6 branhc, assuming that some basic things do not change that often in MariaDB, optimistic approach):

openxs@ao756:~/git/server$ vi +102 storage/innobase/include/dict0types.h

struct table_name_t
{
        /** The name in internal representation */
        char*   m_name;

        /** Default constructor */
        table_name_t() {}
        /** Constructor */
        table_name_t(char* name) : m_name(name) {}

        /** @return the end of the schema name */
        const char* dbend() const
        {
                const char* sep = strchr(m_name, '/');
                ut_ad(sep);
                return sep;
        }

        /** @return the length of the schema name, in bytes */
        size_t dblen() const { return size_t(dbend() - m_name); }

        /** Determine the filename-safe encoded table name.
        @return the filename-safe encoded table name */
        const char* basename() const { return dbend() + 1; }

        /** The start of the table basename suffix for partitioned tables */
        static const char part_suffix[4];

        /** Determine the partition or subpartition name suffix.
        @return the partition name
        @retval NULL    if the table is not partitioned */
        const char* part() const { return strstr(basename(), part_suffix); }

        /** @return whether this is a temporary or intermediate table name */
        inline bool is_temporary() const;
};

I do not care about fun ction members, just data stored in the strucute (highlughted). Moreover, read this part of bpftrace reference carefully:

You can define your own structs when needed. In some cases, kernel structs are not declared in the kernel headers package, and are declared manually in bpftrace tools (or partial structs are: enough to reach the member to dereference).

Thing is, while we theoretically can use #include in bpftrace, structured used in MariaDB server are defined in so many heards amd are so specific and sometimes deeply nested, that usually it's easier to use properly sised placeholderns in explicit struct  definitions. You can get proper sizes from gdb:

(gdb) p sizeof(table_id_t)
$1 = 8
(gdb) p sizeof(dict_table_t *)
$2 = 8
(gdb) p sizeof(long long)
$3 = 8

Taking the above into account, we can try to define structured like these:

struct table_name_t
{
        char*   m_name;
        char part_suffix[4];
}

struct dict_table_t {
        long long                              id;
        struct dict_table_t*                           id_hash;
        struct table_name_t                            name;
}

in our bpftrace code to be able to pproperly derefence the table name as a null-terminated string:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e '
>
> struct table_name_t
> {
        char*   m_name;
        char part_suffix[4];
> }
>
> struct dict_table_t {
        long long                              id;
        struct dict_table_t*                   id_hash;
        struct table_name_t                    name;
> }
>
> uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t
>   { printf("lock_table: %s, %d, %p\n", str(((struct dict_table_t *)arg0)->name.m_name), arg1, arg2); }'
Attaching 1 probe...
lock_table: test/tt, 0, 0x7f121806cc10
lock_table: test/t, 4, 0x7f1218068d60
lock_table: test/t, 1, 0x7f1218068d60
lock_table: mysql/innodb_table_stats, 1, 0x7f122800e7d8
lock_table: mysql/innodb_table_stats, 1, 0x7f122800e7d8
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
^C

Looks good enough as a proof of concept already. We see how to define simplified structured containing just enough information to find and derefernce items of complex nested structures used all over the MariaDB server code.

I'd surely want to add a probe to lock_row_lock too, and for this I need some more gdb outputs from the breakpoint set:

Thread 11 "mariadbd" hit Breakpoint 2, lock_rec_lock (impl=false, mode=2,
    block=0x7f12480325a0, heap_no=2, index=0x7f1218066f90, thr=0x7f121806cc10)
    at /home/openxs/git/server/storage/innobase/include/que0que.ic:37
37              return(thr->graph->trx);
(gdb) p index
$3 = (dict_index_t *) 0x7f1218066f90
(gdb) p index->name
$4 = {m_name = 0x7f1218067120 "PRIMARY"}
(gdb) p index->table->name
$5 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}
(gdb) p index->table
$6 = (dict_table_t *) 0x7f1218065b20
(gdb) p index->id
$7 = 48
(gdb) p sizeof(index->id)
$8 = 8
(gdb) p sizeof(index->heap)
$9 = 8
(gdb) p sizeof(index->name)
$10 = 8

(gdb) p impl
$11 = false
(gdb) p sizeof(impl)
$12 = 1
(gdb) p sizeof(mode)
$13 = 4

I can surely find the defintion of dict_index_t structure and work based on it:

struct dict_index_t {
  /** Maximum number of fields */
  static constexpr unsigned MAX_N_FIELDS= (1U << 10) - 1;

        index_id_t      id;     /*!< id of the index */
        mem_heap_t*     heap;   /*!< memory heap */
        id_name_t       name;   /*!< index name */
        dict_table_t*   table;  /*!< back pointer to table */
...

but sizes above are actually enough to come up with a proper code like this:

sudo src/bpftrace -e '

struct table_name_t
{
        char*   m_name;
        char part_suffix[4];
}

struct dict_table_t {
        long long                              id;
        struct dict_table_t*                           id_hash;
        struct table_name_t                            name;
}

struct dict_index_t {
        long long        id;
        long long        heap;
        char*            name;
        struct dict_table_t*    table;
}

uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t
  { printf("lock_table: %s, mode: %d, thread: %p\n",
      str(((struct dict_table_t *)arg0)->name.m_name),
      arg1,
      arg2); }

uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_rec_lock
  { printf("lock_rec_lock: impl (%d) mode %d index %s rec of %s, thread: %p\n",
      arg0,
      arg1,
      str(((struct dict_index_t *)arg4)->name),
      str(((struct dict_index_t *)arg4)->table->name.m_name),
      arg5); }

that produces the following: 

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e '
>
> struct table_name_t
> {
        char*   m_name;
        char part_suffix[4];
> }
>
> struct dict_table_t {
        long long                              id;
        struct dict_table_t*                           id_hash;
        struct table_name_t                            name;
> }
>
> struct dict_index_t {
        long longid;
        long longheap;
        char*name;
        struct dict_table_t*table;
> }
>
> uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t
  { printf("lock_table: %s, mode: %d, thread: %p\n",
      str(((struct dict_table_t *)arg0)->name.m_name),
      arg1,
      arg2); }
>
> uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_rec_lock
  { printf("lock_rec_lock: impl (%d) mode %d index %s rec of %s, thread: %p\n",
      arg0,
      arg1,
      str(((struct dict_index_t *)arg4)->name),
      str(((struct dict_index_t *)arg4)->table->name.m_name),
      arg5); }
>
> '
Attaching 3 probes...
lock_table: test/tt, mode: 0, thread: 0x7f121806cc10
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_table: test/t, mode: 4, thread: 0x7f1218068d60
lock_table: test/t, mode: 1, thread: 0x7f1218068d60
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_table: mysql/innodb_table_stats, mode: 1, thread: 0x7f123400ebc8
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8
lock_table: mysql/innodb_table_stats, mode: 1, thread: 0x7f123400ebc8
lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
^C

while this SQL is executed, for example:

MariaDB [test]> insert into t(val) select 100 from tt;
Query OK, 4 rows affected (0.080 sec)
Records: 4  Duplicates: 0  Warnings: 0

Please, check that older post from proper interpretation of the output, including contants used to represent lock mode etc. The idea here was to show that bpftrace understands a subset of C/C++ struct statement and show the ways to create such simplified strucutres to be able to dereference and access deeply nexted arguments if needed for tracing.

A bit more advanced code may show more details about the individual rows locked...

* * *

To summarize:

  1. With some efforts like source code checks and gdb breakpoints/prints one can eventually figure out what structures to define to be able to access members of complex structures typical for MariaDB and MySQL server code in the bpftrace code.
  2. Inlcuding existing headers is NOT possible and is hardly practical for complex software like MariaDB that uses C++ nowadays.
  3. bpftrace, with some efforts like those presented above, allows to study what happens in complex server operations (like InnoDB locking) easily, safely and with minimal impact on a running system.
  4. bpftrace is cool, but you already know that!

Monday, January 10, 2022

Differential Flame Graphs to Highlight Performance Schema Waits Impact

Yet another type of flame graphs that I had not discussed yet is a differential flame graph (again invented by Brendan Gregg). It shows the difference of two flame graphs in a clear way (assuming they are comparable - it's on you to make sure comparison makes sense and interpret the output properly). The flame graph is drawn using the "after" profile (such that the frame widths show values for the second graph), and then colorized by the delta to show how we got there. If the metric for the specific frame in the same stack is larger then it is shown in red. If it's smaller the frame is blue (hence the name red/blue differential flame graphs). The saturation is relative to the delta, so dark red is for the frame that has much bigger value in the second graph.While and very light frames can be ignored.

Let me apply this approach to the flame graphs showing waits reported by MySQL's performance schema (built as described in this blog post). As a proof of concept I'll use an easy to interpret case where the same oltp_write_only.lua sysbench test is run with different values of innodb_flush_log_at_trx_commit and otherwise the same settings like 16 concurrent threads on my old 2 cores Ubuntu 20.04 "home server" with slow HDD.

So, here is the first run:

sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua cleanup
sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua prepare
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'truncate table events_waits_history_long'
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'set global innodb_flush_log_at_trx_commit=0'
sysbench --table-size=1000000 --threads=16 --time=120 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua run
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits64_0.txt
cat /tmp/waits64_0.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' > /tmp/w64_0.out

That ended up with this sysbench statistics:

SQL statistics:
    queries performed:
        read:                            0
        write:                           26252
        other:                           13126
        total:                           39378
    transactions:                        6563   (54.60 per sec.)
    queries:                             39378  (327.59 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          120.2017s
    total number of events:              6563

Latency (ms):
         min:                                   66.37
         avg:                                  292.94
         max:                                 1222.92
         95th percentile:                      530.08
         sum:                              1922587.72

Threads fairness:
    events (avg/stddev):           410.1875/4.69
    execution time (avg/stddev):   120.1617/0.05

and then the second:

sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua cleanup
sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua prepare
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'truncate table events_waits_history_long'
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'set global innodb_flush_log_at_trx_commit=1'
sysbench --table-size=1000000 --threads=16 --time=120 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua run
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits64_1.txt
cat /tmp/waits64_1.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' > /tmp/w64_1.out

that produced the following:

SQL statistics:
    queries performed:
        read:                            0
        write:                           16238
        other:                           8119
        total:                           24357
    transactions:                        4059   (33.71 per sec.)
    queries:                             24357  (202.27 per sec.)
    ignored errors:                      1      (0.01 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          120.4139s
    total number of events:              4059

Latency (ms):
         min:                                  151.87
         avg:                                  474.29
         max:                                 1316.65
         95th percentile:                      773.68
         sum:                              1925150.61

Threads fairness:
    events (avg/stddev):           253.6875/3.51
    execution time (avg/stddev):   120.3219/0.13

Now with resulting "collapsed stacks" in /tmp/w64_?.out files, we can build a simple differential flame graph 

openxs@ao756:~/dbs/8.0$ ~/git/FlameGraph/difffolded.pl /tmp/w64_0.out /tmp/w64_1.out | ~/git/FlameGraph/flamegraph.pl --count picoseconds --title Waits > /tmp/w64_01_diff.svg

that, when captured as a .png file for this blog post looks as follows:

Differential flame graph visualizing the impact of innodb_flush_log_at_trx_commit value (the difference 1 makes vs 0) on waits reported by performance_schema for the same oltp_write_only.lua test

We clearly see the negative impact on performance (54 TPS vs 33 TPS). We clearly see that time spent on InnoDB log I/O increased, same as idle time somewhat increased. The highest relative increase was for /wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done condition variable (5+%) and that was probably related to binlog group commit implementation where we could not write to the binary log until redo log is flushed. Wild guess, surely.

I deliberately used as simple flame graphs as possible to make interpretation of the difference based on just a screenshot easier. This lame test demonstrated that we really can see what we expected - somewhat increased redo log I/O waits, and they are highlighted as red. Good enough for the proof of concept and way easier to speculate about than usual Off-CPU differential flame graph like this:

that I built while attempting to reproduce one real life MariaDB performance problem...

Saturday, January 8, 2022

Visualizing MySQL Plan Execution Time With Flame Graphs

In the previous post I've already shown one non-classical but useful application of flame graphs as a visualization of something interesting for MySQL DBAs (besides usual stack traces from profilers), time spent on waits or even statements execution by stages with related waits as reported by performance_schema. Today I am going to abuse the tools even more and try to show the real impact of each step in the query execution as reported by the EXPLAIN ANALYZE statement that is supported since MySQL 8.0.18.

 The idea to use flame graphs for SQL plans visualization was (to the best of my knowledge) first suggested by Tanel Poder in this article, in context of Oracle RDBMS. There are several implementations of the idea in a free software here and there, for example.

Unfortunately I had not found any similar posts for MySQL and tools mentioned above do not work with MySQL. It made me wondering why? 

For quite a some time we have a way to get several metrics for each execution step of the query plan in MySQL. Let's start with some stupid query (used to prove the concept and test possible steps to implement it):

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 33
Server version: 8.0.27 Source distribution

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> explain analyze select user, host from mysql.user u1 where u1.user not in (select distinct user from mysql.user) order by host desc\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop antijoin  (cost=3.75 rows=25) (actual time=0.139..0.139 rows=0 loops=1)
    -> Covering index scan on u1 using PRIMARY (reverse)  (cost=0.75 rows=5) (actual time=0.058..0.064 rows=5 loops=1)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)  (actual time=0.003..0.003 rows=1 loops=5)
        -> Materialize with deduplication  (cost=1.25..1.25 rows=5) (actual time=0.071..0.071 rows=4 loops=1)
            -> Filter: (mysql.`user`.`User` is not null)  (cost=0.75 rows=5) (actual time=0.022..0.035 rows=5 loops=1)
                -> Covering index scan on user using PRIMARY  (cost=0.75 rows=5) (actual time=0.020..0.031 rows=5 loops=1)

1 row in set (0,02 sec)

We see some hierarchical representation of query execution steps (it's Oracle's famous TREE format of EXPLAIN output, the only one supported for EXPLAIN ANALYZE). Each row represents some step, explain what the step was doing, quite verbosely, and then provides several useful metrics like cost, estimated number of rows and then actual time to return the first and all rows for this step etc. The actual time to return all rows at this step is what DBAs are usually interested in The steps form a tree, but the way it's represented is a bit unusual - we do not see JSON or any other structured format as some other RDBMSes provide. Instead each row has some level in the hierarchy represented by the number of spaces before the very informative "->" prompt. The first row with its additional "EXPLAIN: " prompts aside, each nesting level adds 4 spaces, as highlighted by the light green background above.

Maybe this unusual representation (not in the table with plan steps like in Oracle and not as JSON) prevented people from quickly implementing flame graphs-based query "profiler" for MySQL. But this definitely is not going to stop me from hacking something that may even work. I am not really a developer during recent 16+ years already, so I will not use Node.js or Python or anything fancy - just old Unix text processing tools like sed and awk, and surely the power of SQL (that small part I managed to master). 

So, with the output like the above saved in a file:

openxs@ao756:~/dbs/8.0$ cat /tmp/explain.txt
*************************** 1. row ***************************
EXPLAIN: -> Nested loop antijoin  (cost=3.75 rows=25) (actual time=0.115..0.115 rows=0 loops=1)
    -> Covering index scan on u1 using PRIMARY (reverse)  (cost=0.75 rows=5) (actual time=0.045..0.049 rows=5 loops=1)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)  (actual time=0.002..0.002 rows=1 loops=5)
        -> Materialize with deduplication  (cost=1.25..1.25 rows=5) (actual time=0.063..0.063 rows=4 loops=1)
            -> Filter: (mysql.`user`.`User` is not null)  (cost=0.75 rows=5) (actual time=0.025..0.034 rows=5 loops=1)
                -> Covering index scan on user using PRIMARY  (cost=0.75 rows=5) (actual time=0.024..0.032 rows=5 loops=1)

openxs@ao756:~/dbs/8.0$

The first stage is simple and represented by the following command line I came up with quite fast:

openxs@ao756:~/dbs/8.0$  cat /tmp/explain.txt | awk 'NR > 1' | sed 's/EXPLAIN: //' | sed 's/(cost[^)][^)]*)//' | sed 's/(actual time=//' | sed 's/\(..*\)\.\...*/\1/' | sed 's/  \([^ ][^ ]*\)$/; \1/'
-> Nested loop antijoin ; 0.115
    -> Covering index scan on u1 using PRIMARY (reverse) ; 0.045
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`); 0.002
        -> Materialize with deduplication ; 0.063
            -> Filter: (mysql.`user`.`User` is not null) ; 0.025
                -> Covering index scan on user using PRIMARY ; 0.024

openxs@ao756:~/dbs/8.0$

At each step sequentially I removed the first line of the output, removed that stupid "EXPLAIN :" "prompt", removed cost and number of rows estimations and then extracted just the actual time to return all rows from the step (separated by ";" as flamegraph.pl wants) , and kept the step detailed description with the initial spaces.

The next stage of processing took me a lot of time and efforts, as I had to convert the output above into a different format that is suitable for loading into the database:

openxs@ao756:~/dbs/8.0$ cat /tmp/plan.txt
1;0;0; Nested loop antijoin ; 0.115
2;1;1; Covering index scan on u1 using PRIMARY (reverse) ; 0.045
3;1;1; Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`); 0.002
4;3;2; Materialize with deduplication ; 0.063
5;4;3; Filter: (mysql.`user`.`User` is not null) ; 0.025
6;5;4; Covering index scan on user using PRIMARY ; 0.024

The format is simple: semicolon separated row number (step in the plan), row number of the "parent" step in the hierarchy (with 0 used for the very first step, obviously), the level of the step in the hierarchy, just in case, that same detailed step description and the value of the metric for the step (time to return all rows in milliseconds). This file is then loaded into the following MySQL table:

mysql> desc plan;
+--------+---------------+------+-----+---------+-------+
| Field  | Type          | Null | Key | Default | Extra |
+--------+---------------+------+-----+---------+-------+
| seq    | int           | YES  |     | NULL    |       |
| parent | int           | YES  |     | NULL    |       |
| level  | int           | YES  |     | NULL    |       |
| step   | varchar(1024) | YES  |     | NULL    |       |
| val    | decimal(10,3) | YES  |     | NULL    |       |
+--------+---------------+------+-----+---------+-------+
5 rows in set (0,00 sec)

mysql> truncate plan;
Query OK, 0 rows affected (1,77 sec)

mysql> load data infile '/tmp/plan.txt' into table plan fields terminated by ';';
Query OK, 6 rows affected (0,13 sec)
Records: 7  Deleted: 0  Skipped: 0  Warnings: 0

mysql> select * from plan;
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
| seq  | parent | level | step                                                                                     | val   |
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
|    1 |      0 |     0 |  Nested loop antijoin                                                                    | 0.115 |
|    2 |      1 |     1 |  Covering index scan on u1 using PRIMARY (reverse)                                       | 0.045 |
|    3 |      1 |     1 |  Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) | 0.002 |
|    4 |      3 |     2 |  Materialize with deduplication                                                          | 0.063 |
|    5 |      4 |     3 |  Filter: (mysql.`user`.`User` is not null)                                               | 0.025 |
|    6 |      5 |     4 |  Covering index scan on user using PRIMARY                                               | 0.024 |
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
6 rows in set (0,00 sec)

Now, how to end up with that format? I spent a lot of time with awk and created the following script:

awk '
function st_push(val) {
st_array[st_pos++] = val;
}

function st_pop() {
return (st_size() > 0) ? st_array[--st_pos] : "ERROR";
}

function st_size() {
return st_pos;
}

BEGIN { p[0] = 0; level[0] = 0; parent[0] = 0; }

NF > 0 {
  # print "> " NR ";" $1;
  split($0,a,"->");
  lvl = length(a[1])/4;

  level[NR] = lvl;
  parent[NR] = NR-1;

  if (level[NR] > level[NR-1]) {
    st_push(parent[NR]);
    p[level[NR]] = NR-1;
  }
  else if (level[NR] < level[NR-1]) {
    for (i=0; i<=(level[NR-1]-level[NR]); i++) {
      parent[NR] = st_pop();
      # print "poped " parent[NR];
    }
  }
  else {
    parent[NR]=parent[NR-1]
  }
  print NR ";" parent[NR] ";" lvl ";" a[2]
}'

that starts with 3 functions implementing a stack and then 3 associative arrays: p[] to store current parent row number for the specific nesting level, level[] to store hierarchy level of specific row and parent[] to store that number of row that is a parent of a given row in the plan.

To find the level of the row I split() the input row with at least one default filed separator ';' inside, using "->" as a field separator (assuming it is not used in the step description, ever) and then end up with initial spaces in a[1] and the rest (the step description and metrics value) in a[2]. I divide the length of a[1] by 4 to find the level. 

This was a simple part. Then I try to build a value of a parent row using a stack to save previous parent row number for the level. It took some testing with a more primitive input format to come up with "tree traversal" code for such a weird tree representation format that seem to work on few different nested structures tested (correct me if it fails for more complex plans). I am not a developer any more so the code above may be wrong.

So, with the plan in the table, where each line has a seq number of seq number of the parent line as parent, one needs a simple enough recursive CTE to sum up all possible paths and summarizing time spent in the path:

mysql> with recursive cte_plan as (select seq, parent, level, concat(seq, concat(' -', step)) as step, round(val*1000) as val from plan where level = 0 union all select p.seq, p.parent, p.level, concat(c.step, concat(';', concat(p.seq, concat(' -',p.step)))), round(p.val*1000) as val from plan p join cte_plan c on p.parent = c.seq) select step, val from cte_plan\G
*************************** 1. row ***************************
step: 1 - Nested loop antijoin
 val: 115
*************************** 2. row ***************************
step: 1 - Nested loop antijoin ;2 - Covering index scan on u1 using PRIMARY (reverse)
 val: 45
*************************** 3. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)
 val: 2
*************************** 4. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication
 val: 63
*************************** 5. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null)
 val: 25
*************************** 6. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) ;6 - Covering index scan on user using PRIMARY
 val: 24
6 rows in set (0,01 sec)

Note that I've multiplied the metrics value by 1000 as flamegraph.pl script that expects metrics to be integer numbers. So they are actually in microseconds. I also added step number to each step, as actually we may have exact same step in different places in the hierarchy if wee read from the same table more than once in the query.

The final code to produce the input for flamegraph.pl is like this:

openxs@ao756:~/dbs/8.0$ ~/dbs/8.0/bin/mysql -uroot --socket=/tmp/mysql8.sock test -B -e"with recursive cte_plan as (select seq, parent, level, concat(seq, concat(' -', step)) as step, round(val*1000) as val from plan where level = 0 union all select p.seq, p.parent, p.level, concat(c.step, concat(';', concat(p.seq, concat(' -',p.step)))), round(p.val*1000) as val from plan p join cte_plan c on p.parent = c.seq) select step, val from cte_plan;" | awk ' NR > 1' > /tmp/processed_plan.txt
openxs@ao756:~/dbs/8.0$ cat /tmp/processed_plan.txt
1 - Nested loop antijoin        115
1 - Nested loop antijoin ;2 - Covering index scan on u1 using PRIMARY (reverse) 45
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)   2
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication       63
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null)         25
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) ;6 - Covering index scan on user using PRIMARY  24

This is how I create the final flame graph representing the plan steps):

openxs@ao756:~/dbs/8.0$ cat /tmp/processed_plan.txt | ~/git/FlameGraph/flamegraph.pl  --title "EXPLAIN Steps" --inverted --countname miliseconds > /tmp/explain2.svg

Surely these steps can be simplified, put into a script etc. I checked intermediate stages one by one, left some debugging details and so on. You can do better, but the final result is this nice icicle graph:

Time spent on each step of query execution represented as an icicle graph

where  I highlighted the impact of the "Covering index scan on u1 using PRIMARY (reverse)" step. As usual with flame graphs, in the original .svg file you can do search, zoom in an out etc. For complex queries it may be really useful.

As a DBA I'd ask for something better than these weird stages (even if scripted). I wish we had all execution steps in tables or view, like in Oracle's v$sql_plan and v$sql_plan_statistics_all, or at least in JSON format maybe (like EXPLAIN FORMAT=JSON  produces for the plan itself). One day, maybe (I am going to file a feature request for this later). Yet another day I plan to use the output of ANALYZE FORMAT=JSON in MariaDB to get similar information :)

Thursday, January 6, 2022

Visualizing Performance Schema Events with Flame Graphs

Happy New Year 2022, dear readers! It's not the first time I am writing something about or illustrate some points with a flame graph. No wonder. Flame graph concept and related tools by Brendan Gregg provide a great way for visualizing metrics in any nested hierarchies, with function calls stacks being just the most popular and known example of them.

In this new blog post that I am writing while preparing to my upcoming FOSDEM 2022 talk "Flame Graphs for MySQL DBAs" I am going to explore a different hierarchy that is obvious from these simple SQL queries and their outputs:

mysql> select event_name, timer_wait from events_waits_history_long order by 1 desc limit 5;
+----------------------------------------------+------------+
| event_name                                   | timer_wait |
+----------------------------------------------+------------+
| wait/synch/sxlock/innodb/trx_purge_latch     |     747273 |
| wait/synch/sxlock/innodb/index_tree_rw_lock  |     767343 |
| wait/synch/sxlock/innodb/hash_table_locks    |     637557 |
| wait/synch/sxlock/innodb/hash_table_locks    |     280980 |
| wait/synch/sxlock/innodb/dict_operation_lock |    1731372 |
+----------------------------------------------+------------+
5 rows in set (0,02 sec)

Another one, from the summary table:

mysql> select event_name, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name order by 2 desc limit 5;
+---------------------------------------------------------+-------------------+
| event_name                                              | sum_timer_wait    |
+---------------------------------------------------------+-------------------+
| wait/synch/cond/mysqlx/scheduler_dynamic_worker_pending | 20961067338483726 |
| idle                                                    |  1461111486462000 |
| wait/io/socket/sql/client_connection                    |  1373040224524194 |
| wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done            |  1238396604397386 |
| wait/io/file/innodb/innodb_temp_file                    |   330058399794447 |
+---------------------------------------------------------+-------------------+
5 rows in set (0,05 sec)

We clearly see a hierarchy of waits of different types. It takes just few simple steps to convert this kind of output to the formal expected by the flamegraph.pl tool:

# The input is stack frames and sample counts formatted as single lines.  Each
# frame in the stack is semicolon separated, with a space and count at the end
# of the line.  These can be generated for Linux perf script output using
# stackcollapse-perf.pl, for DTrace using stackcollapse.pl, and for other tools
# using the other stackcollapse programs.  Example input:
#
#  swapper;start_kernel;rest_init;cpu_idle;default_idle;native_safe_halt 1
#
# An optional extra column of counts can be provided to generate a differential
# flame graph of the counts, colored red for more, and blue for less.  This
# can be useful when using flame graphs for non-regression testing.
# See the header comment in the difffolded.pl program for instructions.

Let me show a really simple way. We can get the lines without extra decorations:

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits.txt
openxs@ao756:~/dbs/8.0$ head -10 /tmp/waits.txt
event_name      timer_wait
wait/synch/mutex/innodb/log_writer_mutex        933255
wait/synch/mutex/innodb/log_flush_notifier_mutex        937938
wait/synch/mutex/innodb/log_flusher_mutex       561960
wait/synch/mutex/innodb/trx_sys_serialisation_mutex     345873
wait/synch/mutex/innodb/log_write_notifier_mutex        441540
wait/synch/mutex/mysqlx/lock_list_access        908502
wait/synch/mutex/innodb/log_checkpointer_mutex  927903
wait/synch/mutex/innodb/flush_list_mutex        160560
wait/synch/mutex/innodb/log_limits_mutex        120420

and then with an assumption that there are no spaces inside the event_name we can apply simple combination of sed and awk commands to replace '/' with expected ';' and make sure there is just one space before the metric (time in picoseconds in this case). The rest is for the flamegraph.pl script to handle:

openxs@ao756:~/dbs/8.0$ cat /tmp/waits.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --inverted --colors io --title "Waits" --countname picoseconds --width 1000 > /tmp/wait.svg
openxs@ao756:~/dbs/8.0$

I've used a few options above, including --inverted and --colors io to end up with "icicles". The resulting graph with "frames" named like "log" highlighted is below:

Icicle graph for waits, showing the impact of "idle" and InnoDB logging-related waits

The graph above is not very funny looking. That's because the hierarchy of event names in performance_schema is not very deep. You can surely make similar highlights with proper SQL filtering, but .svg file is interactive and you can get a lot of insights after running just one query. You can surely add more fun by adding more data and using more complex SQL statements and text processing.

Since MySQL started to support recursive CTEs I always wanted to use it to navigate through more complex transactions/statements/stages/waits hierarchy that is present in the performance_schema. So, today I tried to use a "proof o concept" recursive CTE to build a bit more funny flame graph and get myself ready to summarize time spent per statement with the option to drill down to related waits if needed. I ended up with this kind of query to show statements by type by stage and by wait related to a stage. No comments to begin with, just a lame SQL (somewhat inspired by the sys.ps_trace_thread()'s cursor here:

with recursive ssw as (
with sw as (
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long
)
select 0 as level, sw.* from sw
  where sw.event_name not like 'wait%'
  and sw.event_name not like 'stage%'
  and nesting_event_id is null
union all
select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait
  from ssw
  inner join sw on sw.nesting_event_id = ssw.event_id
)
select event_name, timer_wait from ssw
order by event_id;

My idea is to start with statements that are at the top level (nested_event_id is NULL) and then concatenate statement events with stages and waits in the hierarchy, to get a longer stack trace for the flame graph. With the following steps (note that I needed longer column for deeper hierarchy):

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e"with recursive ssw as (
with sw as (
  select event_id, cast(event_name as char(1024)) as event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union
  select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long
)
select 0 as level, sw.* from sw
  where sw.event_name not like 'wait%'
  and sw.event_name not like 'stage%'
  and nesting_event_id is null
union all
select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait
  from ssw
  inner join sw on sw.nesting_event_id = ssw.event_id
)
select event_name, timer_wait from ssw
order by event_id
" > /tmp/sqlstages2.txt

openxs@ao756:~/dbs/8.0$ ls -l /tmp/sqlstages2.txt
-rw-rw-r-- 1 openxs openxs 1372235 січ  6 22:36 /tmp/sqlstages2.txt
openxs@ao756:~/dbs/8.0$ tail -10 /tmp/sqlstages2.txt
statement/com/Close stmt/stage/sql/cleaning up  439000
statement/com/Close stmt        7371000
statement/com/Close stmt/stage/sql/starting     6554000
statement/com/Close stmt/stage/sql/cleaning up  433000
statement/com/Close stmt        6733000
statement/com/Close stmt/stage/sql/starting     5979000
statement/com/Close stmt/stage/sql/cleaning up  433000
statement/com/Quit      5343000
statement/com/Quit/stage/sql/starting   4552000
statement/com/Quit/stage/sql/cleaning up        409000

and then:

cat /tmp/sqlstages2.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --colors io --title "Waits" --countname picoseconds > /tmp/stages2.svg

I ended up with a graph like this:

Flame graph of time spent per statement type/stage/wait while running sysbench test

after running sysbench test:

openxs@ao756:~/dbs/8.0$ sysbench --table-size=1000000 --threads=4 --time=20 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_read_write.lua run
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 4
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 4 tps: 4.00 qps: 85.72 (r/w/o: 60.74/13.99/10.99) lat (ms,95%): 1771.29 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 4 tps: 8.00 qps: 166.02 (r/w/o: 116.61/27.40/22.00) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 4 tps: 10.20 qps: 207.40 (r/w/o: 144.60/33.20/29.60) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 4 tps: 10.20 qps: 204.00 (r/w/o: 142.80/33.20/28.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            2324
        write:                           539
        other:                           457
        total:                           3320
    transactions:                        166    (8.20 per sec.)
    queries:                             3320   (164.05 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          20.2331s
    total number of events:              166

Latency (ms):
         min:                                  177.42
         avg:                                  484.72
         max:                                 2039.24
         95th percentile:                      960.30
         sum:                                80463.77

Threads fairness:
    events (avg/stddev):           41.5000/0.87
    execution time (avg/stddev):   20.1159/0.07

Note on the graph above that stages time does NOT sum up to the select or Prepare total time spent - it means large part of the code (where time IS spent) is NOT instrumented as stages of execution. There is still a lot of work to do on Performance Schema it seems...

I could include exact SQL statements text into the graph as well (maybe with a separate stored procedure, not just a lame combination of single SQL statement and simple text processing command line), but that would not help much unless I use width of multiple screens.

To summarize, flame graphs are great for quick overview of the impact of individual "stages" into a cumulative measure collected in a hierarchy. Recursive CTEs are also cool. One day I'll proceed with further steps on this way, but for tonight the proof of concept and "invention" of "P_S Flame Graphs" for MySQL DBAs is enough. Stay tuned!