Blog of (former?) MySQL Entomologist: locks

Showing posts with label locks. Show all posts

Saturday, January 15, 2022

Accessing Complex Structures in MariaDB Server Code in bpftrace Probes - First Steps

I had already written many blog posts about bpftrace. All my public talks devoted to bpftrace included a slide about "problems", and one of them usually sounded like this:

...access to complex structures (bpftrace needs headers)...

So, today I decided to demonstrate how to resolve this minor problem. As an example I tried to reproduce gdb debugging steps from this older post. I want to trace and report table and row level locks set during execution of various SQL statements against InnoDB tables in MariaDB Server 10.6.6 built from more or less current GitHub code. To reduce the performance impact I'll use recent version of bpftrace build from GitHub source:

openxs@ao756:~/git/bpftrace/build$ src/bpftrace --version
bpftrace v0.14.0-50-g4228

First thing to find out (given the same test case as in that older blog post) is what functions to add uprobe on and what information is available. Let's start with gdb and set a couple of breakpoints on lock_tables and lock_row_lock functions:

openxs@ao756:~/dbs/maria10.6$ sudo gdb -p `pidof mariadbd`
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 14850
[New LWP 14852]
[New LWP 14853]
[New LWP 14854]
[New LWP 14855]
[New LWP 14859]
[New LWP 14861]
[New LWP 14862]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
0x00007f126bc04aff in __GI___poll (fds=0x55ef5e36c838, nfds=3,
    timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) b lock_table
Breakpoint 1 at 0x55ef5c44a840: lock_table. (4 locations)
(gdb) b lock_rec_lock
Breakpoint 2 at 0x55ef5c1c4896: lock_rec_lock. (2 locations)
(gdb) c
Continuing.
[New Thread 0x7f1245ffd700 (LWP 14899)]
[New Thread 0x7f1246fff700 (LWP 14900)]
[New Thread 0x7f1268518700 (LWP 14904)]
[Switching to Thread 0x7f1268518700 (LWP 14904)]

Thread 11 "mariadbd" hit Breakpoint 1, lock_table (table=0x7f1218065b20,
    mode=LOCK_IS, thr=0x7f121806cc10)
    at /home/openxs/git/server/storage/innobase/lock/lock0lock.cc:3481
warning: Source file is more recent than executable.
3481    {
(gdb) p table
$1 = (dict_table_t *) 0x7f1218065b20
(gdb) p table->name
$2 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}
(gdb) c
Continuing.

Thread 11 "mariadbd" hit Breakpoint 1, lock_table (table=0x7f1218065b20,
    mode=LOCK_IS, thr=0x7f121806cc10)
    at /home/openxs/git/server/storage/innobase/include/que0que.ic:37
warning: Source file is more recent than executable.
37              return(thr->graph->trx);
(gdb) c
Continuing.

Thread 11 "mariadbd" hit Breakpoint 2, lock_rec_lock (impl=false, mode=2,
    block=0x7f12480325a0, heap_no=2, index=0x7f1218066f90, thr=0x7f121806cc10)
    at /home/openxs/git/server/storage/innobase/include/que0que.ic:37
37              return(thr->graph->trx);
(gdb) p index
$3 = (dict_index_t *) 0x7f1218066f90
(gdb) p index->name
$4 = {m_name = 0x7f1218067120 "PRIMARY"}
(gdb) p index->table->name
$5 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}
(gdb) p index->table
$6 = (dict_table_t *) 0x7f1218065b20
(gdb) q
A debugging session is active.

        Inferior 1 [process 14850] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/openxs/dbs/maria10.6/bin/mariadbd, process 14850
[Inferior 1 (process 14850) detached]
openxs@ao756:~/dbs/maria10.6$

From the above we already see that breakpoints are set in more than one place (so we expect more that one function name to match for the uprobe). We also see some structures used as functions arguments to access, and so we need a way to define them for tracing.

Now, if we try something lame to add a probe (should be done as root or via sudo) for just lock_tables:

openxs@ao756:~/git/bpftrace/build$ src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table { printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'
ERROR: bpftrace currently only supports running as the root user.
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table { printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'
[sudo] password for openxs:
Attaching 20 probes...
^C

we can already suspect something bad, as we ended up with 20 probes. We can even list them in a readable way:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | c++filt
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_names(THD*, DDL_options_st const&, TABLE_LIST*, TABLE_LIST*, unsigned long, unsigned int)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_names(THD*, DDL_options_st const&, TABLE_LIST*, TABLE_LIST*, unsigned long, unsigned int) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_create(dict_table_t*, unsigned int, trx_t*, ib_lock_t*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_for_trx(dict_table_t*, trx_t*, lock_mode)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_has_locks(dict_table_t*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_resurrect(dict_table_t*, trx_t*, lock_mode)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_resurrect(dict_table_t*, trx_t*, lock_mode) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_lock_list_init(ut_list_base<ib_lock_t, ut_list_node<ib_lock_t> lock_table_t::*>*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_print(_IO_FILE*, ib_lock_t const*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_print(_IO_FILE*, ib_lock_t const*) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_wsrep(dict_table_t*, lock_mode, que_thr_t*, trx_t*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_wsrep(dict_table_t*, lock_mode, que_thr_t*, trx_t*) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .constprop.0]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .constprop.0] [clone .cold]
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables_precheck(THD*, TABLE_LIST*)
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables_open_and_lock_tables(THD*, TABLE_LIST*)
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | wc -l
20

So, as expected, out probe maptaches any function with "lock_tables" in its name, and using the first row of the output (demangled function signature) does NOT help:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | head -1 | c++filt
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)'
stdin:1:1-59: ERROR: syntax error, unexpected (, expecting {
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We can NOT use demandled signature, but we can use mangled equivalent:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | head -1
uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t

Like this:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t
{ printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'
Attaching 1 probe...
lock_table: 0x7f1218065b20, 0, 0x7f121806cc10
lock_table: 0x7f121801d690, 4, 0x7f1218068d60
lock_table: 0x7f121801d690, 1, 0x7f1218068d60
lock_table: 0x7f122400d350, 1, 0x7f121807a158
lock_table: 0x7f122400d350, 1, 0x7f121807a158
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
lock_table: 0x7f12240101e0, 1, 0x7f121807a900
^C

I've got the above output whilke execvuting rhis SQL statement:

MariaDB [test]> insert into t(val) select 100 from tt;
Query OK, 4 rows affected (0.038 sec)
Records: 4 Duplicates: 0 Warnings: 0

A bit more table level lock requests than one would expect, and from more than one thread. This can be explained, but the real problem is that we see lock mode, locking thread address but NOT the actual name of the locked table (that is hidden inside a complex structure). Fro0m gdb output we know that the first argument (arg0) of the function is a pointer to dict_table_t structure, but how is it defined?

I do not know InnoDB source code by heart, so I have to search (with some assumptions in mind this is easier):

openxs@ao756:~/git/server$ grep -rn 'struct dict_table_t' * | grep '\.h'
storage/innobase/include/row0import.h:34:struct dict_table_t;
storage/innobase/include/srv0start.h:33:struct dict_table_t;
storage/innobase/include/dict0mem.h:1788:struct dict_table_t {
storage/innobase/include/dict0types.h:39:struct dict_table_t;

I've highlighted the line where structure definition begins, so I can see what's there (only the much later important part is quoted):

...
public:
        /** Id of the table. */
        table_id_t                              id;
        /** dict_sys.id_hash chain node */
        dict_table_t*                           id_hash;
        /** Table name in name_hash */
        table_name_t                            name;
...

Even the name is not a simple scalar data type, so we can find the definition in te code (I had not cared much to make sure this is really a 10.6 branhc, assuming that some basic things do not change that often in MariaDB, optimistic approach):

openxs@ao756:~/git/server$ vi +102 storage/innobase/include/dict0types.h

struct table_name_t
{
        /** The name in internal representation */
        char*   m_name;

        /** Default constructor */
        table_name_t() {}
        /** Constructor */
        table_name_t(char* name) : m_name(name) {}

        /** @return the end of the schema name */
        const char* dbend() const
        {
                const char* sep = strchr(m_name, '/');
                ut_ad(sep);
                return sep;
        }

        /** @return the length of the schema name, in bytes */
        size_t dblen() const { return size_t(dbend() - m_name); }

        /** Determine the filename-safe encoded table name.
        @return the filename-safe encoded table name */
        const char* basename() const { return dbend() + 1; }

        /** The start of the table basename suffix for partitioned tables */
        static const char part_suffix[4];

        /** Determine the partition or subpartition name suffix.
        @return the partition name
        @retval NULL    if the table is not partitioned */
        const char* part() const { return strstr(basename(), part_suffix); }

        /** @return whether this is a temporary or intermediate table name */
        inline bool is_temporary() const;
};

I do not care about fun ction members, just data stored in the strucute (highlughted). Moreover, read this part of bpftrace reference carefully:

You can define your own structs when needed. In some cases, kernel structs are not declared in the kernel headers package, and are declared manually in bpftrace tools (or partial structs are: enough to reach the member to dereference).

Thing is, while we theoretically can use #include in bpftrace, structured used in MariaDB server are defined in so many heards amd are so specific and sometimes deeply nested, that usually it's easier to use properly sised placeholderns in explicit struct definitions. You can get proper sizes from gdb:

(gdb) p sizeof(table_id_t)
$1 = 8
(gdb) p sizeof(dict_table_t *)
$2 = 8
(gdb) p sizeof(long long)
$3 = 8

Taking the above into account, we can try to define structured like these:

struct table_name_t
{
        char*   m_name;
        char part_suffix[4];
}

struct dict_table_t {
        long long                              id;
        struct dict_table_t*                           id_hash;
        struct table_name_t                            name;
}

in our bpftrace code to be able to pproperly derefence the table name as a null-terminated string:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e '
>
> struct table_name_t
> {
>         char*   m_name;
>         char part_suffix[4];
> }
>
> struct dict_table_t {
>         long long                              id;
>         struct dict_table_t*                   id_hash;
>         struct table_name_t                    name;
> }
>
> uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t
>   { printf("lock_table: %s, %d, %p\n", str(((struct dict_table_t *)arg0)->name.m_name), arg1, arg2); }'
Attaching 1 probe...
lock_table: test/tt, 0, 0x7f121806cc10
lock_table: test/t, 4, 0x7f1218068d60
lock_table: test/t, 1, 0x7f1218068d60
lock_table: mysql/innodb_table_stats, 1, 0x7f122800e7d8
lock_table: mysql/innodb_table_stats, 1, 0x7f122800e7d8
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80
^C

Looks good enough as a proof of concept already. We see how to define simplified structured containing just enough information to find and derefernce items of complex nested structures used all over the MariaDB server code.

I'd surely want to add a probe to lock_row_lock too, and for this I need some more gdb outputs from the breakpoint set:

Thread 11 "mariadbd" hit Breakpoint 2, lock_rec_lock (impl=false, mode=2,
    block=0x7f12480325a0, heap_no=2, index=0x7f1218066f90, thr=0x7f121806cc10)
    at /home/openxs/git/server/storage/innobase/include/que0que.ic:37
37              return(thr->graph->trx);
(gdb) p index
$3 = (dict_index_t *) 0x7f1218066f90
(gdb) p index->name
$4 = {m_name = 0x7f1218067120 "PRIMARY"}
(gdb) p index->table->name
$5 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}
(gdb) p index->table
$6 = (dict_table_t *) 0x7f1218065b20
(gdb) p index->id
$7 = 48
(gdb) p sizeof(index->id)
$8 = 8
(gdb) p sizeof(index->heap)
$9 = 8
(gdb) p sizeof(index->name)
$10 = 8
(gdb) p impl
$11 = false
(gdb) p sizeof(impl)
$12 = 1
(gdb) p sizeof(mode)
$13 = 4

I can surely find the defintion of dict_index_t structure and work based on it:

struct dict_index_t {
/** Maximum number of fields */
static constexpr unsigned MAX_N_FIELDS= (1U << 10) - 1;

        index_id_t      id;     /*!< id of the index */
        mem_heap_t*     heap;   /*!< memory heap */
        id_name_t       name;   /*!< index name */
        dict_table_t*   table; /*!< back pointer to table */
...

but sizes above are actually enough to come up with a proper code like this:

sudo src/bpftrace -e '

struct table_name_t
{
        char*   m_name;
        char part_suffix[4];
}

struct dict_table_t {
        long long                              id;
        struct dict_table_t*                           id_hash;
        struct table_name_t                            name;
}

struct dict_index_t {
        long long        id;
        long long        heap;
        char*            name;
        struct dict_table_t*    table;
}

uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t
{ printf("lock_table: %s, mode: %d, thread: %p\n",
      str(((struct dict_table_t *)arg0)->name.m_name),
      arg1,
      arg2); }

uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_rec_lock
{ printf("lock_rec_lock: impl (%d) mode %d index %s rec of %s, thread: %p\n",
      arg0,
      arg1,
      str(((struct dict_index_t *)arg4)->name),
      str(((struct dict_index_t *)arg4)->table->name.m_name),
      arg5); }

that produces the following:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e '
>
> struct table_name_t
> {
>         char*   m_name;
>         char part_suffix[4];
> }
>
> struct dict_table_t {
>         long long                              id;
>         struct dict_table_t*                           id_hash;
>         struct table_name_t                            name;
> }
>
> struct dict_index_t {
>         long longid;
>         long longheap;
>         char*name;
>         struct dict_table_t*table;
> }
>
> uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t
>   { printf("lock_table: %s, mode: %d, thread: %p\n",
>       str(((struct dict_table_t *)arg0)->name.m_name),
>       arg1,
>       arg2); }
>
> uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_rec_lock
>   { printf("lock_rec_lock: impl (%d) mode %d index %s rec of %s, thread: %p\n",
>       arg0,
>       arg1,
>       str(((struct dict_index_t *)arg4)->name),
>       str(((struct dict_index_t *)arg4)->table->name.m_name),
>       arg5); }
>
> '
Attaching 3 probes...
lock_table: test/tt, mode: 0, thread: 0x7f121806cc10
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_table: test/t, mode: 4, thread: 0x7f1218068d60
lock_table: test/t, mode: 1, thread: 0x7f1218068d60
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10
lock_table: mysql/innodb_table_stats, mode: 1, thread: 0x7f123400ebc8
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8
lock_table: mysql/innodb_table_stats, mode: 1, thread: 0x7f123400ebc8
lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370
lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370
^C

while this SQL is executed, for example:

MariaDB [test]> insert into t(val) select 100 from tt;
Query OK, 4 rows affected (0.080 sec)
Records: 4 Duplicates: 0 Warnings: 0

Please, check that older post from proper interpretation of the output, including contants used to represent lock mode etc. The idea here was to show that bpftrace understands a subset of C/C++ struct statement and show the ways to create such simplified strucutres to be able to dereference and access deeply nexted arguments if needed for tracing.

A bit more advanced code may show more details about the individual rows locked...

* * *

To summarize:

With some efforts like source code checks and gdb breakpoints/prints one can eventually figure out what structures to define to be able to access members of complex structures typical for MariaDB and MySQL server code in the bpftrace code.
Inlcuding existing headers is NOT possible and is hardly practical for complex software like MariaDB that uses C++ nowadays.
bpftrace, with some efforts like those presented above, allows to study what happens in complex server operations (like InnoDB locking) easily, safely and with minimal impact on a running system.
bpftrace is cool, but you already know that!

Saturday, February 23, 2019

Fun with Bugs #80 - On MySQL Bug Reports I am Subscribed to, Part XVI

Today I'd like to continue my review of public MySQL bug reports with a list of some bugs I've subscribed to over last 3 weeks. It's already long enough and includes nice cases to check and share. Note that I usually subscribe to a bug either because it directly affects me or customers I work with, or I consider it technically interesting (so I mostly care about InnoDB, replication, partitioning and optimizer bugs), or it's a "metabug" - a problem in the way public bug report is handled by Oracle engineers. These are my interests related to MySQL bugs.

As usual, I start with the oldest bugs and try to mention bug reporters by name with links to their other reports whenever this may give something useful to a reader. I try to check if MariaDB is also affected in some cases. Check also my summary comments at the end of this blog post.

Bug #94148 - "Unnecessary Shared lock on parent table During UPDATE on a child table". In this bug report Uday Varagani reasonably pointed out that formally there is no need to lock parent row when column NOT included in the foreign key gets updated. This happens though when this column is included into the index used to support foreign key constraint. IMHO it's a reasonable feature request and both Trey Raymond and Sveta Smirnova tried their best to highlight this, but this report now has a "Need Feedback" status with a request to explain new algorithm suggested. It's simple - "Stop it", check that column changed is NOT the one foreign key is defined on, even if it's in the same index...I see no reason NOT to verify this as a reasonable feature request. Is it a new policy that every feature request should come with details on how to implement it? I truly doubt.
Bug #94224 - "[5.6] Optimizer reconsiders index based on index definition order, not value". Domas Mituzas found yet another case (see also Bug #36817 - "Non optimal index choice, depending on index creation order" from Jocelyn Fournier, the bug I verified more than 10 years ago) when in MySQL order of index definition matters more for optimizer than anything else. My quick check shows that MariaDB 10.3.7 is not affected:
MariaDB [test]> explain select distinct b from t1 where c not in (0) and d > 0;+------+-------------+-------+-------+---------------+--------------------+---------+------+------+-------------+| id   | select_type | table | type | possible_keys | key            | key_len
| ref | rows | Extra                    |+------+-------------+-------+-------+---------------+--------------------+--------+------+------+-------------+
|    1 | SIMPLE      | t1    | index | NULL          | non_covering_index | 9    | NULL |    1 | Using where |
+------+-------------+-------+-------+---------------+--------------------+---------+------+------+-------------+
1 row in set (0.002 sec)

MariaDB [test]> alter table t1 add index covering_index (b, c, d);
Query OK, 0 rows affected (0.149 sec)
Records: 0 Duplicates: 0 Warnings: 0

MariaDB [test]> explain select distinct b from t1 where c not in (0) and d > 0;
+------+-------------+-------+-------+---------------+----------------+---------+------+------+--------------------------+
| id   | select_type | table | type | possible_keys | key            | key_len
| ref | rows | Extra                    |
+------+-------------+-------+-------+---------------+----------------+---------+------+------+--------------------------+
|    1 | SIMPLE      | t1    | index | NULL          | covering_index | 14
| NULL |    1 | Using where; Using index |
+------+-------------+-------+-------+---------------+----------------+---------+------+------+--------------------------+
1 row in set (0.025 sec)
Fortunately MySQL 8 is no longer affected. Unfortunately we do not see a public comment showing the results of testing on MySQL 5.7 (or any version, for that matter), from engineer who verified the bug. I already pointed out that this "metabug" becomes popular in my previous blog post.
Bug #94243 - "WL#9508 introduced non-idiomatic potentially-broken C macros". Laurynas Biveinis from Percona found new code that in ideal world wound not pass any serious code review.
Bug #94251 - "Aggregate function result is dependent by window is defined directly or as named". This bug was reported by Владислав Сокол. From what I see:
MariaDB [test]> WITH RECURSIVE cte AS (
    -> SELECT 1 num
    -> UNION ALL
    -> SELECT num+1 FROM cte WHERE num < 5
    -> )
    -> SELECT num, COUNT(*) OVER (frame) cnt_named, COUNT(*) OVER (ORDER BY num
DESC) cnt_direct
    -> FROM cte
    -> WINDOW frame AS (ORDER BY num DESC);
+------+-----------+------------+
| num | cnt_named | cnt_direct |
+------+-----------+------------+
|    1 |         5 |          5 |
|    2 |         4 |          4 |
|    3 |         3 |          3 |
|    4 |         2 |          2 |
|    5 |         1 |          1 |
+------+-----------+------------+
5 rows in set (0.117 sec)

MariaDB [test]> WITH RECURSIVE cte AS (
    -> SELECT 1 num
    -> UNION ALL
    -> SELECT num+1 FROM cte WHERE num < 5
    -> )
    -> SELECT num, COUNT(*) OVER (frame) cnt_named, COUNT(*) OVER (ORDER BY num
DESC) cnt_direct
    -> FROM cte
    -> WINDOW frame AS (ORDER BY num DESC)
    -> ORDER BY num desc;
+------+-----------+------------+
| num | cnt_named | cnt_direct |
+------+-----------+------------+
|    5 |         1 |          1 |
|    4 |         2 |          2 |
|    3 |         3 |          3 |
|    2 |         4 |          4 |
|    1 |         5 |          5 |
+------+-----------+------------+
5 rows in set (0.003 sec)
MariaDB 10.3.7 is NOT affected.
Bug #94283 - "MySQL 8.0.15 is slower than MySQL 5.7.25". Percona's CTO Vadim Tkachenko reported that MySQL 8.0.15 is notably slower than 5.7.25 on a simple oltp_read_write sysbench test. He had recently written a separate blog post about this, with more details.There is one detail to clarify based on today's comment from Peter Zaitsev (was the same default character set used), but as my dear friend Sinisa Milivojevic verified the bug without any questions, requests or his own test outputs shared, we can assume that Oracle officially accepted this performance regression (even though "regression" tag was not set).

Check also later Bug #94387 - "MySQL 8.0.15 is slower than MySQL 5.7.25 in read only workloads", yet another performance regression report from Vadim, where he found that on read only (sysbench oltp_point_select) all in memory workloads MySQL 8.0.15 may also be slower than MySQL 5.7.25.
Bug #94302 - "reset master could not break dump thread in some cases". This bug was reported by Ashe Sun. This is definitely a corner case, as it happens only master is still writing to the very first binary log. We can not find out from public comments in the bug report if any other versions besides 5.7.x are affected. This is yet another "metabug" - during my days in Oracle's MySQL bugs verification team we had to check on all versions still supported and present the results explicitly.
Bug #94319 - "Format_description_log_event::write can cause segfaults". Nice bug report by Manuel Ung from Facebook.
Bug #94330 - "Test for possible compressed failures before upgrade?". Change of zlib version starting from MySQL 5.7.24 means that some operations for InnoDB tables with ROW_FORMAT=COMPRESSED that previously worked may start to fail. In this report Monty Solomon asks for some way to determine if there will be a problem with existing compressed tables before upgrading to 5.7.24. The bug is still "Open".
Bug #94338 - "Dirty read-like behavior in READ COMMITTED transaction". Bug reporter, Masaki Oguro, stated that MySQL 8 is not affected (only 5.6 and 5.7) and the bug is verified on these versions, so we should assume it's really the case. But I miss public comment showing the result of testing on recent MySQL 8.0.15.
Bug #94340 - "backwards incompatible changes in 8.0: Error number: 3747". Simon Mudd complains about incompatible change in 8.0.13 that does not allow slave to easily switch from SBR to RBR without restart (and was not clearly documented as a change in behavior). Make sure to read all comments.
Bug #94370 - "Performance regression of btr_cur_prefetch_siblings". Nice bug report with a patch from Zhai Weixiang.
Bug #94383 - "simple ALTER cause unnecessary InnoDB index rebuilds, 5.7.23 or later 5.7 rlses". In this bug report Mikhail Izioumtchenko presented the detailed analysis and suggested diagnostics patches to show what really happens and why. This bug is also a regression of a kind, so while testing results are presented, I still think that it could be processed better according to the good old rules I have in mind.
Bug #94394 - "Absence of mysql.user leads to auto-apply of --skip-grant-tables". Great finding by Ceri Williams from Percona. Sveta Smirnova provided a separate MTR test case and clarified the impact of the bug. Surely this is also a regression comparing to MySQL 5.7, as there you can not start MySQL if mysql.user table is missing. I leave it to a reader to decide if there is any security-related impact of this bug...
Bug #94396 - "Error message too broad: The used command is not allowed with this MySQL version". This bug was reported by my former colleague in Percona Support, famous Bill Karwin. Informative error messages matter for good user experience.

We rely on MySQL in a same way as that guys on top of dolphins pyramid on this strange monument in some court somewhere at the Lanes. Reliable foundation matters, so regressions should better be avoided.

To summarize:

Looks like it's time for Oracle to spend some efforts to make MySQL 8 great again, by fixing some of the bugs mentioned above, especially performance regressions vs MySQL 5.7 found recently by Vadim Tkachenko from Percona.
Oracle continues to introduce backward-incompatible changes in behavior in minor MySQL 8.0.x releases at GA stage. This is not really good for any production environment.
Asking bug reporters to provide "the basics of such a new algorithm" when they complain that current one is wrong or not optimal is a new word in bugs processing!
When I joined MySQL bugs verification team in 2005 we've set up a culture of bugs processing that included, among other things, presenting in a public comment any successful or unsuccessful attempt to verify the bug, by copy-pasting all commands and statements used along with the outputs, whenever possible and with enough context to show what was really checked. I've studied this approach from Oracle's Tom Kyte over the previous 10 years when I followed him closely. I used to think it's standard for more than a decade already, a kind of my (and not only my) "heritage". It's sad to see this approach is no longer followed by many Oracle engineers who process bugs, in too many cases.
Oracle engineers still do not use "regression" tag when setting "Verified" status for obviously regression bugs. I think bug reporters should care then to always set it when they report regression of any kind.

Saturday, November 11, 2017

Fun with Bugs #57 - On MySQL Bug Reports I am Subscribed to, Part I

I've decided to stop reviewing MySQL Release Notes in this series, but it does not mean that I am not interested in MySQL bugs any more. At the moment I am subscribed to 91 active MySQL bugs reported by other MySQL users, and in this blog post I am going to present 15 of them, the most recently reported ones. I'd really want to see them fixed or at least properly processed as soon as possible.

In some cases I am going to add my speculations on how the bug had better be handled, or maybe highlight some important details about it. It is not my job any more to process/"verify" any community bug reports for any kind of MySQL, but I did that for many years and I've spent more than 5 years "on the other side", being a member of Community, so in some cases I let myself to share some strong opinion on what may be done differently from the Oracle side.

As a side note, I started to subscribe to MySQL bugs mostly after I left Oracle, as before that I got email notification about each and every change in every MySQL bug report ever created...

Here is the list, starting from the most recent ones:

Bug #88422 - "MySQL 5.7 innodb purge thread get oldest readview could block other transaction". It is one of that bug reports without a test case from reporter. It is tempting to set it to "Verified" just "based on code review", as the code in 5.7 is quite obviously shows both holding the trx_sys->mutex and linear complexity of search depending on number of read views in the worst case (when most of them are closed):
/**
Get the oldest (active) view in the system.
@return oldest view if found or NULL */

ReadView*
MVCC::get_oldest_view() const
{
        ReadView*       view;

        ut_ad(mutex_own(&trx_sys->mutex));

        for (view = UT_LIST_GET_LAST(m_views);
             view != NULL;
             view = UT_LIST_GET_PREV(m_view_list, view)) {

                if (!view->is_closed()) {
                        break;
                }
        }

        return(view);
}
But probably current Oracle bugs verification rules do not let to just mark it as verified. After all, somebody will have to create a test case... So, my dear old friend Sinisa Milivojevic decided to try to force bug reporter to provide a test case instead of spending some time trying to create one himself. I am not going to blame him for that, why to try the easy way :) But I consider this his statement in the comment dated [10 Nov 16:21]:
"... 5.7 methods holds no mutex what so ever..."
a bit wrong, as we can see the mutex is acquired when get_oldest_view() method is called:
void
MVCC::clone_oldest_view(ReadView* view)
{
        mutex_enter(&trx_sys->mutex);

        ReadView*       oldest_view = get_oldest_view();
        if (oldest_view == NULL) {
...
Bug #88381 - "Predicate cannot be pushed down "past" window function". Here bug reporter had provided enough hints for a test case. One can probably just check 'Handler%' status variables before and after query execution to come to the conclusion. Moreover, it seems Oracle developer, Dag Wanvik, accepted this as a known limitation, but the bug still remains "Open" and nobody knows if it was copied to the internal bugs database, got prioritized and if any work on this is planned any time soon. We shell see. You may also want to monitor MDEV-10855.
Bug #88373 - "Renaming a column breaks replication from 5.7 to 8.0 because of impl. collation". This bug was quickly verified by Umesh Shastry. I expect a lot of "fun" for users upgrading to MySQL 8.0 when it becomes GA, especially in replication setups.
Bug #88328 - "Performance degradation with the slave_parallel_workers increase". There is no test case, just some general description and ideas about the root case when semi-sync replication is used. I expect this bug to stay "Open" for a long time, as it is a topic for a good research and blog posts like this one, that is, a work for real expert!
Bug #88223 - "Replication with no tmpdir space and InnoDB as tmp_storage_engine can break". Here we have clear and simple test case from Sveta Smirnova (no wonder, she also worked at bugs verification team in MySQL, Sun and Oracle). I hope Umesh will verify it soon. As a side note, it is explained (in the comments) elsewhere that InnoDB as internal_tmp_disk_storage_engine may not be the best possible option. We do not have this variable and do not plan to support InnoDB for internal temporary tables in MariaDB 10.2+.
Bug #88220 - "compressing and uncompressing InnoDB tables seems to be inconsistent". See also other, older bug reports mentioned there that are duplicates/closely related, but were not getting proper attention.
Bug #88150 - "'Undo log record is too big.' error occurring in very narrow range of str length". It was reported by my colleague Geoff Montee and is already fixed in recent versions of MariaDB (see MDEV-14051 for the details and some nice examples of gdb usage by a developer)!
Bug #88127 - "Index not used for 'order by' query with utf8mb4 character set". Here I am just curious when bugs like that would be caught up by Oracle QA before any public releases.
Bug #88071 - "An arresting Performance degradation when set sort_buffer_size=32M". here the test case is clear - just run sysbench oltp test at high concurrency with different values of sort_buffer_size. Still, Sinisa Milivojevic decided to explain when RAM limit may play a role instead of just showing how it works great (if it does) on any server with enough RAM... Let's see how this attempt to force bug reporter to work/explain more may end up...
Bug #87947 - "Optimizer chooses ref over range when access when range access is faster". Nice example of a case when optimizer trace may be really useful. Øystein Grøvlen kindly explained that "range access and ref access are not comparable costs". I wish we get better cost model for such cases in MySQL one day.
Bug #87837 - "MySQL 8 does not start after upgrade to 8.03". It is expected actually, and even somewhat documented in the release notes that MySQL 8.0.3 is not compatible to any older version. So, it is more like MySQL Installer (that I do not care much about) bug, but I still subscribed to it as yet another source of potential fun during further upgrade attempts.
Bug #87716 - "SELECT FOR UPDATE with BETWEEN AND gets row lock excessively". I think I already studied once why with IN() rows are locked differently by InnoDB comparing to BETWEEN that selects the same rows. But I'd like to know what's the Oracle's take on this, and I'd like to study this specific test case in details one day as well.
Bug #87670 - "Force index for group by is not always honored". Clear and simple test case, so no wonder it was immediately verified.
Bug #87621 - "Huge InnoDB slowdown when selecting strings without indexes ". I'd like to check with perf one day where the time is spent mostly during this test. For now I think this is a result of the way "long" data are stored on separate pages in InnoDB. What;'s interesting here is also a test case where R is used to generate data set.
Bug #87589 - "Documentation incorrectly states that LOAD DATA LOCAL INFILE does not use tmpdir". This was yet another report from my colleague Geoff Montee. lsof is your friend, maybe I have to talk about it one day at FOSDEM (call for papers is still open :) I like to find and follow bugs and missing details in MySQL manual, maybe because I would never be able to contribute to it as a writer directly...

So, this list shows my typical recent interests related to MySQL bugs - mostly InnoDB, optimizer, replication problems, fine manual and just some fun details like the way some Oracle engineers try to avoid working extra hard while processing bugs... I am also happy to know that in some cases MariaDB is able to deliver fixes faster.

Saturday, September 2, 2017

MySQL Support Engineer's Chronicles, Issue #8

This week is special and full of anniversaries for me. This week 5 years ago I left Oracle behind and joined Percona... Same week 5 years ago I had written something about MySQL in this blog for the first time in my life. 5 years ago I've created my Facebook account that I actively (ab)use for discussing work-related issues. So, to summarize, it's a five years anniversary of my coming out as a MySQL Entomologist, somebody who writes and speaks about MySQL and bugs in MySQL in public! These 5 years were mostly awesome.

I decided to celebrate with yet another post in this relatively new series and summarize in short what interesting things I studied, noticed or had to work on this week while providing support to customers of all kinds of MySQL.

This week started for me with the need to find out why mariabackup fails on Windows, for one of customers. If you missed it, MariaDB created a tool for online backup based on Percona's XtraBackup that supports few additional features (like data at rest encryption of MariaDB Server) and works on Windows as well, included it into MariaDB server and even declared it "Stable" as of MariaDB 10.1.26. In the process of working on that problem I had to use procmon tool, based on this KB article. The root cause of the problem was NTFS compression used for the target directory (see MDEV-13691 by Vladislav Vaintroub , who forces lazy me to improve my rudimentary Windows user skills from time to time, for some related details). So, better do not use NTFS compression of the backup destination if you need to back up big enough (50G+) tables. I really enjoyed working with procmon that helped to find out what could cause (somewhat random) "error 22" failures.

I was (positively) surprised to find out there there is a MariaDB KB article on such a specific topic as troubleshooting on Windows. Besides this one, I had to use the following KB articles while working with customers this week:

and found something new (for me) there. I never cared to find out what join_cache_level is used for previously, for example.

Besides mariabackup, this week I had to discuss various problems related to backing up TokuDB tables, so you should expect my blog posts related to this topic soon.

My colleague Geoff Montee published a useful post this week, "Automatically Dropping Old Partitions in MySQL and MariaDB". Make sure to check comments and his followup. Geoff had also reported nice Bug #87589 - "Documentation incorrectly states that LOAD DATA LOCAL INFILE does not use tmpdir", that is still "Open" for some reason.

During such a great week I had to report some MySQL bug, so I did it. Check Bug #87624 - "Wrong statement-based binary logging for CONCAT() in stored procedure". I've seen all these "Illegal mix of collations..." way too often over years.

Other bugs that attracted my attention were:

Bug #84108 - "Mysql 5.7.14 hangs and kills itself during drop database statement". This bug should probably become "Open" and properly processed, as current "Duplicate" status is questionable, at best. Arjen Lentz attracted mine (and not only mine) attention to this old enough and improperly handled bug report.
Bug #87619 - "InnoDB partition table will lock into the near record as a condition in the use". Nice to see this regression bug "Verified" after all. It seems native partitioning in MySQL 5.7 came with some cost of extra/different locking.

Time to stop writing and prepare for further celebrations, fun and anniversaries. Stay tuned!

Sunday, July 23, 2017

Why Thread May Hang in "Waiting for table level lock" State - Part I

Last time I had to use gdb to check what's going on in MySQL server and found something useful with it to share in the blog post it was April 2017, and I miss this kind of experience already. So, today I decided to try it again in a hope to get some insights in cases when other tools may not apply or may not be as helpful as one could expect. Here is the long enough story, based on recent customer issue I worked on this week.

* * *

Had you seen anything like this output of SHOW PROCESSLIST statement:

Id User Host db Command Time State 
Info Progress
...
28 someuser01 xx.xx.xx.xx:39644 somedb001 Sleep 247  
NULL 0.000
29 someuser01 xx.xx.xx.yy:44100 somedb001 Query 276 
Waiting for table level lock DELETE FROM t1 WHERE (some_id = 'NNNNN') AND ... 0.000
...
33 someuser01 xx.xx.zz.tt:35886 somedb001 Query 275 
Waiting for table level lock DELETE FROM t2 WHERE (some_id = 'XXXXX') 0.000
...
38 someuser01 xx.xx.xx.xx:57055 somedb001 Query 246 
Waiting for table level lock DELETE FROM t3 WHERE (some_id in (123456)) AND ...  0.000
...

recently? That is, many threads accessing InnoDB(!) tables and hanging in the "Waiting for table level lock" state for a long time without any obvious reason in the SHOW PROCESSLIST or SHOW ENGINE INNODB STATUS?

I've seen it this week, along with a statement from customer that the server is basically unusable. Unlike in many other cases I had not found any obvious problem from the outputs mentioned above (other than high concurrency and normal InnoDB row lock waits etc for other sessions). There were no unexpected table level locks reported by InnoDB, and the longest running transactions were those in that waiting state.

Unfortunately fine MySQL manual is not very helpful when describing this thread state:

"Waiting for lock_type lock
The server is waiting to acquire a THR_LOCK lock or a lock from the metadata locking subsystem, where lock_type indicates the type of lock.
This state indicates a wait for a THR_LOCK:

Waiting for table level lock

These states indicate a wait for a metadata lock:

Waiting for event metadata lock

Waiting for global read lock ..."

What that "THR_LOCK" should tell the reader is beyond me (and probably most of MySQL users). Fortunately, I know from experience that this state usually means that thread is trying to access some MyISAM table. The tables mentioned in currently running statements were all InnoDB, though. Server had performance_schema enabled, but only with default instrumentation and consumers defined, so I had no chance to get anything related to metadata locks or much hope to see all previous statement executed by each thread anyway. I surely insisted on checking the source code etc, but this rarely gives immediate results.

So, to check my assumption and try to prove MyISAM tables are involved, and get some hint what sessions they were locked by, I suggested to run mysqladmin debug, the command that is rarely used these days, with the output that Oracle explicitly refused to document properly (see my Bug #71346)! I do know what to expect there (from the days when MyISAM was widely used), and when in doubts I can always check the source code.

The mysqladmin debug command outputs a lot of information into the error log. Among other things I found the following section:

Thread database.table_name          Locked/Waiting        Lock_type

...
928     somedb001.somewhat_seq      Waiting - write       High priority write lock
...
12940   somedb001.somewhat_seq      Locked - write        High priority write lock

where somedb001.somewhat_seq surely was the MyISAM table. There were actually many tables mentioned as locked, with waits on them, and names ending with _seq suffix. One can now check what MySQL thread with id=12940 is doing at all levels, where it comes from, should it be killed etc.

Now it became more clear what's going on. Probably application developers tried to implement sequences with MyISAM tables and, probably, triggers to use them when data are not provided for the columns! The idea is quite popular and described in many blog posts. Check this old blog post by Peter Zaitsev, for example. This assumption was confirmed, and essentially they did something like I did in the following deliberately primitive example:

-- this is our sequence table
mysql> show create table misam\G
*************************** 1. row ***************************
       Table: misam
Create Table: CREATE TABLE `misam` (
`id` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8mb4
1 row in set (0.00 sec)

mysql> create table inno like misam;
Query OK, 0 rows affected (0.30 sec)

mysql> alter table inno engine=InnoDB;
Query OK, 0 rows affected (2.73 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> show create table inno\G
*************************** 1. row ***************************
       Table: inno
Create Table: CREATE TABLE `inno` (
`id` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
1 row in set (0.03 sec)

mysql> select * from misam;
Empty set (0.00 sec)

-- let's set up the first value for the sequence
mysql> insert into misam values(1);
Query OK, 1 row affected (0.05 sec)

-- now let's create trigger to insert the value from sequencemysql> delimiter //
mysql> create trigger tbi before insert on inno
    -> for each row
    -> begin
    -> if ifnull(new.id,0) = 0 then
    ->   update misam set id=last_insert_id(id+1);
    ->   set new.id = last_insert_id();
    -> end if;
    -> end
    -> //
Query OK, 0 rows affected (0.25 sec)

mysql> delimiter ;
mysql> start transaction;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into inno values(0);
Query OK, 1 row affected (0.20 sec)

mysql> insert into inno values(0);
Query OK, 1 row affected (0.10 sec)

mysql> select * from inno;
+----+
| id |
+----+
| 2 |
| 3 |
+----+
2 rows in set (0.00 sec)

mysql> select * from misam;
+----+
| id |
+----+
| 3 |
+----+
1 row in set (0.00 sec)

So, this is how it is supposed to work, and, you know, it works from two concurrent sessions without any locking noticed etc.

This is the end of the real story, for now. We have a lot of things to discuss there with customer, including how to find out previous commands executed in each sessions, what useful details we can get from the performance_schema etc. The root cause was clear, but we still have to find out why the sessions are spending so much time holding the blocking locks on MyISAM tables and what we can do about that, or with architecture based on such implementation of sequences (that I kindly ask you to avoid whenever possible! Use auto_increment values, please, if you care about concurrency.) I expect more blog posts eventually inspired by that real story, but now it's time to move to more generic technical details.

* * *

I have two problems with this story if I try to approach it in a more generic way.

First, do we have any other way to see MyISAM table level locks, besides mysqladmin debug command? (Oracle kindly plans to replace with something, let me quote Shane Bester's comment the bug that originates from hist internal feature request: "We need to replace it with appropriate performance_schema or information_schema tables.")

Second, how exactly to reproduce the situation customer reported? My initial attempts to get such a status with the trigger in place failed, I was not able to capture threads spending any notable time in "Waiting for table level lock" state even with concurrent single row inserts and the trigger above in place. I was thinking about explicit LOCK TABLES misam WRITE etc, but I doubt real code does use that. Fortunately, the following INSERT running from one session:

mysql> insert into inno values(sleep(100));
Query OK, 1 row affected (1 min 40.20 sec)

allowed me to run INSERT in the other session that got blocked:

mysql> insert into inno values(0);
Query OK, 1 row affected (1 min 36.36 sec)

and while it was blocked I've seen the thread state I wanted in the output of the SHOW PROCESSLIST:

mysql> show processlist;+----+------+-----------+------+---------+------+------------------------------+-------------------------------------+-----------+---------------+
| Id | User | Host      | db   | Command | Time | State                        | Info                                | Rows_sent | Rows_examined |
+----+------+-----------+------+---------+------+------------------------------+-------------------------------------+-----------+---------------+
| 16 | root | localhost | test | Query   |   17 | User sleep                   | insert into inno values(sleep(100)) |         0 |             0 |
| 17 | root | localhost | test | Query   |   13 | Waiting for table level lock | insert into inno values(0)          |         0 |             0 |
| 22 | root | localhost | test | Query   |    0 | starting                     | show processlist                    |         0 |             0 |
+----+------+-----------+------+---------+------+------------------------------+-------------------------------------+-----------+---------------+
3 rows in set (0.00 sec)

So, I know how exactly to reproduce the problem and what it can be caused by. Any slow running single INSERT statement (caused by something complex executed in trigger, some slow function used in the list of values inserted, or, maybe just INSERT ... SELECT ... from the big table) will give us the desired thread state.

Coming back to the first generic problem mentioned above, is there any way besides running mysqladmin debug and checking the output vs processlist etc, to identify the thread that holds MyISAM table level locks? One should expect performance_schema to help with this, at least as of 5.7 (I've used Percona Server 5.7.18-15 on my Ubuntu 14.04 netbook for testing today, while the original problem was on MariaDB 10.2.x. Percona is for fun, while MariaDB is for real work...). Indeed, we have table_handles there, and until recently I ignored its existence (it's new in 5.7, and I am not even sure if MariaDB 10.2.x has it already).

So, I reproduced the problem again and got the following there:

mysql> select * from performance_schema.table_handles;
+-------------+---------------+-------------+-----------------------+-----------------+----------------+---------------+----------------+
| OBJECT_TYPE | OBJECT_SCHEMA | OBJECT_NAME | OBJECT_INSTANCE_BEGIN | OWNER_THREAD_ID | OWNER_EVENT_ID | INTERNAL_LOCK | EXTERNAL_LOCK |
+-------------+---------------+-------------+-----------------------+-----------------+----------------+---------------+----------------+
| TABLE       | test          | t           |       140392772351024 |               0 |              0 | NULL          | NULL           |
| TABLE       | test          | t1          |       140392772352560 |               0 |              0 | NULL          | NULL           |
| TABLE       | test          | ttime       |       140392772355632 |               0 |              0 | NULL          | NULL           |
| TABLE       | test          | misam       |       140392772358704 |              55 |             10 | WRITE         | WRITE EXTERNAL |
| TABLE       | test          | misam       |       140392896981552 |              54 |             10 | WRITE         | WRITE EXTERNAL |
| TABLE       | test          | inno        |       140392772361776 |              55 |             10 | NULL          | WRITE EXTERNAL |
| TABLE       | test          | inno        |       140392896983088 |              54 |             10 | NULL          | WRITE EXTERNAL || TABLE       | test          | inno        |       140392826836016 |               0 |              0 | NULL          | NULL           |
| TABLE       | test          | misam       |       140392826837552 |               0 |              0 | NULL          | NULL           |
+-------------+---------------+-------------+-----------------------+-----------------+----------------+---------------+----------------+
9 rows in set (0.00 sec)

What I am supposed to do with the above to find out the MySQL thread id of the blocking session? I am supposed to join to performance_schema.threads table, looking for thread_id value that is the same as owner_thread_id above, 54 and 55. I get the following, note that processlist_id is what you are looking for in the processlist:

mysql> select * from performance_schema.threads where thread_id in (54, 55)\G
*************************** 1. row ***************************
          THREAD_ID: 54
               NAME: thread/sql/one_connection
               TYPE: FOREGROUND
     PROCESSLIST_ID: 27
   PROCESSLIST_USER: root
   PROCESSLIST_HOST: localhost
     PROCESSLIST_DB: test
PROCESSLIST_COMMAND: Sleep
   PROCESSLIST_TIME: 90
PROCESSLIST_STATE: NULL
   PROCESSLIST_INFO: NULL
   PARENT_THREAD_ID: NULL
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: Socket
       THREAD_OS_ID: 32252
*************************** 2. row ***************************
          THREAD_ID: 55
               NAME: thread/sql/one_connection
               TYPE: FOREGROUND
     PROCESSLIST_ID: 28
   PROCESSLIST_USER: root
   PROCESSLIST_HOST: localhost
     PROCESSLIST_DB: test
PROCESSLIST_COMMAND: Sleep
   PROCESSLIST_TIME: 101
PROCESSLIST_STATE: NULL
   PROCESSLIST_INFO: NULL
   PARENT_THREAD_ID: NULL
               ROLE: NULL
       INSTRUMENTED: YES
            HISTORY: YES
    CONNECTION_TYPE: Socket
       THREAD_OS_ID: 27844
2 rows in set (0.10 sec)

I do not see a way to distinguish lock wait from actually holding the lock. All I know at the moment is that there is engine-level lock on misam table (and inno table, for that matter). If the MySQL thread id is NOT for the thread that is "Waiting for table level lock", then it must be the thread that holds the lock! Some smart join with proper WHERE clause would let me to find out what I need directly. Maybe in one of the next parts I'll even present it, but writing it from the top of my head in critical situation is above my current skills related to performance_schema.

* * *

Now, what those of us should do who do not have performance_schema enabled, or have to use version before 5.7? Or those with access to gdb and spare 5 minutes?

Surely we should attach gdb to the mysqld process and, if in doubts, read some parts of the source code to know where to start. I started with the following fragments of code that were easy to find (as long as you know that COM_DEBUG command is actually sent by mysqladmin debug to server):

openxs@ao756:~/git/percona-server$ grep -rni com_debug *
include/mysql/plugin_audit.h.pp:160: COM_DEBUG,
include/mysql.h.pp:47: COM_DEBUG,
include/my_command.h:39: COM_DEBUG,
libmysql/libmysql.c:899: DBUG_RETURN(simple_command(mysql,COM_DEBUG,0,0,0));
rapid/plugin/group_replication/libmysqlgcs/src/bindings/xcom/xcom/task_debug.h:62: "[XCOM_DEBUG] ",
sql/sql_parse.cc:1884: case COM_DEBUG:
^C
...
# from the line highlighted above... case COM_DEBUG:
    thd->status_var.com_other++;
    if (check_global_access(thd, SUPER_ACL))
      break;                                    /* purecov: inspected */
    mysql_print_status();    query_logger.general_log_print(thd, command, NullS);
    my_eof(thd);
    break;

openxs@ao756:~/git/percona-server$ grep -rni mysql_print_status *
sql/sql_test.h:40:void mysql_print_status();
sql/sql_parse.cc:59:#include "sql_test.h"         // mysql_print_status
sql/sql_parse.cc:1888:    mysql_print_status();
sql/sql_test.cc:456:void mysql_print_status()sql/mysqld.cc:69:#include "sql_test.h"     // mysql_print_status
sql/mysqld.cc:2387:        mysql_print_status();   // Print some debug info
openxs@ao756:~/git/percona-server$

# from the line highlighted above...
void mysql_print_status()
{
char current_dir[FN_REFLEN];
STATUS_VAR current_global_status_var;

printf("\nStatus information:\n\n");
(void) my_getwd(current_dir, sizeof(current_dir),MYF(0));
printf("Current dir: %s\n", current_dir);
printf("Running threads: %u Stack size: %ld\n",
         Global_THD_manager::get_instance()->get_thd_count(),
         (long) my_thread_stack_size);
thr_print_locks();                            // Write some debug info
...
printf("\nTable status:\n\
...
display_table_locks();...
# finally, let's find the code of display_table_locks...openxs@ao756:~/git/percona-server$ grep -rni display_table_locks *
sql/sql_test.cc:371:static void display_table_locks(void)sql/sql_test.cc:501: display_table_locks();
sql/mysqld.cc:9537: { &key_memory_locked_thread_list, "display_table_locks", PSI_FLAG_THREAD},
openxs@ao756:~/git/percona-server$

static void display_table_locks(void)
{
LIST *list;
Saved_locks_array saved_table_locks(key_memory_locked_thread_list);
saved_table_locks.reserve(table_cache_manager.cached_tables() + 20);

mysql_mutex_lock(&THR_LOCK_lock);
for (list= thr_lock_thread_list; list; list= list_rest(list))
{
    THR_LOCK *lock=(THR_LOCK*) list->data;
    mysql_mutex_lock(&lock->mutex);
    push_locks_into_array(&saved_table_locks, lock->write.data, FALSE,
                          "Locked - write");
    push_locks_into_array(&saved_table_locks, lock->write_wait.data, TRUE,
                          "Waiting - write");
    push_locks_into_array(&saved_table_locks, lock->read.data, FALSE,
                          "Locked - read");
    push_locks_into_array(&saved_table_locks, lock->read_wait.data, TRUE,
                          "Waiting - read");
    mysql_mutex_unlock(&lock->mutex);
}
...

That's enough to start. We'll have to finally study what THR_LOCK is. There is a global (double) linked lists of all them, thr_lock_thread_list, and this is where we can get details from, in a way similar (let's hope) to those server does while processing COM_DEBUG command for us. So, let's attach gdb and let's fun begin:

openxs@ao756:~/git/percona-server$ sudo gdb -p 23841
[sudo] password for openxs:
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
...
Loaded symbols for /lib/x86_64-linux-gnu/libnss_files.so.2
0x00007fafd93d9c5d in poll () at ../sysdeps/unix/syscall-template.S:81
81      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) set $list=(LIST *)thr_lock_thread_list
(gdb) set $lock=(THR_LOCK*) $list->data
(gdb) p *($lock)
$1 = {list = {prev = 0x0, next = 0x7fafd2fa4780, data = 0x7fafbd545c80},
mutex = {m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0,
        __nusers = 0, __kind = 3, __spins = 0, __elision = 0, __list = {
          __prev = 0x0, __next = 0x0}},
      __size = '\000' <repeats 16 times>, "\003", '\000' <repeats 22 times>,
      __align = 0}, m_psi = 0x7fafd1f97c80}, read_wait = {data = 0x0,
    last = 0x7fafbd545cc8}, read = {data = 0x0, last = 0x7fafbd545cd8},
write_wait = {data = 0x0, last = 0x7fafbd545ce8}, write = {
    data = 0x7fafbd69f188, last = 0x7fafbd69f190}, write_lock_count = 0,
read_no_write_count = 0, get_status = 0xf84910 <mi_get_status>,
copy_status = 0xf84a70 <mi_copy_status>,
update_status = 0xf84970 <mi_update_status>,
restore_status = 0xf84a50 <mi_restore_status>,
check_status = 0xf84a80 <mi_check_status>}

In the above we started from the first item in the list. If needed we can move on to the next with set $list=(LIST *)($list->next) etc. We then interpret $list->data as a (THD_LOCK*). There we can see read, read_wait, write and write_wait structures. One of them will have data item that is not 0x0. This way we can identify is it a lock or lock wait, and what kind of lock is it. In my case i was looking for a write lock, and here it is, the very first one. So, I continue:

(gdb) p *($lock)->write.data
$2 = {owner = 0x7fafbd468d38, next = 0x0, prev = 0x7fafbd545cf8,
lock = 0x7fafbd545c80, cond = 0x0, type = TL_WRITE,
status_param = 0x7fafbd69ee20, debug_print_param = 0x7fafbd515020,
m_psi = 0x7fafa4c06cc0}

We can check the owner field:

(gdb) p *($lock)->write.data.owner
$3 = {thread_id = 16, suspend = 0x7fafbd469370}

and thread_id there is what we were looking for, the MySQL thread id of the blocking thread. If we want to get the details about the table locks we can do this as follows, by studying debug_print_param in data:

(gdb) set $table=(TABLE *) &(*($lock)->write.data->debug_print_param)
(gdb) p $table
$4 = (TABLE *) 0x7fafbd515020
(gdb) p *($table)
$5 = {s = 0x7fafbd4d7030, file = 0x7fafbd535230, next = 0x0, prev = 0x0,
cache_next = 0x0, cache_prev = 0x7fafc4df05a0, in_use = 0x7fafbd468000,
field = 0x7fafbd4d7440, hidden_field_count = 0, record = {
    0x7fafbd4d7430 "\377", 0x7fafbd4d7438 "\n"}, write_row_record = 0x0,
insert_values = 0x0, covering_keys = {map = 0}, quick_keys = {map = 0},
merge_keys = {map = 0}, possible_quick_keys = {map = 0},
keys_in_use_for_query = {map = 0}, keys_in_use_for_group_by = {map = 0},
keys_in_use_for_order_by = {map = 0}, key_info = 0x7fafbd4d7508,
next_number_field = 0x0, found_next_number_field = 0x0, vfield = 0x0,
hash_field = 0x0, fts_doc_id_field = 0x0, triggers = 0x0,
pos_in_table_list = 0x7fafbd47a9b8, pos_in_locked_tables = 0x7fafbd4e5030,
group = 0x0, alias = 0x7fafbbbad120 "misam",
null_flags = 0x7fafbd4d7430 "\377", bitmap_init_value = 0x0, def_read_set = {
    bitmap = 0x7fafbd4d75a8, n_bits = 1, last_word_mask = 4294967294,
    last_word_ptr = 0x7fafbd4d75a8, mutex = 0x0}, def_write_set = {
    bitmap = 0x7fafbd4d75ac, n_bits = 1, last_word_mask = 4294967294,
    last_word_ptr = 0x7fafbd4d75ac, mutex = 0x0}, tmp_set = {
    bitmap = 0x7fafbd4d75b0, n_bits = 1, last_word_mask = 4294967294,
    last_word_ptr = 0x7fafbd4d75b0, mutex = 0x0}, cond_set = {
    bitmap = 0x7fafbd4d75b4, n_bits = 1, last_word_mask = 4294967294,
    last_word_ptr = 0x7fafbd4d75b4, mutex = 0x0},
def_fields_set_during_insert = {bitmap = 0x7fafbd4d75b8, n_bits = 1,
    last_word_mask = 4294967294, last_word_ptr = 0x7fafbd4d75b8, mutex = 0x0},
---Type <return> to continue, or q <return> to quit---
read_set = 0x7fafbd515128, write_set = 0x7fafbd515148,
fields_set_during_insert = 0x7fafbd5151a8, query_id = 0, quick_rows = {
    0 <repeats 64 times>}, const_key_parts = {0 <repeats 64 times>},
quick_key_parts = {0 <repeats 64 times>}, quick_n_ranges = {
    0 <repeats 64 times>}, quick_condition_rows = 0, lock_position = 0,
lock_data_start = 0, lock_count = 1, temp_pool_slot = 0, db_stat = 39,
current_lock = 1, nullable = 0 '\000', null_row = 0 '\000',
status = 3 '\003', copy_blobs = 0 '\000', force_index = 0 '\000',
force_index_order = 0 '\000', force_index_group = 0 '\000',
distinct = 0 '\000', const_table = 0 '\000', no_rows = 0 '\000',
key_read = 0 '\000', no_keyread = 0 '\000', locked_by_logger = 0 '\000',
no_replicate = 0 '\000', locked_by_name = 0 '\000',
fulltext_searched = 0 '\000', no_cache = 0 '\000',
open_by_handler = 0 '\000', auto_increment_field_not_null = 0 '\000',
insert_or_update = 0 '\000', alias_name_used = 0 '\000',
get_fields_in_item_tree = 0 '\000', m_needs_reopen = 0 '\000',
created = true, max_keys = 0, reginfo = {join_tab = 0x0, qep_tab = 0x0,
    lock_type = TL_WRITE, not_exists_optimize = false,
    impossible_range = false}, mem_root = {free = 0x7fafbd4d7420,
    used = 0x7fafbd535220, pre_alloc = 0x0, min_malloc = 32, block_size = 992,
    block_num = 6, first_block_usage = 0, max_capacity = 0,
    allocated_size = 2248, error_for_capacity_exceeded = 0 '\000',
    error_handler = 0xc4b240 <sql_alloc_error_handler()>, m_psi_key = 106},
---Type <return> to continue, or q <return> to quit---
blob_storage = 0x0, grant = {grant_table = 0x0, version = 0,
    privilege = 536870911, m_internal = {m_schema_lookup_done = true,
      m_schema_access = 0x0, m_table_lookup_done = true,
      m_table_access = 0x0}}, sort = {filesort_buffer = {m_next_rec_ptr = 0x0,
      m_rawmem = 0x0, m_record_pointers = 0x0, m_sort_keys = 0x0,
      m_num_records = 0, m_record_length = 0, m_sort_length = 0,
      m_size_in_bytes = 0, m_idx = 0}, io_cache = 0x0, merge_chunks = {
      m_array = 0x0, m_size = 0}, addon_fields = 0x0,
    sorted_result_in_fsbuf = false, sorted_result = 0x0,
    sorted_result_end = 0x0, found_records = 0}, part_info = 0x0,
all_partitions_pruned_away = false, mdl_ticket = 0x7fafbd4c4f10,
m_cost_model = {m_cost_model_server = 0x0, m_se_cost_constants = 0x0,
    m_table = 0x0}, should_binlog_drop_if_temp_flag = false}

We already see table alias, but more details are in the s filed (of TABLE_SHARE * type):

(gdb) p $table->s$26 = (TABLE_SHARE *) 0x7fafbd4d7030
...
(gdb) p $table->s->table_cache_key
$27 = {str = 0x7fafbd4d73b8 "test", length = 11}
...
(gdb) p $table->s->db
$30 = {str = 0x7fafbd4d73b8 "test", length = 4}
(gdb) p $table->s->path
$31 = {str = 0x7fafbd4d73c8 "./test/misam", length = 12}
(gdb) p $table->s->table_name
$32 = {str = 0x7fafbd4d73bd "misam", length = 5}
# to remind you, this is how to get the thread id directly
(gdb) p (*($lock)->write.data.owner).thread_id
$37 = 16

I skilled some of my tests and intermediate results. To summarize, for me it took only a bit more time to search the code and try commands in gdb than to find out that metadata_locks does not help and one has to use table_handles in performance_schema. Next time in gdb I'll do it instantly, and the number of key strokes is far less :)

Surely, for production environments I'll have to write, test and note somewhere proper queries to the performance_schema. For next attempts with gdb and, maybe, FOSDEM talks I'll have to read the code to remember key details about the data structures used above... More posts to come, stay tuned!