Showing posts with label howto. Show all posts
Showing posts with label howto. Show all posts

Thursday, February 10, 2022

How to Summarize gdb Backtrace with pt-pmp (and Flamegraph)

This is going to be more like a note to myself and my readers than a real blog post. But still I'd like o document the trick that I am applying for years already, in its most convenient form. 

Assuming you have a backtrace (or full backtrace) created from a core file or by attaching gdb to a live (and maybe hanging) process with many threads, like MySQL or MariaDB server, what is the best way to summarize it quickly, to understand what most of the threads are doing/hanging at? Like this huge backtrace with hundreds of threads:

openxs@ao756:~$ ls -l /tmp/backtrace1.txt
-rw-rw-r-- 1 openxs openxs 2817054 лют 10 17:02 /tmp/backtrace1.txt
openxs@ao756:~$ grep LWP /tmp/backtrace1.txt | wc -l
1915

Here it is. You have to download pt-pmp from Percona Toolkit. Then source the pt-pmp script and rely on the aggregate_stacktrace function from it, quite advanced in summarizing stack traces:

openxs@ao756:~$ which pt-pmp
/usr/bin/pt-pmp
openxs@ao756:~$ . /usr/bin/pt-pmp
openxs@ao756:~$ cat /tmp/backtrace1.txt | aggregate_stacktrace > /tmp/pmp1.txt 
openxs@ao756:~$ ls -l /tmp/pmp1.txt
-rw-rw-r-- 1 openxs openxs 34174 лют 10 18:07 /tmp/pmp1.txt
openxs@ao756:~$ head -5 /tmp/pmp1.txt
    598 poll(libc.so.6),vio_io_wait(viosocket.c:945),vio_socket_io_wait(viosocket.c:108),vio_read(viosocket.c:184),my_real_read(net_serv.cc:892),my_net_read_packet_reallen(net_serv.cc:1162),my_net_read_packet(net_serv.cc:1146),do_command(sql_parse.cc:1262),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     82 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_s_lock_spin(sync0rw.cc:373),rw_lock_s_lock_func(sync0rw.ic:290),pfs_rw_lock_s_lock_func(sync0rw.ic:290),buf_page_get_gen(buf0buf.cc:4905),btr_cur_search_to_nth_level(btr0cur.cc:1243),btr_pcur_open_low(btr0pcur.ic:467),row_ins_scan_sec_index_for_duplicate(btr0pcur.ic:467),row_ins_sec_index_entry_low(btr0pcur.ic:467),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     55 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),lock_wait_suspend_thread(lock0wait.cc:347),row_mysql_handle_errors(row0mysql.cc:741),row_insert_for_mysql(row0mysql.cc:1428),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     38 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_x_lock_func(sync0rw.cc:733),pfs_rw_lock_x_lock_func(sync0rw.ic:544),buf_page_get_gen(buf0buf.cc:4918),btr_cur_search_to_nth_level(btr0cur.cc:1243),row_ins_sec_index_entry_low(row0ins.cc:2946),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)
     32 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_x_lock_func(sync0rw.cc:733),pfs_rw_lock_x_lock_func(sync0rw.ic:544),buf_page_get_gen(buf0buf.cc:4918),btr_cur_search_to_nth_level(btr0cur.cc:1243),row_ins_sec_index_entry_low(row0ins.cc:3040),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)

If a summary like above is not clear enough, we can surely go one step further and create a proper flame graph based on the output above:

openxs@ao756:~$ cat /tmp/pmp1.txt | awk '{print $2, $1}' | sed -e 's/,/;/g' | ~/git/FlameGraph/flamegraph.pl --countname="threads" --reverse - >/tmp/pmp1.svg  

Then with some creative zooming and search we can concentrate on waits:

Frames with "wait" substring in the function name are highlighted

That's all. As simple as it looks. Quick overview of the backtrace before you start digging there.

Friday, February 4, 2022

How to Add Probe Inside a Function and Access Local Variables with bpftrace

During several my talks on bpftrace I mentioned that it is possible to add probe to "every other" line of the code, same as with perf, but had never demonstrated how to do this. Looks like it's time to show both this and access to local variables declared inside the function we trace, as I am going to speak about bpftrace again at FOSDEM (and have to share something new and cool).

Consider the following debugging session with latest MariaDB server 10.8 built from GitHub source (as usual, for my own work I use only my own builds from current source) on Ubuntu 20.04:

openxs@ao756:~/dbs/maria10.8$ sudo gdb -p `pidof mariadbd`
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 200650
[New LWP 200652]
[New LWP 200653]
[New LWP 200654]
[New LWP 200655]
[New LWP 200656]
[New LWP 200659]
[New LWP 200662]
[New LWP 200663]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
0x00007f7b8424caff in __GI___poll (fds=0x557960733d38, nfds=3,
    timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) b do_command
Breakpoint 1 at 0x55795e867570: file /home/openxs/git/server/sql/sql_parse.cc, line 1197.
(gdb) c
Continuing.

Basically I've attached to running mariadbd process and set a breakpoint on do_command(), and then let it continue working. In another terminal I've connected to server and got the breakpoint hit immediately:

[New Thread 0x7f7b5dbf0700 (LWP 200680)]
[New Thread 0x7f7b6c084700 (LWP 200797)]
[Switching to Thread 0x7f7b6c084700 (LWP 200797)]

Thread 11 "mariadbd" hit Breakpoint 1, do_command (thd=0x7f7b4c001738,
    blocking=blocking@entry=true)
    at /home/openxs/git/server/sql/sql_parse.cc:1197
1197    {
(gdb) bt
#0  do_command (thd=0x7f7b4c001738, blocking=blocking@entry=true)
    at /home/openxs/git/server/sql/sql_parse.cc:1197
#1  0x000055795e97df07 in do_handle_one_connection (connect=<optimized out>,
    put_in_cache=true) at /home/openxs/git/server/sql/sql_connect.cc:1418
#2  0x000055795e97e23d in handle_one_connection (arg=arg@entry=0x55796079f798)
    at /home/openxs/git/server/sql/sql_connect.cc:1312
#3  0x000055795ecc7e8d in pfs_spawn_thread (arg=0x557960733dd8)
    at /home/openxs/git/server/storage/perfschema/pfs.cc:2201
#4  0x00007f7b84686609 in start_thread (arg=<optimized out>)
    at pthread_create.c:477
#5  0x00007f7b84259293 in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

From this state in gdb we can trace execution step by step:

(gdb) n
[New Thread 0x7f7b5ebf2700 (LWP 200800)]
1213      if (thd->async_state.m_state == thd_async_state::enum_async_state::RESUMED)
(gdb) n
1229      thd->lex->current_select= 0;
(gdb) n
1237      if (!thd->skip_wait_timeout)
(gdb) n
1238        my_net_set_read_timeout(net, thd->get_net_wait_timeout());
(gdb) n
1241      thd->clear_error(1);
(gdb) n
1243      net_new_transaction(net);
(gdb) n
1246      thd->start_bytes_received= thd->status_var.bytes_received;
(gdb) n
1264      packet_length= my_net_read_packet(net, 1);
(gdb) n
1266      if (unlikely(packet_length == packet_error))
(gdb) n
1300      packet= (char*) net->read_pos;
(gdb) n
1309      if (packet_length == 0)                       /* safety */
(gdb) n
1316      packet[packet_length]= '\0';                  /* safety */
(gdb) n
1319      command= fetch_command(thd, packet);
(gdb) p packet
$1 = 0x7f7b4c008d08 "\003select @@version_comment limit 1"
(gdb) q
A debugging session is active.

        Inferior 1 [process 200650] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/openxs/dbs/maria10.8/bin/mariadbd, process 200650
[Inferior 1 (process 200650) detached]

In that session above I moved to the next step until I've got to the place in the code where local variable packet got a value, zero-terminated string. I know from the previous experience that it's value is the SQL statement to be executed (in most cases). So, I printed it to find out that mysql command line client starts with finding the server version comment to output.

Question is: can we do the same with bpftrace, attache the probe to the instruction at/around line 1316 of the sql/sql_parse.cc and print the value of the local variable named packet as a string? 

It should be possible, as we can do this with perf or even raw ftrace probes. This is how we can find all "tracable" lines in the code of do_command() function with perf:

openxs@ao756:~/dbs/maria10.8$ sudo perf probe -x /home/openxs/dbs/maria10.8/bin/mysqld --line do_command
<do_command@/home/openxs/git/server/sql/sql_parse.cc:0>
      0  dispatch_command_return do_command(THD *thd, bool blocking)
         {
      2    dispatch_command_return return_value;
           char *packet= 0;
           ulong packet_length;
           NET *net= &thd->net;
           enum enum_server_command command;
           DBUG_ENTER("do_command");

         #ifdef WITH_WSREP
           DBUG_ASSERT(!thd->async_state.pending_ops() ||
                       (WSREP(thd) &&
                        thd->wsrep_trx().state() == wsrep::transaction::s_abort>
         #else
           DBUG_ASSERT(!thd->async_state.pending_ops());
         #endif

           if (thd->async_state.m_state == thd_async_state::enum_async_state::R>
           {
             /*
              Resuming previously suspended command.
              Restore the state
             */
     23      command = thd->async_state.m_command;
     24      packet = thd->async_state.m_packet.str;
             packet_length = (ulong)thd->async_state.m_packet.length;
             goto resume;
           }

...
    104    packet= (char*) net->read_pos;
           /*
             'packet_length' contains length of data, as it was stored in packet
             header. In case of malformed header, my_net_read returns zero.
             If packet_length is not zero, my_net_read ensures that the returned
             number of bytes was actually read from network.
             There is also an extra safety measure in my_net_read:
             it sets packet[packet_length]= 0, but only for non-zero packets.
           */
    113    if (packet_length == 0)                       /* safety */
           {
             /* Initialize with COM_SLEEP packet */
    116      packet[0]= (uchar) COM_SLEEP;
    117      packet_length= 1;
           }
           /* Do not rely on my_net_read, extra safety against programming erro>
    120    packet[packet_length]= '\0';                  /* safety */


    123    command= fetch_command(thd, packet);
...
    221    DBUG_ASSERT(thd->m_digest == NULL);
           DBUG_ASSERT(thd->m_statement_psi == NULL);
         #ifdef WITH_WSREP
           if (packet_length != packet_error)
           {
             /* there was a command to process, and before_command() has been c>
             if (unlikely(wsrep_service_started))
    228        wsrep_after_command_after_result(thd);
           }
         #endif /* WITH_WSREP */
           DBUG_RETURN(return_value);

openxs@ao756:~/dbs/maria10.8$

So, basically we need a probe on line 120 to match the place where I printed the value in gdb. Let's create it with perf:

openxs@ao756:~/dbs/maria10.8$ sudo perf probe -x /home/openxs/dbs/maria10.8/bin/mariadbd 'do_command:120 packet:string'
Added new event:
  probe_mariadbd:do_command_L120 (on do_command:120 in /home/openxs/dbs/maria10.8/bin/mariadbd with packet:string)

You can now use it in all perf tools, such as:

        perf record -e probe_mariadbd:do_command_L120 -aR sleep 1

Now, we can check what user probe was really created by that command line:

openxs@ao756:~/dbs/maria10.8$ sudo cat /sys/kernel/tracing/uprobe_events
p:probe_mariadbd/do_command_L120 /home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761f packet_string=+0(%r10):string

So, the probe is attached to some address, 0x7b761f, in the mariadb binary. Looks like we refer to the register, r10, while trying to access local variable, packet.

Now let's delete this probe (we'll add it back with bpftrace soon) and try to make sense of that address:

openxs@ao756:~/dbs/maria10.8$ sudo perf probe --del 'probe_mariadbd:do_command*'
Removed event: probe_mariadbd:do_command_L120
openxs@ao756:~/dbs/maria10.8$ objdump -tT /home/openxs/dbs/maria10.8/bin/mariadbd | grep do_command
00000000007b7570 g     F .text  00000000000007e9              _Z10do_commandP3THDb
00000000007b7570 g    DF .text  00000000000007e9  Base        _Z10do_commandP3THDb

Note that based on the above the entry address of the do_command() function in the binary is 0x7b7570, a bit smaller than 0x7b761f. The difference is 0x7b761f - 0x7b7570 = 0xaf or 175 in decimal. 

Can we come up to the address for the probe  without perf? Yes, we can do it by checking the assembly code of the function, for example, in gdb:

openxs@ao756:~/dbs/maria10.8$ ps aux | grep 'mariadbd ' | grep -v grep
openxs    200650  0.0  2.4 1341032 94976 ?       Sl   січ30   0:05 /home/openxs/dbs/maria10.8/bin/mariadbd --no-defaults --basedir=/home/openxs/dbs/maria10.8 --datadir=/home/openxs/dbs/maria10.8/data --plugin-dir=/home/openxs/dbs/maria10.8/lib/plugin --log-error=/home/openxs/dbs/maria10.8/data/ao756.err --pid-file=ao756.pid --socket=/tmp/mariadb.sock --port=3309
openxs@ao756:~/dbs/maria10.8$ sudo gdb -p `pidof mariadbd`
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 200650
[New LWP 200652]
[New LWP 200653]
[New LWP 200654]
[New LWP 200655]
[New LWP 200662]
[New LWP 200663]
[New LWP 204349]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
0x00007f7b8424caff in __GI___poll (fds=0x557960733d38, nfds=3,
    timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) disass /m do_command
Dump of assembler code for function do_command(THD*, bool):
1142    static bool wsrep_command_no_result(char command)

1143    {
1144      return (command == COM_STMT_PREPARE          ||
   0x000055795e8679d0 <+1120>:  cmp    $0x16,%al
   0x000055795e8679d2 <+1122>:  je     0x55795e8679dc <do_command(THD*, bool)+1132>
   0x000055795e8679d4 <+1124>:  cmp    $0x1c,%al
   0x000055795e8679d6 <+1126>:  jne    0x55795e867b50 <do_command(THD*, bool)+1504>
   0x000055795e8679dc <+1132>:  mov    $0x1,%esi

1145              command == COM_STMT_FETCH            ||
1146              command == COM_STMT_SEND_LONG_DATA   ||
1147              command == COM_STMT_CLOSE);
1148    }
1149    #endif /* WITH_WSREP */
1150    #ifndef EMBEDDED_LIBRARY
1151    static enum enum_server_command fetch_command(THD *thd, char *packet)

1152    {
1153      enum enum_server_command
--Type <RET> for more, q to quit, c to continue without paging--
   0x000055795e867628 <+184>:   movzbl %al,%r15d

1154        command= (enum enum_server_command) (uchar) packet[0];
1155      DBUG_ENTER("fetch_command");

1156
1157      if (command >= COM_END ||
   0x000055795e86762c <+188>:   lea    -0x20(%r15),%edx
   0x000055795e867630 <+192>:   cmp    $0xd9,%edx
   0x000055795e867636 <+198>:   jbe    0x55795e8676f0 <do_command(THD*, bool)+384>
   0x000055795e86763c <+204>:   cmp    $0xff,%al
   0x000055795e86763e <+206>:   je     0x55795e8676f0 <do_command(THD*, bool)+384>

1158          (command >= COM_MDB_GAP_BEG && command <= COM_MDB_GAP_END))
1159        command= COM_END;                           // Wrong command

1160
1161      DBUG_PRINT("info",("Command on %s = %d (%s)",

1162                         vio_description(thd->net.vio), command,
1163                         command_name[command].str));
--Type <RET> for more, q to quit, c to continue without paging--
1164      DBUG_RETURN(command);

1165    }
1166
1167    /**
1168      Read one command from connection and execute it (query or simple command).
1169      This function is to be used by different schedulers (one-thread-per-connection,
1170      pool-of-threads)
1171
1172      For profiling to work, it must never be called recursively.
1173
1174      @param thd - client connection context
1175
1176      @param blocking - wait for command to finish.
1177                        if false (nonblocking), then the function might
1178                        return when command is "half-finished", with
1179                        DISPATCH_COMMAND_WOULDBLOCK.
1180                        Currenly, this can *only* happen when using
1181                        threadpool. The command will resume, after all outstanding
1182                        async operations (i.e group commit) finish.
--Type <RET> for more, q to quit, c to continue without paging--
1183                        Threadpool scheduler takes care of "resume".
1184
1185      @retval
1186        DISPATCH_COMMAND_SUCCESS - success
1187      @retval
1188        DISPATCH_COMMAND_CLOSE_CONNECTION  request of THD shutdown
1189              (s. dispatch_command() description)
1190      @retval
1191        DISPATCH_COMMAND_WOULDBLOCK - need to wait for asyncronous operations
1192                                      to finish. Only returned if parameter
1193                                      'blocking' is false.
1194    */
1195
1196    dispatch_command_return do_command(THD *thd, bool blocking)
1197    {
   0x000055795e867570 <+0>:     endbr64

1198      dispatch_command_return return_value;

1199      char *packet= 0;

1200      ulong packet_length;
--Type <RET> for more, q to quit, c to continue without paging--

1201      NET *net= &thd->net;

1202      enum enum_server_command command;

1203      DBUG_ENTER("do_command");

1204
1205    #ifdef WITH_WSREP
1206      DBUG_ASSERT(!thd->async_state.pending_ops() ||

1207                  (WSREP(thd) &&
1208                   thd->wsrep_trx().state() == wsrep::transaction::s_aborted));
1209    #else
1210      DBUG_ASSERT(!thd->async_state.pending_ops());
1211    #endif
1212
1213      if (thd->async_state.m_state == thd_async_state::enum_async_state::RESUMED)
   0x000055795e867574 <+4>:     push   %rbp
   0x000055795e867575 <+5>:     mov    %rsp,%rbp
   0x000055795e867578 <+8>:     push   %r15
--Type <RET> for more, q to quit, c to continue without paging--
   0x000055795e86757a <+10>:    push   %r14
   0x000055795e86757c <+12>:    push   %r13
   0x000055795e86757e <+14>:    mov    %esi,%r13d
   0x000055795e867581 <+17>:    push   %r12
   0x000055795e867583 <+19>:    push   %rbx
   0x000055795e867584 <+20>:    mov    %rdi,%rbx
   0x000055795e867587 <+23>:    sub    $0x78,%rsp
   0x000055795e86758b <+27>:    cmpl   $0x2,0x6270(%rdi)
   0x000055795e867592 <+34>:    je     0x55795e867770 <do_command(THD*, bool)+512>
   0x000055795e867598 <+40>:    mov    0x58(%rdi),%rax
   0x000055795e86759c <+44>:    cmpb   $0x0,0x279e(%rdi)
   0x000055795e8675a3 <+51>:    lea    0x290(%rdi),%r14

1214      {
1215        /*
1216         Resuming previously suspended command.
1217         Restore the state
1218        */
1219        command = thd->async_state.m_command;
   0x000055795e867770 <+512>:   mov    0x6278(%rdi),%r10
   0x000055795e867777 <+519>:   mov    0x6280(%rdi),%r12
   0x000055795e86777e <+526>:   movzbl %sil,%r8d
--Type <RET> for more, q to quit, c to continue without paging--
   0x000055795e867782 <+530>:   mov    %rdi,%rsi
   0x000055795e867785 <+533>:   mov    0x6274(%rdi),%r15d

1220        packet = thd->async_state.m_packet.str;

1221        packet_length = (ulong)thd->async_state.m_packet.length;

1222        goto resume;

1223      }
1224
1225      /*
1226        indicator of uninitialized lex => normal flow of errors handling
1227        (see my_message_sql)
1228      */
1229      thd->lex->current_select= 0;
   0x000055795e8675aa <+58>:    movq   $0x0,0xd38(%rax)

1230
1231      /*
1232        This thread will do a blocking read from the client which
1233        will be interrupted when the next command is received from
1234        the client, the connection is closed or "net_wait_timeout"
--Type <RET> for more, q to quit, c to continue without paging--
1235        number of seconds has passed.
1236      */
1237      if (!thd->skip_wait_timeout)
   0x000055795e8675b5 <+69>:    je     0x55795e867710 <do_command(THD*, bool)+416>

1238        my_net_set_read_timeout(net, thd->get_net_wait_timeout());
1239
1240      /* Errors and diagnostics are cleared once here before query */
1241      thd->clear_error(1);
1242
1243      net_new_transaction(net);
   0x000055795e8675dd <+109>:   mov    0x13f0(%rbx),%rax
   0x000055795e8675e4 <+116>:   mov    $0x1,%esi
   0x000055795e8675e9 <+121>:   mov    %r14,%rdi
   0x000055795e8675ec <+124>:   movl   $0x0,0x2f0(%rbx)

1244
1245      /* Save for user statistics */
1246      thd->start_bytes_received= thd->status_var.bytes_received;
   0x000055795e8675f6 <+134>:   mov    %rax,0x4040(%rbx)

1247
--Type <RET> for more, q to quit, c to continue without paging--
1248      /*
1249        Synchronization point for testing of KILL_CONNECTION.
1250        This sync point can wait here, to simulate slow code execution
1251        between the last test of thd->killed and blocking in read().
1252
1253        The goal of this test is to verify that a connection does not
1254        hang, if it is killed at this point of execution.
1255        (Bug#37780 - main.kill fails randomly)
1256
1257        Note that the sync point wait itself will be terminated by a
1258        kill. In this case it consumes a condition broadcast, but does
1259        not change anything else. The consumed broadcast should not
1260        matter here, because the read/recv() below doesn't use it.
1261      */
1262      DEBUG_SYNC(thd, "before_do_command_net_read");

1263
1264      packet_length= my_net_read_packet(net, 1);
   0x000055795e8675fd <+141>:   callq  0x55795ebda8a0 <my_net_read_packet(NET*, my_bool)>
   0x000055795e867602 <+146>:   mov    %rax,%r12

1265
--Type <RET> for more, q to quit, c to continue without paging--
1266      if (unlikely(packet_length == packet_error))
   0x000055795e867605 <+149>:   cmp    $0xffffffffffffffff,%rax
   0x000055795e867609 <+153>:   je     0x55795e8678d0 <do_command(THD*, bool)+864>

1267      {
1268        DBUG_PRINT("info",("Got error %d reading command from socket %s",

1269                           net->error,
1270                           vio_description(net->vio)));
1271
1272        /* Instrument this broken statement as "statement/com/error" */
1273        thd->m_statement_psi= MYSQL_REFINE_STATEMENT(thd->m_statement_psi,
   0x000055795e8678d0 <+864>:   mov    0x3b78(%rbx),%rdi

1274                                                     com_statement_info[COM_END].
1275                                                     m_key);
1276
1277
1278        /* Check if we can continue without closing the connection */
1279
1280        /* The error must be set. */
--Type <RET> for more, q to quit, c to continue without paging--
1281        DBUG_ASSERT(thd->is_error());

1282        thd->protocol->end_statement();
   0x000055795e8678ee <+894>:   mov    0x558(%rbx),%rdi
   0x000055795e8678f5 <+901>:   callq  0x55795e79f7e0 <Protocol::end_statement()>

1283
1284        /* Mark the statement completed. */
1285        MYSQL_END_STATEMENT(thd->m_statement_psi, thd->get_stmt_da());
1286        thd->m_statement_psi= NULL;
   0x000055795e86791b <+939>:   cmpb   $0x3,0x324(%rbx)
   0x000055795e867922 <+946>:   mov    $0x1,%r13d
   0x000055795e867928 <+952>:   movq   $0x0,0x3b78(%rbx)

1287        thd->m_digest= NULL;
   0x000055795e867933 <+963>:   movq   $0x0,0x3b30(%rbx)

1288
1289        if (net->error != 3)
   0x000055795e86793e <+974>:   jne    0x55795e86794a <do_command(THD*, bool)+986>

--Type <RET> for more, q to quit, c to continue without paging--
1290        {
1291          return_value= DISPATCH_COMMAND_CLOSE_CONNECTION;     // We have to close it.
1292          goto out;
1293        }
1294
1295        net->error= 0;
   0x000055795e867940 <+976>:   movb   $0x0,0x324(%rbx)

1296        return_value= DISPATCH_COMMAND_SUCCESS;

1297        goto out;
   0x000055795e867947 <+983>:   xor    %r13d,%r13d

1298      }
1299
1300      packet= (char*) net->read_pos;
   0x000055795e86760f <+159>:   mov    0x2b0(%rbx),%r10

1301      /*
1302        'packet_length' contains length of data, as it was stored in packet
1303        header. In case of malformed header, my_net_read returns zero.
1304        If packet_length is not zero, my_net_read ensures that the returned
--Type <RET> for more, q to quit, c to continue without paging--
1305        number of bytes was actually read from network.
1306        There is also an extra safety measure in my_net_read:
1307        it sets packet[packet_length]= 0, but only for non-zero packets.
1308      */
1309      if (packet_length == 0)                       /* safety */
   0x000055795e867616 <+166>:   test   %rax,%rax
   0x000055795e867619 <+169>:   je     0x55795e8676e0 <do_command(THD*, bool)+368>

1310      {
1311        /* Initialize with COM_SLEEP packet */
1312        packet[0]= (uchar) COM_SLEEP;
   0x000055795e8676e0 <+368>:   movb   $0x0,(%r10)

1313        packet_length= 1;
   0x000055795e8676e4 <+372>:   mov    $0x1,%r12d
   0x000055795e8676ea <+378>:   jmpq   0x55795e86761f <do_command(THD*, bool)+175>
   0x000055795e8676ef <+383>:   nop

1314      }
1315      /* Do not rely on my_net_read, extra safety against programming errors. */
--Type <RET> for more, q to quit, c to continue without paging--
1316      packet[packet_length]= '\0';                  /* safety */
   0x000055795e86761f <+175>:   movb   $0x0,(%r10,%r12,1)

1317
1318
1319      command= fetch_command(thd, packet);
   0x000055795e867624 <+180>:   movzbl (%r10),%eax

1320
1321    #ifdef WITH_WSREP
1322      DEBUG_SYNC(thd, "wsrep_before_before_command");

1323      /*
1324        If this command does not return a result, then we
1325        instruct wsrep_before_command() to skip result handling.
1326        This causes BF aborted transaction to roll back but keep
1327        the error state until next command which is able to return
1328        a result to the client.
1329      */
1330      if (unlikely(wsrep_service_started) &&
   0x000055795e867644 <+212>:   lea    0x1862105(%rip),%rcx        # 0x5579600c9750 <wsrep_service_started>
   0x000055795e86764b <+219>:   cmpb   $0x0,(%rcx)
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) quit
A debugging session is active.

        Inferior 1 [process 200650] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/openxs/dbs/maria10.8/bin/mariadbd, process 200650
[Inferior 1 (process 200650) detached]

At this stage we can stop debugging and quit from gdb. We see that line 1316 in the code (were packet is properly formed and ready for printing) matches (decimal) offset 175 from the beginning of the function. Now we can add that to the start of the function from objdump to find the address for the probe, 0x7b761f.

The only remaining detail is how to print the content of the register r10 as a zero terminated string. Just read fine bpftrace manual for this, and you can come up with the following:

openxs@ao756:~/dbs/maria10.8$ cd ~/git/bpftrace/
openxs@ao756:~/git/bpftrace$ cd build/
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761f { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'
[sudo] password for openxs:
Attaching 1 probe...
Function: do_command(THD*, bool) packet: select @@version_comment limit 1
Function: do_command(THD*, bool) packet: select 1 + 1
^C

This is what I've got from the probe while in another terminal I've connected and executed the following:

openxs@ao756:~/dbs/maria10.8$ bin/mysql
ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)
openxs@ao756:~/dbs/maria10.8$ bin/mysql --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 10.8.1-MariaDB Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select 1 + 1;
+-------+
| 1 + 1 |
+-------+
|     2 |
+-------+
1 row in set (0.000 sec)

So, as long as you know the address of instruction inside the function code, you can attach a probe to it. You can safely experiment, as if the address is surely wrong:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761e { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'
Attaching 1 probe...
ERROR: Could not add uprobe into middle of instruction: /home/openxs/dbs/maria10.8/bin/mariadbd:_Z10do_commandP3THDb+174
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000008b761f { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'
Attaching 1 probe...
ERROR: Could not add uprobe into middle of instruction: /home/openxs/dbs/maria10.8/bin/mariadbd:_ZN14partition_info25set_up_default_partitionsEP3THDP7handlerP14HA_CREATE_INFOj+799

the tool will tell you and do no harm.


Nice conbination of old Tower and new buildings in the City of London. Combining bpftrace with good old gdb is also nice.

To summarize:

  1.  It is possible to add a user probe to every assembly insturction inside the function with bpftrace (or perf, or ftrace, as long as you know what you are doing).
  2. It is possible to access CPU registers in the bpftrace probes.
  3. One can surely access, print etc local varaibles inside functions, as long as the address or register where it is stored is determined.
  4. It is surely easier and faster to do all the above with some knowledge of gdb and assembly language for your CPU.
  5. It is still fun to try to solve more complex problems with bpftrace, so more posts are expected. Stay tuned!


Thursday, January 27, 2022

First steps with MariaDB Server and DTrace on macOS

FOSDEM 2022 is going to happen on the next weekend and I am still missing blog posts supporting my unusual upcoming talk there, this time devoted to building and using MariaDB server on macOS. So tonight I am going to contribute something new to refer to during my talk or followup questions. I'll try to document my way to build MariaDB Server from GitHub source on old (early 2015) MacBook Air running macOS 10.13.6 High Sierra. I am also going to show why one may want to use macOS there instead of installing Linux and having way more usual environment on the same decent hardware.

I've inherited this Air from my daughter in September, 2021 after she had got a new, M1-based, and initially used it mostly for Zoom calls and web-browsing. Soon I recalled that I've used MacBook for more than 3 years in the past while working for Sun and Oracle, and it was my main working platform not only for content consumption, emails and virtual machines. It was regularly used for bugs verification, MySQL builds and tests of all kinds. It would be a waste of capable OS and good hardware (still more powerful formally than all my other machines but Fedora 33 desktop) NOT to try to use it properly.

That's why after few software upgrades to end up with more recent 10.13 minor release:

Yuliyas-Air:maria10.6 Valerii$ uname -a
Darwin Yuliyas-Air 17.7.0 Darwin Kernel Version 17.7.0: Mon Aug 31 22:11:23 PDT 2020; root:xnu-4570.71.82.6~1/RELEASE_X86_64 x86_64

I read the manual, registered as a developer, downloaded and installed proper version of XCode and decided to proceed with MacPorts.

Then I updated ports tree and proceed by the manual, with installing git and other surely needed tools and packages:

sudo port install git cmake jemalloc judy openssl boost gnutls

That was just the beginning, and eventually, with dependencies, port updates, problems and workarounds, I ended up like this (and that works for building current code of 10.1 to 10.8 for sure, with some ports maybe not needed or used for other builds, but who cares):

Yuliyas-Air:maria10.8 Valerii$ port installed
The following ports are currently installed:
  autoconf @2.71_1 (active)
  automake @1.16.5_0 (active)
  bison @3.8.2_0
  bison @3.8.2_2 (active)
  bison-runtime @3.8.2_0 (active)
  boehmgc @8.0.6_0 (active)
  boost @1.76_0 (active)
  boost171 @1.71.0_3+no_single+no_static+python39 (active)
  boost176 @1.76.0_2+no_single+no_static+python39
  boost176 @1.76.0_3+no_single+no_static+python39 (active)
  bzip2 @1.0.8_0 (active)
  cmake @3.21.4_0
  cmake @3.22.1_0 (active)
  curl @7.80.0_0+ssl (active)
  curl-ca-bundle @7.80.0_0
  curl-ca-bundle @7.80.0_1 (active)
  cyrus-sasl2 @2.1.27_5+kerberos (active)
  db48 @4.8.30_4 (active)
  expat @2.4.1_0
  expat @2.4.2_0
  expat @2.4.3_0 (active)
  gdb @11.1_0 (active)
  gdbm @1.22_0 (active)
  gettext @0.21_0 (active)
  gettext-runtime @0.21_0 (active)
  gettext-tools-libs @0.21_0 (active)
  git @2.34.1_1+credential_osxkeychain+diff_highlight+doc+pcre+perl5_28
  git @2.34.1_2+credential_osxkeychain+diff_highlight+doc+pcre+perl5_28 (active)
  gmp @6.2.1_0 (active)
  gnutls @3.6.16_1
  gnutls @3.6.16_2 (active)
  icu @67.1_4 (active)
  jemalloc @5.2.1_1 (active)
  judy @1.0.5_1 (active)
  kerberos5 @1.19.2_1 (active)
  libarchive @3.5.2_1 (active)
  libb2 @0.98.1_1 (active)
  libcomerr @1.45.6_0 (active)
  libcxx @5.0.1_4 (active)
  libedit @20210910-3.1_1 (active)
  libevent @2.1.12_1 (active)
  libffi @3.4.2_2 (active)
  libiconv @1.16_1 (active)
  libidn @1.38_0 (active)
  libidn2 @2.3.2_0
  libidn2 @2.3.2_1 (active)
  libpsl @0.21.1-20210726_2 (active)
  libtasn1 @4.18.0_0 (active)
  libtextstyle @0.21_0 (active)
  libtool @2.4.6_13 (active)
  libunistring @0.9.10_0
  libunistring @1.0_0 (active)
  libuv @1.42.0_1
  libuv @1.43.0_0 (active)
  libxml2 @2.9.12_1 (active)
  libxslt @1.1.34_6 (active)
  lmdb @0.9.29_0 (active)
  luajit @2.1.0-beta3_5 (active)
  lz4 @1.9.3_1 (active)
  lzma @4.65_1 (active)
  lzo2 @2.10_0 (active)
  m4 @1.4.19_1 (active)
  mariadb @5.5.68_0 (active)
  mariadb-server @5.5.68_0 (active)
  mysql57 @5.7.36_1 (active)
  mysql_select @0.1.2_4 (active)
  ncurses @6.3_0 (active)
  nettle @3.7.3_0 (active)
  openssl @3_1
  openssl @3_2 (active)
  openssl3 @3.0.0_6+legacy
  openssl3 @3.0.1_0+legacy (active)
  openssl11 @1.1.1l_5 (active)
  p5-dbd-mysql @4.50.0_0 (active)
  p5.28-authen-sasl @2.160.0_0 (active)
  p5.28-cgi @4.530.0_0 (active)
  p5.28-clone @0.450.0_0 (active)
  p5.28-dbd-mysql @4.50.0_0+mysql57 (active)
  p5.28-dbi @1.643.0_0 (active)
  p5.28-digest-hmac @1.40.0_0 (active)
  p5.28-digest-sha1 @2.130.0_4 (active)
  p5.28-encode @3.160.0_0 (active)
  p5.28-encode-locale @1.50.0_0 (active)
  p5.28-error @0.170.290_0 (active)
  p5.28-gssapi @0.280.0_3 (active)
  p5.28-html-parser @3.760.0_0 (active)
  p5.28-html-tagset @3.200.0_4 (active)
  p5.28-http-date @6.50.0_0 (active)
  p5.28-http-message @6.350.0_0 (active)
  p5.28-io-html @1.4.0_0 (active)
  p5.28-io-socket-ssl @2.72.0_0
  p5.28-io-socket-ssl @2.73.0_0 (active)
  p5.28-lwp-mediatypes @6.40.0_0 (active)
  p5.28-mozilla-ca @20211001_0 (active)
  p5.28-net-libidn @0.120.0_5 (active)
  p5.28-net-smtp-ssl @1.40.0_0 (active)
  p5.28-net-ssleay @1.900.0_4 (active)
  p5.28-term-readkey @2.380.0_0 (active)
  p5.28-time-local @1.300.0_0 (active)
  p5.28-timedate @2.330.0_0 (active)
  p5.28-uri @5.100.0_0 (active)
  p11-kit @0.24.0_1 (active)
  pcre2 @10.39_0 (active)
  perl5.28 @5.28.3_4 (active)
  perl5.30 @5.30.3_3 (active)
  pkgconfig @0.29.2_0 (active)
  popt @1.18_1 (active)
  python3_select @0.0_2 (active)
  python39 @3.9.9_0+lto+optimizations
  python39 @3.9.10_0+lto+optimizations (active)
  python_select @0.3_9 (active)
  readline @8.1.000_0 (active)
  readline-5 @5.2.014_2 (active)
  rsync @3.2.3_1 (active)
  sqlite3 @3.37.0_0
  sqlite3 @3.37.1_0
  sqlite3 @3.37.2_0 (active)
  sysbench @1.0.20_0 (active)
  tcp_wrappers @20_4 (active)
  texinfo @6.8_0 (active)
  umem @1.0.1_1 (active)
  xxhashlib @0.8.1_0
  xxhashlib @0.8.1_1 (active)
  xz @5.2.5_0 (active)
  zlib @1.2.11_0 (active)
  zstd @1.5.0_0
  zstd @1.5.1_0 (active)
Yuliyas-Air:maria10.8 Valerii$

This is surely not the minimal needed set of ports. I've highlighted a couple (like openssl11) that were really needed to build and install 10.8 successfully eventually.

Then I cloned the code with usual steps and ended up with this:

Yuliyas-Air:server Valerii$ git log -1
commit c1cef1afa9962544de4840c9a796ae0a9b5e92e6 (HEAD -> 10.8, origin/bb-10.8-wlad, origin/bb-10.8-release, origin/HEAD, origin/10.8)
Merge: db2013787d2 9d93b51effd
Author: Vladislav Vaintroub <wlad@mariadb.com>
Date:   Wed Jan 26 13:57:00 2022 +0100

    Merge remote-tracking branch 'origin/bb-10.8-wlad' into 10.8
Yuliyas-Air:server Valerii$ git submodule update --init --recursive
Submodule path 'libmariadb': checked out 'ddb031b6a1d8b6e26a0f10f454dc1453a48a6ca8'
Yuliyas-Air:server Valerii$ cd buildtmp/
Yuliyas-Air:buildtmp Valerii$ rm -rf *
Yuliyas-Air:buildtmp Valerii$ cmake .. -DCMAKE_INSTALL_PREFIX=/Users/Valerii/dbs/maria10.8 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DPLUGIN_TOKUDB=NO -DWITH_SSL=/opt/local/libexec/openssl11 -DENABLE_DTRACE=1
...
-- The following OPTIONAL packages have not been found:

 * PMEM
 * Snappy

-- Configuring done
CMake Warning (dev):
  Policy CMP0042 is not set: MACOSX_RPATH is enabled by default.  Run "cmake
  --help-policy CMP0042" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

  MACOSX_RPATH is not specified for the following targets:

   libmariadb

This warning is for project developers.  Use -Wno-dev to suppress it.

-- Generating done
-- Build files have been written to: /Users/Valerii/git/server/buildtmp
Yuliyas-Air:buildtmp Valerii$

Note that I've used somewhat nontrivial cmake command line for out of source build, and the output above was not from the clean state, but from the state after 10.1 to 10.7 where all checked out and built successfully, one by one, with problems found and resolved (more on that on my slides and during the talk).

Two key options in that command line are -DWITH_SSL=/opt/local/libexec/openssl11 to use supported OpenSSL version 1.1 (even though for 10.8 default version 3 should already work too) and -DENABLE_DTRACE=1 to enable static DTrace probes in the code of MariaDB that are still there. The rest are typical for my builds and blog posts.

Then I proceeded with usual make to end up with the first problem:

Yuliyas-Air:buildtmp Valerii$ make
...
[ 46%] Built target heap
[ 46%] Building CXX object storage/innobase/CMakeFiles/innobase.dir/btr/btr0btr.cc.o
In file included from /Users/Valerii/git/server/storage/innobase/btr/btr0btr.cc:28:
In file included from /Users/Valerii/git/server/storage/innobase/include/btr0btr.h:31:
In file included from /Users/Valerii/git/server/storage/innobase/include/dict0dict.h:32:
In file included from /Users/Valerii/git/server/storage/innobase/include/dict0mem.h:45:
In file included from /Users/Valerii/git/server/storage/innobase/include/buf0buf.h:33:
/Users/Valerii/git/server/storage/innobase/include/fil0fil.h:1497:11: error:
      'asm goto' constructs are not supported yet
  __asm__ goto("lock btsl $31, %0\t\njnc %l1" : : "m" (n_pending)
          ^
1 error generated.
make[2]: *** [storage/innobase/CMakeFiles/innobase.dir/btr/btr0btr.cc.o] Error 1
make[1]: *** [storage/innobase/CMakeFiles/innobase.dir/all] Error 2
make: *** [all] Error 2

caused by the fact that clang 10.0.0 from XCode: 

Yuliyas-Air:maria10.8 Valerii$ clang --version
Apple LLVM version 10.0.0 (clang-1000.10.44.4)
Target: x86_64-apple-darwin17.7.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

does NOT support asm goto used in MariaDB Server code since 10.6 (while it was supposed to support it). I've applied a lame fix (see MDEV-27402 for more details and final diff later) and proceeded:

Yuliyas-Air:buildtmp Valerii$ make
...

[ 97%] Building CXX object client/CMakeFiles/mariadb.dir/mysql.cc.o
/Users/Valerii/git/server/client/mysql.cc:2853:59: error: expected expression
  rl_attempted_completion_function= (rl_completion_func_t*)&new_mysql_co...
                                                          ^
/Users/Valerii/git/server/client/mysql.cc:2853:38: error: use of undeclared
      identifier 'rl_completion_func_t'; did you mean 'rl_completion_matches'?
  rl_attempted_completion_function= (rl_completion_func_t*)&new_mysql_co...
                                     ^~~~~~~~~~~~~~~~~~~~
                                     rl_completion_matches
/usr/include/editline/readline.h:202:16: note: 'rl_completion_matches' declared
      here
char           **rl_completion_matches(const char *, rl_compentry_func_t *);
                 ^
/Users/Valerii/git/server/client/mysql.cc:2854:33: error: assigning to
      'Function *' (aka 'int (*)(const char *, int)') from incompatible type
      'rl_compentry_func_t *' (aka 'char *(*)(const char *, int)'): different
      return type ('int' vs 'char *')
  rl_completion_entry_function= (rl_compentry_func_t*)&no_completion;
                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Users/Valerii/git/server/client/mysql.cc:2856:3: error: no matching function
      for call to 'rl_add_defun'
  rl_add_defun("magic-space", (rl_command_func_t *)&fake_magic_space, -1);
  ^~~~~~~~~~~~
/usr/include/editline/readline.h:195:7: note: candidate function not viable: no
      known conversion from 'rl_command_func_t *' (aka 'int (*)(int, int)') to
      'Function *' (aka 'int (*)(const char *, int)') for 2nd argument
int              rl_add_defun(const char *, Function *, int);
                 ^
4 errors generated.
make[2]: *** [client/CMakeFiles/mariadb.dir/mysql.cc.o] Error 1
make[1]: *** [client/CMakeFiles/mariadb.dir/all] Error 2
make: *** [all] Error 2
Yuliyas-Air:buildtmp Valerii$

This readline-related problem was also reported (see MDEV-27579) and I fixed it with a lame patch, and eventually I was able to build successfully:

...

[100%] Linking C executable wsrep_check_version
[100%] Built target wsrep_check_version
Yuliyas-Air:buildtmp Valerii$ make install && make clean
...
-- Installing: /Users/Valerii/dbs/maria10.8/share/aclocal/mysql.m4
-- Installing: /Users/Valerii/dbs/maria10.8/support-files/mysql.server
Yuliyas-Air:buildtmp Valerii$ echo $?
0

The final diff is like this:

Yuliyas-Air:server Valerii$ git diff -u
diff --git a/client/mysql.cc b/client/mysql.cc
index 6612b273d17..902589f2e83 100644
--- a/client/mysql.cc
+++ b/client/mysql.cc
@@ -2849,7 +2849,7 @@ static void initialize_readline ()
   rl_terminal_name= getenv("TERM");

   /* Tell the completer that we want a crack first. */
-#if defined(USE_NEW_READLINE_INTERFACE)
+#if defined(USE_NEW_READLINE_INTERFACE) && !defined(__APPLE_CC__)
   rl_attempted_completion_function= (rl_completion_func_t*)&new_mysql_completion;
   rl_completion_entry_function= (rl_compentry_func_t*)&no_completion;

@@ -2859,7 +2859,9 @@ static void initialize_readline ()
   setlocale(LC_ALL,""); /* so as libedit use isprint */
 #endif
   rl_attempted_completion_function= (CPPFunction*)&new_mysql_completion;
+#if !defined(__APPLE_CC__)
   rl_completion_entry_function= &no_completion;
+#endif
   rl_add_defun("magic-space", (Function*)&fake_magic_space, -1);
 #else
   rl_attempted_completion_function= (CPPFunction*)&new_mysql_completion;
diff --git a/storage/innobase/include/fil0fil.h b/storage/innobase/include/fil0fil.h
index 34a53746b42..a795313116f 100644
--- a/storage/innobase/include/fil0fil.h
+++ b/storage/innobase/include/fil0fil.h
@@ -1489,7 +1489,7 @@ inline void fil_space_t::reacquire()
 inline bool fil_space_t::set_stopping_check()
 {
   mysql_mutex_assert_owner(&fil_system.mutex);
-#if defined __clang_major__ && __clang_major__ < 10
+#if (defined __clang_major__ && __clang_major__ < 10) || defined __APPLE_CC__
   /* Only clang-10 introduced support for asm goto */
   return n_pending.fetch_or(STOPPING, std::memory_order_relaxed) & STOPPING;
 #elif defined __GNUC__ && (defined __i386__ || defined __x86_64__)
diff --git a/storage/rocksdb/rocksdb b/storage/rocksdb/rocksdb
--- a/storage/rocksdb/rocksdb
+++ b/storage/rocksdb/rocksdb
@@ -1 +1 @@
-Subproject commit bba5e7bc21093d7cfa765e1280a7c4fdcd284288
+Subproject commit bba5e7bc21093d7cfa765e1280a7c4fdcd284288-dirty
Yuliyas-Air:server Valerii$

The last diff in rocksdb submodule is related to the "missing zstd headers" problem I fined with yet another lame patch while building 10.7, see MDEV-27619

After usual mysql_install_db and startup, I've ended up with a shiny new MariaDB 10.8.0 up and running on macOS 10.13.6:

Yuliyas-Air:maria10.8 Valerii$ bin/mysql
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 5
Server version: 10.8.0-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> show variables like 'version%';
+-------------------------+------------------------------------------+
| Variable_name           | Value                                    |
+-------------------------+------------------------------------------+
| version                 | 10.8.0-MariaDB                           |
| version_comment         | MariaDB Server                           |
| version_compile_machine | x86_64                                   |
| version_compile_os      | osx10.13                                 |
| version_malloc_library  | system                                   |
| version_source_revision | c1cef1afa9962544de4840c9a796ae0a9b5e92e6 |
| version_ssl_library     | OpenSSL 1.1.1l  24 Aug 2021              |
+-------------------------+------------------------------------------+
7 rows in set (0.001 sec)

I even checked few MTR test suites, and many tests pass:

Yuliyas-Air:mysql-test Valerii$ ./mtr --suite=rocksdb
Logging: ./mtr  --suite=rocksdb
VS config:
vardir: /Users/Valerii/dbs/maria10.8/mysql-test/var
Checking leftover processes...
 - found old pid 89161 in 'mysqld.1.pid', killing it...
   process did not exist!
Removing old var directory...
Creating var directory '/Users/Valerii/dbs/maria10.8/mysql-test/var'...
Checking supported features...
MariaDB Version 10.8.0-MariaDB
 - SSL connections supported
Using suites: rocksdb
Collecting tests...
Installing system database...
...
rocksdb.index_merge_rocksdb 'write_committed' [ pass ]    911
rocksdb.shutdown 'write_prepared'        [ pass ]   1895
rocksdb.index_merge_rocksdb 'write_prepared' [ pass ]    923
rocksdb.mariadb_misc_binlog 'write_committed' [ pass ]     43
rocksdb.mariadb_misc_binlog 'write_prepared' [ pass ]     43
...
rocksdb.issue495 'write_committed'       [ pass ]     33
rocksdb.partition 'write_committed'      [ fail ]
        Test ended at 2022-01-27 22:41:15

CURRENT_TEST: rocksdb.partition
mysqltest: At line 67: query 'ALTER TABLE t1 REBUILD PARTITION p0, p1' failed: ER_METADATA_INCONSISTENCY (4064): Table 'test.t1#P#p0#TMP#' does not exist, but metadata information exists inside MyRocks. This is a sign of data inconsistency. Please check if './test/t1#P#p0#TMP#.frm' exists, and try to restore it if it does not exist.

The result from queries just before the failure was:
< snip >
DROP TABLE IF EXISTS employees_hash_1;
DROP TABLE IF EXISTS t1_hash;
DROP TABLE IF EXISTS employees_linear_hash;
DROP TABLE IF EXISTS t1_linear_hash;
DROP TABLE IF EXISTS k1;
DROP TABLE IF EXISTS k2;
DROP TABLE IF EXISTS tm1;
DROP TABLE IF EXISTS tk;
DROP TABLE IF EXISTS ts;
DROP TABLE IF EXISTS ts_1;
DROP TABLE IF EXISTS ts_3;
DROP TABLE IF EXISTS ts_4;
DROP TABLE IF EXISTS ts_5;
DROP TABLE IF EXISTS trb3;
DROP TABLE IF EXISTS tr;
DROP TABLE IF EXISTS members_3;
DROP TABLE IF EXISTS clients;
DROP TABLE IF EXISTS clients_lk;
DROP TABLE IF EXISTS trb1;
CREATE TABLE t1 (i INT, j INT, k INT, PRIMARY KEY (i)) ENGINE = ROCKSDB PARTITION BY KEY(i) PARTITIONS 4;

More results from queries before failure can be found in /Users/Valerii/dbs/maria10.8/mysql-test/var/log/partition.log

 - saving '/Users/Valerii/dbs/maria10.8/mysql-test/var/log/rocksdb.partition-write_committed/' to '/Users/Valerii/dbs/maria10.8/mysql-test/var/log/rocksdb.partition-write_committed/'

Only  148  of 555 completed.
--------------------------------------------------------------------------
The servers were restarted 23 times
Spent 154.726 of 206 seconds executing testcases

Failure: Failed 1/27 tests, 96.30% were successful.

Failing test(s): rocksdb.partition

The log files in var/log may give you some hint of what went wrong.

If you want to report this error, please read first the documentation
at http://dev.mysql.com/doc/mysql/en/mysql-test-suite.html

69 tests were skipped, 2 by the test itself.

mysql-test-run: *** ERROR: there were failing test cases

But the real reason to build MariaDB server on macOS (other than "because I can") was not even RocksDB testing, but this:

Yuliyas-Air:mysql-test Valerii$ dtrace
Usage: dtrace [-aACeFHlqSvVwZ] [-arch i386|x86_64] [-b bufsz] [-c cmd] [-D name[=def]]
        [-I path] [-L path] [-o output] [-p pid] [-s script] [-U name]
        [-x opt[=val]]

        [-P provider [[ predicate ] action ]]
        [-m [ provider: ] module [[ predicate ] action ]]
        [-f [[ provider: ] module: ] func [[ predicate ] action ]]
        [-n [[[ provider: ] module: ] func: ] name [[ predicate ] action ]]
        [-i probe-id [[ predicate ] action ]] [ args ... ]

        predicate -> '/' D-expression '/'
           action -> '{' D-statements '}'

        -arch Generate programs and Mach-O files for the specified architecture

        -a  claim anonymous tracing state
        -A  generate plist(5) entries for anonymous tracing
        -b  set trace buffer size
        -c  run specified command and exit upon its completion
        -C  run cpp(1) preprocessor on script files
        -D  define symbol when invoking preprocessor
        -e  exit after compiling request but prior to enabling probes
        -f  enable or list probes matching the specified function name
        -F  coalesce trace output by function
        -h  generate a header file with definitions for static probes
        -H  print included files when invoking preprocessor
        -i  enable or list probes matching the specified probe id
        -I  add include directory to preprocessor search path
        -l  list probes matching specified criteria
        -L  add library directory to library search path
        -m  enable or list probes matching the specified module name
        -n  enable or list probes matching the specified probe name
        -o  set output file
        -p  grab specified process-ID and cache its symbol tables
        -P  enable or list probes matching the specified provider name
        -q  set quiet mode (only output explicitly traced data)
        -s  enable or list probes according to the specified D script
        -S  print D compiler intermediate code
        -U  undefine symbol when invoking preprocessor
        -v  set verbose mode (report stability attributes, arguments)
        -V  report DTrace API version
        -w  permit destructive actions
        -W  wait for specified process and exit upon its completion
        -x  enable or modify compiler and tracing options
        -Z  permit probe descriptions that match zero probes
Yuliyas-Air:mysql-test Valerii$

I remember how cool dtrace was since workin for Sun and after last re-install of Fedora I ended up with my Illumos VM gone and no recent FreeBSD VM anyway,. so basically there was no dtrace at hand until I've got this MacBook.

So, unlike MySQL 8.0, MariaDB server still contains USDT (DTrace probes) and I've built my version with them enabled. Let's quickly check how they can be used. There are few sample D files in the source:

Yuliyas-Air:server Valerii$ ls support-files/dtrace/
locktime.d                      query-rowops.d
query-execandqc.d               query-time.d
query-filesort-time.d           statement-time.d
query-network-time.d            statement-type-aggregate.d
query-parse-time.d
Yuliyas-Air:server Valerii$ cat support-files/dtrace/query-time.d
#!/usr/sbin/dtrace -s
#
# Copyright (c) 2009 Sun Microsystems, Inc.
# Use is subject to license terms.
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 of the License.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335 USA
#
# Shows basic query execution time, who execute the query, and on what database

#pragma D option quiet

dtrace:::BEGIN
{
   printf("%-20s %-20s %-40s %-9s\n", "Who", "Database", "Query", "Time(ms)");
}

mysql*:::query-start
{
   self->query = copyinstr(arg0);
   self->connid = arg1;
   self->db    = copyinstr(arg2);
   self->who   = strjoin(copyinstr(arg3),strjoin("@",copyinstr(arg4)));
   self->querystart = timestamp;
}

mysql*:::query-done
{
   printf("%-20s %-20s %-40s %-9d\n",self->who,self->db,self->query,
          (timestamp - self->querystart) / 1000000);
}
Yuliyas-Air:server Valerii$

We can try to run it (with yet another hack to do that I leave to the curious reader who is really going to try this) and then execute some queries against MariaDB from another terminal:

Yuliyas-Air:server Valerii$ sudo support-files/dtrace/query-time.d
dtrace: system integrity protection is on, some features will not be available

Who                  Database             Query                                    Time(ms)
Valerii@localhost                         select @@version_comment limit 1         0
Valerii@localhost                         select 1+1                               0
Valerii@localhost                         select sleep(4)                          4004
Valerii@localhost                         select @@version_comment limit 1         0
Valerii@localhost                         create database sbtest                   1
Valerii@localhost    sbtest               CREATE TABLE sbtest1(
  id INTEGER NOT NULL AUTO_INCREMENT,
  k INTEGER DEFAULT '0' NOT NULL,
  c CHAR(120) DEFAULT '' NOT NULL,
  pad CHAR(60) DEFAULT '' NOT NULL,
  PRIMARY KEY (id)
) /*! ENGINE = innodb */ 91
Valerii@localhost    sbtest               INSERT INTO sbtest1(k, c, pad) VALUES(366941, '31451373586-15688153734-79729593694-96509299839-83724898275-86711833539-78981337422-35049690573-51724173961-87474696253', '98996621624-36689827414-04092488557-09587706818-65008859162'),(277750, '21472970079-7 181
Valerii@localhost    sbtest               INSERT INTO sbtest1(k, c, pad) VALUES(124021, '80697810288-90543941719-80227288793-55278810422-59841440561-49369413842-83550451066-12907725305-62036548401-86959403176', '65708342793-83311865079-53224065384-18645733125-16333693298'),(660496, '07381386584-5 31
Valerii@localhost    sbtest               INSERT INTO sbtest1(k, c, pad) VALUES(425080, '48883413333-17783399741-03981526516-97596354402-27141206678-83563692683-30244461835-25263435890-49140039573-28211133426', '81560227417-96691828090-72817141653-15106797886-43970285630'),(322421, '68618246702-7 33
...

to end up with a lightweight general query log with query execution time repported in milliseconds. Note some statement from, yes, sysbench test preapre :)

Nice view. macOS is also nice and useful platform to run MariaDB server for fun :)

Thta's probably enough for a simgle post. Now go and get that from your MySQL 8 on any OS! USDTs rule!

To summarize:

  1. It is surely possible to build even latest ang greates still alpha MariaDB 10.8 on macOS even as old as 10.13.6, and get usable result.
  2. One of the main benefits of macOS is DTrace support.
  3. It is still possible to use a limited set of USDTs once added to MySQL and now gone in all current versions of MariaDB server.
  4. That rocksdb.partition test failure is to be studied an maybe reported as a MDEV
  5. DTrace is surely more capable than wehat those examples in the source code show. More DTrace-related posts are coming this year. Stay tuned!

Saturday, January 30, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part V, Proper Way To Summarize Time Spent per Stack

In my previous post in this series I was tracing __lll_lock_wait() function and measuring time spent in it per unique MariaDB server's stack trace. I've got the results that had not looked like totally wrong or useless and even presented them as a flame graph, but it took a long time and during all thyat time, for minutes, sysbench test TPS and QPS results were notably smaller. So, it was a useful tracing, but surely not a low impact one.

I already stated before that the approach used is surely not optimal. There should be some better way to summarize the stack traces without such a performance impact. Today I continued to work on this problem. First of all, I tried to measure finally how much time exactly I had to wait for the results from bpftrace command:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt

real    9m51.904s
user    8m25.576s
sys     1m8.057s
[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks1.txt
-rw-rw-r--. 1 openxs openxs 1564291 Jan 30 11:58 /tmp/lll_lock_perf_stacks1.txt

So, test was running for 300 seconds (5 minutes) and monitoring results (for about 20 seconds) came 5 minutes later! This is definitely NOT acceptable for any regular use.

I start to check things and test more. I noted in the Reference Guide (that I've read entirely a couple of times yesterday and today) that ustack() function of bpftrace can accept the second argument, number of frames to show in stack. Based on the previous flame graphs most interesting stacks were not long, so I decided to check if 8 or just 5 may be still enough for useful results while reducing the performance impact. I changed the code here:

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf,5); -- <-- I also tried 8
}

and got these run times for 8 and 5 frames:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt

real    9m39.195s
user    8m9.188s
sys     1m8.430s

[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks1.txt
-rw-rw-r--. 1 openxs openxs 1085019 Jan 30 12:12 /tmp/lll_lock_perf_stacks1.txt

[openxs@fc31 ~]$ time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks2.txt

real    6m6.801s
user    5m5.537s
sys     0m42.226s

[openxs@fc31 ~]$ ls -l /tmp/lll_lock_perf_stacks2.txt
-rw-rw-r--. 1 openxs openxs 728072 Jan 30 12:19 /tmp/lll_lock_perf_stacks2.txt

So, reducing stack to 5 frames made notable impact, and QPS drop was a bit smaller, but still waiting for 5 minutes is not an option in general.

I had not found really powerful functions to replace awk and collapse stacks inside the bpftrace program, so decided to export them "raw", on every hit of uretprobe, and then process by Linux tools as unusual. I'll skip steps, tests and failures and just show the resulting new program for the task:

[openxs@fc31 ~]$ cat lll_lock_wait2.bt
#!/usr/bin/env bpftrace

BEGIN
{
/*    printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n"); */
}

interval:s:$1 { exit(); }

uprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd"/
{
    @start[tid] = nsecs;
    @tidstack[tid] = ustack(perf);    
}

uretprobe:/lib64/libpthread.so.0:__lll_lock_wait
/comm == "mariadbd" && @start[tid] != 0/
{
    $now = nsecs;
    $time = $now - @start[tid];
    @futexstack[@tidstack[tid]] += $time;
    print(@futexstack);
    delete(@futexstack[@tidstack[tid]]);
/*
    printf("Thread: %u, time: %d\n", tid, $time);
*/
    delete(@start[tid]);
    delete(@tidstack[tid]);
}

END
{
    clear(@start);
    clear(@tidstack);
    clear(@futexstack);
}

Changes vs the previous version are highlighted. Basically I print and then delete every stack collected, as function call ends, along with time, to further sum them up externally. The function used is the following:

print(@x[, top [, div]]) - Print the map, optionally the top entries only and with a divisor

I've also added the interval probe at the beginning, referring to the first program argument as seconds to run. It just forces exit after N seconds (0 by default means immediate exist, wrong format will be reported as error).

Now if I run it while test is running, this way:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait2.bt 10 2>/dev/null | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' > /tmp/collapsed_lll_lock_v2_raw.txt


real    0m10.646s
user    0m5.648s
sys     0m0.843s
[openxs@fc31 ~]$ ls -l /tmp/collapsed_lll_lock_v2_raw.txt
-rw-rw-r--. 1 openxs openxs 1566 Jan 30 13:12 /tmp/collapsed_lll_lock_v2_raw.txt
[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt
15531 __lll_lock_wait;
1233833 __lll_lock_wait;
10638 __lll_lock_wait;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;
7170 __lll_lock_wait;tpool::thread_pool_generic::worker_main;;;;
273330 __lll_lock_wait;tdc_acquire_share;open_table;open_tables;open_and_lock_tables;execute_sqlcom_select;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
1193083 __lll_lock_wait;trx_undo_report_row_operation;btr_cur_update_in_place;btr_cur_optimistic_update;row_upd_clust_rec;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
183353 __lll_lock_wait;
43231 __lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;ha_commit_trans;trans_commit;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread

I get a very condensed output almost immediately!  I can afford running for 60 seconds without much troubles:

[openxs@fc31 ~]$ time sudo ./lll_lock_wait2.bt 60 2>/dev/null | awk '
BEGIN { s = ""; }
/^@futexstack\[\]/ { s = ""; }
/^@futexstack/ { s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }
/^]/ { print $2, s }
' > /tmp/collapsed_lll_lock_v2_raw.txt


real    1m0.959s
user    0m44.146s
sys     0m6.126s


[openxs@fc31 ~]$ ls -l /tmp/collapsed_lll_lock_v2_raw.txt
-rw-rw-r--. 1 openxs openxs 12128 Jan 30 13:17 /tmp/collapsed_lll_lock_v2_raw.txt

The impact of these 60 seconds is visible in the sysbench output:

[openxs@fc31 maria10.5]$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 32
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 32 tps: 625.68 qps: 12567.66 (r/w/o: 8800.49/2512.81/1254.36) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 32 tps: 891.51 qps: 17825.69 (r/w/o: 12479.31/3563.26/1783.13) lat (ms,95%): 142.39 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 32 tps: 512.09 qps: 10230.44 (r/w/o: 7163.62/2042.55/1024.27) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 32 tps: 369.10 qps: 7390.48 (r/w/o: 5171.39/1480.90/738.20) lat (ms,95%): 350.33 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 32 tps: 417.83 qps: 8347.73 (r/w/o: 5845.14/1667.03/835.56) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 32 tps: 484.03 qps: 9687.55 (r/w/o: 6782.09/1937.31/968.16) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 32 tps: 431.98 qps: 8640.35 (r/w/o: 6049.38/1727.01/863.95) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 32 tps: 354.60 qps: 7097.39 (r/w/o: 4968.86/1419.32/709.21) lat (ms,95%): 419.45 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 32 tps: 380.98 qps: 7600.82 (r/w/o: 5317.17/1521.80/761.85) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 32 tps: 423.01 qps: 8467.99 (r/w/o: 5928.71/1693.16/846.13) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 32 tps: 475.66 qps: 9525.07 (r/w/o: 6669.22/1904.53/951.32) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 32 tps: 409.13 qps: 8171.48 (r/w/o: 5718.38/1634.94/818.17) lat (ms,95%): 458.96 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 32 tps: 190.91 qps: 3826.72 (r/w/o: 2682.19/762.62/381.91) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 32 tps: 611.40 qps: 12229.99 (r/w/o: 8558.99/2448.20/1222.80) lat (ms,95%): 223.34 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 32 tps: 581.99 qps: 11639.19 (r/w/o: 8148.53/2326.68/1163.99) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 32 tps: 653.21 qps: 13058.51 (r/w/o: 9139.57/2612.52/1306.41) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 32 tps: 561.87 qps: 11231.27 (r/w/o: 7860.53/2246.99/1123.75) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 32 tps: 625.66 qps: 12526.32 (r/w/o: 8770.19/2504.82/1251.31) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 32 tps: 554.50 qps: 11088.01 (r/w/o: 7760.61/2218.40/1109.00) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 32 tps: 607.10 qps: 12143.89 (r/w/o: 8501.49/2428.20/1214.20) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
[ 210s ] thds: 32 tps: 424.17 qps: 8467.63 (r/w/o: 5925.43/1693.87/848.33) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 32 tps: 466.80 qps: 9335.03 (r/w/o: 6533.32/1868.31/933.40) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00
[ 230s ] thds: 32 tps: 365.83 qps: 7330.56 (r/w/o: 5133.49/1465.21/731.86) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 32 tps: 411.27 qps: 8218.45 (r/w/o: 5754.25/1641.87/822.34) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 250s ] thds: 32 tps: 127.10 qps: 2534.71 (r/w/o: 1772.60/507.70/254.40) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 32 tps: 642.35 qps: 12856.29 (r/w/o: 8999.09/2572.50/1284.70) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 32 tps: 603.80 qps: 12078.79 (r/w/o: 8456.89/2414.30/1207.60) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 32 tps: 642.70 qps: 12857.60 (r/w/o: 9001.40/2570.80/1285.40) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00
[ 290s ] thds: 32 tps: 716.57 qps: 14325.60 (r/w/o: 10026.98/2865.48/1433.14) lat (ms,95%): 144.97 err/s: 0.00 reconn/s: 0.00
[ 300s ] thds: 32 tps: 611.16 qps: 12219.69 (r/w/o: 8551.80/2445.66/1222.23) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            2124836
        write:                           607096
        other:                           303548
        total:                           3035480
    transactions:                        151774 (505.83 per sec.)
    queries:                             3035480 (10116.62 per sec.)

    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      505.8309
    time elapsed:                        300.0489s
    total number of events:              151774

Latency (ms):
         min:                                    2.17
         avg:                                   63.25
         max:                                 8775.26
         95th percentile:                      331.91
         sum:                              9599021.34

Threads fairness:
    events (avg/stddev):           4742.9375/42.76
    execution time (avg/stddev):   299.9694/0.03

but it continues only for some time after the end of monitoring and overal QPS is not much affected.

Now what about top 5 stacks by wait time? Here they are (well, I had NOT cared to sum up several entires with the same stack, but you should know how to do this with awk and flamegraph.pl will do it for us later) :

[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt | sort -r -n -k 1,1 | head -5
340360731 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
335903154 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_upd_sec_index_entry;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
287819974 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
281374237 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread
268817930 __lll_lock_wait;buf_page_init_for_read;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;

A lot of contention happens on INSERTing rows. I'd say it's expected! Now to create the flame graph we just have to re-order the output a bit:

[openxs@fc31 ~]$ cat /tmp/collapsed_lll_lock_v2_raw.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="Time spent in ___lll_lock_wait in MariaDB 10.5, all frames" --countname=nsecs > ~/Documents/lll_lock_v2_2.svg

Here it is:

Most mutex waits for oltp_read_write test happen on inserting rows...

Monitoring over 60 seconds allowed to see more realistic picture of contention. It mostly happens on inserting rows by 32 threads into just 5 tables on a system with 4 cores only.

Now it seems we have the tool that can be used more than once ina lifetime and seems to provide a useful information, fast.

* * *

To summarize:

  1. One should not even try to summarize all stacks in a single associative arrays for any function that is called to often in many different contests inside the bftrace! There is a limit on number of items in the ma you may hit, and the impact of storing and exporting this is too high for monitoring more than a couple of seconds.
  2. Cleaning up stack traces should be done externally, by usual Linux test processing tools thta produce smaller summary output. The bpftrace program should NOT be designed to collect a lot of outputs for a long time and then output them at the end. Looks like smaller shunks exported to the userspace regularly is a better approach.
  3. We can use simple command line arguments for the program that can be literally substituted as parts of probe definitions. Next step would be to make the more generic tool with binary and function name to probe as command line arguments.
  4. bpftrace is cool!