Showing posts with label perf. Show all posts
Showing posts with label perf. Show all posts

Friday, February 4, 2022

How to Add Probe Inside a Function and Access Local Variables with bpftrace

During several my talks on bpftrace I mentioned that it is possible to add probe to "every other" line of the code, same as with perf, but had never demonstrated how to do this. Looks like it's time to show both this and access to local variables declared inside the function we trace, as I am going to speak about bpftrace again at FOSDEM (and have to share something new and cool).

Consider the following debugging session with latest MariaDB server 10.8 built from GitHub source (as usual, for my own work I use only my own builds from current source) on Ubuntu 20.04:

openxs@ao756:~/dbs/maria10.8$ sudo gdb -p `pidof mariadbd`
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 200650
[New LWP 200652]
[New LWP 200653]
[New LWP 200654]
[New LWP 200655]
[New LWP 200656]
[New LWP 200659]
[New LWP 200662]
[New LWP 200663]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
0x00007f7b8424caff in __GI___poll (fds=0x557960733d38, nfds=3,
    timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) b do_command
Breakpoint 1 at 0x55795e867570: file /home/openxs/git/server/sql/sql_parse.cc, line 1197.
(gdb) c
Continuing.

Basically I've attached to running mariadbd process and set a breakpoint on do_command(), and then let it continue working. In another terminal I've connected to server and got the breakpoint hit immediately:

[New Thread 0x7f7b5dbf0700 (LWP 200680)]
[New Thread 0x7f7b6c084700 (LWP 200797)]
[Switching to Thread 0x7f7b6c084700 (LWP 200797)]

Thread 11 "mariadbd" hit Breakpoint 1, do_command (thd=0x7f7b4c001738,
    blocking=blocking@entry=true)
    at /home/openxs/git/server/sql/sql_parse.cc:1197
1197    {
(gdb) bt
#0  do_command (thd=0x7f7b4c001738, blocking=blocking@entry=true)
    at /home/openxs/git/server/sql/sql_parse.cc:1197
#1  0x000055795e97df07 in do_handle_one_connection (connect=<optimized out>,
    put_in_cache=true) at /home/openxs/git/server/sql/sql_connect.cc:1418
#2  0x000055795e97e23d in handle_one_connection (arg=arg@entry=0x55796079f798)
    at /home/openxs/git/server/sql/sql_connect.cc:1312
#3  0x000055795ecc7e8d in pfs_spawn_thread (arg=0x557960733dd8)
    at /home/openxs/git/server/storage/perfschema/pfs.cc:2201
#4  0x00007f7b84686609 in start_thread (arg=<optimized out>)
    at pthread_create.c:477
#5  0x00007f7b84259293 in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

From this state in gdb we can trace execution step by step:

(gdb) n
[New Thread 0x7f7b5ebf2700 (LWP 200800)]
1213      if (thd->async_state.m_state == thd_async_state::enum_async_state::RESUMED)
(gdb) n
1229      thd->lex->current_select= 0;
(gdb) n
1237      if (!thd->skip_wait_timeout)
(gdb) n
1238        my_net_set_read_timeout(net, thd->get_net_wait_timeout());
(gdb) n
1241      thd->clear_error(1);
(gdb) n
1243      net_new_transaction(net);
(gdb) n
1246      thd->start_bytes_received= thd->status_var.bytes_received;
(gdb) n
1264      packet_length= my_net_read_packet(net, 1);
(gdb) n
1266      if (unlikely(packet_length == packet_error))
(gdb) n
1300      packet= (char*) net->read_pos;
(gdb) n
1309      if (packet_length == 0)                       /* safety */
(gdb) n
1316      packet[packet_length]= '\0';                  /* safety */
(gdb) n
1319      command= fetch_command(thd, packet);
(gdb) p packet
$1 = 0x7f7b4c008d08 "\003select @@version_comment limit 1"
(gdb) q
A debugging session is active.

        Inferior 1 [process 200650] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/openxs/dbs/maria10.8/bin/mariadbd, process 200650
[Inferior 1 (process 200650) detached]

In that session above I moved to the next step until I've got to the place in the code where local variable packet got a value, zero-terminated string. I know from the previous experience that it's value is the SQL statement to be executed (in most cases). So, I printed it to find out that mysql command line client starts with finding the server version comment to output.

Question is: can we do the same with bpftrace, attache the probe to the instruction at/around line 1316 of the sql/sql_parse.cc and print the value of the local variable named packet as a string? 

It should be possible, as we can do this with perf or even raw ftrace probes. This is how we can find all "tracable" lines in the code of do_command() function with perf:

openxs@ao756:~/dbs/maria10.8$ sudo perf probe -x /home/openxs/dbs/maria10.8/bin/mysqld --line do_command
<do_command@/home/openxs/git/server/sql/sql_parse.cc:0>
      0  dispatch_command_return do_command(THD *thd, bool blocking)
         {
      2    dispatch_command_return return_value;
           char *packet= 0;
           ulong packet_length;
           NET *net= &thd->net;
           enum enum_server_command command;
           DBUG_ENTER("do_command");

         #ifdef WITH_WSREP
           DBUG_ASSERT(!thd->async_state.pending_ops() ||
                       (WSREP(thd) &&
                        thd->wsrep_trx().state() == wsrep::transaction::s_abort>
         #else
           DBUG_ASSERT(!thd->async_state.pending_ops());
         #endif

           if (thd->async_state.m_state == thd_async_state::enum_async_state::R>
           {
             /*
              Resuming previously suspended command.
              Restore the state
             */
     23      command = thd->async_state.m_command;
     24      packet = thd->async_state.m_packet.str;
             packet_length = (ulong)thd->async_state.m_packet.length;
             goto resume;
           }

...
    104    packet= (char*) net->read_pos;
           /*
             'packet_length' contains length of data, as it was stored in packet
             header. In case of malformed header, my_net_read returns zero.
             If packet_length is not zero, my_net_read ensures that the returned
             number of bytes was actually read from network.
             There is also an extra safety measure in my_net_read:
             it sets packet[packet_length]= 0, but only for non-zero packets.
           */
    113    if (packet_length == 0)                       /* safety */
           {
             /* Initialize with COM_SLEEP packet */
    116      packet[0]= (uchar) COM_SLEEP;
    117      packet_length= 1;
           }
           /* Do not rely on my_net_read, extra safety against programming erro>
    120    packet[packet_length]= '\0';                  /* safety */


    123    command= fetch_command(thd, packet);
...
    221    DBUG_ASSERT(thd->m_digest == NULL);
           DBUG_ASSERT(thd->m_statement_psi == NULL);
         #ifdef WITH_WSREP
           if (packet_length != packet_error)
           {
             /* there was a command to process, and before_command() has been c>
             if (unlikely(wsrep_service_started))
    228        wsrep_after_command_after_result(thd);
           }
         #endif /* WITH_WSREP */
           DBUG_RETURN(return_value);

openxs@ao756:~/dbs/maria10.8$

So, basically we need a probe on line 120 to match the place where I printed the value in gdb. Let's create it with perf:

openxs@ao756:~/dbs/maria10.8$ sudo perf probe -x /home/openxs/dbs/maria10.8/bin/mariadbd 'do_command:120 packet:string'
Added new event:
  probe_mariadbd:do_command_L120 (on do_command:120 in /home/openxs/dbs/maria10.8/bin/mariadbd with packet:string)

You can now use it in all perf tools, such as:

        perf record -e probe_mariadbd:do_command_L120 -aR sleep 1

Now, we can check what user probe was really created by that command line:

openxs@ao756:~/dbs/maria10.8$ sudo cat /sys/kernel/tracing/uprobe_events
p:probe_mariadbd/do_command_L120 /home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761f packet_string=+0(%r10):string

So, the probe is attached to some address, 0x7b761f, in the mariadb binary. Looks like we refer to the register, r10, while trying to access local variable, packet.

Now let's delete this probe (we'll add it back with bpftrace soon) and try to make sense of that address:

openxs@ao756:~/dbs/maria10.8$ sudo perf probe --del 'probe_mariadbd:do_command*'
Removed event: probe_mariadbd:do_command_L120
openxs@ao756:~/dbs/maria10.8$ objdump -tT /home/openxs/dbs/maria10.8/bin/mariadbd | grep do_command
00000000007b7570 g     F .text  00000000000007e9              _Z10do_commandP3THDb
00000000007b7570 g    DF .text  00000000000007e9  Base        _Z10do_commandP3THDb

Note that based on the above the entry address of the do_command() function in the binary is 0x7b7570, a bit smaller than 0x7b761f. The difference is 0x7b761f - 0x7b7570 = 0xaf or 175 in decimal. 

Can we come up to the address for the probe  without perf? Yes, we can do it by checking the assembly code of the function, for example, in gdb:

openxs@ao756:~/dbs/maria10.8$ ps aux | grep 'mariadbd ' | grep -v grep
openxs    200650  0.0  2.4 1341032 94976 ?       Sl   січ30   0:05 /home/openxs/dbs/maria10.8/bin/mariadbd --no-defaults --basedir=/home/openxs/dbs/maria10.8 --datadir=/home/openxs/dbs/maria10.8/data --plugin-dir=/home/openxs/dbs/maria10.8/lib/plugin --log-error=/home/openxs/dbs/maria10.8/data/ao756.err --pid-file=ao756.pid --socket=/tmp/mariadb.sock --port=3309
openxs@ao756:~/dbs/maria10.8$ sudo gdb -p `pidof mariadbd`
GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 200650
[New LWP 200652]
[New LWP 200653]
[New LWP 200654]
[New LWP 200655]
[New LWP 200662]
[New LWP 200663]
[New LWP 204349]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
0x00007f7b8424caff in __GI___poll (fds=0x557960733d38, nfds=3,
    timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29      ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(gdb) disass /m do_command
Dump of assembler code for function do_command(THD*, bool):
1142    static bool wsrep_command_no_result(char command)

1143    {
1144      return (command == COM_STMT_PREPARE          ||
   0x000055795e8679d0 <+1120>:  cmp    $0x16,%al
   0x000055795e8679d2 <+1122>:  je     0x55795e8679dc <do_command(THD*, bool)+1132>
   0x000055795e8679d4 <+1124>:  cmp    $0x1c,%al
   0x000055795e8679d6 <+1126>:  jne    0x55795e867b50 <do_command(THD*, bool)+1504>
   0x000055795e8679dc <+1132>:  mov    $0x1,%esi

1145              command == COM_STMT_FETCH            ||
1146              command == COM_STMT_SEND_LONG_DATA   ||
1147              command == COM_STMT_CLOSE);
1148    }
1149    #endif /* WITH_WSREP */
1150    #ifndef EMBEDDED_LIBRARY
1151    static enum enum_server_command fetch_command(THD *thd, char *packet)

1152    {
1153      enum enum_server_command
--Type <RET> for more, q to quit, c to continue without paging--
   0x000055795e867628 <+184>:   movzbl %al,%r15d

1154        command= (enum enum_server_command) (uchar) packet[0];
1155      DBUG_ENTER("fetch_command");

1156
1157      if (command >= COM_END ||
   0x000055795e86762c <+188>:   lea    -0x20(%r15),%edx
   0x000055795e867630 <+192>:   cmp    $0xd9,%edx
   0x000055795e867636 <+198>:   jbe    0x55795e8676f0 <do_command(THD*, bool)+384>
   0x000055795e86763c <+204>:   cmp    $0xff,%al
   0x000055795e86763e <+206>:   je     0x55795e8676f0 <do_command(THD*, bool)+384>

1158          (command >= COM_MDB_GAP_BEG && command <= COM_MDB_GAP_END))
1159        command= COM_END;                           // Wrong command

1160
1161      DBUG_PRINT("info",("Command on %s = %d (%s)",

1162                         vio_description(thd->net.vio), command,
1163                         command_name[command].str));
--Type <RET> for more, q to quit, c to continue without paging--
1164      DBUG_RETURN(command);

1165    }
1166
1167    /**
1168      Read one command from connection and execute it (query or simple command).
1169      This function is to be used by different schedulers (one-thread-per-connection,
1170      pool-of-threads)
1171
1172      For profiling to work, it must never be called recursively.
1173
1174      @param thd - client connection context
1175
1176      @param blocking - wait for command to finish.
1177                        if false (nonblocking), then the function might
1178                        return when command is "half-finished", with
1179                        DISPATCH_COMMAND_WOULDBLOCK.
1180                        Currenly, this can *only* happen when using
1181                        threadpool. The command will resume, after all outstanding
1182                        async operations (i.e group commit) finish.
--Type <RET> for more, q to quit, c to continue without paging--
1183                        Threadpool scheduler takes care of "resume".
1184
1185      @retval
1186        DISPATCH_COMMAND_SUCCESS - success
1187      @retval
1188        DISPATCH_COMMAND_CLOSE_CONNECTION  request of THD shutdown
1189              (s. dispatch_command() description)
1190      @retval
1191        DISPATCH_COMMAND_WOULDBLOCK - need to wait for asyncronous operations
1192                                      to finish. Only returned if parameter
1193                                      'blocking' is false.
1194    */
1195
1196    dispatch_command_return do_command(THD *thd, bool blocking)
1197    {
   0x000055795e867570 <+0>:     endbr64

1198      dispatch_command_return return_value;

1199      char *packet= 0;

1200      ulong packet_length;
--Type <RET> for more, q to quit, c to continue without paging--

1201      NET *net= &thd->net;

1202      enum enum_server_command command;

1203      DBUG_ENTER("do_command");

1204
1205    #ifdef WITH_WSREP
1206      DBUG_ASSERT(!thd->async_state.pending_ops() ||

1207                  (WSREP(thd) &&
1208                   thd->wsrep_trx().state() == wsrep::transaction::s_aborted));
1209    #else
1210      DBUG_ASSERT(!thd->async_state.pending_ops());
1211    #endif
1212
1213      if (thd->async_state.m_state == thd_async_state::enum_async_state::RESUMED)
   0x000055795e867574 <+4>:     push   %rbp
   0x000055795e867575 <+5>:     mov    %rsp,%rbp
   0x000055795e867578 <+8>:     push   %r15
--Type <RET> for more, q to quit, c to continue without paging--
   0x000055795e86757a <+10>:    push   %r14
   0x000055795e86757c <+12>:    push   %r13
   0x000055795e86757e <+14>:    mov    %esi,%r13d
   0x000055795e867581 <+17>:    push   %r12
   0x000055795e867583 <+19>:    push   %rbx
   0x000055795e867584 <+20>:    mov    %rdi,%rbx
   0x000055795e867587 <+23>:    sub    $0x78,%rsp
   0x000055795e86758b <+27>:    cmpl   $0x2,0x6270(%rdi)
   0x000055795e867592 <+34>:    je     0x55795e867770 <do_command(THD*, bool)+512>
   0x000055795e867598 <+40>:    mov    0x58(%rdi),%rax
   0x000055795e86759c <+44>:    cmpb   $0x0,0x279e(%rdi)
   0x000055795e8675a3 <+51>:    lea    0x290(%rdi),%r14

1214      {
1215        /*
1216         Resuming previously suspended command.
1217         Restore the state
1218        */
1219        command = thd->async_state.m_command;
   0x000055795e867770 <+512>:   mov    0x6278(%rdi),%r10
   0x000055795e867777 <+519>:   mov    0x6280(%rdi),%r12
   0x000055795e86777e <+526>:   movzbl %sil,%r8d
--Type <RET> for more, q to quit, c to continue without paging--
   0x000055795e867782 <+530>:   mov    %rdi,%rsi
   0x000055795e867785 <+533>:   mov    0x6274(%rdi),%r15d

1220        packet = thd->async_state.m_packet.str;

1221        packet_length = (ulong)thd->async_state.m_packet.length;

1222        goto resume;

1223      }
1224
1225      /*
1226        indicator of uninitialized lex => normal flow of errors handling
1227        (see my_message_sql)
1228      */
1229      thd->lex->current_select= 0;
   0x000055795e8675aa <+58>:    movq   $0x0,0xd38(%rax)

1230
1231      /*
1232        This thread will do a blocking read from the client which
1233        will be interrupted when the next command is received from
1234        the client, the connection is closed or "net_wait_timeout"
--Type <RET> for more, q to quit, c to continue without paging--
1235        number of seconds has passed.
1236      */
1237      if (!thd->skip_wait_timeout)
   0x000055795e8675b5 <+69>:    je     0x55795e867710 <do_command(THD*, bool)+416>

1238        my_net_set_read_timeout(net, thd->get_net_wait_timeout());
1239
1240      /* Errors and diagnostics are cleared once here before query */
1241      thd->clear_error(1);
1242
1243      net_new_transaction(net);
   0x000055795e8675dd <+109>:   mov    0x13f0(%rbx),%rax
   0x000055795e8675e4 <+116>:   mov    $0x1,%esi
   0x000055795e8675e9 <+121>:   mov    %r14,%rdi
   0x000055795e8675ec <+124>:   movl   $0x0,0x2f0(%rbx)

1244
1245      /* Save for user statistics */
1246      thd->start_bytes_received= thd->status_var.bytes_received;
   0x000055795e8675f6 <+134>:   mov    %rax,0x4040(%rbx)

1247
--Type <RET> for more, q to quit, c to continue without paging--
1248      /*
1249        Synchronization point for testing of KILL_CONNECTION.
1250        This sync point can wait here, to simulate slow code execution
1251        between the last test of thd->killed and blocking in read().
1252
1253        The goal of this test is to verify that a connection does not
1254        hang, if it is killed at this point of execution.
1255        (Bug#37780 - main.kill fails randomly)
1256
1257        Note that the sync point wait itself will be terminated by a
1258        kill. In this case it consumes a condition broadcast, but does
1259        not change anything else. The consumed broadcast should not
1260        matter here, because the read/recv() below doesn't use it.
1261      */
1262      DEBUG_SYNC(thd, "before_do_command_net_read");

1263
1264      packet_length= my_net_read_packet(net, 1);
   0x000055795e8675fd <+141>:   callq  0x55795ebda8a0 <my_net_read_packet(NET*, my_bool)>
   0x000055795e867602 <+146>:   mov    %rax,%r12

1265
--Type <RET> for more, q to quit, c to continue without paging--
1266      if (unlikely(packet_length == packet_error))
   0x000055795e867605 <+149>:   cmp    $0xffffffffffffffff,%rax
   0x000055795e867609 <+153>:   je     0x55795e8678d0 <do_command(THD*, bool)+864>

1267      {
1268        DBUG_PRINT("info",("Got error %d reading command from socket %s",

1269                           net->error,
1270                           vio_description(net->vio)));
1271
1272        /* Instrument this broken statement as "statement/com/error" */
1273        thd->m_statement_psi= MYSQL_REFINE_STATEMENT(thd->m_statement_psi,
   0x000055795e8678d0 <+864>:   mov    0x3b78(%rbx),%rdi

1274                                                     com_statement_info[COM_END].
1275                                                     m_key);
1276
1277
1278        /* Check if we can continue without closing the connection */
1279
1280        /* The error must be set. */
--Type <RET> for more, q to quit, c to continue without paging--
1281        DBUG_ASSERT(thd->is_error());

1282        thd->protocol->end_statement();
   0x000055795e8678ee <+894>:   mov    0x558(%rbx),%rdi
   0x000055795e8678f5 <+901>:   callq  0x55795e79f7e0 <Protocol::end_statement()>

1283
1284        /* Mark the statement completed. */
1285        MYSQL_END_STATEMENT(thd->m_statement_psi, thd->get_stmt_da());
1286        thd->m_statement_psi= NULL;
   0x000055795e86791b <+939>:   cmpb   $0x3,0x324(%rbx)
   0x000055795e867922 <+946>:   mov    $0x1,%r13d
   0x000055795e867928 <+952>:   movq   $0x0,0x3b78(%rbx)

1287        thd->m_digest= NULL;
   0x000055795e867933 <+963>:   movq   $0x0,0x3b30(%rbx)

1288
1289        if (net->error != 3)
   0x000055795e86793e <+974>:   jne    0x55795e86794a <do_command(THD*, bool)+986>

--Type <RET> for more, q to quit, c to continue without paging--
1290        {
1291          return_value= DISPATCH_COMMAND_CLOSE_CONNECTION;     // We have to close it.
1292          goto out;
1293        }
1294
1295        net->error= 0;
   0x000055795e867940 <+976>:   movb   $0x0,0x324(%rbx)

1296        return_value= DISPATCH_COMMAND_SUCCESS;

1297        goto out;
   0x000055795e867947 <+983>:   xor    %r13d,%r13d

1298      }
1299
1300      packet= (char*) net->read_pos;
   0x000055795e86760f <+159>:   mov    0x2b0(%rbx),%r10

1301      /*
1302        'packet_length' contains length of data, as it was stored in packet
1303        header. In case of malformed header, my_net_read returns zero.
1304        If packet_length is not zero, my_net_read ensures that the returned
--Type <RET> for more, q to quit, c to continue without paging--
1305        number of bytes was actually read from network.
1306        There is also an extra safety measure in my_net_read:
1307        it sets packet[packet_length]= 0, but only for non-zero packets.
1308      */
1309      if (packet_length == 0)                       /* safety */
   0x000055795e867616 <+166>:   test   %rax,%rax
   0x000055795e867619 <+169>:   je     0x55795e8676e0 <do_command(THD*, bool)+368>

1310      {
1311        /* Initialize with COM_SLEEP packet */
1312        packet[0]= (uchar) COM_SLEEP;
   0x000055795e8676e0 <+368>:   movb   $0x0,(%r10)

1313        packet_length= 1;
   0x000055795e8676e4 <+372>:   mov    $0x1,%r12d
   0x000055795e8676ea <+378>:   jmpq   0x55795e86761f <do_command(THD*, bool)+175>
   0x000055795e8676ef <+383>:   nop

1314      }
1315      /* Do not rely on my_net_read, extra safety against programming errors. */
--Type <RET> for more, q to quit, c to continue without paging--
1316      packet[packet_length]= '\0';                  /* safety */
   0x000055795e86761f <+175>:   movb   $0x0,(%r10,%r12,1)

1317
1318
1319      command= fetch_command(thd, packet);
   0x000055795e867624 <+180>:   movzbl (%r10),%eax

1320
1321    #ifdef WITH_WSREP
1322      DEBUG_SYNC(thd, "wsrep_before_before_command");

1323      /*
1324        If this command does not return a result, then we
1325        instruct wsrep_before_command() to skip result handling.
1326        This causes BF aborted transaction to roll back but keep
1327        the error state until next command which is able to return
1328        a result to the client.
1329      */
1330      if (unlikely(wsrep_service_started) &&
   0x000055795e867644 <+212>:   lea    0x1862105(%rip),%rcx        # 0x5579600c9750 <wsrep_service_started>
   0x000055795e86764b <+219>:   cmpb   $0x0,(%rcx)
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) quit
A debugging session is active.

        Inferior 1 [process 200650] will be detached.

Quit anyway? (y or n) y
Detaching from program: /home/openxs/dbs/maria10.8/bin/mariadbd, process 200650
[Inferior 1 (process 200650) detached]

At this stage we can stop debugging and quit from gdb. We see that line 1316 in the code (were packet is properly formed and ready for printing) matches (decimal) offset 175 from the beginning of the function. Now we can add that to the start of the function from objdump to find the address for the probe, 0x7b761f.

The only remaining detail is how to print the content of the register r10 as a zero terminated string. Just read fine bpftrace manual for this, and you can come up with the following:

openxs@ao756:~/dbs/maria10.8$ cd ~/git/bpftrace/
openxs@ao756:~/git/bpftrace$ cd build/
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761f { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'
[sudo] password for openxs:
Attaching 1 probe...
Function: do_command(THD*, bool) packet: select @@version_comment limit 1
Function: do_command(THD*, bool) packet: select 1 + 1
^C

This is what I've got from the probe while in another terminal I've connected and executed the following:

openxs@ao756:~/dbs/maria10.8$ bin/mysql
ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)
openxs@ao756:~/dbs/maria10.8$ bin/mysql --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 4
Server version: 10.8.1-MariaDB Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select 1 + 1;
+-------+
| 1 + 1 |
+-------+
|     2 |
+-------+
1 row in set (0.000 sec)

So, as long as you know the address of instruction inside the function code, you can attach a probe to it. You can safely experiment, as if the address is surely wrong:

openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761e { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'
Attaching 1 probe...
ERROR: Could not add uprobe into middle of instruction: /home/openxs/dbs/maria10.8/bin/mariadbd:_Z10do_commandP3THDb+174
openxs@ao756:~/git/bpftrace/build$ sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000008b761f { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'
Attaching 1 probe...
ERROR: Could not add uprobe into middle of instruction: /home/openxs/dbs/maria10.8/bin/mariadbd:_ZN14partition_info25set_up_default_partitionsEP3THDP7handlerP14HA_CREATE_INFOj+799

the tool will tell you and do no harm.


Nice conbination of old Tower and new buildings in the City of London. Combining bpftrace with good old gdb is also nice.

To summarize:

  1.  It is possible to add a user probe to every assembly insturction inside the function with bpftrace (or perf, or ftrace, as long as you know what you are doing).
  2. It is possible to access CPU registers in the bpftrace probes.
  3. One can surely access, print etc local varaibles inside functions, as long as the address or register where it is stored is determined.
  4. It is surely easier and faster to do all the above with some knowledge of gdb and assembly language for your CPU.
  5. It is still fun to try to solve more complex problems with bpftrace, so more posts are expected. Stay tuned!


Friday, January 8, 2021

Linux /proc Filesystem for MySQL DBAs - Part III, 0x.tools by Tanel Poder

In this third post of my "Linux /proc Filesystem for MySQL DBAs" series (see also Part I and Part II for the context and details) I am going to present a useful set of programs to access, summarize and record /proc details created and recently shared by famous Tanel Poder, 0x.tools. I'll try to build them and apply to Percona Server 5.7.32-35 running on Ubuntu 16.04 netbook and fighting with some read-write sysbench load.

To build the tools I clone them from GitHub and just make as usual: 

openxs@ao756:~/git$ git clone https://github.com/tanelpoder/0xtools
Cloning into '0xtools'...
remote: Enumerating objects: 103, done.
remote: Counting objects: 100% (103/103), done.
remote: Compressing objects: 100% (67/67), done.
remote: Total 186 (delta 53), reused 77 (delta 32), pack-reused 83
Receiving objects: 100% (186/186), 108.43 KiB | 0 bytes/s, done.
Resolving deltas: 100% (94/94), done.
Checking connectivity... done.
openxs@ao756:~/git$ cd 0xtools/
openxs@ao756:~/git/0xtools$ make
gcc -I include -Wall -o bin/xcapture src/xcapture.c
openxs@ao756:~/git/0xtools$ sudo make install
# for now the temporary "install" method is with symlinks
ln -s `pwd`/bin/xcapture /usr/bin/xcapture
ln -s `pwd`/bin/psn /usr/bin/psn
ln -s `pwd`/bin/schedlat /usr/bin/schedlat
openxs@ao756:~/git/0xtools$ file bin/xcapture
bin/xcapture: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=ad05c149b0379a29fef765ec4706e912843846ff, not stripped
openxs@ao756:~/git/0xtools$ file bin/psn
bin/psn: Python script, ASCII text executable
openxs@ao756:~/git/0xtools$ file bin/schedlat
bin/schedlat: Python script, ASCII text executable
openxs@ao756:~/git/0xtools$ python --version
Python 2.7.12

Nothing more than that for now. Executable and two Python scripts. This installation process should work well on any Linux where you can build non-ancient MySQL from source as has some Python installed, starting from RHEL5. 

We can get some details about the programs usage as follows:

openxs@ao756:~/git/0xtools$ xcapture -h

0x.Tools xcapture v1.0 by Tanel Poder [https://0x.tools]

Usage:
  xcapture [options]

  By default, sample all /proc tasks in states R, D every second and print to stdout

  Options:
    -a             capture tasks in additional states, even the ones Sleeping (S)
    -A             capture tasks in All states, including Zombie (Z), Exiting (X), Idle (I)
    -c <c1,c2>     print additional columns (for example: -c exe,cmdline,kstack)
    -d <N>         seconds to sleep between samples (default: 1)
    -E <string>    custom task state Exclusion filter (default: XZIS)
    -h             display this help message
    -o <dirname>   write wide output into hourly CSV files in this directory instead of stdout

openxs@ao756:~/git/0xtools$ psn -h
usage: psn [-h] [-d seconds] [-p [pid]] [-t [tid]] [-r] [-a]
           [--sample-hz SAMPLE_HZ] [--ps-hz PS_HZ] [-o filename] [-i filename]
           [-s csv-columns] [-g csv-columns] [-G csv-columns] [--list]

optional arguments:
  -h, --help            show this help message and exit
  -d seconds            number of seconds to sample for
  -p [pid], --pid [pid]
                        process id to sample (including all its threads), or
                        process name regex, or omit for system-wide sampling
  -t [tid], --thread [tid]
                        thread/task id to sample (not implemented yet)
  -r, --recursive       also sample and report for descendant processes
  -a, --all-states      display threads in all states, including idle ones
  --sample-hz SAMPLE_HZ
                        sample rate in Hz (default: 20)
  --ps-hz PS_HZ         sample rate of new processes in Hz (default: 2)
  -o filename, --output-sample-db filename
                        path of sqlite3 database to persist samples to,
                        defaults to in-memory/transient
  -i filename, --input-sample-db filename
                        path of sqlite3 database to read samples from instead
                        of actively sampling
  -s csv-columns, --select csv-columns
                        additional columns to report
  -g csv-columns, --group-by csv-columns
                        columns to aggregate by in reports
  -G csv-columns, --append-group-by csv-columns
                        default + additional columns to aggregate by in
                        reports
  --list                list all available columns

The last one is also a python script that does not care to provide -h option, but shows the usage when called without arguments:

openxs@ao756:~/git/0xtools$ bin/schedlat
usage: bin/schedlat PID

There are also two shell scripts in bin/:

openxs@ao756:~/git/0xtools$ ls -F bin
psn*  run_xcapture.sh*  run_xcpu.sh*  schedlat*  xcapture*

The purposes of the tools are the following:

  • xcapture - low-overhead thread state sampler based on reading /proc files
  • psn - shows current top thread activity by sampling /proc files
  • schedlat - shows CPU scheduling latency for the given PID as a % of its runtime
  • run_xcapture.sh - a simple “daemon” script for keeping xcapture running
  • run_xcpu.sh - low-frequency continuous stack sampling for threads on CPU (using perf)

The last script, run_xcpu.sh, is out of scope of this series. We may get back to it in some next perf-related posts. It shows proper usge of perf for low frequency (1 Hz) and low impact on-CPU profiling.

Let's see what we can get with xcapture:

openxs@ao756:~/git/0xtools$ mkdir /tmp/xcap
openxs@ao756:~/git/0xtools$ xcapture -a -o /tmp/xcap

0xTools xcapture v1.0 by Tanel Poder [https://0x.tools]

Sampling /proc...

^C
openxs@ao756:~/git/0xtools$

while sysbench test is running against my Percona Server:

openxs@ao756:~$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-db=sbtest --threads=16 --time=1200 --report-interval=10 run
...
[ 1200s ] thds: 16 tps: 70.99 qps: 1420.02 (r/w/o: 993.87/284.16/141.98) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            1214556
        write:                           347016
        other:                           173508
        total:                           1735080
    transactions:                        86754  (72.28 per sec.)
    queries:                             1735080 (1445.54 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      72.2771
    time elapsed:                        1200.2964s
    total number of events:              86754

Latency (ms):
         min:                                   89.13
         avg:                                  221.33
         max:                                  855.31
         95th percentile:                      356.70
         sum:                             19200998.85

Threads fairness:
    events (avg/stddev):           5422.1250/12.10
    execution time (avg/stddev):   1200.0624/0.11

To get the impact of capturing the information, I've got the same test results without xcapture running previously:

...

    transactions:                        87376  (72.80 per sec.)
    queries:                             1747537 (1456.07 per sec.)
    ignored errors:                      1      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      72.8027
    time elapsed:                        1200.1747s
    total number of events:              87376

Latency (ms):
         min:                                   87.54
         avg:                                  219.75
         max:                                  843.19
         95th percentile:                      356.70
...

So we had 72.80 TPS and 1456 QPS over 1200 seconds without monitoring and 72.28 TPS and 1446 QPS with it. Do not mind the absolute (no tuning was performed, all default, old netbook with slow HDD), but the difference is really ignorable.

Now let's see what was captured (with -a, so all threads of all processes in the system, in all states, even S - sleeping):

openxs@ao756:~/git/0xtools$ ls -l /tmp/xcap/
total 56720
-rw-rw-r-- 1 openxs openxs 36310058 Jan  8 16:59 2021-01-08.16.csv
-rw-rw-r-- 1 openxs openxs 21768834 Jan  8 17:07 2021-01-08.17.csv
openxs@ao756:~/git/0xtools$ cat /tmp/xcap/2021-01-08.16.csv | head -30
TS,PID,TID,USERNAME,ST,COMMAND,SYSCALL,WCHAN,EXE,KSTACK
2021-01-08 16:47:24.414,1,1,root,S,(systemd),-,0,-,-
2021-01-08 16:47:24.414,2,2,root,S,(kthreadd),-,0,-,-
2021-01-08 16:47:24.414,3,3,root,S,(ksoftirqd/0),-,0,-,-
2021-01-08 16:47:24.414,5,5,root,S,(kworker/0:0H),-,0,-,-
2021-01-08 16:47:24.414,7,7,root,S,(rcu_sched),-,0,-,-
2021-01-08 16:47:24.414,8,8,root,S,(rcu_bh),-,0,-,-
2021-01-08 16:47:24.414,9,9,root,S,(migration/0),-,0,-,-
2021-01-08 16:47:24.414,10,10,root,S,(watchdog/0),-,0,-,-
2021-01-08 16:47:24.414,11,11,root,S,(watchdog/1),-,0,-,-
2021-01-08 16:47:24.414,12,12,root,S,(migration/1),-,0,-,-
2021-01-08 16:47:24.414,13,13,root,S,(ksoftirqd/1),-,0,-,-
2021-01-08 16:47:24.414,15,15,root,S,(kworker/1:0H),-,0,-,-
2021-01-08 16:47:24.414,16,16,root,S,(kdevtmpfs),-,0,-,-
2021-01-08 16:47:24.414,17,17,root,S,(netns),-,0,-,-
2021-01-08 16:47:24.414,18,18,root,S,(perf),-,0,-,-
2021-01-08 16:47:24.414,19,19,root,S,(khungtaskd),-,0,-,-
2021-01-08 16:47:24.414,20,20,root,S,(writeback),-,0,-,-
2021-01-08 16:47:24.414,21,21,root,S,(ksmd),-,0,-,-
2021-01-08 16:47:24.414,22,22,root,S,(khugepaged),-,0,-,-
2021-01-08 16:47:24.414,23,23,root,S,(crypto),-,0,-,-
2021-01-08 16:47:24.414,24,24,root,S,(kintegrityd),-,0,-,-
2021-01-08 16:47:24.414,25,25,root,S,(bioset),-,0,-,-
2021-01-08 16:47:24.414,26,26,root,S,(kblockd),-,0,-,-
2021-01-08 16:47:24.414,27,27,root,S,(ata_sff),-,0,-,-
2021-01-08 16:47:24.414,28,28,root,S,(md),-,0,-,-
2021-01-08 16:47:24.414,29,29,root,S,(devfreq_wq),-,0,-,-
2021-01-08 16:47:24.414,34,34,root,S,(kswapd0),-,0,-,-
2021-01-08 16:47:24.414,35,35,root,S,(vmstat),-,0,-,-
2021-01-08 16:47:24.414,36,36,root,S,(fsnotify_mark),-,0,-,-
openxs@ao756:~/git/0xtools$

Nothing that fancy in the first 30 lines. But you can then load CSV files into the database for rocessing, or query them with standard Linux text processing tools. It’s like SQL but with different keywords: grep for filtering, awk or sed for column projection, uniq for group by, sort for ordering and head/tail for LIMIT. Let's count number of threads per command, state and syscall, for example, asnd get top 10:

openxs@ao756:~/git/0xtools$ cat /tmp/xcap/*.csv | awk -F, '{ printf("%-20s %-20s %s\n",$6,$5,$7) }' | sort | uniq -c | sort -nbr | head -10
  71442 (console-kit-dae)    S                    -
  69174 (gmain)              S                    read
  69173 (gdbus)              S                    read
  48762 (dconf               S                    read
  47628 (dockerd)            S                    -
  46455 (mysqld)             S                    -
  19196 (sysbench)           S                    read

  15876 (gmain)              S                    -
  15876 (bioset)             S                    -
  14742 (containerd)         S                    -

No wonder MySQL threads were mostly sleeping, with 16 of them and just 2 slow CPUs.

Next stool is psn (Linux Process Snapper), a Python script for troubleshooting currently on-going issues (no historical capture). It currently reports more fields directly from /proc than xcapture captures (like filenames accessed by IO system calls). Let me try to apply psn to the mysqld process:

openxs@ao756:~/git/0xtools$ psn -p `pidof mysqld`

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat for 5 seconds... finished.


=== Active Threads ========================================

 samples | avg_threads | comm     | state
-----------------------------------------------------------
     205 |        2.05 | (mysqld) | Disk (Uninterruptible)
     114 |        1.14 | (mysqld) | Running (ON CPU)


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.01, measure time: 1.29

or to the entire system while sysbench test is running:

openxs@ao756:~/git/0xtools$ psn

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat for 5 seconds... finished.


=== Active Threads ================================================

 samples | avg_threads | comm             | state
-------------------------------------------------------------------
      94 |        2.24 | (mysqld)         | Disk (Uninterruptible)
      41 |        0.98 | (jbd*/dm-*-*)    | Disk (Uninterruptible)
      31 |        0.74 | (mysqld)         | Running (ON CPU)
      31 |        0.74 | (sysbench)       | Running (ON CPU)
      14 |        0.33 | (update-manager) | Running (ON CPU)
       5 |        0.12 | (kworker/u*:*)   | Running (ON CPU)
       2 |        0.05 | (Xorg)           | Running (ON CPU)
       1 |        0.02 | (compiz)         | Running (ON CPU)
       1 |        0.02 | (dockerd)        | Running (ON CPU)
       1 |        0.02 | (rcu_sched)      | Running (ON CPU)


samples: 42 (expected: 100)
total processes: 221, threads: 650
runtime: 5.06, measure time: 5.01

Some useful summaries of numbers of threads per state per process for the monitoring period, but nothing really fancy. This is, we can add other colunms from this long list with items that should look familiar for those studying /proc content:

openxs@ao756:~/git/0xtools$ psn --list

stat
====================================
pid                              int
comm                             str
comm2                            str
state_id                         str
state                            str
ppid                             int
pgrp                             int
session                          int
tty_nr                           int
tpgid                            int
minflt                           int
cminflt                          int
majflt                           int
cmajflt                          int
utime                           long
stime                           long
cutime                          long
cstime                          long
utime_sec                       long
stime_sec                       long
cutime_sec                      long
cstime_sec                      long
priority                         int
nice                             int
num_threads                      int
starttime                       long
vsize                           long
rss                             long
rsslim                           str
exit_signal                      int
processor                        int
rt_priority                      int
delayacct_blkio_ticks           long
guest_time                       int
cgust_time                       int
exit_code                        int

status
====================================
name                             str
umask                            str
state                            str
tgid                             int
ngid                             int
pid                              int
ppid                             int
tracerpid                        int
uid                              int
gid                              int
fdsize                           int
groups                           str
nstgid                           str
nspid                            str
nspgid                           str
nssid                            str
vmpeak_kb                        int
vmsize_kb                        int
vmlck_kb                         int
vmpin_kb                         int
vmhwm_kb                         int
vmrss_kb                         int
rssanon_kb                       int
rssfile_kb                       int
rssshmem_kb                      int
vmdata_kb                        int
vmstk_kb                         int
vmexe_kb                         int
vmlib_kb                         int
vmpte_kb                         int
vmpmd_kb                         int
vmswap_kb                        int
hugetlbpages_kb                  int
threads                          int
sigq                             str
sigpnd                           str
shdpnd                           str
sigblk                           str
sigign                           str
sigcgt                           str
capinh                           str
capprm                           str
capeff                           str
capbnd                           str
capamb                           str
seccomp                          int
cpus_allowed                     str
cpus_allowed_list                str
mems_allowed                     str
mems_allowed_list                str
voluntary_ctxt_switches          int
nonvoluntary_ctxt_switches       int

syscall
====================================
syscall_id                       int
syscall                          str
arg0                             str
arg1                             str
arg2                             str
arg3                             str
arg4                             str
arg5                             str
filename                         str
filename2                        str
filenamesum                      str
basename                         str
dirname                          str

wchan
====================================
wchan                            str

io
====================================
rchar                            int
wchar                            int
syscr                            int
syscw                            int
read_bytes                       int
write_bytes                      int
cancelled_write_bytes            int

smaps
====================================
address_range                    str
perms                            str
offset                           str
dev                              str
inode                            int
pathname                         str
size_kb                          int
rss_kb                           int
pss_kb                           int
shared_clean_kb                  int
shared_dirty_kb                  int
private_clean_kb                 int
private_dirty_kb                 int
referenced_kb                    int
anonymous_kb                     int
anonhugepages_kb                 int
shmempmdmapped_kb                int
shared_hugetld_kb                int
private_hugetld_kb               int
swap_kb                          int
swappss_kb                       int
kernelpagesize_kb                int
mmupagesize_kb                   int
locked_kb                        int
vmflags                          str

stack
====================================
kstack                           str

cmdline
====================================
cmdline                          str
openxs@ao756:~/git/0xtools$

We can add syscall to find out what threads were waiting on (not only number as in /proc, but syscall decoded, that's what requires some tricks or good tool):

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G syscall

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, syscall for 5 seconds... finished.


=== Active Threads ====================================================

 samples | avg_threads | comm     | state                  | syscall
-----------------------------------------------------------------------
     136 |        1.36 | (mysqld) | Disk (Uninterruptible) | fsync
      97 |        0.97 | (mysqld) | Disk (Uninterruptible) | fdatasync
      41 |        0.41 | (mysqld) | Running (ON CPU)       | [running]
       5 |        0.05 | (mysqld) | Running (ON CPU)       | poll
       4 |        0.04 | (mysqld) | Disk (Uninterruptible) | pwrite64
       3 |        0.03 | (mysqld) | Running (ON CPU)       | futex
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | write


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.02, measure time: 1.67

We can further group by different kernel stacks:

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G kstack

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, stack for 5 seconds... finished.


=== Active Threads =========================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm     | state                  | kstack                                                                                                                                                                             
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     101 |        1.01 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                     
      89 |        0.89 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                 
      48 |        0.48 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_poll()->do_sys_poll()->poll_schedule_timeout()                                                                                                    
      15 |        0.15 | (mysqld) | Running (ON CPU)       | int_ret_from_sys_call()->syscall_return_slowpath()->exit_to_usermode_loop()                                                                                                        
      11 |        0.11 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->blkdev_issue_flush()->submit_bio_wait()                                                 
      11 |        0.11 | (mysqld) | Running (ON CPU)       | -                                                                                                                                                                                  
       9 |        0.09 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       7 |        0.07 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                          
       4 |        0.04 | (mysqld) | Running (ON CPU)       | retint_user()->prepare_exit_to_usermode()->exit_to_usermode_loop()                                                                                                                 
       3 |        0.03 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_pwrite64()->vfs_write()->__vfs_write()->new_sync_write()->ext4_file_write_iter()->__generic_file_write_iter()->generic_file_direct_write()->ext4_direct_IO()->__blockdev_direct_IO()->do_blockdev_direct_IO()
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_pread64()->vfs_read()->__vfs_read()->new_sync_read()->generic_file_read_iter()->wait_on_page_bit_killable()                                       
       2 |        0.02 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_io_getevents()->read_events()                                                                                                                     
       1 |        0.01 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       1 |        0.01 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                     


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.01, measure time: 1.78

or even more, we can see wait channel and if it's disk I/O what file it is related to:

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G wchan,filename

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/syscall, wchan, stat for 5 seconds... finished.


=== Active Threads =====================================================================================================

 samples | avg_threads | comm     | state                  | wchan                 | filename
------------------------------------------------------------------------------------------------------------------------
      91 |        0.91 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/ao756-bin.000092
      89 |        0.89 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/ib_logfile1

      49 |        0.49 | (mysqld) | Running (ON CPU)       | 0                     |
       9 |        0.09 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/ib_logfile1
       7 |        0.07 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/xb_doublewrite
       4 |        0.04 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/sbtest/sbtest5.ibd
       3 |        0.03 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/sbtest/sbtest4.ibd
       3 |        0.03 | (mysqld) | Running (ON CPU)       | poll_schedule_timeout |
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/ao756-bin.000092
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | do_blockdev_direct_IO | /var/lib/mysql/xb_doublewrite
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | submit_bio_wait       | /var/lib/mysql/sbtest/sbtest2.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | submit_bio_wait       | /var/lib/mysql/sbtest/sbtest3.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      |
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/sbtest/sbtest4.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/sbtest/sbtest5.ibd
       1 |        0.01 | (mysqld) | Running (ON CPU)       | 0                     | /var/lib/mysql/ib_logfile1


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.03, measure time: 2.06

This is really cool and easy to use! This way I got reminded that actually I have binary logging enabled somewhere and it surely slows things down :)

The last tool to check today is schedlat. It had not provided me with any insights:

openxs@ao756:~/git/0xtools$ schedlat `pidof mysqld`
SchedLat by Tanel Poder [https://0x.tools]

PID=19951 COMM=mysqld

TIMESTAMP              %CPU   %LAT   %SLP
2021-01-08 18:52:38     0.0    0.0  100.0
2021-01-08 18:52:39     0.0    0.0  100.0
2021-01-08 18:52:40     0.0    0.0  100.0
2021-01-08 18:52:41     0.0    0.0  100.0
2021-01-08 18:52:42     0.0    0.4   99.6
2021-01-08 18:52:43     0.0    0.0  100.0
2021-01-08 18:52:44     0.0    0.0  100.0
2021-01-08 18:52:45     0.0    0.0  100.0
2021-01-08 18:52:46     0.0    0.0  100.0
2021-01-08 18:52:47     0.0    0.0  100.0
2021-01-08 18:52:48     0.0    0.0  100.0
2021-01-08 18:52:49     0.0    0.0  100.0
2021-01-08 18:52:50     0.0    0.0  100.0
2021-01-08 18:52:51     0.0    0.0  100.0
2021-01-08 18:52:52     0.0    0.0  100.0
2021-01-08 18:52:53     0.0    0.0  100.0
2021-01-08 18:52:54     0.0    0.0  100.0
2021-01-08 18:52:55     0.0    0.0  100.0
2021-01-08 18:52:56     0.0    0.0  100.0
2021-01-08 18:52:57     0.0    0.0  100.0
2021-01-08 18:52:58     0.0    0.0  100.0
2021-01-08 18:52:59     0.0    0.0  100.0
2021-01-08 18:53:00     0.0    0.0  100.0
2021-01-08 18:53:01     0.0    0.0  100.0
2021-01-08 18:53:02     0.0    0.0  100.0
2021-01-08 18:53:03     0.0    0.0  100.0
2021-01-08 18:53:04     0.0    0.0  100.0
2021-01-08 18:53:05     0.0    0.0  100.0
2021-01-08 18:53:06     0.0    0.0  100.0
2021-01-08 18:53:07     0.0    0.0  100.0
^CTraceback (most recent call last):
  File "/usr/bin/schedlat", line 36, in <module>
    time.sleep(1)
KeyboardInterrupt

I was not CPU-bound, the process was mostly sleeping. 

I suggest you to get inspired by the thread states chart below and article it got it copied from, and read other articled by Tanel Poder here. This is what I am going to do myself before trying to apply the tools in real life. The idea of this blog post was to show what is possible for a creative person while sampling thread states from /proc.

Nice illustration of generic threads states we care about for off-CPU sampling, provided by Brendann Gregg at http://www.brendangregg.com/offcpuanalysis.html

* * *

To summarize, why would anyone may prefer or need to use /proc sampling tools instead of other approaches:

  1. Unlike perf or eBPF tools, /proc is always there. It is present on old Linux versions, and poor man's sampling with shell scripts does not require installing anything else. 
  2. Some information about the processes is visible to all users, so there may be no need for root/sudo privileges that MySQLproduction DBAs often lack.
  3. It is easy to use to do low-overhead off-CPU profiling. While it is possible to enable tracing for off-cpu events in perf, it comes with a higher tracing overhead (and then overhead of post-processing these high-frequency events). eBPF tools can be used to reduce both, but it is still extra overhead on top of what is already there in /proc.

So I decided to add these tools and ad hoc /proc sampling scripts (and checking sar outputs, if any) to my emergency toolset while studying some complex MySQL performance problems. They fit well into the ovberall system of application-level tracing based on performance_schema, on-CPU sampling with perf and some bpftrace quick scripts. Use right tools for the job!

In the next, probably final post in this series, I'll try to present some good example of MySQL performance problem where /proc-based sampling tools really help to get useful insight.