Friday, January 8, 2021

Linux /proc Filesystem for MySQL DBAs - Part III, 0x.tools by Tanel Poder

In this third post of my "Linux /proc Filesystem for MySQL DBAs" series (see also Part I and Part II for the context and details) I am going to present a useful set of programs to access, summarize and record /proc details created and recently shared by famous Tanel Poder, 0x.tools. I'll try to build them and apply to Percona Server 5.7.32-35 running on Ubuntu 16.04 netbook and fighting with some read-write sysbench load.

To build the tools I clone them from GitHub and just make as usual: 

openxs@ao756:~/git$ git clone https://github.com/tanelpoder/0xtools
Cloning into '0xtools'...
remote: Enumerating objects: 103, done.
remote: Counting objects: 100% (103/103), done.
remote: Compressing objects: 100% (67/67), done.
remote: Total 186 (delta 53), reused 77 (delta 32), pack-reused 83
Receiving objects: 100% (186/186), 108.43 KiB | 0 bytes/s, done.
Resolving deltas: 100% (94/94), done.
Checking connectivity... done.
openxs@ao756:~/git$ cd 0xtools/
openxs@ao756:~/git/0xtools$ make
gcc -I include -Wall -o bin/xcapture src/xcapture.c
openxs@ao756:~/git/0xtools$ sudo make install
# for now the temporary "install" method is with symlinks
ln -s `pwd`/bin/xcapture /usr/bin/xcapture
ln -s `pwd`/bin/psn /usr/bin/psn
ln -s `pwd`/bin/schedlat /usr/bin/schedlat
openxs@ao756:~/git/0xtools$ file bin/xcapture
bin/xcapture: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=ad05c149b0379a29fef765ec4706e912843846ff, not stripped
openxs@ao756:~/git/0xtools$ file bin/psn
bin/psn: Python script, ASCII text executable
openxs@ao756:~/git/0xtools$ file bin/schedlat
bin/schedlat: Python script, ASCII text executable
openxs@ao756:~/git/0xtools$ python --version
Python 2.7.12

Nothing more than that for now. Executable and two Python scripts. This installation process should work well on any Linux where you can build non-ancient MySQL from source as has some Python installed, starting from RHEL5. 

We can get some details about the programs usage as follows:

openxs@ao756:~/git/0xtools$ xcapture -h

0x.Tools xcapture v1.0 by Tanel Poder [https://0x.tools]

Usage:
  xcapture [options]

  By default, sample all /proc tasks in states R, D every second and print to stdout

  Options:
    -a             capture tasks in additional states, even the ones Sleeping (S)
    -A             capture tasks in All states, including Zombie (Z), Exiting (X), Idle (I)
    -c <c1,c2>     print additional columns (for example: -c exe,cmdline,kstack)
    -d <N>         seconds to sleep between samples (default: 1)
    -E <string>    custom task state Exclusion filter (default: XZIS)
    -h             display this help message
    -o <dirname>   write wide output into hourly CSV files in this directory instead of stdout

openxs@ao756:~/git/0xtools$ psn -h
usage: psn [-h] [-d seconds] [-p [pid]] [-t [tid]] [-r] [-a]
           [--sample-hz SAMPLE_HZ] [--ps-hz PS_HZ] [-o filename] [-i filename]
           [-s csv-columns] [-g csv-columns] [-G csv-columns] [--list]

optional arguments:
  -h, --help            show this help message and exit
  -d seconds            number of seconds to sample for
  -p [pid], --pid [pid]
                        process id to sample (including all its threads), or
                        process name regex, or omit for system-wide sampling
  -t [tid], --thread [tid]
                        thread/task id to sample (not implemented yet)
  -r, --recursive       also sample and report for descendant processes
  -a, --all-states      display threads in all states, including idle ones
  --sample-hz SAMPLE_HZ
                        sample rate in Hz (default: 20)
  --ps-hz PS_HZ         sample rate of new processes in Hz (default: 2)
  -o filename, --output-sample-db filename
                        path of sqlite3 database to persist samples to,
                        defaults to in-memory/transient
  -i filename, --input-sample-db filename
                        path of sqlite3 database to read samples from instead
                        of actively sampling
  -s csv-columns, --select csv-columns
                        additional columns to report
  -g csv-columns, --group-by csv-columns
                        columns to aggregate by in reports
  -G csv-columns, --append-group-by csv-columns
                        default + additional columns to aggregate by in
                        reports
  --list                list all available columns

The last one is also a python script that does not care to provide -h option, but shows the usage when called without arguments:

openxs@ao756:~/git/0xtools$ bin/schedlat
usage: bin/schedlat PID

There are also two shell scripts in bin/:

openxs@ao756:~/git/0xtools$ ls -F bin
psn*  run_xcapture.sh*  run_xcpu.sh*  schedlat*  xcapture*

The purposes of the tools are the following:

  • xcapture - low-overhead thread state sampler based on reading /proc files
  • psn - shows current top thread activity by sampling /proc files
  • schedlat - shows CPU scheduling latency for the given PID as a % of its runtime
  • run_xcapture.sh - a simple “daemon” script for keeping xcapture running
  • run_xcpu.sh - low-frequency continuous stack sampling for threads on CPU (using perf)

The last script, run_xcpu.sh, is out of scope of this series. We may get back to it in some next perf-related posts. It shows proper usge of perf for low frequency (1 Hz) and low impact on-CPU profiling.

Let's see what we can get with xcapture:

openxs@ao756:~/git/0xtools$ mkdir /tmp/xcap
openxs@ao756:~/git/0xtools$ xcapture -a -o /tmp/xcap

0xTools xcapture v1.0 by Tanel Poder [https://0x.tools]

Sampling /proc...

^C
openxs@ao756:~/git/0xtools$

while sysbench test is running against my Percona Server:

openxs@ao756:~$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-db=sbtest --threads=16 --time=1200 --report-interval=10 run
...
[ 1200s ] thds: 16 tps: 70.99 qps: 1420.02 (r/w/o: 993.87/284.16/141.98) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            1214556
        write:                           347016
        other:                           173508
        total:                           1735080
    transactions:                        86754  (72.28 per sec.)
    queries:                             1735080 (1445.54 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      72.2771
    time elapsed:                        1200.2964s
    total number of events:              86754

Latency (ms):
         min:                                   89.13
         avg:                                  221.33
         max:                                  855.31
         95th percentile:                      356.70
         sum:                             19200998.85

Threads fairness:
    events (avg/stddev):           5422.1250/12.10
    execution time (avg/stddev):   1200.0624/0.11

To get the impact of capturing the information, I've got the same test results without xcapture running previously:

...

    transactions:                        87376  (72.80 per sec.)
    queries:                             1747537 (1456.07 per sec.)
    ignored errors:                      1      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      72.8027
    time elapsed:                        1200.1747s
    total number of events:              87376

Latency (ms):
         min:                                   87.54
         avg:                                  219.75
         max:                                  843.19
         95th percentile:                      356.70
...

So we had 72.80 TPS and 1456 QPS over 1200 seconds without monitoring and 72.28 TPS and 1446 QPS with it. Do not mind the absolute (no tuning was performed, all default, old netbook with slow HDD), but the difference is really ignorable.

Now let's see what was captured (with -a, so all threads of all processes in the system, in all states, even S - sleeping):

openxs@ao756:~/git/0xtools$ ls -l /tmp/xcap/
total 56720
-rw-rw-r-- 1 openxs openxs 36310058 Jan  8 16:59 2021-01-08.16.csv
-rw-rw-r-- 1 openxs openxs 21768834 Jan  8 17:07 2021-01-08.17.csv
openxs@ao756:~/git/0xtools$ cat /tmp/xcap/2021-01-08.16.csv | head -30
TS,PID,TID,USERNAME,ST,COMMAND,SYSCALL,WCHAN,EXE,KSTACK
2021-01-08 16:47:24.414,1,1,root,S,(systemd),-,0,-,-
2021-01-08 16:47:24.414,2,2,root,S,(kthreadd),-,0,-,-
2021-01-08 16:47:24.414,3,3,root,S,(ksoftirqd/0),-,0,-,-
2021-01-08 16:47:24.414,5,5,root,S,(kworker/0:0H),-,0,-,-
2021-01-08 16:47:24.414,7,7,root,S,(rcu_sched),-,0,-,-
2021-01-08 16:47:24.414,8,8,root,S,(rcu_bh),-,0,-,-
2021-01-08 16:47:24.414,9,9,root,S,(migration/0),-,0,-,-
2021-01-08 16:47:24.414,10,10,root,S,(watchdog/0),-,0,-,-
2021-01-08 16:47:24.414,11,11,root,S,(watchdog/1),-,0,-,-
2021-01-08 16:47:24.414,12,12,root,S,(migration/1),-,0,-,-
2021-01-08 16:47:24.414,13,13,root,S,(ksoftirqd/1),-,0,-,-
2021-01-08 16:47:24.414,15,15,root,S,(kworker/1:0H),-,0,-,-
2021-01-08 16:47:24.414,16,16,root,S,(kdevtmpfs),-,0,-,-
2021-01-08 16:47:24.414,17,17,root,S,(netns),-,0,-,-
2021-01-08 16:47:24.414,18,18,root,S,(perf),-,0,-,-
2021-01-08 16:47:24.414,19,19,root,S,(khungtaskd),-,0,-,-
2021-01-08 16:47:24.414,20,20,root,S,(writeback),-,0,-,-
2021-01-08 16:47:24.414,21,21,root,S,(ksmd),-,0,-,-
2021-01-08 16:47:24.414,22,22,root,S,(khugepaged),-,0,-,-
2021-01-08 16:47:24.414,23,23,root,S,(crypto),-,0,-,-
2021-01-08 16:47:24.414,24,24,root,S,(kintegrityd),-,0,-,-
2021-01-08 16:47:24.414,25,25,root,S,(bioset),-,0,-,-
2021-01-08 16:47:24.414,26,26,root,S,(kblockd),-,0,-,-
2021-01-08 16:47:24.414,27,27,root,S,(ata_sff),-,0,-,-
2021-01-08 16:47:24.414,28,28,root,S,(md),-,0,-,-
2021-01-08 16:47:24.414,29,29,root,S,(devfreq_wq),-,0,-,-
2021-01-08 16:47:24.414,34,34,root,S,(kswapd0),-,0,-,-
2021-01-08 16:47:24.414,35,35,root,S,(vmstat),-,0,-,-
2021-01-08 16:47:24.414,36,36,root,S,(fsnotify_mark),-,0,-,-
openxs@ao756:~/git/0xtools$

Nothing that fancy in the first 30 lines. But you can then load CSV files into the database for rocessing, or query them with standard Linux text processing tools. It’s like SQL but with different keywords: grep for filtering, awk or sed for column projection, uniq for group by, sort for ordering and head/tail for LIMIT. Let's count number of threads per command, state and syscall, for example, asnd get top 10:

openxs@ao756:~/git/0xtools$ cat /tmp/xcap/*.csv | awk -F, '{ printf("%-20s %-20s %s\n",$6,$5,$7) }' | sort | uniq -c | sort -nbr | head -10
  71442 (console-kit-dae)    S                    -
  69174 (gmain)              S                    read
  69173 (gdbus)              S                    read
  48762 (dconf               S                    read
  47628 (dockerd)            S                    -
  46455 (mysqld)             S                    -
  19196 (sysbench)           S                    read

  15876 (gmain)              S                    -
  15876 (bioset)             S                    -
  14742 (containerd)         S                    -

No wonder MySQL threads were mostly sleeping, with 16 of them and just 2 slow CPUs.

Next stool is psn (Linux Process Snapper), a Python script for troubleshooting currently on-going issues (no historical capture). It currently reports more fields directly from /proc than xcapture captures (like filenames accessed by IO system calls). Let me try to apply psn to the mysqld process:

openxs@ao756:~/git/0xtools$ psn -p `pidof mysqld`

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat for 5 seconds... finished.


=== Active Threads ========================================

 samples | avg_threads | comm     | state
-----------------------------------------------------------
     205 |        2.05 | (mysqld) | Disk (Uninterruptible)
     114 |        1.14 | (mysqld) | Running (ON CPU)


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.01, measure time: 1.29

or to the entire system while sysbench test is running:

openxs@ao756:~/git/0xtools$ psn

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat for 5 seconds... finished.


=== Active Threads ================================================

 samples | avg_threads | comm             | state
-------------------------------------------------------------------
      94 |        2.24 | (mysqld)         | Disk (Uninterruptible)
      41 |        0.98 | (jbd*/dm-*-*)    | Disk (Uninterruptible)
      31 |        0.74 | (mysqld)         | Running (ON CPU)
      31 |        0.74 | (sysbench)       | Running (ON CPU)
      14 |        0.33 | (update-manager) | Running (ON CPU)
       5 |        0.12 | (kworker/u*:*)   | Running (ON CPU)
       2 |        0.05 | (Xorg)           | Running (ON CPU)
       1 |        0.02 | (compiz)         | Running (ON CPU)
       1 |        0.02 | (dockerd)        | Running (ON CPU)
       1 |        0.02 | (rcu_sched)      | Running (ON CPU)


samples: 42 (expected: 100)
total processes: 221, threads: 650
runtime: 5.06, measure time: 5.01

Some useful summaries of numbers of threads per state per process for the monitoring period, but nothing really fancy. This is, we can add other colunms from this long list with items that should look familiar for those studying /proc content:

openxs@ao756:~/git/0xtools$ psn --list

stat
====================================
pid                              int
comm                             str
comm2                            str
state_id                         str
state                            str
ppid                             int
pgrp                             int
session                          int
tty_nr                           int
tpgid                            int
minflt                           int
cminflt                          int
majflt                           int
cmajflt                          int
utime                           long
stime                           long
cutime                          long
cstime                          long
utime_sec                       long
stime_sec                       long
cutime_sec                      long
cstime_sec                      long
priority                         int
nice                             int
num_threads                      int
starttime                       long
vsize                           long
rss                             long
rsslim                           str
exit_signal                      int
processor                        int
rt_priority                      int
delayacct_blkio_ticks           long
guest_time                       int
cgust_time                       int
exit_code                        int

status
====================================
name                             str
umask                            str
state                            str
tgid                             int
ngid                             int
pid                              int
ppid                             int
tracerpid                        int
uid                              int
gid                              int
fdsize                           int
groups                           str
nstgid                           str
nspid                            str
nspgid                           str
nssid                            str
vmpeak_kb                        int
vmsize_kb                        int
vmlck_kb                         int
vmpin_kb                         int
vmhwm_kb                         int
vmrss_kb                         int
rssanon_kb                       int
rssfile_kb                       int
rssshmem_kb                      int
vmdata_kb                        int
vmstk_kb                         int
vmexe_kb                         int
vmlib_kb                         int
vmpte_kb                         int
vmpmd_kb                         int
vmswap_kb                        int
hugetlbpages_kb                  int
threads                          int
sigq                             str
sigpnd                           str
shdpnd                           str
sigblk                           str
sigign                           str
sigcgt                           str
capinh                           str
capprm                           str
capeff                           str
capbnd                           str
capamb                           str
seccomp                          int
cpus_allowed                     str
cpus_allowed_list                str
mems_allowed                     str
mems_allowed_list                str
voluntary_ctxt_switches          int
nonvoluntary_ctxt_switches       int

syscall
====================================
syscall_id                       int
syscall                          str
arg0                             str
arg1                             str
arg2                             str
arg3                             str
arg4                             str
arg5                             str
filename                         str
filename2                        str
filenamesum                      str
basename                         str
dirname                          str

wchan
====================================
wchan                            str

io
====================================
rchar                            int
wchar                            int
syscr                            int
syscw                            int
read_bytes                       int
write_bytes                      int
cancelled_write_bytes            int

smaps
====================================
address_range                    str
perms                            str
offset                           str
dev                              str
inode                            int
pathname                         str
size_kb                          int
rss_kb                           int
pss_kb                           int
shared_clean_kb                  int
shared_dirty_kb                  int
private_clean_kb                 int
private_dirty_kb                 int
referenced_kb                    int
anonymous_kb                     int
anonhugepages_kb                 int
shmempmdmapped_kb                int
shared_hugetld_kb                int
private_hugetld_kb               int
swap_kb                          int
swappss_kb                       int
kernelpagesize_kb                int
mmupagesize_kb                   int
locked_kb                        int
vmflags                          str

stack
====================================
kstack                           str

cmdline
====================================
cmdline                          str
openxs@ao756:~/git/0xtools$

We can add syscall to find out what threads were waiting on (not only number as in /proc, but syscall decoded, that's what requires some tricks or good tool):

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G syscall

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, syscall for 5 seconds... finished.


=== Active Threads ====================================================

 samples | avg_threads | comm     | state                  | syscall
-----------------------------------------------------------------------
     136 |        1.36 | (mysqld) | Disk (Uninterruptible) | fsync
      97 |        0.97 | (mysqld) | Disk (Uninterruptible) | fdatasync
      41 |        0.41 | (mysqld) | Running (ON CPU)       | [running]
       5 |        0.05 | (mysqld) | Running (ON CPU)       | poll
       4 |        0.04 | (mysqld) | Disk (Uninterruptible) | pwrite64
       3 |        0.03 | (mysqld) | Running (ON CPU)       | futex
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | write


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.02, measure time: 1.67

We can further group by different kernel stacks:

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G kstack

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, stack for 5 seconds... finished.


=== Active Threads =========================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm     | state                  | kstack                                                                                                                                                                             
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     101 |        1.01 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                     
      89 |        0.89 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                 
      48 |        0.48 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_poll()->do_sys_poll()->poll_schedule_timeout()                                                                                                    
      15 |        0.15 | (mysqld) | Running (ON CPU)       | int_ret_from_sys_call()->syscall_return_slowpath()->exit_to_usermode_loop()                                                                                                        
      11 |        0.11 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->blkdev_issue_flush()->submit_bio_wait()                                                 
      11 |        0.11 | (mysqld) | Running (ON CPU)       | -                                                                                                                                                                                  
       9 |        0.09 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       7 |        0.07 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                          
       4 |        0.04 | (mysqld) | Running (ON CPU)       | retint_user()->prepare_exit_to_usermode()->exit_to_usermode_loop()                                                                                                                 
       3 |        0.03 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_pwrite64()->vfs_write()->__vfs_write()->new_sync_write()->ext4_file_write_iter()->__generic_file_write_iter()->generic_file_direct_write()->ext4_direct_IO()->__blockdev_direct_IO()->do_blockdev_direct_IO()
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_pread64()->vfs_read()->__vfs_read()->new_sync_read()->generic_file_read_iter()->wait_on_page_bit_killable()                                       
       2 |        0.02 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_io_getevents()->read_events()                                                                                                                     
       1 |        0.01 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       1 |        0.01 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                     


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.01, measure time: 1.78

or even more, we can see wait channel and if it's disk I/O what file it is related to:

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G wchan,filename

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/syscall, wchan, stat for 5 seconds... finished.


=== Active Threads =====================================================================================================

 samples | avg_threads | comm     | state                  | wchan                 | filename
------------------------------------------------------------------------------------------------------------------------
      91 |        0.91 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/ao756-bin.000092
      89 |        0.89 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/ib_logfile1

      49 |        0.49 | (mysqld) | Running (ON CPU)       | 0                     |
       9 |        0.09 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/ib_logfile1
       7 |        0.07 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/xb_doublewrite
       4 |        0.04 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/sbtest/sbtest5.ibd
       3 |        0.03 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/sbtest/sbtest4.ibd
       3 |        0.03 | (mysqld) | Running (ON CPU)       | poll_schedule_timeout |
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/ao756-bin.000092
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | do_blockdev_direct_IO | /var/lib/mysql/xb_doublewrite
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | submit_bio_wait       | /var/lib/mysql/sbtest/sbtest2.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | submit_bio_wait       | /var/lib/mysql/sbtest/sbtest3.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      |
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/sbtest/sbtest4.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/sbtest/sbtest5.ibd
       1 |        0.01 | (mysqld) | Running (ON CPU)       | 0                     | /var/lib/mysql/ib_logfile1


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.03, measure time: 2.06

This is really cool and easy to use! This way I got reminded that actually I have binary logging enabled somewhere and it surely slows things down :)

The last tool to check today is schedlat. It had not provided me with any insights:

openxs@ao756:~/git/0xtools$ schedlat `pidof mysqld`
SchedLat by Tanel Poder [https://0x.tools]

PID=19951 COMM=mysqld

TIMESTAMP              %CPU   %LAT   %SLP
2021-01-08 18:52:38     0.0    0.0  100.0
2021-01-08 18:52:39     0.0    0.0  100.0
2021-01-08 18:52:40     0.0    0.0  100.0
2021-01-08 18:52:41     0.0    0.0  100.0
2021-01-08 18:52:42     0.0    0.4   99.6
2021-01-08 18:52:43     0.0    0.0  100.0
2021-01-08 18:52:44     0.0    0.0  100.0
2021-01-08 18:52:45     0.0    0.0  100.0
2021-01-08 18:52:46     0.0    0.0  100.0
2021-01-08 18:52:47     0.0    0.0  100.0
2021-01-08 18:52:48     0.0    0.0  100.0
2021-01-08 18:52:49     0.0    0.0  100.0
2021-01-08 18:52:50     0.0    0.0  100.0
2021-01-08 18:52:51     0.0    0.0  100.0
2021-01-08 18:52:52     0.0    0.0  100.0
2021-01-08 18:52:53     0.0    0.0  100.0
2021-01-08 18:52:54     0.0    0.0  100.0
2021-01-08 18:52:55     0.0    0.0  100.0
2021-01-08 18:52:56     0.0    0.0  100.0
2021-01-08 18:52:57     0.0    0.0  100.0
2021-01-08 18:52:58     0.0    0.0  100.0
2021-01-08 18:52:59     0.0    0.0  100.0
2021-01-08 18:53:00     0.0    0.0  100.0
2021-01-08 18:53:01     0.0    0.0  100.0
2021-01-08 18:53:02     0.0    0.0  100.0
2021-01-08 18:53:03     0.0    0.0  100.0
2021-01-08 18:53:04     0.0    0.0  100.0
2021-01-08 18:53:05     0.0    0.0  100.0
2021-01-08 18:53:06     0.0    0.0  100.0
2021-01-08 18:53:07     0.0    0.0  100.0
^CTraceback (most recent call last):
  File "/usr/bin/schedlat", line 36, in <module>
    time.sleep(1)
KeyboardInterrupt

I was not CPU-bound, the process was mostly sleeping. 

I suggest you to get inspired by the thread states chart below and article it got it copied from, and read other articled by Tanel Poder here. This is what I am going to do myself before trying to apply the tools in real life. The idea of this blog post was to show what is possible for a creative person while sampling thread states from /proc.

Nice illustration of generic threads states we care about for off-CPU sampling, provided by Brendann Gregg at http://www.brendangregg.com/offcpuanalysis.html

* * *

To summarize, why would anyone may prefer or need to use /proc sampling tools instead of other approaches:

  1. Unlike perf or eBPF tools, /proc is always there. It is present on old Linux versions, and poor man's sampling with shell scripts does not require installing anything else. 
  2. Some information about the processes is visible to all users, so there may be no need for root/sudo privileges that MySQLproduction DBAs often lack.
  3. It is easy to use to do low-overhead off-CPU profiling. While it is possible to enable tracing for off-cpu events in perf, it comes with a higher tracing overhead (and then overhead of post-processing these high-frequency events). eBPF tools can be used to reduce both, but it is still extra overhead on top of what is already there in /proc.

So I decided to add these tools and ad hoc /proc sampling scripts (and checking sar outputs, if any) to my emergency toolset while studying some complex MySQL performance problems. They fit well into the ovberall system of application-level tracing based on performance_schema, on-CPU sampling with perf and some bpftrace quick scripts. Use right tools for the job!

In the next, probably final post in this series, I'll try to present some good example of MySQL performance problem where /proc-based sampling tools really help to get useful insight.

Thursday, January 7, 2021

Linux /proc Filesystem for MySQL DBAs - Part II, Threads of the mysqld Process

It's common knowledge that MySQL server (mysqld process) is multi-threaded. Let me quote:

"The MySQL Server (mysqld) executes as a single OS process, with multiple threads executing concurrent activities. MySQL does not have its own thread implementation, but relies on the thread implementation of the underlying OS."

In the previous post we've seen that on Linux the /proc/[pid]/task/ subdirectory contains subdirectories of the form [tid]/, one for each thread of the process (pid). This is what I have for the Percona Server 5.7.32-35 I use on Ubuntu 16.04 as an example in this series of blog posts:

mysql> select connection_id(), version(), @@pid_file;
+-----------------+---------------+----------------------------+
| connection_id() | version()     | @@pid_file                 |
+-----------------+---------------+----------------------------+
|             167 | 5.7.32-35-log | /var/run/mysqld/mysqld.pid |
+-----------------+---------------+----------------------------+
1 row in set (0.00 sec)

mysql> \! sudo cat /var/run/mysqld/mysqld.pid
30580
mysql> \! ls -F /proc/30580/task
2389/  2394/   30582/  30586/  30590/  30594/  30600/  30604/  30608/  3620/
2390/  2488/   30583/  30587/  30591/  30597/  30601/  30605/  30609/
2391/  30580/  30584/  30588/  30592/  30598/  30602/  30606/  31000/
2393/  30581/  30585/  30589/  30593/  30599/  30603/  30607/  31002/
mysql>

I've also shown a poor man's monitoring shell script idea to study some /proc information for each thread. Let me present it nicely structured for the case of wchan check to see where thread waits (I started sysbench test with 16 threads to put some load and see more different wait channels):

openxs@ao756:~$ for dir in `ls /proc/30580/task`
> do
>   echo -n $dir': '
>   2>/dev/null sudo strings /proc/$dir/wchan
> done
2389: futex_wait_queue_me
2390: futex_wait_queue_me
2391: futex_wait_queue_me
2393: 2394: futex_wait_queue_me
2488: futex_wait_queue_me
30580: poll_schedule_timeout
30581: do_sigtimedwait
30582: read_events
...
30591: read_events
30592: futex_wait_queue_me
30593: hrtimer_nanosleep
30594: futex_wait_queue_me
30597: futex_wait_queue_me
30598: futex_wait_queue_me
30599: hrtimer_nanosleep
30600: futex_wait_queue_me
...
30607: futex_wait_queue_me
30608: do_sigtimedwait
30609: futex_wait_queue_me
31000: jbd2_log_wait_commit
31002: futex_wait_queue_me
3620: futex_wait_queue_me
...
6514: futex_wait_queue_me
openxs@ao756:~$

Note that wchan was empty for the thread with tid 2393 based on the output, so it was running at the moment.

But as comm and cmdline files content for MySQL threads is not changed, how do we know what each thread is doing or what kind of thread is it? Probably if we attach gdb... but the topic of this series is different. Luckily, in MySQL 5.7 we have a nice insight in the performance_schema.threads table, the thread_os_id column contains exactly what we need:

mysql> select thread_id, thread_os_id, name from performance_schema.threads where type = 'BACKGROUND';
+-----------+--------------+----------------------------------------+
| thread_id | thread_os_id | name                                   |
+-----------+--------------+----------------------------------------+
|         1 |        30580 | thread/sql/main                        |
|         2 |        30581 | thread/sql/thread_timer_notifier       |
|         3 |        30582 | thread/innodb/io_ibuf_thread           |
|         4 |        30584 | thread/innodb/io_read_thread           |
|         5 |        30583 | thread/innodb/io_log_thread            |
|         6 |        30585 | thread/innodb/io_read_thread           |
|         7 |        30586 | thread/innodb/io_read_thread           |
|         8 |        30587 | thread/innodb/io_read_thread           |
|         9 |        30588 | thread/innodb/io_write_thread          |
|        10 |        30589 | thread/innodb/io_write_thread          |
|        11 |        30590 | thread/innodb/io_write_thread          |
|        12 |        30591 | thread/innodb/io_write_thread          |
|        13 |        30592 | thread/innodb/page_cleaner_thread      |
|        14 |        30593 | thread/innodb/buf_lru_manager_thread   |
|        15 |        30594 | thread/innodb/srv_monitor_thread       |
|        17 |        30603 | thread/innodb/srv_worker_thread        |
|        18 |        30602 | thread/innodb/srv_worker_thread        |
|        19 |        30601 | thread/innodb/srv_worker_thread        |
|        20 |        30600 | thread/innodb/srv_purge_thread         |
|        21 |        30599 | thread/innodb/srv_master_thread        |
|        22 |        30598 | thread/innodb/srv_error_monitor_thread |
|        23 |        30597 | thread/innodb/srv_lock_timeout_thread  |
|        24 |        30605 | thread/innodb/dict_stats_thread        |
|        25 |        30604 | thread/innodb/buf_dump_thread          |
|        26 |        30608 | thread/sql/signal_handler              |
+-----------+--------------+----------------------------------------+
25 rows in set (0.05 sec)

This is what I've used to state in the previous post that 30592 was a page cleaner thread. Here we can see that threads waiting for timer are InnoDB's master thread and LRU manager thread that are woken up periodically... What was that thread 31000 waiting on something related to commit at the filesystem level? It was a user connection thread, obviously:

mysql> select thread_id, thread_os_id, type, name from performance_schema.threads where thread_os_id = 31000;
+-----------+--------------+------------+---------------------------+
| thread_id | thread_os_id | type       | name                      |
+-----------+--------------+------------+---------------------------+
|       217 |        31000 | FOREGROUND | thread/sql/one_connection |
+-----------+--------------+------------+---------------------------+
1 row in set (0.00 sec)

But are we really forced to run scripts and access /proc directly? No, and fine MySQL manual gives some hints, for example, about ps -L option:

"For Linux, THREAD_OS_ID corresponds to the value of the gettid() function. This value is exposed, for example, using the perf or ps -L commands, or in the proc file system (/proc/[pid]/task/[tid]). For more information, see the perf-stat(1), ps(1), and proc(5) man pages. "

So, with ps -L we can see the threads of the mysqld process:

openxs@ao756:~$ ps -L aux | grep -e PID -e `pidof mysqld` | more
USER       PID   LWP %CPU NLWP %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
mysql    30580 30580  0.0   38  7.7 746308 299448 ?       Sl   Jan02   0:00 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
mysql    30580 30581  0.0   38  7.7 746308 299448 ?       Sl   Jan02   0:00 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
mysql    30580 30582  0.0   38  7.7 746308 299448 ?       Sl   Jan02   0:06 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
mysql    30580 30583  0.0   38  7.7 746308 299448 ?       Sl   Jan02   0:07 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
mysql    30580 30584  0.0   38  7.7 746308 299448 ?       Sl   Jan02   0:06 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
mysql    30580 30585  0.0   38  7.7 746308 299448 ?       Sl   Jan02   0:06 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
...

 We can clearly see tid in the column titled LWP. grep allows us to apply conditions to the rows we see in the command line output. It is similar to the WHERE clause of SELECT statement. But as the rest of the information in the output looks the same for all threads we need something better in ps command line options to show the list of columns with useful values. Note that grep over all threads is actually not mandatory, we can use -p option to show threads only for the given pid:

openxs@ao756:~$ ps -L -p `pidof mysqld` | more
  PID   LWP TTY          TIME CMD
30580 30580 ?        00:00:00 mysqld
30580 30581 ?        00:00:00 mysqld
30580 30582 ?        00:00:06 mysqld
30580 30583 ?        00:00:07 mysqld
30580 30584 ?        00:00:06 mysqld
30580 30585 ?        00:00:06 mysqld
30580 30586 ?        00:00:07 mysqld
30580 30587 ?        00:00:06 mysqld
...

The real problem is the list of columns - the default one we see above is even less useful. Time to read man ps where we can find the -o option:

 -o format

              User-defined format.  format is a single argument in the
              form of a blank-separated or comma-separated list, which
              offers a way to specify individual output columns.  The
              recognized keywords are described in the STANDARD FORMAT
              SPECIFIERS section below. ...

So, after reading the details we are ready to proceed to listing all columns we need explicitly, like this:

openxs@ao756:~$ ps -o lwp,pcpu,state,wchan -L -p `pidof mysqld`
  LWP %CPU S WCHAN
30580  0.0 S -
30581  0.0 S -
...
30592  0.0 S -
...
30609  0.0 S -
 7192  0.6 D -
 7194  0.6 S -
 7195  0.6 S -
 7377  0.6 S -
 7379  0.6 S -
 9112  0.7 S -
 9115  0.7 S -
 9119  0.7 S -
 9120  0.7 D -
 9436  1.5 S -
...

We can also sort the output by different criteria either with ps options or with external sort as a kind of ORDER BY, do GROUP BY with sort + uniq etc. For example, we can summarize thread states in the process like this:

openxs@ao756:~$ ps -o state -L -p `pidof mysqld` | sort | uniq -c | sort -nbr
     37 S
      7 R
      1 D

The idea is clear, hopefully. For quick checks and summarizing it may be enough to use ps with options to show threads (-L) and specific fields (-o) and process the output by standard Linux command line utilities.

As a side note, you can monitor threads of the mysqld process with top -H option as well:

openxs@ao756:~$ top -version
  procps-ng version 3.3.10

Usage:

  top -hv | -bcHiOSs -d secs -n max -u|U user -p pid(s) -o field -w [cols]

openxs@ao756:~$ top -H -b -n 1 -o +%CPU -p `pidof mysqld`
top - 19:40:20 up 30 days,  7:47,  4 users,  load average: 4.08, 2.87, 2.06
Threads:  44 total,   9 running,  35 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.3 us,  6.3 sy,  5.2 ni, 59.7 id,  3.5 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem :  3859384 total,   188328 free,  1052156 used,  2618900 buff/cache
KiB Swap:  4001788 total,  2464336 free,  1537452 used.  2008620 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
10002 mysql     20   0  746308 307432   8872 D  6.2  8.0   0:15.70 mysqld
10977 mysql     20   0  746308 307432   8872 S  6.2  8.0   0:01.97 mysqld
10978 mysql     20   0  746308 307432   8872 R  6.2  8.0   0:01.95 mysqld
10981 mysql     20   0  746308 307432   8872 R  6.2  8.0   0:01.92 mysqld

30580 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:00.35 mysqld
30581 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:00.00 mysqld
30582 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.42 mysqld
30583 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.77 mysqld
30584 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.46 mysqld
30585 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.52 mysqld
30586 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.79 mysqld
30587 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.50 mysqld
30588 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.72 mysqld
30589 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:08.08 mysqld
30590 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.70 mysqld
30591 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.80 mysqld
30592 mysql     20   0  746308 307432   8872 S  0.0  8.0   3:09.57 mysqld
30593 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:04.68 mysqld
30594 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:00.91 mysqld
30597 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:04.62 mysqld
30598 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:07.12 mysqld
30599 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:08.67 mysqld
30600 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:41.00 mysqld
30601 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:11.73 mysqld
30602 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:12.69 mysqld
30603 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:14.40 mysqld
30604 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:00.00 mysqld
30605 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:01.74 mysqld
30606 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:00.97 mysqld
30607 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:00.00 mysqld
30608 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:00.00 mysqld
30609 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:00.00 mysqld
 7195 mysql     20   0  746308 307432   8872 S  0.0  8.0   1:34.78 mysqld
10106 mysql     20   0  746308 307432   8872 R  0.0  8.0   0:11.63 mysqld
10108 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:11.58 mysqld
10854 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:06.62 mysqld
10855 mysql     20   0  746308 307432   8872 R  0.0  8.0   0:06.69 mysqld
10856 mysql     20   0  746308 307432   8872 R  0.0  8.0   0:06.64 mysqld
10858 mysql     20   0  746308 307432   8872 R  0.0  8.0   0:06.64 mysqld
10859 mysql     20   0  746308 307432   8872 S  0.0  8.0   0:06.70 mysqld
10975 mysql     20   0  746308 307432   8872 R  0.0  8.0   0:01.94 mysqld
10976 mysql     20   0  746308 307432   8872 R  0.0  8.0   0:01.96 mysqld
10979 mysql     20   0  746308 307432   8872 R  0.0  8.0   0:01.91 mysqld
10980 mysql     20   0  746308 307432   8872 D  0.0  8.0   0:01.94 mysqld
openxs@ao756:~$

In the command above I asked to run top in the batch mode and capture the output once, sort it by highest CPU usage first. Configuring columns in the output (from a relatively small subset of what is visible via /proc) is possible the interactive mode and final set can be saved in the ~/.toprc file. Usually I do not care to do that.

Multi-threaded mooring of boats somewhere at Regent's Canal

* * *

So, for quick checks and summarizing different details about MySQL server threads from /proc you can use not only some poor man's monitoring shell scripts, but also ps -L and top -H commands with proper options and then process the output with usual Linux command line utilities like grep, sort and uniq. One can surely write more advanced tools working directly with /proc content, in C, Python etc if not just shell, and in the next post in this series we'll see some of them from https://0x.tools/ applied to MySQL. 

Also performance_schema.threads table in MySQL 5.7+ (and MariaDB 10.5+) allows to easily find out the role of each OS thread in the MySQL server and link the OS level and internal metrics for it.