Showing posts with label psn. Show all posts
Showing posts with label psn. Show all posts

Saturday, January 16, 2021

Linux /proc Filesystem for MySQL DBAs - Part IV, Creating Off-CPU Flame Graphs

 My upcoming FOSDEM 2021 MySQL Devroom talk based on this series of blog posts is already prepared and recorded. But in the process I noted that some more details should be shared than one can cover in a 20 minutes talk. So I decided to continue the series, and today I am going to show what one can do with kernel stack samples collected from /proc, for example, with the psn tool we discussed in the previous post.

There we've seen that psn allows to summarize kernel stack samples among other per-thread details, for example, we can get the following for Percona Server running under some sysbench load:

openxs@ao756:~$ sudo psn -G kstack -p `pidof mysqld`

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, stack for 5 seconds... finished.


=== Active Threads =========================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm     | state                  | kstack                                                                                                                                                                             
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      99 |        0.99 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                     
      81 |        0.81 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                 
      70 |        0.70 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_poll()->do_sys_poll()->poll_schedule_timeout()                                                                                                    
...
       1 |        0.01 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_sendto()->SYSC_sendto()->sock_sendmsg()->unix_stream_sendmsg()->sock_alloc_send_pskb()->alloc_skb_with_frags()->__alloc_skb()->__kmalloc_reserve.isra.34()->__kmalloc_node_track_caller()


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.02, measure time: 1.89

So we have the number of samples in the first column, for threads in the process we see in the third column, per state we see in the fourth column having the given kernel stack fancy formatted and presented in the fifth column. That is, we get number of samples with the given kernel stack trace, for waiting threads (among others).

Now let me recall where I've seen some similarly structured information processed before? Here it is, "Off-CPU Analysis" by Brendan Gregg! There he described different approaches, with /proc sampling among them, but mostly concentrated on his eBPF/bcc based tool, offcputime. that summarizes off-CPU time by stack trace. I used it on Fedora for a coupled of years already, but on this old Ubuntu 16.04 it is not supposed to work (kernel 4.8+ is needed):

openxs@ao756:~$ uname -a
Linux ao756 4.4.0-198-generic #230-Ubuntu SMP Sat Nov 28 01:30:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -h
usage: offcputime [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f]
                  [--stack-storage-size STACK_STORAGE_SIZE]
                  [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]
                  [duration]

Summarize off-CPU time by stack trace

positional arguments:
  duration              duration of trace, in seconds

optional arguments:
  -h, --help            show this help message and exit
  -p PID, --pid PID     trace this PID only
  -t TID, --tid TID     trace this TID only
  -u, --user-threads-only
                        user threads only (no kernel threads)
  -k, --kernel-threads-only
                        kernel threads only (no user threads)
  -U, --user-stacks-only
                        show stacks from user space only (no kernel space
                        stacks)
  -K, --kernel-stacks-only
                        show stacks from kernel space only (no user space
                        stacks)
  -d, --delimited       insert delimiter between kernel/user stacks
  -f, --folded          output folded format
  --stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored
                        and displayed (default 1024)
  -m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store
                        traces (default 1)
  -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we
                        store traces (default U64_MAX)
  --state STATE         filter on this thread state bitmask (eg, 2 ==
                        TASK_UNINTERRUPTIBLE) see include/linux/sched.h

examples:
    ./offcputime             # trace off-CPU stack time until Ctrl-C
    ./offcputime 5           # trace for 5 seconds only
    ./offcputime -f 5        # 5 seconds, and output in folded format
    ./offcputime -m 1000     # trace only events that last more than 1000 usec
    ./offcputime -M 10000    # trace only events that last less than 10000 usec
    ./offcputime -p 185      # only trace threads for PID 185
    ./offcputime -t 188      # only trace thread 188
    ./offcputime -u          # only trace user threads (no kernel)
    ./offcputime -k          # only trace kernel threads (no user)
    ./offcputime -U          # only show user space stacks (no kernel)
    ./offcputime -K          # only show kernel space stacks (no user)

openxs@ao756:~$ sudo /usr/share/bcc/tools/offcputime -K -f 5 -p `pidof mysqld`
could not open bpf map: stack_traces, error: Invalid argument
Traceback (most recent call last):
  File "/usr/share/bcc/tools/offcputime", line 235, in <module>
    b = BPF(text=bpf_text)
  File "/usr/lib/python2.7/dist-packages/bcc/__init__.py", line 325, in __init__
    raise Exception("Failed to compile BPF text")
Exception: Failed to compile BPF text

and it does not work as you can see other than wiht -h option to get the usage details. That's unfortunate, as this slow netbook could really benefit from the Off-CPU analysis with its 2 slow cores and encrypted slow HDD :) The offcputime tool output was then used to create nice flame graphs that allow to identify the most often happening kinds of wait:

# /usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks
...
# git clone https://github.com/brendangregg/FlameGraph # cd FlameGraph # ./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg

Flame graph created like that shows all off-CPU stack traces, with stack depth on the y-axis, and the width corresponds to the total time in each stack, along with application level stacks.

I could not get these on my Ubuntu with offcputime. But I tried to find out what kind of outp[ut the tool produces with the -f option we see used above. We can see this without running the tool from this file, https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt:

"A -f option will emit output using the "folded stacks" format, which can be
read directly by flamegraph.pl from the FlameGraph open source software
(https://github.com/brendangregg/FlameGraph). Eg:

# ./offcputime -K -f 5
bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;...;schedule 8
..."

So, the format is simple: first column is the program name and kernel stack trace, separated by ';', and then the number of times we've noted this stack for the program. I can surely get the output of psn converted to such simple format!

I've started a sysbench test like this:

openxs@ao756:~$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-db=sbtest --threads=16 --time=1200 --report-interval=10 run
sysbench 1.1.0-faaff4f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 16
Report intermediate results every 10 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 10s ] thds: 16 tps: 69.66 qps: 1423.20 (r/w/o: 997.63/284.64/140.92) lat (ms,95%): 383.33 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 16 tps: 79.81 qps: 1596.57 (r/w/o: 1117.32/319.63/159.62) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00
...

and in another shell after several attempts filtered out and agrregated exactly what I needed (I'll skip explanations of my lame awk and sed usage, this transformation can be surely done in a more efficient way and with less commands in the pipe):

openxs@ao756:~$ sudo psn -d 60 -G kstack | grep -v Running | awk -F\| '{ print $3, $5, $1 }' |  sed 's/->/;/g' | grep '^.(' | sed 's/(//g' | sed 's/)//g' | awk '{ print $1";"$2, $3 }' > /tmp/psnstacks.txt

Kernel stacks of all the processes were collected for 60 seconds, processed to match the format needed to build a flame graph, and saved into the file. 

The next step is to process it to create an .svg file with the graph:

openxs@ao756:~$ git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph based on proc sampling" --countname=hits < /tmp/psnstacks.txt > ~/Documents/psn1.svg
openxs@ao756:~$ ls -l ~/Documents/psn1.svg
-rw-rw-r-- 1 openxs openxs 25999 Jan 16 15:02 /home/openxs/Documents/psn1.svg

The result (with a typo I made in the original command) was the following (this is a .png made from a screenshot of a browser used to work with the psn1.svg):

You can create Off-CPU flame graphs like the above on any Linux 2.6.x by sampling /proc!


I see that basically most of the time when mysqld threads are not running is spent on jbd2_log_wait_commit system call waits. Now we can go to the Linux kernel source code and study what it is about:

/*
 * Wait for a specified commit to complete.
 * The caller may not hold the journal lock.
 */
int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
{
    int err = 0;

    read_lock(&journal->j_state_lock);
#ifdef CONFIG_PROVE_LOCKING
    /*
     * Some callers make sure transaction is already committing and in that
     * case we cannot block on open handles anymore. So don't warn in that
     * case.
     */
    if (tid_gt(tid, journal->j_commit_sequence) &&
        (!journal->j_committing_transaction ||
         journal->j_committing_transaction->t_tid != tid)) {
        read_unlock(&journal->j_state_lock);
        jbd2_might_wait_for_commit(journal);
        read_lock(&journal->j_state_lock);
    }
#endif
...

 So we are waiting for commit of write to the journal of ext4 filesystem here. Fair enough.

* * *

To summarize, the fact that you do not have bcc tools installed or use old enough kernel for them not to work at all does not prevent you from doing off-CPU analysis based on /proc sampling approach. With some simple scripting and command line text processing applied you can get folded stacks for further analysis based on flame graphs. This may help to resolve some obscure MySQL performance problems that go beyong bad execution plans or spinning and wasting CPU cicles...

Moreover, the impact of such sampling is notably lower than with bcc tools. This is a topic I am going to study in my next blog post in this series, based on testing on Fedora 31 box where all the tools just work.

Friday, January 8, 2021

Linux /proc Filesystem for MySQL DBAs - Part III, 0x.tools by Tanel Poder

In this third post of my "Linux /proc Filesystem for MySQL DBAs" series (see also Part I and Part II for the context and details) I am going to present a useful set of programs to access, summarize and record /proc details created and recently shared by famous Tanel Poder, 0x.tools. I'll try to build them and apply to Percona Server 5.7.32-35 running on Ubuntu 16.04 netbook and fighting with some read-write sysbench load.

To build the tools I clone them from GitHub and just make as usual: 

openxs@ao756:~/git$ git clone https://github.com/tanelpoder/0xtools
Cloning into '0xtools'...
remote: Enumerating objects: 103, done.
remote: Counting objects: 100% (103/103), done.
remote: Compressing objects: 100% (67/67), done.
remote: Total 186 (delta 53), reused 77 (delta 32), pack-reused 83
Receiving objects: 100% (186/186), 108.43 KiB | 0 bytes/s, done.
Resolving deltas: 100% (94/94), done.
Checking connectivity... done.
openxs@ao756:~/git$ cd 0xtools/
openxs@ao756:~/git/0xtools$ make
gcc -I include -Wall -o bin/xcapture src/xcapture.c
openxs@ao756:~/git/0xtools$ sudo make install
# for now the temporary "install" method is with symlinks
ln -s `pwd`/bin/xcapture /usr/bin/xcapture
ln -s `pwd`/bin/psn /usr/bin/psn
ln -s `pwd`/bin/schedlat /usr/bin/schedlat
openxs@ao756:~/git/0xtools$ file bin/xcapture
bin/xcapture: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=ad05c149b0379a29fef765ec4706e912843846ff, not stripped
openxs@ao756:~/git/0xtools$ file bin/psn
bin/psn: Python script, ASCII text executable
openxs@ao756:~/git/0xtools$ file bin/schedlat
bin/schedlat: Python script, ASCII text executable
openxs@ao756:~/git/0xtools$ python --version
Python 2.7.12

Nothing more than that for now. Executable and two Python scripts. This installation process should work well on any Linux where you can build non-ancient MySQL from source as has some Python installed, starting from RHEL5. 

We can get some details about the programs usage as follows:

openxs@ao756:~/git/0xtools$ xcapture -h

0x.Tools xcapture v1.0 by Tanel Poder [https://0x.tools]

Usage:
  xcapture [options]

  By default, sample all /proc tasks in states R, D every second and print to stdout

  Options:
    -a             capture tasks in additional states, even the ones Sleeping (S)
    -A             capture tasks in All states, including Zombie (Z), Exiting (X), Idle (I)
    -c <c1,c2>     print additional columns (for example: -c exe,cmdline,kstack)
    -d <N>         seconds to sleep between samples (default: 1)
    -E <string>    custom task state Exclusion filter (default: XZIS)
    -h             display this help message
    -o <dirname>   write wide output into hourly CSV files in this directory instead of stdout

openxs@ao756:~/git/0xtools$ psn -h
usage: psn [-h] [-d seconds] [-p [pid]] [-t [tid]] [-r] [-a]
           [--sample-hz SAMPLE_HZ] [--ps-hz PS_HZ] [-o filename] [-i filename]
           [-s csv-columns] [-g csv-columns] [-G csv-columns] [--list]

optional arguments:
  -h, --help            show this help message and exit
  -d seconds            number of seconds to sample for
  -p [pid], --pid [pid]
                        process id to sample (including all its threads), or
                        process name regex, or omit for system-wide sampling
  -t [tid], --thread [tid]
                        thread/task id to sample (not implemented yet)
  -r, --recursive       also sample and report for descendant processes
  -a, --all-states      display threads in all states, including idle ones
  --sample-hz SAMPLE_HZ
                        sample rate in Hz (default: 20)
  --ps-hz PS_HZ         sample rate of new processes in Hz (default: 2)
  -o filename, --output-sample-db filename
                        path of sqlite3 database to persist samples to,
                        defaults to in-memory/transient
  -i filename, --input-sample-db filename
                        path of sqlite3 database to read samples from instead
                        of actively sampling
  -s csv-columns, --select csv-columns
                        additional columns to report
  -g csv-columns, --group-by csv-columns
                        columns to aggregate by in reports
  -G csv-columns, --append-group-by csv-columns
                        default + additional columns to aggregate by in
                        reports
  --list                list all available columns

The last one is also a python script that does not care to provide -h option, but shows the usage when called without arguments:

openxs@ao756:~/git/0xtools$ bin/schedlat
usage: bin/schedlat PID

There are also two shell scripts in bin/:

openxs@ao756:~/git/0xtools$ ls -F bin
psn*  run_xcapture.sh*  run_xcpu.sh*  schedlat*  xcapture*

The purposes of the tools are the following:

  • xcapture - low-overhead thread state sampler based on reading /proc files
  • psn - shows current top thread activity by sampling /proc files
  • schedlat - shows CPU scheduling latency for the given PID as a % of its runtime
  • run_xcapture.sh - a simple “daemon” script for keeping xcapture running
  • run_xcpu.sh - low-frequency continuous stack sampling for threads on CPU (using perf)

The last script, run_xcpu.sh, is out of scope of this series. We may get back to it in some next perf-related posts. It shows proper usge of perf for low frequency (1 Hz) and low impact on-CPU profiling.

Let's see what we can get with xcapture:

openxs@ao756:~/git/0xtools$ mkdir /tmp/xcap
openxs@ao756:~/git/0xtools$ xcapture -a -o /tmp/xcap

0xTools xcapture v1.0 by Tanel Poder [https://0x.tools]

Sampling /proc...

^C
openxs@ao756:~/git/0xtools$

while sysbench test is running against my Percona Server:

openxs@ao756:~$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-db=sbtest --threads=16 --time=1200 --report-interval=10 run
...
[ 1200s ] thds: 16 tps: 70.99 qps: 1420.02 (r/w/o: 993.87/284.16/141.98) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            1214556
        write:                           347016
        other:                           173508
        total:                           1735080
    transactions:                        86754  (72.28 per sec.)
    queries:                             1735080 (1445.54 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      72.2771
    time elapsed:                        1200.2964s
    total number of events:              86754

Latency (ms):
         min:                                   89.13
         avg:                                  221.33
         max:                                  855.31
         95th percentile:                      356.70
         sum:                             19200998.85

Threads fairness:
    events (avg/stddev):           5422.1250/12.10
    execution time (avg/stddev):   1200.0624/0.11

To get the impact of capturing the information, I've got the same test results without xcapture running previously:

...

    transactions:                        87376  (72.80 per sec.)
    queries:                             1747537 (1456.07 per sec.)
    ignored errors:                      1      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

Throughput:
    events/s (eps):                      72.8027
    time elapsed:                        1200.1747s
    total number of events:              87376

Latency (ms):
         min:                                   87.54
         avg:                                  219.75
         max:                                  843.19
         95th percentile:                      356.70
...

So we had 72.80 TPS and 1456 QPS over 1200 seconds without monitoring and 72.28 TPS and 1446 QPS with it. Do not mind the absolute (no tuning was performed, all default, old netbook with slow HDD), but the difference is really ignorable.

Now let's see what was captured (with -a, so all threads of all processes in the system, in all states, even S - sleeping):

openxs@ao756:~/git/0xtools$ ls -l /tmp/xcap/
total 56720
-rw-rw-r-- 1 openxs openxs 36310058 Jan  8 16:59 2021-01-08.16.csv
-rw-rw-r-- 1 openxs openxs 21768834 Jan  8 17:07 2021-01-08.17.csv
openxs@ao756:~/git/0xtools$ cat /tmp/xcap/2021-01-08.16.csv | head -30
TS,PID,TID,USERNAME,ST,COMMAND,SYSCALL,WCHAN,EXE,KSTACK
2021-01-08 16:47:24.414,1,1,root,S,(systemd),-,0,-,-
2021-01-08 16:47:24.414,2,2,root,S,(kthreadd),-,0,-,-
2021-01-08 16:47:24.414,3,3,root,S,(ksoftirqd/0),-,0,-,-
2021-01-08 16:47:24.414,5,5,root,S,(kworker/0:0H),-,0,-,-
2021-01-08 16:47:24.414,7,7,root,S,(rcu_sched),-,0,-,-
2021-01-08 16:47:24.414,8,8,root,S,(rcu_bh),-,0,-,-
2021-01-08 16:47:24.414,9,9,root,S,(migration/0),-,0,-,-
2021-01-08 16:47:24.414,10,10,root,S,(watchdog/0),-,0,-,-
2021-01-08 16:47:24.414,11,11,root,S,(watchdog/1),-,0,-,-
2021-01-08 16:47:24.414,12,12,root,S,(migration/1),-,0,-,-
2021-01-08 16:47:24.414,13,13,root,S,(ksoftirqd/1),-,0,-,-
2021-01-08 16:47:24.414,15,15,root,S,(kworker/1:0H),-,0,-,-
2021-01-08 16:47:24.414,16,16,root,S,(kdevtmpfs),-,0,-,-
2021-01-08 16:47:24.414,17,17,root,S,(netns),-,0,-,-
2021-01-08 16:47:24.414,18,18,root,S,(perf),-,0,-,-
2021-01-08 16:47:24.414,19,19,root,S,(khungtaskd),-,0,-,-
2021-01-08 16:47:24.414,20,20,root,S,(writeback),-,0,-,-
2021-01-08 16:47:24.414,21,21,root,S,(ksmd),-,0,-,-
2021-01-08 16:47:24.414,22,22,root,S,(khugepaged),-,0,-,-
2021-01-08 16:47:24.414,23,23,root,S,(crypto),-,0,-,-
2021-01-08 16:47:24.414,24,24,root,S,(kintegrityd),-,0,-,-
2021-01-08 16:47:24.414,25,25,root,S,(bioset),-,0,-,-
2021-01-08 16:47:24.414,26,26,root,S,(kblockd),-,0,-,-
2021-01-08 16:47:24.414,27,27,root,S,(ata_sff),-,0,-,-
2021-01-08 16:47:24.414,28,28,root,S,(md),-,0,-,-
2021-01-08 16:47:24.414,29,29,root,S,(devfreq_wq),-,0,-,-
2021-01-08 16:47:24.414,34,34,root,S,(kswapd0),-,0,-,-
2021-01-08 16:47:24.414,35,35,root,S,(vmstat),-,0,-,-
2021-01-08 16:47:24.414,36,36,root,S,(fsnotify_mark),-,0,-,-
openxs@ao756:~/git/0xtools$

Nothing that fancy in the first 30 lines. But you can then load CSV files into the database for rocessing, or query them with standard Linux text processing tools. It’s like SQL but with different keywords: grep for filtering, awk or sed for column projection, uniq for group by, sort for ordering and head/tail for LIMIT. Let's count number of threads per command, state and syscall, for example, asnd get top 10:

openxs@ao756:~/git/0xtools$ cat /tmp/xcap/*.csv | awk -F, '{ printf("%-20s %-20s %s\n",$6,$5,$7) }' | sort | uniq -c | sort -nbr | head -10
  71442 (console-kit-dae)    S                    -
  69174 (gmain)              S                    read
  69173 (gdbus)              S                    read
  48762 (dconf               S                    read
  47628 (dockerd)            S                    -
  46455 (mysqld)             S                    -
  19196 (sysbench)           S                    read

  15876 (gmain)              S                    -
  15876 (bioset)             S                    -
  14742 (containerd)         S                    -

No wonder MySQL threads were mostly sleeping, with 16 of them and just 2 slow CPUs.

Next stool is psn (Linux Process Snapper), a Python script for troubleshooting currently on-going issues (no historical capture). It currently reports more fields directly from /proc than xcapture captures (like filenames accessed by IO system calls). Let me try to apply psn to the mysqld process:

openxs@ao756:~/git/0xtools$ psn -p `pidof mysqld`

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat for 5 seconds... finished.


=== Active Threads ========================================

 samples | avg_threads | comm     | state
-----------------------------------------------------------
     205 |        2.05 | (mysqld) | Disk (Uninterruptible)
     114 |        1.14 | (mysqld) | Running (ON CPU)


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.01, measure time: 1.29

or to the entire system while sysbench test is running:

openxs@ao756:~/git/0xtools$ psn

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat for 5 seconds... finished.


=== Active Threads ================================================

 samples | avg_threads | comm             | state
-------------------------------------------------------------------
      94 |        2.24 | (mysqld)         | Disk (Uninterruptible)
      41 |        0.98 | (jbd*/dm-*-*)    | Disk (Uninterruptible)
      31 |        0.74 | (mysqld)         | Running (ON CPU)
      31 |        0.74 | (sysbench)       | Running (ON CPU)
      14 |        0.33 | (update-manager) | Running (ON CPU)
       5 |        0.12 | (kworker/u*:*)   | Running (ON CPU)
       2 |        0.05 | (Xorg)           | Running (ON CPU)
       1 |        0.02 | (compiz)         | Running (ON CPU)
       1 |        0.02 | (dockerd)        | Running (ON CPU)
       1 |        0.02 | (rcu_sched)      | Running (ON CPU)


samples: 42 (expected: 100)
total processes: 221, threads: 650
runtime: 5.06, measure time: 5.01

Some useful summaries of numbers of threads per state per process for the monitoring period, but nothing really fancy. This is, we can add other colunms from this long list with items that should look familiar for those studying /proc content:

openxs@ao756:~/git/0xtools$ psn --list

stat
====================================
pid                              int
comm                             str
comm2                            str
state_id                         str
state                            str
ppid                             int
pgrp                             int
session                          int
tty_nr                           int
tpgid                            int
minflt                           int
cminflt                          int
majflt                           int
cmajflt                          int
utime                           long
stime                           long
cutime                          long
cstime                          long
utime_sec                       long
stime_sec                       long
cutime_sec                      long
cstime_sec                      long
priority                         int
nice                             int
num_threads                      int
starttime                       long
vsize                           long
rss                             long
rsslim                           str
exit_signal                      int
processor                        int
rt_priority                      int
delayacct_blkio_ticks           long
guest_time                       int
cgust_time                       int
exit_code                        int

status
====================================
name                             str
umask                            str
state                            str
tgid                             int
ngid                             int
pid                              int
ppid                             int
tracerpid                        int
uid                              int
gid                              int
fdsize                           int
groups                           str
nstgid                           str
nspid                            str
nspgid                           str
nssid                            str
vmpeak_kb                        int
vmsize_kb                        int
vmlck_kb                         int
vmpin_kb                         int
vmhwm_kb                         int
vmrss_kb                         int
rssanon_kb                       int
rssfile_kb                       int
rssshmem_kb                      int
vmdata_kb                        int
vmstk_kb                         int
vmexe_kb                         int
vmlib_kb                         int
vmpte_kb                         int
vmpmd_kb                         int
vmswap_kb                        int
hugetlbpages_kb                  int
threads                          int
sigq                             str
sigpnd                           str
shdpnd                           str
sigblk                           str
sigign                           str
sigcgt                           str
capinh                           str
capprm                           str
capeff                           str
capbnd                           str
capamb                           str
seccomp                          int
cpus_allowed                     str
cpus_allowed_list                str
mems_allowed                     str
mems_allowed_list                str
voluntary_ctxt_switches          int
nonvoluntary_ctxt_switches       int

syscall
====================================
syscall_id                       int
syscall                          str
arg0                             str
arg1                             str
arg2                             str
arg3                             str
arg4                             str
arg5                             str
filename                         str
filename2                        str
filenamesum                      str
basename                         str
dirname                          str

wchan
====================================
wchan                            str

io
====================================
rchar                            int
wchar                            int
syscr                            int
syscw                            int
read_bytes                       int
write_bytes                      int
cancelled_write_bytes            int

smaps
====================================
address_range                    str
perms                            str
offset                           str
dev                              str
inode                            int
pathname                         str
size_kb                          int
rss_kb                           int
pss_kb                           int
shared_clean_kb                  int
shared_dirty_kb                  int
private_clean_kb                 int
private_dirty_kb                 int
referenced_kb                    int
anonymous_kb                     int
anonhugepages_kb                 int
shmempmdmapped_kb                int
shared_hugetld_kb                int
private_hugetld_kb               int
swap_kb                          int
swappss_kb                       int
kernelpagesize_kb                int
mmupagesize_kb                   int
locked_kb                        int
vmflags                          str

stack
====================================
kstack                           str

cmdline
====================================
cmdline                          str
openxs@ao756:~/git/0xtools$

We can add syscall to find out what threads were waiting on (not only number as in /proc, but syscall decoded, that's what requires some tricks or good tool):

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G syscall

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, syscall for 5 seconds... finished.


=== Active Threads ====================================================

 samples | avg_threads | comm     | state                  | syscall
-----------------------------------------------------------------------
     136 |        1.36 | (mysqld) | Disk (Uninterruptible) | fsync
      97 |        0.97 | (mysqld) | Disk (Uninterruptible) | fdatasync
      41 |        0.41 | (mysqld) | Running (ON CPU)       | [running]
       5 |        0.05 | (mysqld) | Running (ON CPU)       | poll
       4 |        0.04 | (mysqld) | Disk (Uninterruptible) | pwrite64
       3 |        0.03 | (mysqld) | Running (ON CPU)       | futex
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | write


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.02, measure time: 1.67

We can further group by different kernel stacks:

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G kstack

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/stat, stack for 5 seconds... finished.


=== Active Threads =========================================================================================================================================================================================================================================================================================

 samples | avg_threads | comm     | state                  | kstack                                                                                                                                                                             
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     101 |        1.01 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                     
      89 |        0.89 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                 
      48 |        0.48 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_poll()->do_sys_poll()->poll_schedule_timeout()                                                                                                    
      15 |        0.15 | (mysqld) | Running (ON CPU)       | int_ret_from_sys_call()->syscall_return_slowpath()->exit_to_usermode_loop()                                                                                                        
      11 |        0.11 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->blkdev_issue_flush()->submit_bio_wait()                                                 
      11 |        0.11 | (mysqld) | Running (ON CPU)       | -                                                                                                                                                                                  
       9 |        0.09 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       7 |        0.07 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me()                                                                                          
       4 |        0.04 | (mysqld) | Running (ON CPU)       | retint_user()->prepare_exit_to_usermode()->exit_to_usermode_loop()                                                                                                                 
       3 |        0.03 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_pwrite64()->vfs_write()->__vfs_write()->new_sync_write()->ext4_file_write_iter()->__generic_file_write_iter()->generic_file_direct_write()->ext4_direct_IO()->__blockdev_direct_IO()->do_blockdev_direct_IO()
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_pread64()->vfs_read()->__vfs_read()->new_sync_read()->generic_file_read_iter()->wait_on_page_bit_killable()                                       
       2 |        0.02 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_io_getevents()->read_events()                                                                                                                     
       1 |        0.01 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()
       1 |        0.01 | (mysqld) | Running (ON CPU)       | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit()                                     


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.01, measure time: 1.78

or even more, we can see wait channel and if it's disk I/O what file it is related to:

openxs@ao756:~/git/0xtools$ sudo psn -p `pidof mysqld` -G wchan,filename

Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]
Sampling /proc/syscall, wchan, stat for 5 seconds... finished.


=== Active Threads =====================================================================================================

 samples | avg_threads | comm     | state                  | wchan                 | filename
------------------------------------------------------------------------------------------------------------------------
      91 |        0.91 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/ao756-bin.000092
      89 |        0.89 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/ib_logfile1

      49 |        0.49 | (mysqld) | Running (ON CPU)       | 0                     |
       9 |        0.09 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/ib_logfile1
       7 |        0.07 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/xb_doublewrite
       4 |        0.04 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/sbtest/sbtest5.ibd
       3 |        0.03 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit  | /var/lib/mysql/sbtest/sbtest4.ibd
       3 |        0.03 | (mysqld) | Running (ON CPU)       | poll_schedule_timeout |
       2 |        0.02 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/ao756-bin.000092
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | do_blockdev_direct_IO | /var/lib/mysql/xb_doublewrite
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | submit_bio_wait       | /var/lib/mysql/sbtest/sbtest2.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | submit_bio_wait       | /var/lib/mysql/sbtest/sbtest3.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      |
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/sbtest/sbtest4.ibd
       1 |        0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit      | /var/lib/mysql/sbtest/sbtest5.ibd
       1 |        0.01 | (mysqld) | Running (ON CPU)       | 0                     | /var/lib/mysql/ib_logfile1


samples: 100 (expected: 100)
total processes: 1, threads: 44
runtime: 5.03, measure time: 2.06

This is really cool and easy to use! This way I got reminded that actually I have binary logging enabled somewhere and it surely slows things down :)

The last tool to check today is schedlat. It had not provided me with any insights:

openxs@ao756:~/git/0xtools$ schedlat `pidof mysqld`
SchedLat by Tanel Poder [https://0x.tools]

PID=19951 COMM=mysqld

TIMESTAMP              %CPU   %LAT   %SLP
2021-01-08 18:52:38     0.0    0.0  100.0
2021-01-08 18:52:39     0.0    0.0  100.0
2021-01-08 18:52:40     0.0    0.0  100.0
2021-01-08 18:52:41     0.0    0.0  100.0
2021-01-08 18:52:42     0.0    0.4   99.6
2021-01-08 18:52:43     0.0    0.0  100.0
2021-01-08 18:52:44     0.0    0.0  100.0
2021-01-08 18:52:45     0.0    0.0  100.0
2021-01-08 18:52:46     0.0    0.0  100.0
2021-01-08 18:52:47     0.0    0.0  100.0
2021-01-08 18:52:48     0.0    0.0  100.0
2021-01-08 18:52:49     0.0    0.0  100.0
2021-01-08 18:52:50     0.0    0.0  100.0
2021-01-08 18:52:51     0.0    0.0  100.0
2021-01-08 18:52:52     0.0    0.0  100.0
2021-01-08 18:52:53     0.0    0.0  100.0
2021-01-08 18:52:54     0.0    0.0  100.0
2021-01-08 18:52:55     0.0    0.0  100.0
2021-01-08 18:52:56     0.0    0.0  100.0
2021-01-08 18:52:57     0.0    0.0  100.0
2021-01-08 18:52:58     0.0    0.0  100.0
2021-01-08 18:52:59     0.0    0.0  100.0
2021-01-08 18:53:00     0.0    0.0  100.0
2021-01-08 18:53:01     0.0    0.0  100.0
2021-01-08 18:53:02     0.0    0.0  100.0
2021-01-08 18:53:03     0.0    0.0  100.0
2021-01-08 18:53:04     0.0    0.0  100.0
2021-01-08 18:53:05     0.0    0.0  100.0
2021-01-08 18:53:06     0.0    0.0  100.0
2021-01-08 18:53:07     0.0    0.0  100.0
^CTraceback (most recent call last):
  File "/usr/bin/schedlat", line 36, in <module>
    time.sleep(1)
KeyboardInterrupt

I was not CPU-bound, the process was mostly sleeping. 

I suggest you to get inspired by the thread states chart below and article it got it copied from, and read other articled by Tanel Poder here. This is what I am going to do myself before trying to apply the tools in real life. The idea of this blog post was to show what is possible for a creative person while sampling thread states from /proc.

Nice illustration of generic threads states we care about for off-CPU sampling, provided by Brendann Gregg at http://www.brendangregg.com/offcpuanalysis.html

* * *

To summarize, why would anyone may prefer or need to use /proc sampling tools instead of other approaches:

  1. Unlike perf or eBPF tools, /proc is always there. It is present on old Linux versions, and poor man's sampling with shell scripts does not require installing anything else. 
  2. Some information about the processes is visible to all users, so there may be no need for root/sudo privileges that MySQLproduction DBAs often lack.
  3. It is easy to use to do low-overhead off-CPU profiling. While it is possible to enable tracing for off-cpu events in perf, it comes with a higher tracing overhead (and then overhead of post-processing these high-frequency events). eBPF tools can be used to reduce both, but it is still extra overhead on top of what is already there in /proc.

So I decided to add these tools and ad hoc /proc sampling scripts (and checking sar outputs, if any) to my emergency toolset while studying some complex MySQL performance problems. They fit well into the ovberall system of application-level tracing based on performance_schema, on-CPU sampling with perf and some bpftrace quick scripts. Use right tools for the job!

In the next, probably final post in this series, I'll try to present some good example of MySQL performance problem where /proc-based sampling tools really help to get useful insight.