Wednesday, January 6, 2021

Linux /proc Filesystem for MySQL DBAs - Part I, Basics

Happy New Year 2021, dear readers of my blog!

We used to have real winters at this time of the year. Not any more...

Among other good things that happened on December 31, 2020, I was informed that my talk "Linux /proc filesystem for MySQL DBAs" was accepted for FOSDEM 2021 MySQL devroom. So, it's time to get back to blogging that I abandoned for a while in favor of this YouTube channel, and share some details to refer to on my slides.

This is not my first talk or blog post in "Something for MySQL DBAs" series. There were some in the past, like these:

My colleague once said that eventually I have to end up with something like "/dev/null for MySQL DBAs" kind of talk. I am not yet there, but I still think that the more OS level tools DBAs know the better they can do the job.

The idea of talking about /proc filesystem was inspired by the fact that I use some files there on a regular basis while doing my job and by this great talk by Tanel Poder and his 0x.tools, a small set of open-source utilities for analyzing application performance on Linux based mostly on /proc sampling. 

I am planning to cover most part of the upcoming talk in a mini-series of three blog posts. In this post I am going to present some basic information about the /proc filesystem and some files and directories there that would be surely useful for MySQL DBAs, as well as (who could imagine) some related MySQL bug reports. In another one I'll discuss how threads in the process are represented in the /proc filesystem (as well as ps and top outputs). Finally in the third post I am going to show some use cases for the 0x.tools and summarize the benefits and use cases for /proc sampling approaches.

* * *

Basically, the proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at /proc:   

openxs@ao756:~$ cat /etc/issue
Ubuntu 16.04.7 LTS \n \l

openxs@ao756:~$ mount | grep '/proc'
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
...

Most  of it is read-only, but some files allow to change kernel variables. For the purpose of this discussion I skip mount options and most of the /proc/* files and proceed to /proc subdirectories, one per each process running, named after PID. I have the following mysqld process (of Percona Server 5.7.x) running:

openxs@ao756:~$ ps aux | grep mysqld
...
mysql    30580  0.7  8.1 746308 313984 ?       Sl   Jan02   9:55 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid

So there will be the /proc/30580 directory with the following content:

openxs@ao756:~$ ls -F /proc/30580
ls: cannot read symbolic link '/proc/30580/cwd': Permission denied
ls: cannot read symbolic link '/proc/30580/root': Permission denied
ls: cannot read symbolic link '/proc/30580/exe': Permission denied
attr/            cpuset   limits      net/           projid_map  stat
autogroup        cwd@     loginuid    ns/            root@       statm
auxv             environ  map_files/  numa_maps      sched       status
cgroup           exe@     maps        oom_adj        schedstat   syscall
clear_refs       fd/      mem         oom_score      sessionid   task/
cmdline          fdinfo/  mountinfo   oom_score_adj  setgroups   timers
comm             gid_map  mounts      pagemap        smaps       uid_map
coredump_filter  io       mountstats  personality    stack       wchan

I highlighted the files and directories I consider most useful. Note also "Permission denied" messages above that you may get while accessing some files in /proc, even related to the processes you own. You may still need root/sudo access (or belong to some dedicated group) to read them.

Now let me give short descriptions or hints about the content of the highlighted files and directories (see man 5 proc to get more details):

  • task/[tid] - task subdirectory contains subdirectories of the form task/[tid], which contain corresponding information about each of the threads in the process, where tid is the kernel thread ID of the thread. In my case:

    openxs@ao756:~$ ls -F /proc/30580/task/
    2488/  30580/  30584/  30588/  30592/  30598/  30602/  30606/  31972/  3622/
    2493/  30581/  30585/  30589/  30593/  30599/  30603/  30607/  3618/
    2800/  30582/  30586/  30590/  30594/  30600/  30604/  30608/  3620/
    2805/  30583/  30587/  30591/  30597/  30601/  30605/  30609/  3621/

    Each of them in turn has files and subdirectories similar to those /proc/pid ones that I discuss below:

    openxs@ao756:~$ sudo ls -F /proc/30580/task/30592
    attr/       cpuset   io         net/           personality  smaps    wchan
    auxv        cwd@     limits     ns/            projid_map   stack
    cgroup      environ  loginuid   numa_maps      root@        stat
    children    exe@     maps       oom_adj        sched        statm
    clear_refs  fd/      mem        oom_score      schedstat    status
    cmdline     fdinfo/  mountinfo  oom_score_adj  sessionid    syscall
    comm        gid_map  mounts     pagemap        setgroups    uid_map


  • cmdline - this read-only file contains the command line (argv) that the process wants you to see, as strings terminated by null bytes ('\0'). This is how you can check the content to see each string separately:

    openxs@ao756:~$ strings /proc/30580/cmdline
    /usr/sbin/mysqld
    --daemonize
    --pid-file=/var/run/mysqld/mysqld.pid

  • comm - command name (up to 16 characters including the terminating null byte, longer values truncated) associated with the process. In my case:

    openxs@ao756:~$ cat /proc/30580/comm
    mysqld

    Individual threads can set different comm values. There is a useful feature request for MySQL threads to be named according to their role. It is Bug #70858 - "Set thread name" by DaniĆ«l van Eeden.

  • coredump_filter - this file can be used to control which memory segments are written to the core dump file in the event that a core dump is performed for the process with the corresponding PID. The value in the file is a bit mask of memory mapping types:
    bit 0  Dump anonymous private mappings.
    bit 1  Dump anonymous shared mappings.
    bit 2  Dump file-backed private mappings.
    bit 3  Dump file-backed shared mappings.
    bit 4  Dump ELF headers.
    bit 5  Dump private huge pages.
    ...

    By default bits 0, 1, 4 and 5 are set, so we see hex value 33 in the file:

    openxs@ao756:~$ cat /proc/30580/coredump_filter
    00000033

    We can control core dump size to some extent by writing to this file. Other, way more important files in /proc that are related to core dumps are presented in man 5 core and in other sources. Read them to decode, for example, this output that is typical for systemd-based systems (and do not look for the core.* files related to the mysqld process in the datadir desperately after that):

    openxs@ao756:~$ cat /proc/sys/kernel/core_pattern
    |/usr/share/apport/apport %p %s %c %d %P %E


  • environ - this file contains the initial environment that was set when the currently executing program was started. The entries are separated by null bytes, so you can check them like this:

    openxs@ao756:~$ sudo strings /proc/30580/environ
    LANG=en_US.UTF-8
    ...
    LC_TIME=uk_UA.UTF-8
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
    HOME=/nonexistent
    LOGNAME=mysql
    USER=mysql
    SHELL=/bin/false
    STARTTIMEOUT=120
    STOPTIMEOUT=600
    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1

    Note the last 3 variables above. They show systemd timeouts for the unit and jemalloc preloaded.

  • fd/ - this subdirectory contains one entry for each file which the process has open. The name is its file descriptor and it is a symbolic link to the actual file:

    openxs@ao756:~$ sudo ls -l /proc/30580/fd/
    total 0
    lr-x------ 1 mysql mysql 64 Jan  5 19:09 0 -> /dev/null
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 1 -> socket:[822912]
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 10 -> /var/lib/mysql/ib_logfile1
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 11 -> /var/lib/mysql/ibdata1
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 12 -> /var/lib/mysql/xb_doublewrite
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 13 -> /tmp/ibGEsYuO (deleted)
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 14 -> /var/lib/mysql/ibtmp1
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 15 -> /var/lib/mysql/sbtest/sbtest3.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 16 -> /var/lib/mysql/mysql/plugin.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 17 -> /var/lib/mysql/mysql/gtid_executed.ibd
    l-wx------ 1 mysql mysql 64 Jan  5 19:09 18 -> /var/lib/mysql/ao756-bin.000089
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 19 -> socket:[822935]
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 2 -> socket:[822912]
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 20 -> socket:[822936]
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 21 -> /var/lib/mysql/mysql/server_cost.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 22 -> /var/lib/mysql/mysql/engine_cost.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 23 -> /var/lib/mysql/mysql/db.MYI
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 24 -> /var/lib/mysql/mysql/db.MYD
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 25 -> /var/lib/mysql/mysql/user.MYI
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 26 -> /var/lib/mysql/mysql/user.MYD
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 27 -> /var/lib/mysql/mysql/event.MYI
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 28 -> /var/lib/mysql/mysql/event.MYD
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 29 -> /var/lib/mysql/mysql/time_zone_leap_second.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 3 -> /var/lib/mysql/ao756-bin.index
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 30 -> /var/lib/mysql/mysql/time_zone_name.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 31 -> /var/lib/mysql/mysql/time_zone.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 32 -> /var/lib/mysql/mysql/time_zone_transition_type.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 33 -> /var/lib/mysql/mysql/time_zone_transition.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 34 -> /var/lib/mysql/mysql/innodb_table_stats.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 35 -> /var/lib/mysql/sbtest/sbtest2.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 36 -> /var/lib/mysql/sbtest/sbtest4.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 39 -> /var/lib/mysql/sbtest/sbtest1.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 4 -> /var/lib/mysql/sbtest/sbtest5.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 40 -> /var/lib/mysql/mysql/servers.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 41 -> /var/lib/mysql/mysql/slave_master_info.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 42 -> /var/lib/mysql/mysql/slave_relay_log_info.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 43 -> /var/lib/mysql/mysql/slave_worker_info.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 45 -> /var/lib/mysql/mysql/innodb_index_stats.ibd
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 5 -> /var/lib/mysql/ib_logfile0
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 6 -> /tmp/ibmKNTQP (deleted)
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 7 -> /tmp/ib2GbIR1 (deleted)
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 8 -> /tmp/ibME5wSd (deleted)
    lrwx------ 1 mysql mysql 64 Jan  5 19:09 9 -> /tmp/ibyiPjXB (deleted)

    In the above we see stderr (descriptor 2) is pointing out to socket:[822912]. We can get some more information about this inode in /proc/net and in other sources:

    openxs@ao756:~$ strings /proc/net/unix | grep 822912
    0000000000000000: 00000003 00000000 00000000 0001 03 822912
    openxs@ao756:~$ sudo netstat -n --program | grep 822912
    unix  3      [ ]         STREAM     CONNECTED     822912   30580/mysqld

    The information in fd/ is similar to what you may get from lsof, but it is "always there" while lsof must be installed separately.

  • fdinfo/ - files in this subdirectory provide more information about the corresponding file descriptor. Let's check 3 -> /var/lib/mysql/ao756-bin.index:

    openxs@ao756:~$ sudo cat /proc/30580/fdinfo/3
    pos:    1691
    flags:  0100002
    mnt_id: 24
    openxs@ao756:~$ sudo ls -l /var/lib/mysql/ao756-bin.index
    -rw-r----- 1 mysql mysql 1691 Jan  2 17:35 /var/lib/mysql/ao756-bin.index

    and then some InnoDB file, for example, 5 -> /var/lib/mysql/ib_logfile0:

    openxs@ao756:~$ sudo cat /proc/30580/fdinfo/5
    pos:    0
    flags:  0100002
    mnt_id: 24
    lock:   1: POSIX  ADVISORY  WRITE 30580 fc:01:11143169 0 EOF
    openxs@ao756:~$ cat /proc/30580/mountinfo | grep ^'24 '
    24 0 252:1 / / rw,relatime shared:1 - ext4 /dev/mapper/ubuntu--vg-root rw,errors=remount-ro,data=ordered

    As we can see the details include file offset (pos), octal number that displays the file access mode and file status flags (flags), id of the mountpoint for the file that we were able to find in /proc/$PID/mountinfo (/ file system in this case), and some other details depending on file type. For redo log we see lock record with the details about the write lock set on this file by the mysqld process. You can see it among all the file locks in /proc/locks too:

    openxs@ao756:~$ cat /proc/locks | grep fc:01:11143169
    30: POSIX  ADVISORY  WRITE 30580 fc:01:11143169 0 EOF

    See "File locking in Linux" by Victor Gaydov for much more details.

  • io - this file contains I/O statistics for the process, for example:

    openxs@ao756:~$ sudo cat /proc/30580/io
    rchar: 198026951
    wchar: 2365236669
    syscr: 8014
    syscw: 92558
    read_bytes: 49184768
    write_bytes: 3394945024
    cancelled_write_bytes: 171876352

    Here we can see number of characters read and written (maybe just to the pagecache, without physical I/O), number of system calls to read and write, number of bytes read and written to the storage and, in cancelled_write_bytes, number of written bytes which this process caused to not happen, by truncating pagecache (when deleting the file created, for example).

    It's important to note that the same information is properly tracked for the individual threads. For example, this is the page cleaner thread of our Percona Server (more on why I am so sure while comm is not set is in the next post):

    openxs@ao756:~$ sudo cat /proc/30580/task/30592/io
    rchar: 0
    wchar: 912518825
    syscr: 0
    syscw: 2092
    read_bytes: 0
    write_bytes: 1813786624
    cancelled_write_bytes: 0

    and we can see that it did notable share of all writes.

  • limits - this file provides the details about the actual resource limits for the process:

    This is my personal most often used file in the /proc filesystem, at least while working on support issues with customers.

    openxs@ao756:~$ cat /proc/30580/limits
    Limit                     Soft Limit           Hard Limit           Units
    Max cpu time              unlimited            unlimited            seconds
    Max file size             unlimited            unlimited            bytes
    Max data size             unlimited            unlimited            bytes
    Max stack size            8388608              unlimited            bytes
    Max core file size        0                    unlimited            bytes
    Max resident set          unlimited            unlimited            bytes
    Max processes             14893                14893                processes
    Max open files            1024                 4096                 files
    Max locked memory         65536                65536                bytes
    Max address space         unlimited            unlimited            bytes
    Max file locks            unlimited            unlimited            locks
    Max pending signals       14893                14893                signals
    Max msgqueue size         819200               819200               bytes
    Max nice priority         0                    0
    Max realtime priority     0                    0
    Max realtime timeout      unlimited            unlimited            us

    I highlighted rows most imortant in the context of MySQL server - limit on number of processes (and thus threads and connections, thread pool aside) and limit on number of open files. Users often assume values very different from those actually used, for various reasons... Note that on modern kernels one does NOT need root/sudo to check the limits of process that belongs to a different user.

  • maps - this file contains currently mapped memory regions and their access permissions. It is also very imortant to check in case of MySQL server. Consider the following examle:

    openxs@ao756:~$ sudo cat /proc/30580/maps | more
    00400000-019d6000 r-xp 00000000 fc:01 12066751                           /usr/sbin/mysqld
    01bd6000-01cb7000 r--p 015d6000 fc:01 12066751                           /usr/sbin/mysqld
    01cb7000-01d67000 rw-p 016b7000 fc:01 12066751                           /usr/sbin/mysqld
    01d67000-01e48000 rw-p 00000000 00:00 0
    7fe19d400000-7fe1a2000000 rw-p 00000000 00:00 0
    ...
    7fe1caa17000-7fe1caa4a000 r-xp 00000000 fc:01 12064860                   /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
    7fe1caa4a000-7fe1cac4a000 ---p 00033000 fc:01 12064860                   /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
    7fe1cac4a000-7fe1cac4c000 r--p 00033000 fc:01 12064860                   /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
    7fe1cac4c000-7fe1cac4d000 rw-p 00035000 fc:01 12064860                   /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
    ...
    7fe1cae33000-7fe1cae38000 rw-s 00000000 00:0d 822928                     /[aio]
    (deleted)
    7fe1cae38000-7fe1cae42000 rw-p 00000000 00:00 0
    7fe1cae42000-7fe1cae44000 rw-s 00000000 00:0d 822929                     /[aio]
    (deleted)
    ...
    7fe1cae73000-7fe1cae74000 r--p 00025000 fc:01 1311186                    /lib/x86_64-linux-gnu/ld-2.23.so
    7fe1cae74000-7fe1cae75000 rw-p 00026000 fc:01 1311186                    /lib/x86_64-linux-gnu/ld-2.23.so
    7fe1cae75000-7fe1cae76000 rw-p 00000000 00:00 0
    7ffe2f559000-7ffe2f57d000 rw-p 00000000 00:00 0                          [stack]
    7ffe2f5c6000-7ffe2f5c8000 r--p 00000000 00:00 0                          [vvar]
    7ffe2f5c8000-7ffe2f5ca000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

    The format of the output lines is simple: range of addresses in the address space of the process, permissions, offset, device (major:minor), inode and pathname. Permissions r/w/x have usual meanings, while s means shared and p - private. Offset represents offset in the related file, if any. Device represents where the inode is located amd inode corresponds the given pathname (inode 0 is for the memory region). The pathname field will usually be the file that is backing the mapping. This is the huge memory-mapped file for the Galera cache, for examle:

    7ed567fff000-7fd568000000 rw-s 00000000 fd:0f 2147676724                 /home/mariadb-data/galera.cache

    See also map_files/* for memory-mapped files for the process:

    openxs@ao756:~$ sudo ls -l /proc/30580/map_files/ | more
    total 0
    lr-------- 1 mysql mysql 64 Jan  6 16:42 1bd6000-1cb7000 -> /usr/sbin/mysqld
    lr-------- 1 mysql mysql 64 Jan  6 16:42 1cb7000-1d67000 -> /usr/sbin/mysqld
    lr-------- 1 mysql mysql 64 Jan  6 16:42 400000-19d6000 -> /usr/sbin/mysqld
    lr-------- 1 mysql mysql 64 Jan  6 16:42 7fe1c01ed000-7fe1c01f8000 -> /lib/x86_6
    4-linux-gnu/libnss_files-2.23.so
    lr-------- 1 mysql mysql 64 Jan  6 16:42 7fe1c01f8000-7fe1c03f7000 -> /lib/x86_6
    4-linux-gnu/libnss_files-2.23.so
    ...


  • numa_maps - this file displays information about a process's NUMA memory policy and allocation. Each line contains information about a memory range used by the process, displaying - among other information - the effective memory policy for that memory range and on which nodes the pages have been allocated. Consider the following example (single node only, sorry):

    openxs@ao756:~$ sudo cat /proc/30580/numa_maps | more
    00400000 default file=/usr/sbin/mysqld mapped=1551 active=1061 N0=1551 kernelpagesize_kB=4
    01bd6000 default file=/usr/sbin/mysqld anon=37 dirty=1 mapped=99 swapcache=36 active=0 N0=99 kernelpagesize_kB=4
    01cb7000 default file=/usr/sbin/mysqld anon=46 dirty=45 mapped=71 swapcache=1 active=0 N0=71 kernelpagesize_kB=4
    01d67000 default anon=115 dirty=93 swapcache=22 active=2 N0=115 kernelpagesize_kB=4
    ...
    7fe1caa17000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 mapped=14 mapmax=2 N0=14 kernelpagesize_kB=4
    7fe1caa4a000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1
    7fe1cac4a000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 anon=2 dirty=1 swapcache=1 active=0 N0=2 kernelpagesize_kB=4
    7fe1cac4c000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4
    ...
    7fe1cae71000 default file=/[aio]\040(deleted)
    7fe1cae72000 default anon=1 swapcache=1 active=0 N0=1 kernelpagesize_kB=4
    7fe1cae73000 default file=/lib/x86_64-linux-gnu/ld-2.23.so anon=1 swapcache=1 ac
    tive=0 N0=1 kernelpagesize_kB=4
    7fe1cae74000 default file=/lib/x86_64-linux-gnu/ld-2.23.so anon=1 dirty=1 active
    =0 N0=1 kernelpagesize_kB=4
    7fe1cae75000 default anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4
    7ffe2f559000 default stack anon=32 dirty=30 swapcache=2 active=3 N0=32 kernelpag
    esize_kB=4
    7ffe2f5c6000 default
    7ffe2f5c8000 default

    In the above we see starting address of the range, policy (all default in my case), NUMA node (N0 in my case) for the allocation and number of pages allocated (see N0=1551), dirty pages (dirty=30), number of pages that have an associated entry on a swap device (swapcache=2) etc. See also man 7 numa for many more details.

  • oom_score - this file displays the current score that the kernel gives to this process for the purpose of selecting a process for the OOM-killer. A higher score means that the process is more likely to be selected by the OOM-killer. The range is basically 0-1000, where 1000 basically means the process uses all the memory. Process with the value 0 in this file will never be killed. My server has low enough value:

    openxs@ao756:~$ sudo cat /proc/30580/oom_score
    40

  • oom_score_adj - this file can be used to adjust the OOM score used to select which process gets killed in out-of-memory conditions. Each candidate task is assigned a value ranging from 0 (never kill) to 1000 (always kill) to determine which process is targeted. The value of oom_score_adj is added to the OOM score before it is used to determine which task to kill. Acceptable values range from -1000 (OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows user space to control the preference for OOM-killing, ranging from always preferring a certain task or completely disabling it from OOM killing.

    In practice you just write large negative valuer into the file:

    openxs@ao756:~$ sudo cat /proc/30580/oom_score
    40
    openxs@ao756:~$ sudo cat /proc/30580/oom_score_adj
    0
    openxs@ao756:~$ sudo echo -100 > /proc/30580/oom_score_adj
    -bash: /proc/30580/oom_score_adj: Permission denied
    openxs@ao756:~$ sudo su -
    root@ao756:~# echo -100 > /proc/30580/oom_score_adj
    root@ao756:~# exit
    logout
    openxs@ao756:~$ sudo cat /proc/30580/oom_score_adj
    -100
    openxs@ao756:~$ sudo cat /proc/30580/oom_score
    0


    In my case I had to become root to be able to adjust the score. Read this nice blog post, "MySQL, OOM Killer, and everything related", for more details.

  • sched - scheduler related statistcis for the process. Not that it is much documented (see some hints here and there), but the output is more or less clear:

    openxs@ao756:~$ sudo cat /proc/30580/sched
    mysqld (30580, #threads: 37)
    -------------------------------------------------------------------
    se.exec_start                                :     259734254.219078
    se.vruntime                                  :        102780.674970
    se.sum_exec_runtime                          :           354.092211
    se.statistics.sum_sleep_runtime              :      38046805.925355
    se.statistics.wait_start                     :             0.000000
    se.statistics.sleep_start                    :     259734254.219078
    se.statistics.block_start                    :             0.000000
    se.statistics.sleep_max                      :      11587559.290106
    se.statistics.block_max                      :          2466.143746
    se.statistics.exec_max                       :             4.005223
    se.statistics.slice_max                      :             2.790630
    se.statistics.wait_max                       :            22.567269
    se.statistics.wait_sum                       :            85.791532
    se.statistics.wait_count                     :                  748
    se.statistics.iowait_sum                     :          5117.788260
    se.statistics.iowait_count                   :                  524

    se.nr_migrations                             :                   50
    se.statistics.nr_migrations_cold             :                    0
    se.statistics.nr_failed_migrations_affine    :                    0
    se.statistics.nr_failed_migrations_running   :                   10
    se.statistics.nr_failed_migrations_hot       :                    5
    se.statistics.nr_forced_migrations           :                    0
    se.statistics.nr_wakeups                     :                  658
    se.statistics.nr_wakeups_sync                :                  104
    se.statistics.nr_wakeups_migrate             :                   42
    se.statistics.nr_wakeups_local               :                  468
    se.statistics.nr_wakeups_remote              :                  190
    se.statistics.nr_wakeups_affine              :                   22
    se.statistics.nr_wakeups_affine_attempts     :                   55
    se.statistics.nr_wakeups_passive             :                    0
    se.statistics.nr_wakeups_idle                :                    0
    avg_atom                                     :             0.477857
    avg_per_cpu                                  :             7.081844
    nr_switches                                  :                  741
    nr_voluntary_switches                        :                  659
    nr_involuntary_switches                      :                   82
    se.load.weight                               :                 1024
    se.avg.load_sum                              :              6000742
    se.avg.util_sum                              :                22528
    se.avg.load_avg                              :                  125
    se.avg.util_avg                              :                    0
    se.avg.last_update_time                      :      259734254219078
    policy                                       :                    0
    prio                                         :                  120
    clock-delta                                  :                   60
    mm->numa_scan_seq                            :                    0
    numa_pages_migrated                          :                    0
    numa_preferred_nid                           :                   -1
    total_numa_faults                            :                    0
    current_node=0, numa_group_id=0
    numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0

    and you can reset most counters by writing 0 to the file:

    openxs@ao756:~$ sudo su -
    root@ao756:~# echo 0 > /proc/30580/sched
    root@ao756:~# exit
    logout
    openxs@ao756:~$ sudo cat /proc/30580/sched
    mysqld (30580, #threads: 37)
    -------------------------------------------------------------------
    se.exec_start                                :     259734254.219078
    se.vruntime                                  :        102780.674970
    se.sum_exec_runtime                          :           354.092211
    se.statistics.sum_sleep_runtime              :             0.000000
    se.statistics.wait_start                     :             0.000000
    se.statistics.sleep_start                    :             0.000000
    se.statistics.block_start                    :             0.000000
    se.statistics.sleep_max                      :             0.000000
    se.statistics.block_max                      :             0.000000
    se.statistics.exec_max                       :             0.000000
    se.statistics.slice_max                      :             0.000000
    se.statistics.wait_max                       :             0.000000
    se.statistics.wait_sum                       :             0.000000
    se.statistics.wait_count                     :                    0
    se.statistics.iowait_sum                     :             0.000000
    se.statistics.iowait_count                   :                    0
    se.nr_migrations                             :                   50
    se.statistics.nr_migrations_cold             :                    0
    se.statistics.nr_failed_migrations_affine    :                    0
    se.statistics.nr_failed_migrations_running   :                    0
    se.statistics.nr_failed_migrations_hot       :                    0
    se.statistics.nr_forced_migrations           :                    0
    se.statistics.nr_wakeups                     :                    0
    se.statistics.nr_wakeups_sync                :                    0
    se.statistics.nr_wakeups_migrate             :                    0
    se.statistics.nr_wakeups_local               :                    0
    se.statistics.nr_wakeups_remote              :                    0
    se.statistics.nr_wakeups_affine              :                    0
    se.statistics.nr_wakeups_affine_attempts     :                    0
    se.statistics.nr_wakeups_passive             :                    0
    se.statistics.nr_wakeups_idle                :                    0
    avg_atom                                     :             0.477857
    avg_per_cpu                                  :             7.081844
    nr_switches                                  :                  741
    nr_voluntary_switches                        :                  659
    nr_involuntary_switches                      :                   82
    se.load.weight                               :                 1024
    se.avg.load_sum                              :              6000742
    se.avg.util_sum                              :                22528
    se.avg.load_avg                              :                  125
    se.avg.util_avg                              :                    0
    se.avg.last_update_time                      :      259734254219078
    policy                                       :                    0
    prio                                         :                  120
    clock-delta                                  :                   77
    mm->numa_scan_seq                            :                    0
    numa_pages_migrated                          :                    0
    numa_preferred_nid                           :                   -1
    total_numa_faults                            :                    0
    current_node=0, numa_group_id=0
    numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0

    then run some load and check the impact. The values are displayed in milliseconds; they’re tracked in nanoseconds, and scaled by one million.

  • smaps - this file shows memory consumption for each of the process's mappings (see maps above). For each mapping there is a series of lines:

    openxs@ao756:~$ sudo cat /proc/30580/smaps | more
    00400000-019d6000 r-xp 00000000 fc:01 12066751                           /usr/sbin/mysqld
    Size:              22360 kB
    Rss:                6204 kB
    Pss:                6204 kB
    Shared_Clean:          0 kB
    Shared_Dirty:          0 kB
    Private_Clean:      6204 kB
    Private_Dirty:         0 kB
    Referenced:         5704 kB
    Anonymous:             0 kB
    AnonHugePages:         0 kB
    Shared_Hugetlb:        0 kB
    Private_Hugetlb:       0 kB
    Swap:                  0 kB
    SwapPss:               0 kB
    KernelPageSize:        4 kB
    MMUPageSize:           4 kB
    Locked:                0 kB
    VmFlags: rd ex mr mw me dw sd
    01bd6000-01cb7000 r--p 015d6000 fc:01 12066751                           /usr/sb
    in/mysqld
    Size:                900 kB
    ...

    These lines show the size of the mapping, the amount of the mapping that is currently resident in RAM ("Rss"), the process's proportional share of this mapping ("Pss"), the number of clean and dirty shared pages in the mapping, and the number of clean and dirty private pages in the mapping. "Referenced" indicates the amount of memory currently marked as referenced or accessed. "Anonymous" shows the amount of memory that does not belong to any file. "Swap" shows how much would-be-anonymous memory is also used, but out on swap, "VmFlags" represents the kernel flags associated with the virtual memory area as a set of two character flags, and so on.

  • stack - this file provides a symbolic trace of the function calls in this process's kernel stack. Here is what I have for the process and one of its threads (actually MySQL server's main thread and page cleaner thread as we'll see in the next post in this series):

    openxs@ao756:~$ sudo cat /proc/30580/stack
    [<ffffffff81233b94>] poll_schedule_timeout+0x44/0x70
    [<ffffffff81235223>] do_sys_poll+0x4b3/0x570
    [<ffffffff81235329>] do_restart_poll+0x49/0x80
    [<ffffffff81097155>] sys_restart_syscall+0x25/0x30
    [<ffffffff81869c5b>] entry_SYSCALL_64_fastpath+0x22/0xd0
    [<ffffffffffffffff>] 0xffffffffffffffff
    openxs@ao756:~$ sudo cat /proc/30580/task/30592/stack
    [<ffffffff811097c0>] futex_wait_queue_me+0xc0/0x120
    [<ffffffff8110a4f6>] futex_wait+0x116/0x270
    [<ffffffff8110ca60>] do_futex+0x120/0x5a0
    [<ffffffff8110cf61>] SyS_futex+0x81/0x180
    [<ffffffff81869c5b>] entry_SYSCALL_64_fastpath+0x22/0xd0
    [<ffffffffffffffff>] 0xffffffffffffffff

    For other threads we'll see more interesting stacks when they are waiting for disk etc. Stay tuned! Note that if you'll check wchan file for the process:

    openxs@ao756:~$ sudo strings /proc/30580/wchan
    poll_schedule_timeout


    we'll see the top line of the output above (hex values aside) as a null-terminated string. This is a "wait channel", the symbolic name corresponding to the location in the kernel where the process is sleeping.

  • stat - status information about the process, used by ps. This file is a set of fields separated by spaces, so it is easy to parse by some script or load into the database. Domas Mituzas once reported a bug, Bug #72027 - "LOAD_FILE() does not work on dynamic files" (fixed only in MySQL 8.0.15+), based on failed attempts to apply LOAD_DATA() to stat file in /proc.

    Output starts with PID and comm fields, followed by the process state and other details. For example:

    openxs@ao756:~$ cat /proc/30580/stat
    30580 (mysqld) S 1 30579 30579 0 -1 4194368 1007759 0 360 0 68598 13573 0 0 20 0 37 0 218060678 764219392 76109 18446744073709551615 1 1 0 0 0 0 540679 12288 1768 0 0 0 17 1 0 0 513 0 0 0 0 0 0 0 0 0 0

    The value of state is a single character:
    • R  Running
    • S  Sleeping in an interruptible wait
    • Waiting in uninterruptible disk sleep
    • Z  Zombie
    • T  Stopped (on a signal)
    • t  Tracing stop
    • ... some other values were also used in the past
    Then we see identifiers of the parent process and process group, followed by the session id and controlling terminal (-1), and so on. Field 20 is the number of threads in the process, fields 23 and 24 virtual memory size and resident set size, in bytes. Just read the manual if you plan to interpretad these lines somewhere. Note that sudo/root is not needed to see the details about the process of the other user.

  • statm - this file provides information about memory usage, measured in (4K) pages:

    openxs@ao756:~$ cat /proc/30580/statm
    186577 76109 2515 5590 0 169800 0

    Again we see just space-separated fields, easy to parse. The fileds are total size, resident, shared, text (code) size, next one (former lib) is always 0, then data + stack size and dirty pages. Some of these values may be inaccurate because of internal kernel optimizations applied.

  • status -  this file provides much of the information from stat and statm in a format that's easier for humans to read. For example:
    openxs@ao756:~$ cat /proc/30580/status
    Name:   mysqld
    State:  S (sleeping)
    Tgid:   30580
    Ngid:   0
    Pid:    30580
    PPid:   1
    TracerPid:      0
    Uid:    116     116     116     116
    Gid:    125     125     125     125
    FDSize: 128
    Groups: 125
    NStgid: 30580
    NSpid:  30580
    NSpgid: 30579
    NSsid:  30579
    VmPeak:   746308 kB
    VmSize:   746308 kB
    VmLck:         0 kB
    VmPin:         0 kB
    VmHWM:    328248 kB
    VmRSS:    304436 kB
    VmData:   679056 kB
    VmStk:       144 kB
    VmExe:     22360 kB
    VmLib:      7784 kB
    VmPTE:      1036 kB
    VmPMD:        16 kB
    VmSwap:    13256 kB
    HugetlbPages:          0 kB
    Threads:        37
    SigQ:   0/14893
    SigPnd: 0000000000000000
    ShdPnd: 0000000000000000
    SigBlk: 0000000000084007
    SigIgn: 0000000000003000
    SigCgt: 00000001800006e8
    CapInh: 0000000000000000
    CapPrm: 0000000000000000
    CapEff: 0000000000000000
    CapBnd: 0000003fffffffff
    CapAmb: 0000000000000000
    Seccomp:        0
    Speculation_Store_Bypass:       thread vulnerable
    Cpus_allowed:   ff
    Cpus_allowed_list:      0-7
    Mems_allowed:   00000000,00000001
    Mems_allowed_list:      0
    voluntary_ctxt_switches:        687
    nonvoluntary_ctxt_switches:     82

    The values are supposed to be clear for a reader. If not - just check man 5 proc.

  • syscall - this file exposes the system call number and argument registers for the system call currently being executed by the process, followed by the values of the stack pointer and program counter registers. The values of all six argument registers are exposed, although most system calls use fewer registers. If the process is blocked, but not in a system call, thensystem call number is -1, followed by just the values of the stack pointer and program counter. If process is not blocked, then the file contains just the string "running". For example, this is my ad hoc poor man threads monitoring script of a kind applied to the mysqld process executing some sysbench load:

    openxs@ao756:~$ for dir in `ls /proc/30580/task`; do echo -n $dir': '; 2>/dev/null sudo cat /proc/$dir/syscall; done | more
    2389: 202 0x1db141c 0x80 0x834e8 0x0 0x1db1400 0x41a70 0x7fe1cace9fb0 0x7fe1ca807360
    2390: 202 0x1db141c 0x80 0x834e5 0x0 0x1db1400 0x41a70 0x7fe1c00e4fb0 0x7fe1ca807360
    2391: 202 0x1db141c 0x80 0x83505 0x0 0x1db1400 0x41a7d 0x7fe1c00a3fb0 0x7fe1ca807360
    2392: 202 0x1db141c 0x80 0x834fe 0x0 0x1db1400 0x41a7d 0x7fe1b6a39fb0 0x7fe1ca807360
    2393: 202 0x1db141c 0x80 0x83504 0x0 0x1db1400 0x41a7d 0x7fe1b6a7afb0 0x7fe1ca807360
    2394: 202 0x1db141c 0x80 0x83501 0x0 0x1db1400 0x41a7d 0x7fe1c0125fb0 0x7fe1ca807360
    2395: running
    2488: 202 0x1db141c 0x80 0x8351e 0x0 0x1db1400 0x41a8a 0x7fe1cadedfb0 0x7fe1ca807360
    2493: 202 0x1db141c 0x80 0x83518 0x0 0x1db1400 0x41a8a 0x7fe1cad2afb0 0x7fe1ca807360
    2800: 75 0x12 0x1 0x0 0x0 0x0 0x1db0870 0x7fe1c01e8f50 0x7fe1c876a8dd
    30580: 7 0x7fe1c13b9e98 0x2 0xffffffff 0x7fe1c84539d0 0x0 0x7fe1c8453700 0x7ffe2
    f57a590 0x7fe1c876880d
    30581: 128 0x7fe1c0bfed70 0x7fe1c0bfedf0 0x0 0x8 0x0 0x7ffe2f57a5d0 0x7fe1c0bfec
    c0 0x7fe1c86a3b36
    30582: 208 0x7fe1cae58000 0x1 0x100 0x7fe1c7c76000 0x7fe1b3bfe750 0x7fe1b3bff9c0
    ...


    Here you can see a running thread and various system calls for other threads. If only I could map numbers to something useful... Smart people can as you'll find out in the next posts.

Let me stop checking /proc at this stage, otherwise I'll end up with a yet another MySQL-illustrated manual page for /proc with a size of the large book chapter.

In the next post I am going to describe MySQL threads monitoring with /proc, ps and top sampling, based on the information presented here.

No comments:

Post a Comment