Happy New Year 2021, dear readers of my blog!
We used to have real winters at this time of the year. Not any more... |
Among other good things that happened on December 31, 2020, I was informed that my talk "Linux /proc filesystem for MySQL DBAs" was accepted for FOSDEM 2021 MySQL devroom. So, it's time to get back to blogging that I abandoned for a while in favor of this YouTube channel, and share some details to refer to on my slides.
This is not my first talk or blog post in "Something for MySQL DBAs" series. There were some in the past, like these:
- "gdb tips and tricks for MySQL DBAs"
- "Using strace for MySQL Troubleshooting"
- "How lsof Utility May Help MySQL DBAs"
- "Windows Tools for MySQL DBAs: Basic Minidump Analysis"
- ... and so on
My colleague once said that eventually I have to end up with something like "/dev/null for MySQL DBAs" kind of talk. I am not yet there, but I still think that the more OS level tools DBAs know the better they can do the job.
The idea of talking about /proc filesystem was inspired by the fact that I use some files there on a regular basis while doing my job and by this great talk by Tanel Poder and his 0x.tools, a small set of open-source utilities for analyzing application performance on Linux based mostly on /proc sampling.
I am planning to cover most part of the upcoming talk in a mini-series of three blog posts. In this post I am going to present some basic information about the /proc filesystem and some files and directories there that would be surely useful for MySQL DBAs, as well as (who could imagine) some related MySQL bug reports. In another one I'll discuss how threads in the process are represented in the /proc filesystem (as well as ps and top outputs). Finally in the third post I am going to show some use cases for the 0x.tools and summarize the benefits and use cases for /proc sampling approaches.
* * *
Basically, the proc filesystem is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at /proc:
openxs@ao756:~$ cat /etc/issue
Ubuntu 16.04.7 LTS \n \l
openxs@ao756:~$ mount | grep '/proc'
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
...
Most of it is read-only, but some files allow to change kernel variables. For the purpose of this discussion I skip mount options and most of the /proc/* files and proceed to /proc subdirectories, one per each process running, named after PID. I have the following mysqld process (of Percona Server 5.7.x) running:
openxs@ao756:~$ ps aux | grep mysqld
...
mysql 30580 0.7 8.1 746308 313984 ? Sl Jan02 9:55 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid
So there will be the /proc/30580 directory with the following content:
openxs@ao756:~$ ls -F /proc/30580
ls: cannot read symbolic link '/proc/30580/cwd': Permission denied
ls: cannot read symbolic link '/proc/30580/root': Permission denied
ls: cannot read symbolic link '/proc/30580/exe': Permission denied
attr/ cpuset limits net/ projid_map stat
autogroup cwd@ loginuid ns/ root@ statm
auxv environ map_files/ numa_maps sched status
cgroup exe@ maps oom_adj schedstat syscall
clear_refs fd/ mem oom_score sessionid task/
cmdline fdinfo/ mountinfo oom_score_adj setgroups timers
comm gid_map mounts pagemap smaps uid_map
coredump_filter io mountstats personality stack wchan
I highlighted the files and directories I consider most useful. Note also "Permission denied" messages above that you may get while accessing some files in /proc, even related to the processes you own. You may still need root/sudo access (or belong to some dedicated group) to read them.
Now let me give short descriptions or hints about the content of the highlighted files and directories (see man 5 proc to get more details):
- task/[tid] - task subdirectory contains subdirectories of the form task/[tid], which contain corresponding information about each of the threads in the process, where tid is the kernel thread ID of the thread. In my case:
openxs@ao756:~$ ls -F /proc/30580/task/
2488/ 30580/ 30584/ 30588/ 30592/ 30598/ 30602/ 30606/ 31972/ 3622/
2493/ 30581/ 30585/ 30589/ 30593/ 30599/ 30603/ 30607/ 3618/
2800/ 30582/ 30586/ 30590/ 30594/ 30600/ 30604/ 30608/ 3620/
2805/ 30583/ 30587/ 30591/ 30597/ 30601/ 30605/ 30609/ 3621/
Each of them in turn has files and subdirectories similar to those /proc/pid ones that I discuss below:
openxs@ao756:~$ sudo ls -F /proc/30580/task/30592
attr/ cpuset io net/ personality smaps wchan
auxv cwd@ limits ns/ projid_map stack
cgroup environ loginuid numa_maps root@ stat
children exe@ maps oom_adj sched statm
clear_refs fd/ mem oom_score schedstat status
cmdline fdinfo/ mountinfo oom_score_adj sessionid syscall
comm gid_map mounts pagemap setgroups uid_map - cmdline - this read-only file contains the command line (argv) that the process wants you to see, as strings terminated by null bytes ('\0'). This is how you can check the content to see each string separately:
openxs@ao756:~$ strings /proc/30580/cmdline
/usr/sbin/mysqld
--daemonize
--pid-file=/var/run/mysqld/mysqld.pid - comm - command name (up to 16 characters including the terminating null byte, longer values truncated) associated with the process. In my case:
openxs@ao756:~$ cat /proc/30580/comm
mysqld
Individual threads can set different comm values. There is a useful feature request for MySQL threads to be named according to their role. It is Bug #70858 - "Set thread name" by Daniƫl van Eeden. - coredump_filter - this file can be used to control which memory segments are written to the core dump file in the event that a core dump is performed for the process with the corresponding PID. The value in the file is a bit mask of memory mapping types:
bit 0 Dump anonymous private mappings. bit 1 Dump anonymous shared mappings. bit 2 Dump file-backed private mappings. bit 3 Dump file-backed shared mappings. bit 4 Dump ELF headers. bit 5 Dump private huge pages.
By default bits 0, 1, 4 and 5 are set, so we see hex value 33 in the file:
...
openxs@ao756:~$ cat /proc/30580/coredump_filter
00000033
We can control core dump size to some extent by writing to this file. Other, way more important files in /proc that are related to core dumps are presented in man 5 core and in other sources. Read them to decode, for example, this output that is typical for systemd-based systems (and do not look for the core.* files related to the mysqld process in the datadir desperately after that):
openxs@ao756:~$ cat /proc/sys/kernel/core_pattern
|/usr/share/apport/apport %p %s %c %d %P %E - environ - this file contains the initial environment that was set when the currently executing program was started. The entries are separated by null bytes, so you can check them like this:
openxs@ao756:~$ sudo strings /proc/30580/environ
LANG=en_US.UTF-8
...
LC_TIME=uk_UA.UTF-8
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
HOME=/nonexistent
LOGNAME=mysql
USER=mysql
SHELL=/bin/false
STARTTIMEOUT=120
STOPTIMEOUT=600
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1
Note the last 3 variables above. They show systemd timeouts for the unit and jemalloc preloaded. - fd/ - this subdirectory contains one entry for each file which the process has open. The name is its file descriptor and it is a symbolic link to the actual file:
openxs@ao756:~$ sudo ls -l /proc/30580/fd/
total 0
lr-x------ 1 mysql mysql 64 Jan 5 19:09 0 -> /dev/null
lrwx------ 1 mysql mysql 64 Jan 5 19:09 1 -> socket:[822912]
lrwx------ 1 mysql mysql 64 Jan 5 19:09 10 -> /var/lib/mysql/ib_logfile1
lrwx------ 1 mysql mysql 64 Jan 5 19:09 11 -> /var/lib/mysql/ibdata1
lrwx------ 1 mysql mysql 64 Jan 5 19:09 12 -> /var/lib/mysql/xb_doublewrite
lrwx------ 1 mysql mysql 64 Jan 5 19:09 13 -> /tmp/ibGEsYuO (deleted)
lrwx------ 1 mysql mysql 64 Jan 5 19:09 14 -> /var/lib/mysql/ibtmp1
lrwx------ 1 mysql mysql 64 Jan 5 19:09 15 -> /var/lib/mysql/sbtest/sbtest3.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 16 -> /var/lib/mysql/mysql/plugin.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 17 -> /var/lib/mysql/mysql/gtid_executed.ibd
l-wx------ 1 mysql mysql 64 Jan 5 19:09 18 -> /var/lib/mysql/ao756-bin.000089
lrwx------ 1 mysql mysql 64 Jan 5 19:09 19 -> socket:[822935]
lrwx------ 1 mysql mysql 64 Jan 5 19:09 2 -> socket:[822912]
lrwx------ 1 mysql mysql 64 Jan 5 19:09 20 -> socket:[822936]
lrwx------ 1 mysql mysql 64 Jan 5 19:09 21 -> /var/lib/mysql/mysql/server_cost.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 22 -> /var/lib/mysql/mysql/engine_cost.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 23 -> /var/lib/mysql/mysql/db.MYI
lrwx------ 1 mysql mysql 64 Jan 5 19:09 24 -> /var/lib/mysql/mysql/db.MYD
lrwx------ 1 mysql mysql 64 Jan 5 19:09 25 -> /var/lib/mysql/mysql/user.MYI
lrwx------ 1 mysql mysql 64 Jan 5 19:09 26 -> /var/lib/mysql/mysql/user.MYD
lrwx------ 1 mysql mysql 64 Jan 5 19:09 27 -> /var/lib/mysql/mysql/event.MYI
lrwx------ 1 mysql mysql 64 Jan 5 19:09 28 -> /var/lib/mysql/mysql/event.MYD
lrwx------ 1 mysql mysql 64 Jan 5 19:09 29 -> /var/lib/mysql/mysql/time_zone_leap_second.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 3 -> /var/lib/mysql/ao756-bin.index
lrwx------ 1 mysql mysql 64 Jan 5 19:09 30 -> /var/lib/mysql/mysql/time_zone_name.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 31 -> /var/lib/mysql/mysql/time_zone.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 32 -> /var/lib/mysql/mysql/time_zone_transition_type.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 33 -> /var/lib/mysql/mysql/time_zone_transition.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 34 -> /var/lib/mysql/mysql/innodb_table_stats.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 35 -> /var/lib/mysql/sbtest/sbtest2.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 36 -> /var/lib/mysql/sbtest/sbtest4.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 39 -> /var/lib/mysql/sbtest/sbtest1.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 4 -> /var/lib/mysql/sbtest/sbtest5.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 40 -> /var/lib/mysql/mysql/servers.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 41 -> /var/lib/mysql/mysql/slave_master_info.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 42 -> /var/lib/mysql/mysql/slave_relay_log_info.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 43 -> /var/lib/mysql/mysql/slave_worker_info.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 45 -> /var/lib/mysql/mysql/innodb_index_stats.ibd
lrwx------ 1 mysql mysql 64 Jan 5 19:09 5 -> /var/lib/mysql/ib_logfile0
lrwx------ 1 mysql mysql 64 Jan 5 19:09 6 -> /tmp/ibmKNTQP (deleted)
lrwx------ 1 mysql mysql 64 Jan 5 19:09 7 -> /tmp/ib2GbIR1 (deleted)
lrwx------ 1 mysql mysql 64 Jan 5 19:09 8 -> /tmp/ibME5wSd (deleted)
lrwx------ 1 mysql mysql 64 Jan 5 19:09 9 -> /tmp/ibyiPjXB (deleted)
In the above we see stderr (descriptor 2) is pointing out to socket:[822912]. We can get some more information about this inode in /proc/net and in other sources:
openxs@ao756:~$ strings /proc/net/unix | grep 822912
0000000000000000: 00000003 00000000 00000000 0001 03 822912
openxs@ao756:~$ sudo netstat -n --program | grep 822912
unix 3 [ ] STREAM CONNECTED 822912 30580/mysqld
The information in fd/ is similar to what you may get from lsof, but it is "always there" while lsof must be installed separately. - fdinfo/ - files in this subdirectory provide more information about the corresponding file descriptor. Let's check 3 -> /var/lib/mysql/ao756-bin.index:
openxs@ao756:~$ sudo cat /proc/30580/fdinfo/3
pos: 1691
flags: 0100002
mnt_id: 24
openxs@ao756:~$ sudo ls -l /var/lib/mysql/ao756-bin.index
-rw-r----- 1 mysql mysql 1691 Jan 2 17:35 /var/lib/mysql/ao756-bin.index
and then some InnoDB file, for example, 5 -> /var/lib/mysql/ib_logfile0:
openxs@ao756:~$ sudo cat /proc/30580/fdinfo/5
pos: 0
flags: 0100002
mnt_id: 24
lock: 1: POSIX ADVISORY WRITE 30580 fc:01:11143169 0 EOF
openxs@ao756:~$ cat /proc/30580/mountinfo | grep ^'24 '
24 0 252:1 / / rw,relatime shared:1 - ext4 /dev/mapper/ubuntu--vg-root rw,errors=remount-ro,data=ordered
As we can see the details include file offset (pos), octal number that displays the file access mode and file status flags (flags), id of the mountpoint for the file that we were able to find in /proc/$PID/mountinfo (/ file system in this case), and some other details depending on file type. For redo log we see lock record with the details about the write lock set on this file by the mysqld process. You can see it among all the file locks in /proc/locks too:
openxs@ao756:~$ cat /proc/locks | grep fc:01:11143169
30: POSIX ADVISORY WRITE 30580 fc:01:11143169 0 EOF
See "File locking in Linux" by Victor Gaydov for much more details. - io - this file contains I/O statistics for the process, for example:
openxs@ao756:~$ sudo cat /proc/30580/io
rchar: 198026951
wchar: 2365236669
syscr: 8014
syscw: 92558
read_bytes: 49184768
write_bytes: 3394945024
cancelled_write_bytes: 171876352
Here we can see number of characters read and written (maybe just to the pagecache, without physical I/O), number of system calls to read and write, number of bytes read and written to the storage and, in cancelled_write_bytes, number of written bytes which this process caused to not happen, by truncating pagecache (when deleting the file created, for example).
It's important to note that the same information is properly tracked for the individual threads. For example, this is the page cleaner thread of our Percona Server (more on why I am so sure while comm is not set is in the next post):
openxs@ao756:~$ sudo cat /proc/30580/task/30592/io
rchar: 0
wchar: 912518825
syscr: 0
syscw: 2092
read_bytes: 0
write_bytes: 1813786624
cancelled_write_bytes: 0
and we can see that it did notable share of all writes. - limits - this file provides the details about the actual resource limits for the process:
This is my personal most often used file in the /proc filesystem, at least while working on support issues with customers.
openxs@ao756:~$ cat /proc/30580/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 14893 14893 processes
Max open files 1024 4096 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 14893 14893 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
I highlighted rows most imortant in the context of MySQL server - limit on number of processes (and thus threads and connections, thread pool aside) and limit on number of open files. Users often assume values very different from those actually used, for various reasons... Note that on modern kernels one does NOT need root/sudo to check the limits of process that belongs to a different user. - maps - this file contains currently mapped memory regions and their access permissions. It is also very imortant to check in case of MySQL server. Consider the following examle:
openxs@ao756:~$ sudo cat /proc/30580/maps | more
00400000-019d6000 r-xp 00000000 fc:01 12066751 /usr/sbin/mysqld
01bd6000-01cb7000 r--p 015d6000 fc:01 12066751 /usr/sbin/mysqld
01cb7000-01d67000 rw-p 016b7000 fc:01 12066751 /usr/sbin/mysqld
01d67000-01e48000 rw-p 00000000 00:00 0
7fe19d400000-7fe1a2000000 rw-p 00000000 00:00 0
...
7fe1caa17000-7fe1caa4a000 r-xp 00000000 fc:01 12064860 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
7fe1caa4a000-7fe1cac4a000 ---p 00033000 fc:01 12064860 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
7fe1cac4a000-7fe1cac4c000 r--p 00033000 fc:01 12064860 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
7fe1cac4c000-7fe1cac4d000 rw-p 00035000 fc:01 12064860 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1
...
7fe1cae33000-7fe1cae38000 rw-s 00000000 00:0d 822928 /[aio]
(deleted)
7fe1cae38000-7fe1cae42000 rw-p 00000000 00:00 0
7fe1cae42000-7fe1cae44000 rw-s 00000000 00:0d 822929 /[aio]
(deleted)
...
7fe1cae73000-7fe1cae74000 r--p 00025000 fc:01 1311186 /lib/x86_64-linux-gnu/ld-2.23.so
7fe1cae74000-7fe1cae75000 rw-p 00026000 fc:01 1311186 /lib/x86_64-linux-gnu/ld-2.23.so
7fe1cae75000-7fe1cae76000 rw-p 00000000 00:00 0
7ffe2f559000-7ffe2f57d000 rw-p 00000000 00:00 0 [stack]
7ffe2f5c6000-7ffe2f5c8000 r--p 00000000 00:00 0 [vvar]
7ffe2f5c8000-7ffe2f5ca000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
The format of the output lines is simple: range of addresses in the address space of the process, permissions, offset, device (major:minor), inode and pathname. Permissions r/w/x have usual meanings, while s means shared and p - private. Offset represents offset in the related file, if any. Device represents where the inode is located amd inode corresponds the given pathname (inode 0 is for the memory region). The pathname field will usually be the file that is backing the mapping. This is the huge memory-mapped file for the Galera cache, for examle:
7ed567fff000-7fd568000000 rw-s 00000000 fd:0f 2147676724 /home/mariadb-data/galera.cache
See also map_files/* for memory-mapped files for the process:
openxs@ao756:~$ sudo ls -l /proc/30580/map_files/ | more
total 0
lr-------- 1 mysql mysql 64 Jan 6 16:42 1bd6000-1cb7000 -> /usr/sbin/mysqld
lr-------- 1 mysql mysql 64 Jan 6 16:42 1cb7000-1d67000 -> /usr/sbin/mysqld
lr-------- 1 mysql mysql 64 Jan 6 16:42 400000-19d6000 -> /usr/sbin/mysqld
lr-------- 1 mysql mysql 64 Jan 6 16:42 7fe1c01ed000-7fe1c01f8000 -> /lib/x86_6
4-linux-gnu/libnss_files-2.23.so
lr-------- 1 mysql mysql 64 Jan 6 16:42 7fe1c01f8000-7fe1c03f7000 -> /lib/x86_6
4-linux-gnu/libnss_files-2.23.so
... - numa_maps - this file displays information about a process's NUMA memory policy and allocation. Each line contains information about a memory range used by the process, displaying - among other information - the effective memory policy for that memory range and on which nodes the pages have been allocated. Consider the following example (single node only, sorry):
openxs@ao756:~$ sudo cat /proc/30580/numa_maps | more
00400000 default file=/usr/sbin/mysqld mapped=1551 active=1061 N0=1551 kernelpagesize_kB=4
01bd6000 default file=/usr/sbin/mysqld anon=37 dirty=1 mapped=99 swapcache=36 active=0 N0=99 kernelpagesize_kB=4
01cb7000 default file=/usr/sbin/mysqld anon=46 dirty=45 mapped=71 swapcache=1 active=0 N0=71 kernelpagesize_kB=4
01d67000 default anon=115 dirty=93 swapcache=22 active=2 N0=115 kernelpagesize_kB=4
...
7fe1caa17000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 mapped=14 mapmax=2 N0=14 kernelpagesize_kB=4
7fe1caa4a000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1
7fe1cac4a000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 anon=2 dirty=1 swapcache=1 active=0 N0=2 kernelpagesize_kB=4
7fe1cac4c000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4
...
7fe1cae71000 default file=/[aio]\040(deleted)
7fe1cae72000 default anon=1 swapcache=1 active=0 N0=1 kernelpagesize_kB=4
7fe1cae73000 default file=/lib/x86_64-linux-gnu/ld-2.23.so anon=1 swapcache=1 ac
tive=0 N0=1 kernelpagesize_kB=4
7fe1cae74000 default file=/lib/x86_64-linux-gnu/ld-2.23.so anon=1 dirty=1 active
=0 N0=1 kernelpagesize_kB=4
7fe1cae75000 default anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4
7ffe2f559000 default stack anon=32 dirty=30 swapcache=2 active=3 N0=32 kernelpag
esize_kB=4
7ffe2f5c6000 default
7ffe2f5c8000 default
In the above we see starting address of the range, policy (all default in my case), NUMA node (N0 in my case) for the allocation and number of pages allocated (see N0=1551), dirty pages (dirty=30), number of pages that have an associated entry on a swap device (swapcache=2) etc. See also man 7 numa for many more details. - oom_score - this file displays the current score that the kernel gives to this process for the purpose of selecting a process for the OOM-killer. A higher score means that the process is more likely to be selected by the OOM-killer. The range is basically 0-1000, where 1000 basically means the process uses all the memory. Process with the value 0 in this file will never be killed. My server has low enough value:
openxs@ao756:~$ sudo cat /proc/30580/oom_score
40 - oom_score_adj - this file can be used to adjust the OOM score used to select which process gets killed in out-of-memory conditions. Each candidate task is assigned a value ranging from 0 (never kill) to 1000 (always kill) to determine which process is targeted. The value of oom_score_adj is added to the OOM score before it is used to determine which task to kill. Acceptable values range from -1000 (OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows user space to control the preference for OOM-killing, ranging from always preferring a certain task or completely disabling it from OOM killing.
In practice you just write large negative valuer into the file:
openxs@ao756:~$ sudo cat /proc/30580/oom_score
40
openxs@ao756:~$ sudo cat /proc/30580/oom_score_adj
0
openxs@ao756:~$ sudo echo -100 > /proc/30580/oom_score_adj
-bash: /proc/30580/oom_score_adj: Permission denied
openxs@ao756:~$ sudo su -
root@ao756:~# echo -100 > /proc/30580/oom_score_adj
root@ao756:~# exit
logout
openxs@ao756:~$ sudo cat /proc/30580/oom_score_adj
-100
openxs@ao756:~$ sudo cat /proc/30580/oom_score
0
In my case I had to become root to be able to adjust the score. Read this nice blog post, "MySQL, OOM Killer, and everything related", for more details. - sched - scheduler related statistcis for the process. Not that it is much documented (see some hints here and there), but the output is more or less clear:
openxs@ao756:~$ sudo cat /proc/30580/sched
mysqld (30580, #threads: 37)
-------------------------------------------------------------------
se.exec_start : 259734254.219078
se.vruntime : 102780.674970
se.sum_exec_runtime : 354.092211
se.statistics.sum_sleep_runtime : 38046805.925355
se.statistics.wait_start : 0.000000
se.statistics.sleep_start : 259734254.219078
se.statistics.block_start : 0.000000
se.statistics.sleep_max : 11587559.290106
se.statistics.block_max : 2466.143746
se.statistics.exec_max : 4.005223
se.statistics.slice_max : 2.790630
se.statistics.wait_max : 22.567269
se.statistics.wait_sum : 85.791532
se.statistics.wait_count : 748
se.statistics.iowait_sum : 5117.788260
se.statistics.iowait_count : 524
se.nr_migrations : 50
se.statistics.nr_migrations_cold : 0
se.statistics.nr_failed_migrations_affine : 0
se.statistics.nr_failed_migrations_running : 10
se.statistics.nr_failed_migrations_hot : 5
se.statistics.nr_forced_migrations : 0
se.statistics.nr_wakeups : 658
se.statistics.nr_wakeups_sync : 104
se.statistics.nr_wakeups_migrate : 42
se.statistics.nr_wakeups_local : 468
se.statistics.nr_wakeups_remote : 190
se.statistics.nr_wakeups_affine : 22
se.statistics.nr_wakeups_affine_attempts : 55
se.statistics.nr_wakeups_passive : 0
se.statistics.nr_wakeups_idle : 0
avg_atom : 0.477857
avg_per_cpu : 7.081844
nr_switches : 741
nr_voluntary_switches : 659
nr_involuntary_switches : 82
se.load.weight : 1024
se.avg.load_sum : 6000742
se.avg.util_sum : 22528
se.avg.load_avg : 125
se.avg.util_avg : 0
se.avg.last_update_time : 259734254219078
policy : 0
prio : 120
clock-delta : 60
mm->numa_scan_seq : 0
numa_pages_migrated : 0
numa_preferred_nid : -1
total_numa_faults : 0
current_node=0, numa_group_id=0
numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0
and you can reset most counters by writing 0 to the file:
openxs@ao756:~$ sudo su -
root@ao756:~# echo 0 > /proc/30580/sched
root@ao756:~# exit
logout
openxs@ao756:~$ sudo cat /proc/30580/sched
mysqld (30580, #threads: 37)
-------------------------------------------------------------------
se.exec_start : 259734254.219078
se.vruntime : 102780.674970
se.sum_exec_runtime : 354.092211
se.statistics.sum_sleep_runtime : 0.000000
se.statistics.wait_start : 0.000000
se.statistics.sleep_start : 0.000000
se.statistics.block_start : 0.000000
se.statistics.sleep_max : 0.000000
se.statistics.block_max : 0.000000
se.statistics.exec_max : 0.000000
se.statistics.slice_max : 0.000000
se.statistics.wait_max : 0.000000
se.statistics.wait_sum : 0.000000
se.statistics.wait_count : 0
se.statistics.iowait_sum : 0.000000
se.statistics.iowait_count : 0
se.nr_migrations : 50
se.statistics.nr_migrations_cold : 0
se.statistics.nr_failed_migrations_affine : 0
se.statistics.nr_failed_migrations_running : 0
se.statistics.nr_failed_migrations_hot : 0
se.statistics.nr_forced_migrations : 0
se.statistics.nr_wakeups : 0
se.statistics.nr_wakeups_sync : 0
se.statistics.nr_wakeups_migrate : 0
se.statistics.nr_wakeups_local : 0
se.statistics.nr_wakeups_remote : 0
se.statistics.nr_wakeups_affine : 0
se.statistics.nr_wakeups_affine_attempts : 0
se.statistics.nr_wakeups_passive : 0
se.statistics.nr_wakeups_idle : 0
avg_atom : 0.477857
avg_per_cpu : 7.081844
nr_switches : 741
nr_voluntary_switches : 659
nr_involuntary_switches : 82
se.load.weight : 1024
se.avg.load_sum : 6000742
se.avg.util_sum : 22528
se.avg.load_avg : 125
se.avg.util_avg : 0
se.avg.last_update_time : 259734254219078
policy : 0
prio : 120
clock-delta : 77
mm->numa_scan_seq : 0
numa_pages_migrated : 0
numa_preferred_nid : -1
total_numa_faults : 0
current_node=0, numa_group_id=0
numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0
then run some load and check the impact. The values are displayed in milliseconds; they’re tracked in nanoseconds, and scaled by one million. - smaps - this file shows memory consumption for each of the process's mappings (see maps above). For each mapping there is a series of lines:
openxs@ao756:~$ sudo cat /proc/30580/smaps | more
00400000-019d6000 r-xp 00000000 fc:01 12066751 /usr/sbin/mysqld
Size: 22360 kB
Rss: 6204 kB
Pss: 6204 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 6204 kB
Private_Dirty: 0 kB
Referenced: 5704 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
VmFlags: rd ex mr mw me dw sd
01bd6000-01cb7000 r--p 015d6000 fc:01 12066751 /usr/sb
in/mysqld
Size: 900 kB
...
These lines show the size of the mapping, the amount of the mapping that is currently resident in RAM ("Rss"), the process's proportional share of this mapping ("Pss"), the number of clean and dirty shared pages in the mapping, and the number of clean and dirty private pages in the mapping. "Referenced" indicates the amount of memory currently marked as referenced or accessed. "Anonymous" shows the amount of memory that does not belong to any file. "Swap" shows how much would-be-anonymous memory is also used, but out on swap, "VmFlags" represents the kernel flags associated with the virtual memory area as a set of two character flags, and so on. - stack - this file provides a symbolic trace of the function calls in this process's kernel stack. Here is what I have for the process and one of its threads (actually MySQL server's main thread and page cleaner thread as we'll see in the next post in this series):
openxs@ao756:~$ sudo cat /proc/30580/stack
[<ffffffff81233b94>] poll_schedule_timeout+0x44/0x70
[<ffffffff81235223>] do_sys_poll+0x4b3/0x570
[<ffffffff81235329>] do_restart_poll+0x49/0x80
[<ffffffff81097155>] sys_restart_syscall+0x25/0x30
[<ffffffff81869c5b>] entry_SYSCALL_64_fastpath+0x22/0xd0
[<ffffffffffffffff>] 0xffffffffffffffff
openxs@ao756:~$ sudo cat /proc/30580/task/30592/stack
[<ffffffff811097c0>] futex_wait_queue_me+0xc0/0x120
[<ffffffff8110a4f6>] futex_wait+0x116/0x270
[<ffffffff8110ca60>] do_futex+0x120/0x5a0
[<ffffffff8110cf61>] SyS_futex+0x81/0x180
[<ffffffff81869c5b>] entry_SYSCALL_64_fastpath+0x22/0xd0
[<ffffffffffffffff>] 0xffffffffffffffff
For other threads we'll see more interesting stacks when they are waiting for disk etc. Stay tuned! Note that if you'll check wchan file for the process:
openxs@ao756:~$ sudo strings /proc/30580/wchan
poll_schedule_timeout
we'll see the top line of the output above (hex values aside) as a null-terminated string. This is a "wait channel", the symbolic name corresponding to the location in the kernel where the process is sleeping. - stat - status information about the process, used by ps. This file is a set of fields separated by spaces, so it is easy to parse by some script or load into the database. Domas Mituzas once reported a bug, Bug #72027 - "LOAD_FILE() does not work on dynamic files" (fixed only in MySQL 8.0.15+), based on failed attempts to apply LOAD_DATA() to stat file in /proc.
Output starts with PID and comm fields, followed by the process state and other details. For example:
openxs@ao756:~$ cat /proc/30580/stat
30580 (mysqld) S 1 30579 30579 0 -1 4194368 1007759 0 360 0 68598 13573 0 0 20 0 37 0 218060678 764219392 76109 18446744073709551615 1 1 0 0 0 0 540679 12288 1768 0 0 0 17 1 0 0 513 0 0 0 0 0 0 0 0 0 0
The value of state is a single character:- R Running
- S Sleeping in an interruptible wait
- D Waiting in uninterruptible disk sleep
- Z Zombie
- T Stopped (on a signal)
- t Tracing stop
- ... some other values were also used in the past
- statm - this file provides information about memory usage, measured in (4K) pages:
openxs@ao756:~$ cat /proc/30580/statm
186577 76109 2515 5590 0 169800 0
Again we see just space-separated fields, easy to parse. The fileds are total size, resident, shared, text (code) size, next one (former lib) is always 0, then data + stack size and dirty pages. Some of these values may be inaccurate because of internal kernel optimizations applied. - status - this file provides much of the information from stat and statm in a format that's easier for humans to read. For example:
openxs@ao756:~$ cat /proc/30580/status
Name: mysqld
State: S (sleeping)
Tgid: 30580
Ngid: 0
Pid: 30580
PPid: 1
TracerPid: 0
Uid: 116 116 116 116
Gid: 125 125 125 125
FDSize: 128
Groups: 125
NStgid: 30580
NSpid: 30580
NSpgid: 30579
NSsid: 30579
VmPeak: 746308 kB
VmSize: 746308 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 328248 kB
VmRSS: 304436 kB
VmData: 679056 kB
VmStk: 144 kB
VmExe: 22360 kB
VmLib: 7784 kB
VmPTE: 1036 kB
VmPMD: 16 kB
VmSwap: 13256 kB
HugetlbPages: 0 kB
Threads: 37
SigQ: 0/14893
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000084007
SigIgn: 0000000000003000
SigCgt: 00000001800006e8
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
Seccomp: 0
Speculation_Store_Bypass: thread vulnerable
Cpus_allowed: ff
Cpus_allowed_list: 0-7
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 687
nonvoluntary_ctxt_switches: 82
The values are supposed to be clear for a reader. If not - just check man 5 proc. - syscall - this file exposes the system call number and argument registers for the system call currently being executed by the process, followed by the values of the stack pointer and program counter registers. The values of all six argument registers are exposed, although most system calls use fewer registers. If the process is blocked, but not in a system call, thensystem call number is -1, followed by just the values of the stack pointer and program counter. If process is not blocked, then the file contains just the string "running". For example, this is my ad hoc poor man threads monitoring script of a kind applied to the mysqld process executing some sysbench load:
openxs@ao756:~$ for dir in `ls /proc/30580/task`; do echo -n $dir': '; 2>/dev/null sudo cat /proc/$dir/syscall; done | more
2389: 202 0x1db141c 0x80 0x834e8 0x0 0x1db1400 0x41a70 0x7fe1cace9fb0 0x7fe1ca807360
2390: 202 0x1db141c 0x80 0x834e5 0x0 0x1db1400 0x41a70 0x7fe1c00e4fb0 0x7fe1ca807360
2391: 202 0x1db141c 0x80 0x83505 0x0 0x1db1400 0x41a7d 0x7fe1c00a3fb0 0x7fe1ca807360
2392: 202 0x1db141c 0x80 0x834fe 0x0 0x1db1400 0x41a7d 0x7fe1b6a39fb0 0x7fe1ca807360
2393: 202 0x1db141c 0x80 0x83504 0x0 0x1db1400 0x41a7d 0x7fe1b6a7afb0 0x7fe1ca807360
2394: 202 0x1db141c 0x80 0x83501 0x0 0x1db1400 0x41a7d 0x7fe1c0125fb0 0x7fe1ca807360
2395: running
2488: 202 0x1db141c 0x80 0x8351e 0x0 0x1db1400 0x41a8a 0x7fe1cadedfb0 0x7fe1ca807360
2493: 202 0x1db141c 0x80 0x83518 0x0 0x1db1400 0x41a8a 0x7fe1cad2afb0 0x7fe1ca807360
2800: 75 0x12 0x1 0x0 0x0 0x0 0x1db0870 0x7fe1c01e8f50 0x7fe1c876a8dd
30580: 7 0x7fe1c13b9e98 0x2 0xffffffff 0x7fe1c84539d0 0x0 0x7fe1c8453700 0x7ffe2
f57a590 0x7fe1c876880d
30581: 128 0x7fe1c0bfed70 0x7fe1c0bfedf0 0x0 0x8 0x0 0x7ffe2f57a5d0 0x7fe1c0bfec
c0 0x7fe1c86a3b36
30582: 208 0x7fe1cae58000 0x1 0x100 0x7fe1c7c76000 0x7fe1b3bfe750 0x7fe1b3bff9c0
...
Here you can see a running thread and various system calls for other threads. If only I could map numbers to something useful... Smart people can as you'll find out in the next posts.
Let me stop checking /proc at this stage, otherwise I'll end up with a yet another MySQL-illustrated manual page for /proc with a size of the large book chapter.
In the next post I am going to describe MySQL threads monitoring with /proc, ps and top sampling, based on the information presented here.
No comments:
Post a Comment