tag:blogger.com,1999:blog-30806152114680835372024-03-13T21:55:49.031+02:00Blog of (former?) MySQL EntomologistValerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.comBlogger241125tag:blogger.com,1999:blog-3080615211468083537.post-34111092044565751172022-02-10T18:30:00.000+02:002022-02-10T18:30:33.404+02:00How to Summarize gdb Backtrace with pt-pmp (and Flamegraph)<p>This is going to be more like a note to myself and my readers than a real blog post. But still I'd like o document the trick that I am applying for years already, in its most convenient form. </p><p>Assuming you have a backtrace (or full backtrace) created from a core file or by attaching <b>gdb</b> to a live (and maybe hanging) process with many threads, like MySQL or MariaDB server, what is the best way to summarize it quickly, to understand what most of the threads are doing/hanging at? Like this huge backtrace with hundreds of threads:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~$ <b>ls -l /tmp/backtrace1.txt</b><br />-rw-rw-r-- 1 openxs openxs 2817054 лют 10 17:02 /tmp/backtrace1.txt<br />openxs@ao756:~$ <b>grep LWP /tmp/backtrace1.txt | wc -l</b><br />1915</span></span><br /></p></blockquote><p>Here it is. You have to download <a href="https://www.percona.com/doc/percona-toolkit/LATEST/pt-pmp.html" target="_blank"><b>pt-pmp</b></a> from Percona Toolkit. Then source the <b>pt-pmp</b> script and rely on the <b>aggregate_stacktrace</b> function from it, quite advanced in summarizing stack traces:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>which pt-pmp</b><br />/usr/bin/pt-pmp<br />openxs@ao756:~$ <b>. /usr/bin/pt-pmp</b><br />openxs@ao756:~$ <b>cat /tmp/backtrace1.txt | aggregate_stacktrace > /tmp/pmp1.txt </b><br />openxs@ao756:~$ <b>ls -l /tmp/pmp1.txt</b><br />-rw-rw-r-- 1 openxs openxs 34174 лют 10 18:07 /tmp/pmp1.txt<br />openxs@ao756:~$ <b>head -5 /tmp/pmp1.txt</b><br /> 598 poll(libc.so.6),vio_io_wait(viosocket.c:945),vio_socket_io_wait(viosocket.c:108),vio_read(viosocket.c:184),my_real_read(net_serv.cc:892),my_net_read_packet_reallen(net_serv.cc:1162),my_net_read_packet(net_serv.cc:1146),do_command(sql_parse.cc:1262),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)<br /> 82 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_s_lock_spin(sync0rw.cc:373),rw_lock_s_lock_func(sync0rw.ic:290),pfs_rw_lock_s_lock_func(sync0rw.ic:290),buf_page_get_gen(buf0buf.cc:4905),btr_cur_search_to_nth_level(btr0cur.cc:1243),btr_pcur_open_low(btr0pcur.ic:467),row_ins_scan_sec_index_for_duplicate(btr0pcur.ic:467),row_ins_sec_index_entry_low(btr0pcur.ic:467),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)<br /> 55 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),lock_wait_suspend_thread(lock0wait.cc:347),row_mysql_handle_errors(row0mysql.cc:741),row_insert_for_mysql(row0mysql.cc:1428),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)<br /> 38 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_x_lock_func(sync0rw.cc:733),pfs_rw_lock_x_lock_func(sync0rw.ic:544),buf_page_get_gen(buf0buf.cc:4918),btr_cur_search_to_nth_level(btr0cur.cc:1243),row_ins_sec_index_entry_low(row0ins.cc:2946),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)<br /> 32 pthread_cond_wait,wait(os0event.cc:158),wait_low(os0event.cc:158),os_event_wait_low(os0event.cc:158),sync_array_wait_event(sync0arr.cc:471),rw_lock_x_lock_func(sync0rw.cc:733),pfs_rw_lock_x_lock_func(sync0rw.ic:544),buf_page_get_gen(buf0buf.cc:4918),btr_cur_search_to_nth_level(btr0cur.cc:1243),row_ins_sec_index_entry_low(row0ins.cc:3040),row_ins_sec_index_entry(row0ins.cc:3251),row_ins_index_entry(row0ins.cc:3297),row_ins_index_entry_step(row0ins.cc:3297),row_ins(row0ins.cc:3297),row_ins_step(row0ins.cc:3297),row_insert_for_mysql(row0mysql.cc:1414),ha_innobase::write_row(ha_innodb.cc:8231),handler::ha_write_row(handler.cc:6089),write_record(sql_insert.cc:1941),mysql_insert(sql_insert.cc:1066),mysql_execute_command(sql_parse.cc:4170),mysql_parse(sql_parse.cc:7760),dispatch_command(sql_parse.cc:1832),do_command(sql_parse.cc:1386),do_handle_one_connection(sql_connect.cc:1336),handle_one_connection(sql_connect.cc:1241),pfs_spawn_thread(pfs.cc:1862),start_thread(libpthread.so.0),clone(libc.so.6)<br /></span></span></p></blockquote><p>If a summary like above is not clear enough, we can surely go one step further and create a proper flame graph based on the output above:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ cat /tmp/pmp1.txt | awk '{print $2, $1}' | sed -e 's/,/;/g' | ~/git/FlameGraph/flamegraph.pl --countname="threads" --reverse - >/tmp/pmp1.svg <br /></span></span></p></blockquote><p>Then with some creative zooming and search we can concentrate on waits:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh7T4GYrioTDlpMVrGu3f8jlJAfwyZOKYuk3jGj2bfOtDuwJ9B6NdanK--E0qrvlF-1_YIN5jfRCbeHbGal4d-IibVum0OkNDYvl-CGIoXkokYmX4kKMzObtQwf3zhtbs2SPLmSsooM4IRg2mh-yy5yduGRqkfEFpmWxmsd23XO_TjIBJkZdEq3ZrFsQw=s1205" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="613" data-original-width="1205" height="326" src="https://blogger.googleusercontent.com/img/a/AVvXsEh7T4GYrioTDlpMVrGu3f8jlJAfwyZOKYuk3jGj2bfOtDuwJ9B6NdanK--E0qrvlF-1_YIN5jfRCbeHbGal4d-IibVum0OkNDYvl-CGIoXkokYmX4kKMzObtQwf3zhtbs2SPLmSsooM4IRg2mh-yy5yduGRqkfEFpmWxmsd23XO_TjIBJkZdEq3ZrFsQw=w640-h326" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Frames with "wait" substring in the function name are highlighted<br /></td></tr></tbody></table><p>That's all. As simple as it looks. Quick overview of the backtrace before you start digging there.<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com1tag:blogger.com,1999:blog-3080615211468083537.post-51795437729150293662022-02-04T12:00:00.000+02:002022-02-04T12:00:59.035+02:00How to Add Probe Inside a Function and Access Local Variables with bpftrace<p>During several my talks on <b>bpftrace</b> I mentioned that it is possible to add probe to "every other" line of the code, same as with <b>perf</b>, but had never demonstrated how to do this. Looks like it's time to show both this and access to local variables declared inside the function we trace, as I am going to speak <a href="https://fosdem.org/2022/schedule/event/mariadb_bfptrace/" target="_blank">about <b>bpftrace</b> again at <b>FOSDEM</b></a> (and have to share something new and cool).<br /></p><p>Consider the following debugging session with latest MariaDB server 10.8 built from GitHub source (as usual, for my own work I use only my own builds from current source) on Ubuntu 20.04:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.8$ <b>sudo gdb -p `pidof mariadbd`</b><br />GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2<br />Copyright (C) 2020 Free Software Foundation, Inc.<br />License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html><br />This is free software: you are free to change and redistribute it.<br />There is NO WARRANTY, to the extent permitted by law.<br />Type "show copying" and "show warranty" for details.<br />This GDB was configured as "x86_64-linux-gnu".<br />Type "show configuration" for configuration details.<br />For bug reporting instructions, please see:<br /><http://www.gnu.org/software/gdb/bugs/>.<br />Find the GDB manual and other documentation resources online at:<br /> <http://www.gnu.org/software/gdb/documentation/>.<br /><br />For help, type "help".<br />Type "apropos word" to search for commands related to "word".<br />Attaching to process 200650<br />[New LWP 200652]<br />[New LWP 200653]<br />[New LWP 200654]<br />[New LWP 200655]<br />[New LWP 200656]<br />[New LWP 200659]<br />[New LWP 200662]<br />[New LWP 200663]<br />[Thread debugging using libthread_db enabled]<br />Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".<br />--Type <RET> for more, q to quit, c to continue without paging--<br />0x00007f7b8424caff in __GI___poll (fds=0x557960733d38, nfds=3,<br /> timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29<br />29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.<br />(gdb) <b>b do_command</b><br />Breakpoint 1 at 0x55795e867570: <b>file /home/openxs/git/server/sql/sql_parse.cc, line 1197.</b><br />(gdb) <b>c</b><br />Continuing.</span></span><br /></p></blockquote><p>Basically I've attached to running <b>mariadbd</b> process and set a breakpoint on <b>do_command()</b>, and then let it continue working. In another terminal I've connected to server and got the breakpoint hit immediately:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[New Thread 0x7f7b5dbf0700 (LWP 200680)]<br />[New Thread 0x7f7b6c084700 (LWP 200797)]<br />[Switching to Thread 0x7f7b6c084700 (LWP 200797)]<br /><br /><b>Thread 11 "mariadbd" hit Breakpoint 1, do_command (thd=0x7f7b4c001738,<br /> blocking=blocking@entry=true)<br /> at /home/openxs/git/server/sql/sql_parse.cc:1197<br /></b>1197 {<br />(gdb) <b>bt</b><br />#0 do_command (thd=0x7f7b4c001738, blocking=blocking@entry=true)<br /> at /home/openxs/git/server/sql/sql_parse.cc:1197<br />#1 0x000055795e97df07 in do_handle_one_connection (connect=<optimized out>,<br /> put_in_cache=true) at /home/openxs/git/server/sql/sql_connect.cc:1418<br />#2 0x000055795e97e23d in handle_one_connection (arg=arg@entry=0x55796079f798)<br /> at /home/openxs/git/server/sql/sql_connect.cc:1312<br />#3 0x000055795ecc7e8d in pfs_spawn_thread (arg=0x557960733dd8)<br /> at /home/openxs/git/server/storage/perfschema/pfs.cc:2201<br />#4 0x00007f7b84686609 in start_thread (arg=<optimized out>)<br /> at pthread_create.c:477<br />#5 0x00007f7b84259293 in clone ()<br /> at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95</span></span><br /></p></blockquote><p>From this state in <b>gdb</b> we can trace execution step by step:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>n</b><br />[New Thread 0x7f7b5ebf2700 (LWP 200800)]<br />1213 if (thd->async_state.m_state == thd_async_state::enum_async_state::RESUMED)<br />(gdb) <b>n</b><br />1229 thd->lex->current_select= 0;<br />(gdb) <b>n</b><br />1237 if (!thd->skip_wait_timeout)<br />(gdb) <b>n</b><br />1238 my_net_set_read_timeout(net, thd->get_net_wait_timeout());<br />(gdb) <b>n</b><br />1241 thd->clear_error(1);<br />(gdb) <b>n</b><br />1243 net_new_transaction(net);<br />(gdb) <b>n</b><br />1246 thd->start_bytes_received= thd->status_var.bytes_received;<br />(gdb) <b>n</b><br />1264 packet_length= my_net_read_packet(net, 1);<br />(gdb) <b>n</b><br />1266 if (unlikely(packet_length == packet_error))<br />(gdb) <b>n</b><br />1300 packet= (char*) net->read_pos;<br />(gdb) <b>n</b><br />1309 if (packet_length == 0) /* safety */<br />(gdb) <b>n</b><br />1316 packet[packet_length]= '\0'; /* safety */<br />(gdb) <b>n</b><br />1319 command= fetch_command(thd, packet);<br />(gdb) <b>p packet</b><br /><b>$1 = 0x7f7b4c008d08 "\003select @@version_comment limit 1"</b><br />(gdb) <b>q</b><br />A debugging session is active.<br /><br /> Inferior 1 [process 200650] will be detached.<br /><br />Quit anyway? (y or n) <b>y</b><br />Detaching from program: /home/openxs/dbs/maria10.8/bin/mariadbd, process 200650<br />[Inferior 1 (process 200650) detached]</span></span><br /></p></blockquote><p>In that session above I moved to the next step until I've got to the place in the code where local variable <b>packet</b> got a value, zero-terminated string. I know from the previous experience that it's value is the SQL statement to be executed (in most cases). So, I printed it to find out that <b>mysql</b> command line client starts with finding the server version comment to output.<br /></p><p>Question is: can we do the same with <b>bpftrace</b>, attache the probe to the instruction at/around line <b>1316</b> of the <b>sql/sql_parse.cc</b> and print the value of the local variable named <b>packet</b> as a string? <br /></p><p>It should be possible, as we can do this with <b>perf</b> or even raw <b>ftrace</b> probes. This is how we can find all "tracable" lines in the code of <b>do_command()</b> function with <b>perf</b>:<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.8$ <b>sudo perf probe -x /home/openxs/dbs/maria10.8/bin/mysqld --line do_command</b><br /><do_command@/home/openxs/git/server/sql/sql_parse.cc:0><br /> 0 dispatch_command_return do_command(THD *thd, bool blocking)<br /> {<br /> 2 dispatch_command_return return_value;<br /> char *packet= 0;<br /> ulong packet_length;<br /> NET *net= &thd->net;<br /> enum enum_server_command command;<br /> DBUG_ENTER("do_command");<br /><br /> #ifdef WITH_WSREP<br /> DBUG_ASSERT(!thd->async_state.pending_ops() ||<br /> (WSREP(thd) &&<br /> thd->wsrep_trx().state() == wsrep::transaction::s_abort><br /> #else<br /> DBUG_ASSERT(!thd->async_state.pending_ops());<br /> #endif<br /><br /> if (thd->async_state.m_state == thd_async_state::enum_async_state::R><br /> {<br /> /*<br /> Resuming previously suspended command.<br /> Restore the state<br /> */<br /> 23 command = thd->async_state.m_command;<br /> 24 packet = thd->async_state.m_packet.str;<br /> packet_length = (ulong)thd->async_state.m_packet.length;<br /> goto resume;<br /> }<br /><br />...<br /> 104 packet= (char*) net->read_pos;<br /> /*<br /> 'packet_length' contains length of data, as it was stored in packet<br /> header. In case of malformed header, my_net_read returns zero.<br /> If packet_length is not zero, my_net_read ensures that the returned<br /> number of bytes was actually read from network.<br /> There is also an extra safety measure in my_net_read:<br /> it sets packet[packet_length]= 0, but only for non-zero packets.<br /> */<br /> 113 if (packet_length == 0) /* safety */<br /> {<br /> /* Initialize with COM_SLEEP packet */<br /> 116 packet[0]= (uchar) COM_SLEEP;<br /> 117 packet_length= 1;<br /> }<br /> /* Do not rely on my_net_read, extra safety against programming erro><br /><b> 120 packet[packet_length]= '\0'; /* safety */<br /></b><br /><br /> 123 command= fetch_command(thd, packet);<br />...<br /> 221 DBUG_ASSERT(thd->m_digest == NULL);<br /> DBUG_ASSERT(thd->m_statement_psi == NULL);<br /> #ifdef WITH_WSREP<br /> if (packet_length != packet_error)<br /> {<br /> /* there was a command to process, and before_command() has been c><br /> if (unlikely(wsrep_service_started))<br /> 228 wsrep_after_command_after_result(thd);<br /> }<br /> #endif /* WITH_WSREP */<br /> DBUG_RETURN(return_value);<br /><br />openxs@ao756:~/dbs/maria10.8$</span></span><br /></p></blockquote><p>So, basically we need a probe on line 120 to match the place where I printed the value in <b>gdb</b>. Let's create it with <b>perf</b>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.8$ <b>sudo perf probe -x /home/openxs/dbs/maria10.8/bin/mariadbd 'do_command:120 packet:string'</b><br />Added new event:<br /> probe_mariadbd:do_command_L120 (on do_command:120 in /home/openxs/dbs/maria10.8/bin/mariadbd with packet:string)<br /><br />You can now use it in all perf tools, such as:<br /><br /> perf record -e probe_mariadbd:do_command_L120 -aR sleep 1<br /></span></span></p></blockquote><p>Now, we can check what user probe was really created by that command line:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.8$ <b>sudo cat /sys/kernel/tracing/uprobe_events</b><br />p:probe_mariadbd/do_command_L120 <b>/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761f packet_string=+0(%r10):string</b></span></span><br /></p></blockquote><p>So, the probe is attached to some address, <b>0x7b761f</b>, in the <b>mariadb</b> binary. Looks like we refer to the register, <b>r10</b>, while trying to access local variable, <b>packet</b>.<br /></p><p>Now let's delete this probe (we'll add it back with <b>bpftrace</b> soon) and try to make sense of that address:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.8$ <b>sudo perf probe --del 'probe_mariadbd:do_command*'</b><br />Removed event: probe_mariadbd:do_command_L120<br />openxs@ao756:~/dbs/maria10.8$ <b>objdump -tT /home/openxs/dbs/maria10.8/bin/mariadbd | grep do_command</b><br />00000000007b7570 g F .text 00000000000007e9 _Z10do_commandP3THDb<br />0000000000<b>7b7570</b> g DF .text 00000000000007e9 Base _Z10do_commandP3THDb</span></span><br /></p></blockquote><p>Note that based on the above the entry address of the <b>do_command()</b> function in the binary is <b>0x7b7570</b>, a bit smaller than <b>0x7b761f</b>. The difference is <b>0x7b761f</b> - <b>0x7b7570 = 0xaf</b> or 175 in decimal. </p><p>Can we come up to the address for the probe without <b>perf</b>? Yes, we can do it by checking the assembly code of the function, for example, in <b>gdb</b>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.8$ <b>ps aux | grep 'mariadbd ' | grep -v grep</b><br />openxs <b>200650</b> 0.0 2.4 1341032 94976 ? Sl січ30 0:05 /home/openxs/dbs/maria10.8/bin/mariadbd --no-defaults --basedir=/home/openxs/dbs/maria10.8 --datadir=/home/openxs/dbs/maria10.8/data --plugin-dir=/home/openxs/dbs/maria10.8/lib/plugin --log-error=/home/openxs/dbs/maria10.8/data/ao756.err --pid-file=ao756.pid --socket=/tmp/mariadb.sock --port=3309<br />openxs@ao756:~/dbs/maria10.8$ <b>sudo gdb -p `pidof mariadbd`</b><br />GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2<br />Copyright (C) 2020 Free Software Foundation, Inc.<br />License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html><br />This is free software: you are free to change and redistribute it.<br />There is NO WARRANTY, to the extent permitted by law.<br />Type "show copying" and "show warranty" for details.<br />This GDB was configured as "x86_64-linux-gnu".<br />Type "show configuration" for configuration details.<br />For bug reporting instructions, please see:<br /><http://www.gnu.org/software/gdb/bugs/>.<br />Find the GDB manual and other documentation resources online at:<br /> <http://www.gnu.org/software/gdb/documentation/>.<br /><br />For help, type "help".<br />Type "apropos word" to search for commands related to "word".<br />Attaching to process 200650<br />[New LWP 200652]<br />[New LWP 200653]<br />[New LWP 200654]<br />[New LWP 200655]<br />[New LWP 200662]<br />[New LWP 200663]<br />[New LWP 204349]<br />[Thread debugging using libthread_db enabled]<br />Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".<br />--Type <RET> for more, q to quit, c to continue without paging--<br />0x00007f7b8424caff in __GI___poll (fds=0x557960733d38, nfds=3,<br /> timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29<br />29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.<br /></span><span style="font-size: x-small;">(gdb) <b>disass /m do_command</b><br />Dump of assembler code for function do_command(THD*, bool):<br />1142 static bool wsrep_command_no_result(char command)<br /><br />1143 {<br />1144 return (command == COM_STMT_PREPARE ||<br /> 0x000055795e8679d0 <+1120>: cmp $0x16,%al<br /> 0x000055795e8679d2 <+1122>: je 0x55795e8679dc <do_command(THD*, bool)+1132><br /> 0x000055795e8679d4 <+1124>: cmp $0x1c,%al<br /> 0x000055795e8679d6 <+1126>: jne 0x55795e867b50 <do_command(THD*, bool)+1504><br /> 0x000055795e8679dc <+1132>: mov $0x1,%esi<br /><br />1145 command == COM_STMT_FETCH ||<br />1146 command == COM_STMT_SEND_LONG_DATA ||<br />1147 command == COM_STMT_CLOSE);<br />1148 }<br />1149 #endif /* WITH_WSREP */<br />1150 #ifndef EMBEDDED_LIBRARY<br />1151 static enum enum_server_command fetch_command(THD *thd, char *packet)<br /><br />1152 {<br />1153 enum enum_server_command<br />--Type <RET> for more, q to quit, c to continue without paging--<br /> 0x000055795e867628 <+184>: movzbl %al,%r15d<br /><br />1154 command= (enum enum_server_command) (uchar) packet[0];<br />1155 DBUG_ENTER("fetch_command");<br /><br />1156<br />1157 if (command >= COM_END ||<br /> 0x000055795e86762c <+188>: lea -0x20(%r15),%edx<br /> 0x000055795e867630 <+192>: cmp $0xd9,%edx<br /> 0x000055795e867636 <+198>: jbe 0x55795e8676f0 <do_command(THD*, bool)+384><br /> 0x000055795e86763c <+204>: cmp $0xff,%al<br /> 0x000055795e86763e <+206>: je 0x55795e8676f0 <do_command(THD*, bool)+384><br /><br />1158 (command >= COM_MDB_GAP_BEG && command <= COM_MDB_GAP_END))<br />1159 command= COM_END; // Wrong command<br /><br />1160<br />1161 DBUG_PRINT("info",("Command on %s = %d (%s)",<br /><br />1162 vio_description(thd->net.vio), command,<br />1163 command_name[command].str));<br />--Type <RET> for more, q to quit, c to continue without paging--<br />1164 DBUG_RETURN(command);<br /><br />1165 }<br />1166<br />1167 /**<br />1168 Read one command from connection and execute it (query or simple command).<br />1169 This function is to be used by different schedulers (one-thread-per-connection,<br />1170 pool-of-threads)<br />1171<br />1172 For profiling to work, it must never be called recursively.<br />1173<br />1174 @param thd - client connection context<br />1175<br />1176 @param blocking - wait for command to finish.<br />1177 if false (nonblocking), then the function might<br />1178 return when command is "half-finished", with<br />1179 DISPATCH_COMMAND_WOULDBLOCK.<br />1180 Currenly, this can *only* happen when using<br />1181 threadpool. The command will resume, after all outstanding<br />1182 async operations (i.e group commit) finish.<br />--Type <RET> for more, q to quit, c to continue without paging--<br />1183 Threadpool scheduler takes care of "resume".<br />1184<br />1185 @retval<br />1186 DISPATCH_COMMAND_SUCCESS - success<br />1187 @retval<br />1188 DISPATCH_COMMAND_CLOSE_CONNECTION request of THD shutdown<br />1189 (s. dispatch_command() description)<br />1190 @retval<br />1191 DISPATCH_COMMAND_WOULDBLOCK - need to wait for asyncronous operations<br />1192 to finish. Only returned if parameter<br />1193 'blocking' is false.<br />1194 */<br />1195<br /><b>1196 dispatch_command_return do_command(THD *thd, bool blocking)<br /></b>1197 {<br /> 0x000055795e867570 <+0>: endbr64<br /><br />1198 dispatch_command_return return_value;<br /><br />1199 char *packet= 0;<br /><br />1200 ulong packet_length;<br />--Type <RET> for more, q to quit, c to continue without paging--<br /><br />1201 NET *net= &thd->net;<br /><br />1202 enum enum_server_command command;<br /><br />1203 DBUG_ENTER("do_command");<br /><br />1204<br />1205 #ifdef WITH_WSREP<br />1206 DBUG_ASSERT(!thd->async_state.pending_ops() ||<br /><br />1207 (WSREP(thd) &&<br />1208 thd->wsrep_trx().state() == wsrep::transaction::s_aborted));<br />1209 #else<br />1210 DBUG_ASSERT(!thd->async_state.pending_ops());<br />1211 #endif<br />1212<br />1213 if (thd->async_state.m_state == thd_async_state::enum_async_state::RESUMED)<br /> 0x000055795e867574 <+4>: push %rbp<br /> 0x000055795e867575 <+5>: mov %rsp,%rbp<br /> 0x000055795e867578 <+8>: push %r15<br />--Type <RET> for more, q to quit, c to continue without paging--<br /> 0x000055795e86757a <+10>: push %r14<br /> 0x000055795e86757c <+12>: push %r13<br /> 0x000055795e86757e <+14>: mov %esi,%r13d<br /> 0x000055795e867581 <+17>: push %r12<br /> 0x000055795e867583 <+19>: push %rbx<br /> 0x000055795e867584 <+20>: mov %rdi,%rbx<br /> 0x000055795e867587 <+23>: sub $0x78,%rsp<br /> 0x000055795e86758b <+27>: cmpl $0x2,0x6270(%rdi)<br /> 0x000055795e867592 <+34>: je 0x55795e867770 <do_command(THD*, bool)+512><br /> 0x000055795e867598 <+40>: mov 0x58(%rdi),%rax<br /> 0x000055795e86759c <+44>: cmpb $0x0,0x279e(%rdi)<br /> 0x000055795e8675a3 <+51>: lea 0x290(%rdi),%r14<br /><br />1214 {<br />1215 /*<br />1216 Resuming previously suspended command.<br />1217 Restore the state<br />1218 */<br />1219 command = thd->async_state.m_command;<br /> 0x000055795e867770 <+512>: mov 0x6278(%rdi),%r10<br /> 0x000055795e867777 <+519>: mov 0x6280(%rdi),%r12<br /> 0x000055795e86777e <+526>: movzbl %sil,%r8d<br />--Type <RET> for more, q to quit, c to continue without paging--<br /> 0x000055795e867782 <+530>: mov %rdi,%rsi<br /> 0x000055795e867785 <+533>: mov 0x6274(%rdi),%r15d<br /><br />1220 packet = thd->async_state.m_packet.str;<br /><br />1221 packet_length = (ulong)thd->async_state.m_packet.length;<br /><br />1222 goto resume;<br /><br />1223 }<br />1224<br />1225 /*<br />1226 indicator of uninitialized lex => normal flow of errors handling<br />1227 (see my_message_sql)<br />1228 */<br />1229 thd->lex->current_select= 0;<br /> 0x000055795e8675aa <+58>: movq $0x0,0xd38(%rax)<br /><br />1230<br />1231 /*<br />1232 This thread will do a blocking read from the client which<br />1233 will be interrupted when the next command is received from<br />1234 the client, the connection is closed or "net_wait_timeout"<br />--Type <RET> for more, q to quit, c to continue without paging--<br />1235 number of seconds has passed.<br />1236 */<br />1237 if (!thd->skip_wait_timeout)<br /> 0x000055795e8675b5 <+69>: je 0x55795e867710 <do_command(THD*, bool)+416><br /><br />1238 my_net_set_read_timeout(net, thd->get_net_wait_timeout());<br />1239<br />1240 /* Errors and diagnostics are cleared once here before query */<br />1241 thd->clear_error(1);<br />1242<br />1243 net_new_transaction(net);<br /> 0x000055795e8675dd <+109>: mov 0x13f0(%rbx),%rax<br /> 0x000055795e8675e4 <+116>: mov $0x1,%esi<br /> 0x000055795e8675e9 <+121>: mov %r14,%rdi<br /> 0x000055795e8675ec <+124>: movl $0x0,0x2f0(%rbx)<br /><br />1244<br />1245 /* Save for user statistics */<br />1246 thd->start_bytes_received= thd->status_var.bytes_received;<br /> 0x000055795e8675f6 <+134>: mov %rax,0x4040(%rbx)<br /><br />1247<br />--Type <RET> for more, q to quit, c to continue without paging--<br />1248 /*<br />1249 Synchronization point for testing of KILL_CONNECTION.<br />1250 This sync point can wait here, to simulate slow code execution<br />1251 between the last test of thd->killed and blocking in read().<br />1252<br />1253 The goal of this test is to verify that a connection does not<br />1254 hang, if it is killed at this point of execution.<br />1255 (Bug#37780 - main.kill fails randomly)<br />1256<br />1257 Note that the sync point wait itself will be terminated by a<br />1258 kill. In this case it consumes a condition broadcast, but does<br />1259 not change anything else. The consumed broadcast should not<br />1260 matter here, because the read/recv() below doesn't use it.<br />1261 */<br />1262 DEBUG_SYNC(thd, "before_do_command_net_read");<br /><br />1263<br />1264 packet_length= my_net_read_packet(net, 1);<br /> 0x000055795e8675fd <+141>: callq 0x55795ebda8a0 <my_net_read_packet(NET*, my_bool)><br /> 0x000055795e867602 <+146>: mov %rax,%r12<br /><br />1265<br />--Type <RET> for more, q to quit, c to continue without paging--<br />1266 if (unlikely(packet_length == packet_error))<br /> 0x000055795e867605 <+149>: cmp $0xffffffffffffffff,%rax<br /> 0x000055795e867609 <+153>: je 0x55795e8678d0 <do_command(THD*, bool)+864><br /><br />1267 {<br />1268 DBUG_PRINT("info",("Got error %d reading command from socket %s",<br /><br />1269 net->error,<br />1270 vio_description(net->vio)));<br />1271<br />1272 /* Instrument this broken statement as "statement/com/error" */<br />1273 thd->m_statement_psi= MYSQL_REFINE_STATEMENT(thd->m_statement_psi,<br /> 0x000055795e8678d0 <+864>: mov 0x3b78(%rbx),%rdi<br /><br />1274 com_statement_info[COM_END].<br />1275 m_key);<br />1276<br />1277<br />1278 /* Check if we can continue without closing the connection */<br />1279<br />1280 /* The error must be set. */<br />--Type <RET> for more, q to quit, c to continue without paging--<br />1281 DBUG_ASSERT(thd->is_error());<br /><br />1282 thd->protocol->end_statement();<br /> 0x000055795e8678ee <+894>: mov 0x558(%rbx),%rdi<br /> 0x000055795e8678f5 <+901>: callq 0x55795e79f7e0 <Protocol::end_statement()><br /><br />1283<br />1284 /* Mark the statement completed. */<br />1285 MYSQL_END_STATEMENT(thd->m_statement_psi, thd->get_stmt_da());<br />1286 thd->m_statement_psi= NULL;<br /> 0x000055795e86791b <+939>: cmpb $0x3,0x324(%rbx)<br /> 0x000055795e867922 <+946>: mov $0x1,%r13d<br /> 0x000055795e867928 <+952>: movq $0x0,0x3b78(%rbx)<br /><br />1287 thd->m_digest= NULL;<br /> 0x000055795e867933 <+963>: movq $0x0,0x3b30(%rbx)<br /><br />1288<br />1289 if (net->error != 3)<br /> 0x000055795e86793e <+974>: jne 0x55795e86794a <do_command(THD*, bool)+986><br /><br />--Type <RET> for more, q to quit, c to continue without paging--<br />1290 {<br />1291 return_value= DISPATCH_COMMAND_CLOSE_CONNECTION; // We have to close it.<br />1292 goto out;<br />1293 }<br />1294<br />1295 net->error= 0;<br /> 0x000055795e867940 <+976>: movb $0x0,0x324(%rbx)<br /><br />1296 return_value= DISPATCH_COMMAND_SUCCESS;<br /><br />1297 goto out;<br /> 0x000055795e867947 <+983>: xor %r13d,%r13d<br /><br />1298 }<br />1299<br />1300 packet= (char*) net->read_pos;<br /> 0x000055795e86760f <+159>: mov 0x2b0(%rbx),%r10<br /><br />1301 /*<br />1302 'packet_length' contains length of data, as it was stored in packet<br />1303 header. In case of malformed header, my_net_read returns zero.<br />1304 If packet_length is not zero, my_net_read ensures that the returned<br />--Type <RET> for more, q to quit, c to continue without paging--<br />1305 number of bytes was actually read from network.<br />1306 There is also an extra safety measure in my_net_read:<br />1307 it sets packet[packet_length]= 0, but only for non-zero packets.<br />1308 */<br />1309 if (packet_length == 0) /* safety */<br /> 0x000055795e867616 <+166>: test %rax,%rax<br /> 0x000055795e867619 <+169>: je 0x55795e8676e0 <do_command(THD*, bool)+368><br /><br />1310 {<br />1311 /* Initialize with COM_SLEEP packet */<br />1312 packet[0]= (uchar) COM_SLEEP;<br /> 0x000055795e8676e0 <+368>: movb $0x0,(%r10)<br /><br />1313 packet_length= 1;<br /> 0x000055795e8676e4 <+372>: mov $0x1,%r12d<br /> 0x000055795e8676ea <+378>: jmpq 0x55795e86761f <do_command(THD*, bool)+175><br /> 0x000055795e8676ef <+383>: nop<br /><br />1314 }<br />1315 /* Do not rely on my_net_read, extra safety against programming errors. */<br />--Type <RET> for more, q to quit, c to continue without paging--<br /><b>1316 packet[packet_length]= '\0'; /* safety */<br /> 0x000055795e86761f <+175>: movb $0x0,(%r10,%r12,1)<br /><br /></b>1317<br />1318<br />1319 command= fetch_command(thd, packet);<br /> 0x000055795e867624 <+180>: movzbl (%r10),%eax<br /><br />1320<br />1321 #ifdef WITH_WSREP<br />1322 DEBUG_SYNC(thd, "wsrep_before_before_command");<br /><br />1323 /*<br />1324 If this command does not return a result, then we<br />1325 instruct wsrep_before_command() to skip result handling.<br />1326 This causes BF aborted transaction to roll back but keep<br />1327 the error state until next command which is able to return<br />1328 a result to the client.<br />1329 */<br />1330 if (unlikely(wsrep_service_started) &&<br /> 0x000055795e867644 <+212>: lea 0x1862105(%rip),%rcx # 0x5579600c9750 <wsrep_service_started><br /> 0x000055795e86764b <+219>: cmpb $0x0,(%rcx)<br />--Type <RET> for more, q to quit, c to continue without paging--<b>q</b><br />Quit<br />(gdb)<b> quit</b><br />A debugging session is active.<br /><br /> Inferior 1 [process 200650] will be detached.<br /><br />Quit anyway? (y or n) <b>y</b><br />Detaching from program: /home/openxs/dbs/maria10.8/bin/mariadbd, process 200650<br />[Inferior 1 (process 200650) detached]</span></span><br /></p></blockquote><p>At this stage we can stop debugging and quit from <b>gdb</b>. We see that line <b>1316</b> in the code (were packet is properly formed and ready for printing) matches (decimal) offset <b>175</b> from the beginning of the function. Now we can add that to the start of the function from <b>objdump</b> to find the address for the probe, <b>0x7b761f</b>.</p><p>The only remaining detail is how to print the content of the register <b>r10</b> as a zero terminated string. Just read fine <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#10-reg-registers" target="_blank"><b>bpftrace</b> manual</a> for this, and you can come up with the following:<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.8$ <b>cd ~/git/bpftrace/</b><br />openxs@ao756:~/git/bpftrace$ <b>cd build/</b><br />openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761f { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'</b><br />[sudo] password for openxs:<br />Attaching 1 probe...<br />Function: do_command(THD*, bool) packet: select @@version_comment limit 1<br />Function: do_command(THD*, bool) packet: select 1 + 1<br />^C</span></span><br /></p></blockquote><p>This is what I've got from the probe while in another terminal I've connected and executed the following:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.8$ <b>bin/mysql</b><br />ERROR 2002 (HY000): Can't connect to local server through socket '/tmp/mysql.sock' (2)<br />openxs@ao756:~/dbs/maria10.8$ bin/mysql --socket=/tmp/mariadb.sock<br />Welcome to the MariaDB monitor. Commands end with ; or \g.<br />Your MariaDB connection id is 4<br />Server version: 10.8.1-MariaDB Source distribution<br /><br />Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.<br /><br />Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.<br /><br />MariaDB [(none)]> <b>select 1 + 1;</b><br />+-------+<br />| 1 + 1 |<br />+-------+<br />| 2 |<br />+-------+<br />1 row in set (0.000 sec)</span></span><br /></p></blockquote><p>So, as long as you know the address of instruction inside the function code, you can attach a probe to it. You can safely experiment, as if the address is surely wrong:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000007b761e { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'</b><br />Attaching 1 probe...<br />ERROR: Could not add uprobe into middle of instruction: /home/openxs/dbs/maria10.8/bin/mariadbd:_Z10do_commandP3THDb+174<br />openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.8/bin/mariadbd:0x00000000008b761f { printf("Function: %s packet: %s\n", func, str(reg("r10"))); }'</b><br />Attaching 1 probe...<br />ERROR: Could not add uprobe into middle of instruction: /home/openxs/dbs/maria10.8/bin/mariadbd:_ZN14partition_info25set_up_default_partitionsEP3THDP7handlerP14HA_CREATE_INFOj+799<br /></span></span></p></blockquote><p>the tool will tell you and do no harm.<br /></p><p><br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgQ9jogv2Csaa5mRDllSUd1tfPaOSYpnc8caQhNmitEj1Z9krDK3ktxk62W1sBRG8esiOq8jggRo2e1dZ9nsogrjF3zC58RNhTS2hd2Q0Wc6EHLQUVqZRR5UcNyOd2q2zEgy1MjLnkgfA_eHDANT-bLBxaT_aBAgPelxMpuKChpxCrvVBfNXWqdOt-fCQ=s3264" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="2448" data-original-width="3264" height="480" src="https://blogger.googleusercontent.com/img/a/AVvXsEgQ9jogv2Csaa5mRDllSUd1tfPaOSYpnc8caQhNmitEj1Z9krDK3ktxk62W1sBRG8esiOq8jggRo2e1dZ9nsogrjF3zC58RNhTS2hd2Q0Wc6EHLQUVqZRR5UcNyOd2q2zEgy1MjLnkgfA_eHDANT-bLBxaT_aBAgPelxMpuKChpxCrvVBfNXWqdOt-fCQ=w640-h480" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Nice conbination of old Tower and new buildings in the City of London. Combining <b>bpftrace</b> with good old <b>gdb</b> is also nice.<br /></td></tr></tbody></table><p></p><p>To summarize:</p><ol style="text-align: left;"><li> It is possible to add a user probe to every assembly insturction inside the function with <b>bpftrace</b> (or <b>perf</b>, or <b>ftrace</b>, as long as you know what you are doing).</li><li>It is possible to access CPU registers in the <b>bpftrace</b> probes.</li><li>One can surely access, print etc local varaibles inside functions, as long as the address or register where it is stored is determined.</li><li>It is surely easier and faster to do all the above with some knowledge of <b>gdb </b>and assembly language for your CPU.</li><li>It is still fun to try to solve more complex problems with <b>bpftrace</b>, so more posts are expected. Stay tuned!<br /></li></ol><p><br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com1tag:blogger.com,1999:blog-3080615211468083537.post-71601444706661352932022-01-30T12:30:00.000+02:002022-01-30T12:30:32.839+02:00DTrace Basics and Tracing Every MariaDB Server Function Call<p>In my <a href="http://mysqlentomologist.blogspot.com/2022/01/first-steps-with-mariadb-server-and.html" target="_blank">previous post</a> I've explained how to build MariaDB Server 10.8 from GitHub source on macOS. This kind of (totally unsupported by both MariaDB Corporation and Foundation) build may be used to study new interesting testing, troubleshooting and performance problem solving options that <a href="https://docs.oracle.com/cd/E18752_01/html/819-5488/toc.html" target="_blank">DTrace</a> provides.</p><p>To start adding custom probes and doing something non-trivial one has to understand (or, better, master) some basics of the language used. Code in the sample <b>support-files/dtrace/query-time.d</b> program looked clear and familiar enough (at least for those who've seen <a href="http://mysqlentomologist.blogspot.com/search?q=bpftrace" target="_blank"><b>bpftrace</b></a> probes). It consists of several probes like this:<br /></p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;"><span><b>mysql*:::query-start</b><br /><b>{<br /></b> self->query = copyinstr(<b>arg0</b>);<br /> self->connid = <b>arg1</b>;<br /> self->db = copyinstr(<b>arg2</b>);<br /> self->who = strjoin(copyinstr(<b>arg3</b>),strjoin("@",copyinstr(<b>arg4</b>)));<br /> self->querystart = <b>timestamp;</b><br /><b>}</b></span></span></span></p></blockquote><p>As we can read in <a href="https://docs.oracle.com/cd/E18752_01/html/819-5488/gcfar.html" target="_blank">the manual</a>, the format for the probe specification is <b>provider:module:function:name</b>.
An empty component in a probe specification matches anything. You can also describe probes with a pattern matching syntax that is similar
to the syntax that is used by shell. </p><p>In the case above (USDT) provider name should start with <b>mysql</b>, we do not care about module and function (actually we maybe have USDT with the name in different functions), and we refer to the probe by name. </p><p>If you are wondering to see other kinds of specific probes, you can check <b>dtrace -l</b> output:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:~ Valerii$ <b>sudo dtrace -l | head -20</b><br />dtrace: system integrity protection is on, some features will not be available<br /><br /><b> ID PROVIDER MODULE FUNCTION NAME<br /> 1 dtrace BEGIN<br /></b> 2 dtrace END<br /> 3 dtrace ERROR<br />...<br /> <b> 42 profile tick-1000<br /></b> 43 profile tick-5000<br /> 44 mach_trap kern_invalid entry<br /> 45 mach_trap kern_invalid return<br /> 46 mach_trap _kernelrpc_mach_vm_allocate_trap entry<br />Yuliyas-Air:~ Valerii$ <b>sudo dtrace -l | tail -4</b><br />dtrace: system integrity protection is on, some features will not be available<br /><br /><b>254755 pid82428 mariadbd do_command(THD*, bool) 5e4<br />254917 pid82428 libsystem_malloc.dylib malloc return<br /></b>467101 security_exception639 csparser _ZN8Security7CFErrorC2Ev [Security::CFError::CFError()] throw-cf<br />467102
security_exception639 csparser _ZN8Security9UnixErrorC2Eib
[Security::UnixError::UnixError(int, bool)] throw-unix</span></span><br /></p></blockquote><p>We should study different providers that DTrace supports:</p><ul style="text-align: left;"><li><b>dtrace</b> - few special probes: <b>BEGIN</b>, <b>END</b> and <b>ERROR</b>.<br /></li><li><b>profile</b> - this provider includes probes that are
associated with an interrupt that fires at some regular, specified
time interval .<b>profile-N</b> probes basically fire <b>N</b> times per second on every CPU. <b>tick-N</b> probes fire on only one CPU per interval.<br /></li><li><a href="https://docs.oracle.com/cd/E18752_01/html/819-5488/gcgkk.html#gcgld" target="_blank"><b>syscall</b></a> - enables you to trace every system call entry and return.</li><li><a href="https://docs.oracle.com/cd/E18752_01/html/819-5488/gcgkk.html#gcgmc" target="_blank"><b>pid*</b></a> - enables tracing of any user
process, as specified by its <b>pid</b><code class="literal"></code>. It basically allows to trace entries (with <b>entry</b> probe name) and returns (with <b>return</b> probe name) to every user function, among other things.<br /></li><li>...<br /></li></ul><p>There are more providers actually, but those mentioned above allow to start reading and writing DTrace scripts for MariaDB server tracing and profiling, so let's stop listing them for now.</p><p>As we can see from this output </p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:~ Valerii$ <b>dtrace</b><br />Usage: dtrace [-aACeFHlqSvVwZ] [-arch i386|x86_64] [-b bufsz] [-c cmd] [-D name[=def]]<br /> [-I path] [-L path] [-o output] [-p pid] [-s script] [-U name]<br /> [-x opt[=val]]<br /><br /> [-P provider [[ predicate ] action ]]<br /> [-m [ provider: ] module [[ predicate ] action ]]<br /> [-f [[ provider: ] module: ] func [[ predicate ] action ]]<br /> <b>[-n [[[ provider: ] module: ] func: ] name [[ predicate ] action ]]</b><br /> [-i probe-id <b>[[ predicate ] action ]] [ args ... ]</b><br /><br /><b> predicate -> '/' D-expression '/'<br /> action -> '{' D-statements '}'<br /></b>...</span></span><br /></p></blockquote><p>for each probe we can define optional <i>predicate</i> (in slashes), <i>expression</i> that must evaluate to true (non-zero integer value) when probe fires for optional <i>action</i> (in curly brackets) to be executed. Action is a sequence of DTrace statements separated by semicolons. Note that DTrace does NOT support loops or complex conditional execution in the actions, for good reasons discussed elsewhere. We may also have optional <i>arguments</i> for the probe.<br /></p><p>In our <b>query-start</b> probe above there is no predicate, so the action is executed unconditionally when we hit it. We use some built-in functions (like <b>copyinsrt()</b>) and variables, like <b>arg0</b>, <b>arg1</b> and <b>timestamp</b>. We'll discuss them <a href="https://docs.oracle.com/cd/E18752_01/html/819-5488/gcfpz.html" target="_blank">in details</a> in some other post, but <b>arg0</b> ... <b>arg9</b> obviously refer to the first ten arguments of the probe and <b>timestamp</b> is the current value of a nanosecond timestamp counter.</p><p>In the action we see several assignments to <b>self->something</b>. These are <a href="https://docs.oracle.com/cd/E19253-01/817-6223/chp-variables-3/index.html" target="_blank"><i>thread-local</i> variables</a> that share a common name in your D
code but refer to separate data storage associated with each operating system thread.</p><p>Now we can try to add some probes. Given this information:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:~ Valerii$ <b>ps aux | grep mariadbd</b><br />Valerii 37341 0.0 0.0 4267752 700 s003 S+ 10:47AM 0:00.01 grep mariadbd<br />Valerii <b>82428</b> 0.0 0.1 4834244 5960 ?? S Wed09PM 116:47.57 /Users/Valerii/dbs/maria10.8/bin/mariadbd --no-defaults --basedir=/Users/Valerii/dbs/maria10.8 --datadir=/Users/Valerii/dbs/maria10.8/data --plugin-dir=/Users/Valerii/dbs/maria10.8/lib/plugin --log-error=/Users/Valerii/dbs/maria10.8/data/Yuliyas-Air.err --pid-file=Yuliyas-Air.pid</span></span><br /></p></blockquote><p>and knowing that function name portion of the current probe's description is avaiable ad <b>probefunc</b> <a href="https://docs.oracle.com/cd/E19253-01/817-6223/chp-variables-5/index.html" target="_blank">built-in variable</a>, we can try to trace the enter into the every tracable function in this MariaDB server process:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:~ Valerii$ <b>sudo dtrace -n 'pid82428:::entry { printf ("\nEnter %s", probefunc); }' | more</b><br />dtrace: system integrity protection is on, some features will not be available<br /><br /><b>dtrace: description 'pid82428:::entry ' matched 240404 probes</b><br /><b>CPU ID FUNCTION:NAME<br /> 1 128445 cerror_nocancel:entry</b><br /><b>Enter cerror_nocancel</b><br /> 1 129499 _pthread_testcancel:entry<br />Enter _pthread_testcancel<br /> 1 128444 __error:entry<br />Enter __error<br /> 1 129505 _pthread_cond_updateval:entry<br />Enter _pthread_cond_updateval<br /> 1 129429 pthread_mutex_lock:entry<br />Enter pthread_mutex_lock<br /> 1 36852 my_hrtime:entry<br />Enter my_hrtime<br /> 1 122312 clock_gettime:entry<br />Enter clock_gettime<br /><b> 1 122657 gettimeofday:entry<br />Enter gettimeofday<br /></b> 1 128412 __commpage_gettimeofday:entry<br />Enter __commpage_gettimeofday<br /> 1 128413 __commpage_gettimeofday_internal:entry<br />Enter __commpage_gettimeofday_internal<br /> 1 128343 mach_absolute_time:entry<br />Enter mach_absolute_time<br />:</span></span><br /></p></blockquote><p>I've used faimiliar C-style <b>printf()</b> function to print some custom string including the name of the function when we hit the probe. Note that dtrace attached probes to <b>240404</b> different functions without any problem, in default configuration. We immediately started to gte the output that shows function calls happening in background threads all the time. I've used <b>more</b> to be able to copy-paste small portion of the output and then stopped tracing with Ctrl+C after quiting from <b>more</b>.</p><p>Note that by default for each hit of the probe we get probe identifier, CPU that it fired on, function and probe name reported. Add <b>-q</b> option to suppress this and see only what we explicitly print in the probe:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:~ Valerii$ <b>sudo dtrace -q -n 'pid82428:::entry { printf ("\nEnter %s", probefunc); }' | more</b><br />dtrace: system integrity protection is on, some features will not be available<br /><br /><br />Enter cerror_nocancel<br />Enter _pthread_testcancel<br />Enter __error<br />Enter _pthread_cond_updateval<br />Enter pthread_mutex_lock<br />Enter my_hrtime<br />Enter clock_gettime<br />Enter gettimeofday<br />Enter __commpage_gettimeofday<br />Enter __commpage_gettimeofday_internal<br /></span></span>...<br /></p></blockquote><p>If we want to trace all processes with "<b>mariadbd</b>" as the name of the executable, we can add predicate to the probe that does not specify <b>pid</b> at all:<br /></p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:~ Valerii$ <b>sudo dtrace -q -n ':::entry / execname == "mariadbd" / { printf ("\nEnter %s", probefunc); }' | more</b><br />dtrace: system integrity protection is on, some features will not be available<br /><br /><br />Enter cerror_nocancel<br />Enter _pthread_testcancel<br />Enter __error<br />Enter _pthread_cond_updateval<br />Enter pthread_mutex_lock<br />Enter my_hrtime<br />Enter clock_gettime<br />Enter gettimeofday<br />...</span></span><br /></p></blockquote><p>So, we can produce a raw trace (to be stored or processed by some user level code outside of DTrace). But what if we would like to do some more advanced processing in the probe itself and share only the results? For example, we run some MariaDB tests and would like to know how many times specific function in the code was executed (if at all) in the process?</p><p>For this we have to use <a href="https://docs.oracle.com/cd/E18752_01/html/819-5488/gcggh.html" target="_blank"><i>aggregations</i></a> and rely on <i>aggregate functions</i> to do the job in kernel space. For now it's enough to know that <b>@name[key]</b> is the aggregation <b>name</b> indexed by the <b>key</b> (you can use several keys separated by commas) and <b>count()</b> aggregation function simly counts the number of times it was called. </p><p>Another inportant detail to study to create useful DTrace scripts is <a href="https://docs.oracle.com/cd/E18752_01/html/819-5488/gcfqr.html#gcgke" target="_blank"><i>scripts arguments substitution</i></a>. When called with <b>-s</b> option <b>dtrace</b> prowides a set of built-in macro variables to the script, with nsame starting with <b>$</b>. Of them <b>$0</b> ... <b>$9</b> have the same meenings as in shell, so <b>$1</b> is the first argument of the script.</p><p>With the above taken into account, we can use the following basic DTrace script to count function executions per name in any user process (<b>quiet</b> option has the same effect as <b>-q</b>):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:~ Valerii$ <b>cat /tmp/codecoverage.d</b><br />#!/usr/sbin/dtrace -s<br />#pragma D option quiet<br />pid<b>$1</b>::<b>$2</b>:<b>entry</b> {<br /> <b>@calls</b>[<b>probefunc</b>] = <b>count()</b>;<br />}</span></span><br /></p></blockquote><p>This script substitutes the first argument for the process id in the pid provider and the second argument for the function name in the probe, thus we create one or more probes. The aggregation's collected results are printed by default when dtrace script ends, but there is also the printa() function to do this explicitly in the probes if needed. Let's try to apply this to MariaDB server 10.8 running as process <b>82428</b> and trace all functions (hence <b>'*'</b> as the explicite second argument):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:~ Valerii$ <b>sudo /tmp/codecoverage.d 82428 '*' >/tmp/res.txt</b><br />dtrace: system integrity protection is on, some features will not be available<br /><br /><b>^C</b><br />Yuliyas-Air:~ Valerii$ <b>ls -l /tmp/res.txt</b><br />-rw-r--r-- 1 Valerii wheel 58772 Jan 28 20:53 /tmp/res.txt<br />Yuliyas-Air:~ Valerii$ <b>head -10 /tmp/res.txt</b><br /><br /><br /> ACL_USER* find_by_username_or_anon<ACL_USER>(ACL_USER*, unsigned long, char const*, char const*, char const*, char const*, char 1<br /> ACL_USER::copy(st_mem_root*) 1<br /> Apc_target::init(st_mysql_mutex*) 1<br /> Binary_string::copy(char const*, unsigned long) 1<br /> CONNECT::create_thd(THD*) 1<br /> Create_func_arg1::create_func(THD*, st_mysql_const_lex_string*, List<Item>*) 1<br /> Create_func_sleep::create_1_arg(THD*, Item*) 1<br /> Current_schema_tracker::update(THD*, set_var*) 1<br />Yuliyas-Air:~ Valerii$ <b>tail -10 /tmp/res.txt</b><br /> buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, mtr_t*, dberr_t*, bool) 33895<br /> pthread_self 36422<br /> mtr_t::memcpy_low(buf_block_t const&, unsigned short, void const*, unsigned long) 44143<br /> unsigned char* mtr_t::log_write<(unsigned char)48>(page_id_t, buf_page_t const*, unsigned long, bool, unsigned long) 44167<br /> mach_boottime_usec 45007<br /> mtr_t::modify(buf_block_t const&) 45609<br /> pthread_mutex_unlock 56530<br /> pthread_mutex_lock 57295<br /> pthread_cond_broadcast 64197<br /> _platform_memmove$VARIANT$Haswell 72767<br />Yuliyas-Air:~ Valerii$ <b>cat /tmp/res.txt | grep do_command</b><br /> do_command(THD*, bool) 4</span></span><br /></p></blockquote><p>I've executed just several statements (4 if you check <b>do_command()</b> executions) over a short enough period of time, but some functions like <b>pthread_mutex_lock</b> were executed thousands of times in the background. Now we have a lame code coverage testing method that works at MariaDB server scale (unlike <a href="http://mysqlentomologist.blogspot.com/2021/10/bpftrace-as-codefunction-coverage-tool.html" target="_blank">with <b>bpftrace</b> on Linux</a>). There are performance implications (huge impact when I tried to do the same while running <b>sysbench</b> test with many concurrent threads) and limitations of internal buffer sizes that may cause some probes to be skipped etc, but basically it works!<br /></p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjKOWUEEhz8HneVUJqe9Mz8G6k9PJUJTWuhPc0eCYYYn_NKZhwgmSXFQ3GnmuGnThYvb696Wf5_LQesc32KCXKfULkHSe6HKKz_n8d7K15bS_TE0heD3WiNqiWJaHSYIFpFS08ztqY_RYhku72jRWDX4wTwn7U0SgWB8qgvkk1S-zxYpmYr_YiWYq8N6w=s3264" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="2448" data-original-width="3264" height="480" src="https://blogger.googleusercontent.com/img/a/AVvXsEjKOWUEEhz8HneVUJqe9Mz8G6k9PJUJTWuhPc0eCYYYn_NKZhwgmSXFQ3GnmuGnThYvb696Wf5_LQesc32KCXKfULkHSe6HKKz_n8d7K15bS_TE0heD3WiNqiWJaHSYIFpFS08ztqY_RYhku72jRWDX4wTwn7U0SgWB8qgvkk1S-zxYpmYr_YiWYq8N6w=w640-h480" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">DTrace is almost as beautiful as <span>Côte d'Azur :)<br /></span></td></tr></tbody></table></p><p> To summarize:</p><ol style="text-align: left;"><li>We checked basic building blocks of simple DTrace one liners and scripts in some details, including probe structure, main providers, some built-in variables, thread-local variables referenced via <b>self-></b> and aggregations. <br /></li><li>It is possible to add probes to each and every function in MariaDB server 10.8 and do both raw tracing and aggregation in the kernel context. Lame code coverage scripts do work.</li><li>It is possible to create parametrized, generic scripts with command line arguments and other macros substituted.<br /></li><li>DTrace language is similar to what we see later implemented in <b>bpftrace</b>. After some checks of the manual one can easily switch from one tool to the other for basic tasks.</li><li>Built-in variablers and functions may be named differently than in <b>bpftrace</b>.</li><li>There are many more details to cover and interesting scripts to use with MariaDB server, so stay tuned!<br /></li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-7677682118458140222022-01-27T23:27:00.000+02:002022-01-27T23:27:38.751+02:00First steps with MariaDB Server and DTrace on macOS<p><a href="https://fosdem.org/2022/" target="_blank">FOSDEM 2022</a> is going to happen on the next weekend and I am still missing blog posts supporting my unusual upcoming talk there, this time devoted to <a href="https://fosdem.org/2022/schedule/event/mariadb_macos/" target="_blank">building and using MariaDB server on macOS</a>. So tonight I am going to contribute something new to refer to during my talk or followup questions. I'll try to document my way to build MariaDB Server from <a href="https://github.com/MariaDB/server" target="_blank">GitHub source</a> on old (early 2015) MacBook Air running macOS 10.13.6 <a href="https://en.wikipedia.org/wiki/MacOS_High_Sierra" target="_blank">High Sierra</a>. I am also going to show why one may want to use macOS there instead of installing Linux and having way more usual environment on the same decent hardware.<br /></p><p>I've inherited this Air from my daughter in September, 2021 after she had got a new, M1-based, and initially used it mostly for Zoom calls and web-browsing. Soon I recalled that I've used MacBook for more than 3 years in the past while working for Sun and Oracle, and it was my main working platform not only for content consumption, emails and virtual machines. It was regularly used for bugs verification, MySQL builds and tests of all kinds. It would be a waste of capable OS and good hardware (still more powerful formally than all my other machines but Fedora 33 desktop) NOT to try to use it properly.</p><p>That's why after few software upgrades to end up with more recent 10.13 minor release:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:maria10.6 Valerii$ <b>uname -a</b><br />Darwin Yuliyas-Air 17.7.0 Darwin Kernel Version 17.7.0: Mon Aug 31 22:11:23 PDT 2020; root:xnu-4570.71.82.6~1/RELEASE_X86_64 x86_64 </span></span><br /></p></blockquote><p>I read <a href="https://mariadb.com/kb/en/Build_Environment_Setup_for_Mac/" target="_blank">the manual</a>, registered as a developer, <a href="https://developer.apple.com/xcode/downloads/" target="_blank">downloaded</a> and installed <a href="https://xcodereleases.com/" target="_blank">proper version of XCode</a> and decided to proceed with <a href="https://www.macports.org/install.php#installing" target="_blank">MacPorts</a>.</p><p>Then I updated ports tree and proceed by the manual, with installing <b>git</b> and other surely needed tools and packages:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">sudo port install git cmake jemalloc judy openssl boost gnutls</span></span></p></blockquote><p>That was just the beginning, and eventually, with dependencies, port updates, problems and workarounds, I ended up like this (and that works for building current code of 10.1 to 10.8 for sure, with some ports maybe not needed or used for other builds, but who cares):</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">Yuliyas-Air:maria10.8 Valerii$ <b>port installed</b><br />The following ports are currently installed:<br /> autoconf @2.71_1 (active)<br /> automake @1.16.5_0 (active)<br /> bison @3.8.2_0<br /> bison @3.8.2_2 (active)<br /> bison-runtime @3.8.2_0 (active)<br /> boehmgc @8.0.6_0 (active)<br /> boost @1.76_0 (active)<br /> boost171 @1.71.0_3+no_single+no_static+python39 (active)<br /> boost176 @1.76.0_2+no_single+no_static+python39<br /> boost176 @1.76.0_3+no_single+no_static+python39 (active)<br /> bzip2 @1.0.8_0 (active)<br /> cmake @3.21.4_0<br /><b> cmake @3.22.1_0 (active)<br /></b> curl @7.80.0_0+ssl (active)<br /> curl-ca-bundle @7.80.0_0<br /> curl-ca-bundle @7.80.0_1 (active)<br /> cyrus-sasl2 @2.1.27_5+kerberos (active)<br /> db48 @4.8.30_4 (active)<br /> expat @2.4.1_0<br /> expat @2.4.2_0<br /> expat @2.4.3_0 (active)<br /> gdb @11.1_0 (active)<br /> gdbm @1.22_0 (active)<br /> gettext @0.21_0 (active)<br /> gettext-runtime @0.21_0 (active)<br /> gettext-tools-libs @0.21_0 (active)<br /> git @2.34.1_1+credential_osxkeychain+diff_highlight+doc+pcre+perl5_28<br /> git @2.34.1_2+credential_osxkeychain+diff_highlight+doc+pcre+perl5_28 (active)<br /> gmp @6.2.1_0 (active)<br /> gnutls @3.6.16_1<br /> gnutls @3.6.16_2 (active)<br /> icu @67.1_4 (active)<br /> jemalloc @5.2.1_1 (active)<br /> judy @1.0.5_1 (active)<br /> kerberos5 @1.19.2_1 (active)<br /> libarchive @3.5.2_1 (active)<br /> libb2 @0.98.1_1 (active)<br /> libcomerr @1.45.6_0 (active)<br /> libcxx @5.0.1_4 (active)<br /> libedit @20210910-3.1_1 (active)<br /> libevent @2.1.12_1 (active)<br /> libffi @3.4.2_2 (active)<br /> libiconv @1.16_1 (active)<br /> libidn @1.38_0 (active)<br /> libidn2 @2.3.2_0<br /> libidn2 @2.3.2_1 (active)<br /> libpsl @0.21.1-20210726_2 (active)<br /> libtasn1 @4.18.0_0 (active)<br /> libtextstyle @0.21_0 (active)<br /> libtool @2.4.6_13 (active)<br /> libunistring @0.9.10_0<br /> libunistring @1.0_0 (active)<br /> libuv @1.42.0_1<br /> libuv @1.43.0_0 (active)<br /> libxml2 @2.9.12_1 (active)<br /> libxslt @1.1.34_6 (active)<br /> lmdb @0.9.29_0 (active)<br /> luajit @2.1.0-beta3_5 (active)<br /> lz4 @1.9.3_1 (active)<br /> lzma @4.65_1 (active)<br /> lzo2 @2.10_0 (active)<br /> m4 @1.4.19_1 (active)<br /> mariadb @5.5.68_0 (active)<br /> mariadb-server @5.5.68_0 (active)<br /> mysql57 @5.7.36_1 (active)<br /> mysql_select @0.1.2_4 (active)<br /> ncurses @6.3_0 (active)<br /> nettle @3.7.3_0 (active)<br /> openssl @3_1<br /> openssl @3_2 (active)<br /> openssl3 @3.0.0_6+legacy<br /> openssl3 @3.0.1_0+legacy (active)<br /> <b>openssl11 @1.1.1l_5 (active)</b><br /> p5-dbd-mysql @4.50.0_0 (active)<br /> p5.28-authen-sasl @2.160.0_0 (active)<br /> p5.28-cgi @4.530.0_0 (active)<br /> p5.28-clone @0.450.0_0 (active)<br /> p5.28-dbd-mysql @4.50.0_0+mysql57 (active)<br /> p5.28-dbi @1.643.0_0 (active)<br /> p5.28-digest-hmac @1.40.0_0 (active)<br /> p5.28-digest-sha1 @2.130.0_4 (active)<br /> p5.28-encode @3.160.0_0 (active)<br /> p5.28-encode-locale @1.50.0_0 (active)<br /> p5.28-error @0.170.290_0 (active)<br /> p5.28-gssapi @0.280.0_3 (active)<br /> p5.28-html-parser @3.760.0_0 (active)<br /> p5.28-html-tagset @3.200.0_4 (active)<br /> p5.28-http-date @6.50.0_0 (active)<br /> p5.28-http-message @6.350.0_0 (active)<br /> p5.28-io-html @1.4.0_0 (active)<br /> p5.28-io-socket-ssl @2.72.0_0<br /> p5.28-io-socket-ssl @2.73.0_0 (active)<br /> p5.28-lwp-mediatypes @6.40.0_0 (active)<br /> p5.28-mozilla-ca @20211001_0 (active)<br /> p5.28-net-libidn @0.120.0_5 (active)<br /> p5.28-net-smtp-ssl @1.40.0_0 (active)<br /> p5.28-net-ssleay @1.900.0_4 (active)<br /> p5.28-term-readkey @2.380.0_0 (active)<br /> p5.28-time-local @1.300.0_0 (active)<br /> p5.28-timedate @2.330.0_0 (active)<br /> p5.28-uri @5.100.0_0 (active)<br /> p11-kit @0.24.0_1 (active)<br /> pcre2 @10.39_0 (active)<br /> perl5.28 @5.28.3_4 (active)<br /> perl5.30 @5.30.3_3 (active)<br /> pkgconfig @0.29.2_0 (active)<br /> popt @1.18_1 (active)<br /> python3_select @0.0_2 (active)<br /> python39 @3.9.9_0+lto+optimizations<br /> python39 @3.9.10_0+lto+optimizations (active)<br /> python_select @0.3_9 (active)<br /><b> readline @8.1.000_0 (active)<br /> readline-5 @5.2.014_2 (active)<br /></b> rsync @3.2.3_1 (active)<br /> sqlite3 @3.37.0_0<br /> sqlite3 @3.37.1_0<br /> sqlite3 @3.37.2_0 (active)<br /> sysbench @1.0.20_0 (active)<br /> tcp_wrappers @20_4 (active)<br /> texinfo @6.8_0 (active)<br /> umem @1.0.1_1 (active)<br /> xxhashlib @0.8.1_0<br /> xxhashlib @0.8.1_1 (active)<br /> xz @5.2.5_0 (active)<br /> zlib @1.2.11_0 (active)<br /> zstd @1.5.0_0<br /><b> zstd @1.5.1_0 (active)<br /></b>Yuliyas-Air:maria10.8 Valerii$</span></span><br /></p></blockquote><p>This is surely not the minimal needed set of ports. I've highlighted a couple (like <b>openssl11</b>) that were really needed to build and install 10.8 successfully eventually.<br /></p><p>Then I cloned the code with usual steps and ended up with this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:server Valerii$ <b>git log -1</b><br />commit c1cef1afa9962544de4840c9a796ae0a9b5e92e6 (HEAD -> 10.8, origin/bb-10.8-wlad, origin/bb-10.8-release, origin/HEAD, origin/10.8)<br />Merge: db2013787d2 9d93b51effd<br />Author: Vladislav Vaintroub <wlad@mariadb.com><br />Date: Wed Jan 26 13:57:00 2022 +0100<br /><br /> Merge remote-tracking branch 'origin/bb-10.8-wlad' into 10.8<br />Yuliyas-Air:server Valerii$ <b>git submodule update --init --recursive</b><br />Submodule path 'libmariadb': checked out 'ddb031b6a1d8b6e26a0f10f454dc1453a48a6ca8'<br />Yuliyas-Air:server Valerii$ <b>cd buildtmp/</b><br />Yuliyas-Air:buildtmp Valerii$ <b>rm -rf *</b><br />Yuliyas-Air:buildtmp Valerii$ <b>cmake .. -DCMAKE_INSTALL_PREFIX=/Users/Valerii/dbs/maria10.8 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DPLUGIN_TOKUDB=NO -DWITH_SSL=/opt/local/libexec/openssl11 -DENABLE_DTRACE=1</b><br />...<br />-- The following OPTIONAL packages have not been found:<br /><br /> * PMEM<br /> * Snappy<br /><br />-- Configuring done<br />CMake Warning (dev):<br /> Policy CMP0042 is not set: MACOSX_RPATH is enabled by default. Run "cmake<br /> --help-policy CMP0042" for policy details. Use the cmake_policy command to<br /> set the policy and suppress this warning.<br /><br /> MACOSX_RPATH is not specified for the following targets:<br /><br /> libmariadb<br /><br />This warning is for project developers. Use -Wno-dev to suppress it.<br /><br />-- Generating done<br />-- Build files have been written to: /Users/Valerii/git/server/buildtmp<br />Yuliyas-Air:buildtmp Valerii$</span></span><br /></p></blockquote><p>Note that I've used somewhat nontrivial <b>cmake</b> command line for out of source build, and the output above was not from the clean state, but from the state after 10.1 to 10.7 where all checked out and built successfully, one by one, with problems found and resolved (more on that on my slides and during the talk).</p><p>Two key options in that command line are <b>-DWITH_SSL=/opt/local/libexec/openssl11</b> to use supported OpenSSL version 1.1 (even though for 10.8 default version 3 should already work too) and <b>-DENABLE_DTRACE=1</b> to enable static DTrace probes in the code of MariaDB that are still there. The rest are typical for my builds and blog posts.<br /></p><p>Then I proceeded with usual <b>make</b> to end up with the first problem:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:buildtmp Valerii$ <b>make</b><br />...<br />[ 46%] Built target heap<br />[ 46%] Building CXX object storage/innobase/CMakeFiles/innobase.dir/btr/btr0btr.cc.o<br />In file included from /Users/Valerii/git/server/storage/innobase/btr/btr0btr.cc:28:<br />In file included from /Users/Valerii/git/server/storage/innobase/include/btr0btr.h:31:<br />In file included from /Users/Valerii/git/server/storage/innobase/include/dict0dict.h:32:<br />In file included from /Users/Valerii/git/server/storage/innobase/include/dict0mem.h:45:<br />In file included from /Users/Valerii/git/server/storage/innobase/include/buf0buf.h:33:<br />/Users/Valerii/git/server/storage/innobase/include/fil0fil.h:1497:11: <b>error:<br /> 'asm goto' constructs are not supported yet<br /> __asm__ goto("lock btsl $31, %0\t\njnc %l1" : : "m" (n_pending)<br /></b> ^<br />1 error generated.<br />make[2]: *** [storage/innobase/CMakeFiles/innobase.dir/btr/btr0btr.cc.o] Error 1<br />make[1]: *** [storage/innobase/CMakeFiles/innobase.dir/all] Error 2<br />make: *** [all] Error 2</span></span><br /></p></blockquote><p>caused by the fact that clang 10.0.0 from XCode: </p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:maria10.8 Valerii$ <b>clang --version</b><br />Apple LLVM version 10.0.0 (clang-1000.10.44.4)<br />Target: x86_64-apple-darwin17.7.0<br />Thread model: posix<br />InstalledDir: /Library/Developer/CommandLineTools/usr/bin </span></span><br /></p></blockquote><p>does NOT support <b>asm goto</b> used in MariaDB Server code since 10.6 (while it was supposed to support it). I've applied a lame fix (see <a href="https://jira.mariadb.org/browse/MDEV-27402" target="_blank"><b>MDEV-27402</b></a> for more details and final <b>diff</b> later) and proceeded:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:buildtmp Valerii$ <b>make</b><br />...<br /><br />[ 97%] Building CXX object client/CMakeFiles/mariadb.dir/mysql.cc.o<br />/Users/Valerii/git/server/client/mysql.cc:2853:59: error: expected expression<br /> rl_attempted_completion_function= (rl_completion_func_t*)&new_mysql_co...<br /> ^<br />/Users/Valerii/git/server/client/mysql.cc:2853:38: error: use of undeclared<br /> identifier 'rl_completion_func_t'; did you mean 'rl_completion_matches'?<br /> rl_attempted_completion_function= (rl_completion_func_t*)&new_mysql_co...<br /> ^~~~~~~~~~~~~~~~~~~~<br /> rl_completion_matches<br />/usr/include/editline/readline.h:202:16: note: 'rl_completion_matches' declared<br /> here<br />char **rl_completion_matches(const char *, rl_compentry_func_t *);<br /> ^<br />/Users/Valerii/git/server/client/mysql.cc:2854:33: error: assigning to<br /> 'Function *' (aka 'int (*)(const char *, int)') from incompatible type<br /> 'rl_compentry_func_t *' (aka 'char *(*)(const char *, int)'): different<br /> return type ('int' vs 'char *')<br /> rl_completion_entry_function= (rl_compentry_func_t*)&no_completion;<br /> ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~<br />/Users/Valerii/git/server/client/mysql.cc:2856:3: error: no matching function<br /> for call to 'rl_add_defun'<br /> rl_add_defun("magic-space", (rl_command_func_t *)&fake_magic_space, -1);<br /> ^~~~~~~~~~~~<br />/usr/include/editline/readline.h:195:7: note: candidate function not viable: no<br /> known conversion from 'rl_command_func_t *' (aka 'int (*)(int, int)') to<br /> 'Function *' (aka 'int (*)(const char *, int)') for 2nd argument<br />int rl_add_defun(const char *, Function *, int);<br /> ^<br />4 errors generated.<br />make[2]: *** [client/CMakeFiles/mariadb.dir/mysql.cc.o] Error 1<br />make[1]: *** [client/CMakeFiles/mariadb.dir/all] Error 2<br />make: *** [all] Error 2<br />Yuliyas-Air:buildtmp Valerii$</span></span><br /></p></blockquote><p>This <b>readline</b>-related problem was also reported (see <b><a href="https://jira.mariadb.org/browse/MDEV-27579" target="_blank">MDEV-27579</a></b>) and I fixed it with a lame patch, and eventually I was able to build successfully:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">...<br /><br />[100%] Linking C executable wsrep_check_version<br />[100%] Built target wsrep_check_version<br />Yuliyas-Air:buildtmp Valerii$ <b>make install && make clean</b><br />...<br />-- Installing: /Users/Valerii/dbs/maria10.8/share/aclocal/mysql.m4<br />-- Installing: /Users/Valerii/dbs/maria10.8/support-files/mysql.server<br />Yuliyas-Air:buildtmp Valerii$ <b>echo $?</b><br />0</span></span><br /></p></blockquote><p>The final <b>diff</b> is like this:<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:server Valerii$ <b>git diff -u</b><br /><b>diff --git a/client/mysql.cc b/client/mysql.cc</b><br />index 6612b273d17..902589f2e83 100644<br />--- a/client/mysql.cc<br />+++ b/client/mysql.cc<br />@@ -2849,7 +2849,7 @@ static void initialize_readline ()<br /> rl_terminal_name= getenv("TERM");<br /><br /> /* Tell the completer that we want a crack first. */<br /><b>-#if defined(USE_NEW_READLINE_INTERFACE)<br />+#if defined(USE_NEW_READLINE_INTERFACE) && !defined(__APPLE_CC__)<br /></b> rl_attempted_completion_function= (rl_completion_func_t*)&new_mysql_completion;<br /> rl_completion_entry_function= (rl_compentry_func_t*)&no_completion;<br /><br />@@ -2859,7 +2859,9 @@ static void initialize_readline ()<br /> setlocale(LC_ALL,""); /* so as libedit use isprint */<br /> #endif<br /> rl_attempted_completion_function= (CPPFunction*)&new_mysql_completion;<br /><b>+#if !defined(__APPLE_CC__)<br /></b> rl_completion_entry_function= &no_completion;<br /><b>+#endif<br /></b> rl_add_defun("magic-space", (Function*)&fake_magic_space, -1);<br /> #else<br /> rl_attempted_completion_function= (CPPFunction*)&new_mysql_completion;<br /><b>diff --git a/storage/innobase/include/fil0fil.h b/storage/innobase/include/fil0fil.h<br /></b>index 34a53746b42..a795313116f 100644<br />--- a/storage/innobase/include/fil0fil.h<br />+++ b/storage/innobase/include/fil0fil.h<br />@@ -1489,7 +1489,7 @@ inline void fil_space_t::reacquire()<br /> inline bool fil_space_t::set_stopping_check()<br /> {<br /> mysql_mutex_assert_owner(&fil_system.mutex);<br /><b>-#if defined __clang_major__ && __clang_major__ < 10<br />+#if (defined __clang_major__ && __clang_major__ < 10) || defined __APPLE_CC__<br /></b> /* Only clang-10 introduced support for asm goto */<br /> return n_pending.fetch_or(STOPPING, std::memory_order_relaxed) & STOPPING;<br /> #elif defined __GNUC__ && (defined __i386__ || defined __x86_64__)<br /><b>diff --git a/storage/rocksdb/rocksdb b/storage/rocksdb/rocksdb<br /></b>--- a/storage/rocksdb/rocksdb<br />+++ b/storage/rocksdb/rocksdb<br />@@ -1 +1 @@<br /><b>-Subproject commit bba5e7bc21093d7cfa765e1280a7c4fdcd284288<br />+Subproject commit bba5e7bc21093d7cfa765e1280a7c4fdcd284288-dirty<br /></b>Yuliyas-Air:server Valerii$</span></span><br /></p></blockquote><p>The last diff in <b>rocksdb</b> submodule is related to the "missing zstd headers" problem I fined with yet another lame patch while building 10.7, see <b><a href="https://jira.mariadb.org/browse/MDEV-27619" target="_blank">MDEV-27619</a></b>. </p><p>After usual mysql_install_db and startup, I've ended up with a shiny new MariaDB 10.8.0 up and running on macOS 10.13.6:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:maria10.8 Valerii$ <b>bin/mysql</b><br />Welcome to the MariaDB monitor. Commands end with ; or \g.<br />Your MariaDB connection id is 5<br />Server version: 10.8.0-MariaDB MariaDB Server<br /><br />Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.<br /><br />Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.<br /><br />MariaDB [(none)]> show variables like 'version%';<br />+-------------------------+------------------------------------------+<br />| Variable_name | Value |<br />+-------------------------+------------------------------------------+<br /><b>| version | 10.8.0-MariaDB |<br /></b>| version_comment | MariaDB Server |<br />| version_compile_machine | x86_64 |<br /><b>| version_compile_os | osx10.13 |<br /></b>| version_malloc_library | system |<br /><b>| version_source_revision | c1cef1afa9962544de4840c9a796ae0a9b5e92e6 |<br />| version_ssl_library | OpenSSL 1.1.1l 24 Aug 2021 |<br /></b>+-------------------------+------------------------------------------+<br />7 rows in set (0.001 sec)</span></span><br /></p></blockquote><p>I even checked few MTR test suites, and many tests pass:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:mysql-test Valerii$ ./mtr --suite=rocksdb<br />Logging: ./mtr --suite=rocksdb<br />VS config:<br />vardir: /Users/Valerii/dbs/maria10.8/mysql-test/var<br />Checking leftover processes...<br /> - found old pid 89161 in 'mysqld.1.pid', killing it...<br /> process did not exist!<br />Removing old var directory...<br />Creating var directory '/Users/Valerii/dbs/maria10.8/mysql-test/var'...<br />Checking supported features...<br />MariaDB Version 10.8.0-MariaDB<br /> - SSL connections supported<br />Using suites: rocksdb<br />Collecting tests...<br />Installing system database...<br />...<br />rocksdb.index_merge_rocksdb 'write_committed' [ pass ] 911<br />rocksdb.shutdown 'write_prepared' [ pass ] 1895<br />rocksdb.index_merge_rocksdb 'write_prepared' [ pass ] 923<br />rocksdb.mariadb_misc_binlog 'write_committed' [ pass ] 43<br />rocksdb.mariadb_misc_binlog 'write_prepared' [ pass ] 43<br />...<br />rocksdb.issue495 'write_committed' [ pass ] 33<br />rocksdb.partition 'write_committed' [ fail ]<br /> Test ended at 2022-01-27 22:41:15<br /><br />CURRENT_TEST: rocksdb.partition<br />mysqltest: At line 67: query 'ALTER TABLE t1 REBUILD PARTITION p0, p1' failed: ER_METADATA_INCONSISTENCY (4064): Table 'test.t1#P#p0#TMP#' does not exist, but metadata information exists inside MyRocks. This is a sign of data inconsistency. Please check if './test/t1#P#p0#TMP#.frm' exists, and try to restore it if it does not exist.<br /><br />The result from queries just before the failure was:<br />< snip ><br />DROP TABLE IF EXISTS employees_hash_1;<br />DROP TABLE IF EXISTS t1_hash;<br />DROP TABLE IF EXISTS employees_linear_hash;<br />DROP TABLE IF EXISTS t1_linear_hash;<br />DROP TABLE IF EXISTS k1;<br />DROP TABLE IF EXISTS k2;<br />DROP TABLE IF EXISTS tm1;<br />DROP TABLE IF EXISTS tk;<br />DROP TABLE IF EXISTS ts;<br />DROP TABLE IF EXISTS ts_1;<br />DROP TABLE IF EXISTS ts_3;<br />DROP TABLE IF EXISTS ts_4;<br />DROP TABLE IF EXISTS ts_5;<br />DROP TABLE IF EXISTS trb3;<br />DROP TABLE IF EXISTS tr;<br />DROP TABLE IF EXISTS members_3;<br />DROP TABLE IF EXISTS clients;<br />DROP TABLE IF EXISTS clients_lk;<br />DROP TABLE IF EXISTS trb1;<br />CREATE TABLE t1 (i INT, j INT, k INT, PRIMARY KEY (i)) ENGINE = ROCKSDB PARTITION BY KEY(i) PARTITIONS 4;<br /><br />More results from queries before failure can be found in /Users/Valerii/dbs/maria10.8/mysql-test/var/log/partition.log<br /><br /> - saving '/Users/Valerii/dbs/maria10.8/mysql-test/var/log/rocksdb.partition-write_committed/' to '/Users/Valerii/dbs/maria10.8/mysql-test/var/log/rocksdb.partition-write_committed/'<br /><br />Only 148 of 555 completed.<br />--------------------------------------------------------------------------<br />The servers were restarted 23 times<br />Spent 154.726 of 206 seconds executing testcases<br /><br />Failure: Failed 1/27 tests, 96.30% were successful.<br /><br />Failing test(s): rocksdb.partition<br /><br />The log files in var/log may give you some hint of what went wrong.<br /><br />If you want to report this error, please read first the documentation<br />at http://dev.mysql.com/doc/mysql/en/mysql-test-suite.html<br /><br />69 tests were skipped, 2 by the test itself.<br /><br />mysql-test-run: *** ERROR: there were failing test cases</span></span><br /></p></blockquote><p></p><p>But the real reason to build MariaDB server on macOS (other than "because I can") was not even RocksDB testing, but this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:mysql-test Valerii$ <b>dtrace</b><br />Usage: dtrace [-aACeFHlqSvVwZ] [-arch i386|x86_64] [-b bufsz] [-c cmd] [-D name[=def]]<br /> [-I path] [-L path] [-o output] [-p pid] [-s script] [-U name]<br /> [-x opt[=val]]<br /><br /> [-P provider [[ predicate ] action ]]<br /> [-m [ provider: ] module [[ predicate ] action ]]<br /> [-f [[ provider: ] module: ] func [[ predicate ] action ]]<br /> [-n [[[ provider: ] module: ] func: ] name [[ predicate ] action ]]<br /> [-i probe-id [[ predicate ] action ]] [ args ... ]<br /><br /> predicate -> '/' D-expression '/'<br /> action -> '{' D-statements '}'<br /><br /> -arch Generate programs and Mach-O files for the specified architecture<br /><br /> -a claim anonymous tracing state<br /> -A generate plist(5) entries for anonymous tracing<br /> -b set trace buffer size<br /> -c run specified command and exit upon its completion<br /> -C run cpp(1) preprocessor on script files<br /> -D define symbol when invoking preprocessor<br /> -e exit after compiling request but prior to enabling probes<br /> -f enable or list probes matching the specified function name<br /> -F coalesce trace output by function<br /> -h generate a header file with definitions for static probes<br /> -H print included files when invoking preprocessor<br /> -i enable or list probes matching the specified probe id<br /> -I add include directory to preprocessor search path<br /> -l list probes matching specified criteria<br /> -L add library directory to library search path<br /> -m enable or list probes matching the specified module name<br /> -n enable or list probes matching the specified probe name<br /> -o set output file<br /> -p grab specified process-ID and cache its symbol tables<br /> -P enable or list probes matching the specified provider name<br /> -q set quiet mode (only output explicitly traced data)<br /> -s enable or list probes according to the specified D script<br /> -S print D compiler intermediate code<br /> -U undefine symbol when invoking preprocessor<br /> -v set verbose mode (report stability attributes, arguments)<br /> -V report DTrace API version<br /> -w permit destructive actions<br /> -W wait for specified process and exit upon its completion<br /> -x enable or modify compiler and tracing options<br /> -Z permit probe descriptions that match zero probes<br />Yuliyas-Air:mysql-test Valerii$</span></span><br /></p></blockquote><p>I remember how cool dtrace was since workin for Sun and after last re-install of Fedora I ended up with my <a href="https://illumos.org/" target="_blank">Illumos</a> VM gone and no recent FreeBSD VM anyway,. so basically there was no <b>dtrace</b> at hand until I've got this MacBook.<br /></p><p>So, unlike MySQL 8.0, MariaDB server still contains USDT (DTrace probes) and I've built my version with them enabled. Let's quickly check how they can be used. There are few sample D files in the source:</p><blockquote><p style="text-align: left;"><span style="font-family: courier;"><span style="font-size: x-small;">Yuliyas-Air:server Valerii$ <b>ls support-files/dtrace/</b><br />locktime.d query-rowops.d<br />query-execandqc.d query-time.d<br />query-filesort-time.d statement-time.d<br />query-network-time.d statement-type-aggregate.d<br />query-parse-time.d<br />Yuliyas-Air:server Valerii$ <b>cat support-files/dtrace/query-time.d</b><br /><b>#!/usr/sbin/dtrace -s<br /></b>#<br /># Copyright (c) 2009 Sun Microsystems, Inc.<br /># Use is subject to license terms.<br />#<br /># This program is free software; you can redistribute it and/or modify<br /># it under the terms of the GNU General Public License as published by<br /># the Free Software Foundation; version 2 of the License.<br />#<br /># This program is distributed in the hope that it will be useful,<br /># but WITHOUT ANY WARRANTY; without even the implied warranty of<br /># MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the<br /># GNU General Public License for more details.<br />#<br /># You should have received a copy of the GNU General Public License<br /># along with this program; if not, write to the Free Software<br /># Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335 USA<br />#<br /># Shows basic query execution time, who execute the query, and on what database<br /><br />#pragma D option quiet<br /><br />dtrace:::BEGIN<br />{<br /> printf("%-20s %-20s %-40s %-9s\n", "Who", "Database", "Query", "Time(ms)");<br />}<br /><br />mysql*:::query-start<br />{<br /> self->query = copyinstr(arg0);<br /> self->connid = arg1;<br /> self->db = copyinstr(arg2);<br /> self->who = strjoin(copyinstr(arg3),strjoin("@",copyinstr(arg4)));<br /> self->querystart = timestamp;<br />}<br /><br />mysql*:::query-done<br />{<br /> printf("%-20s %-20s %-40s %-9d\n",self->who,self->db,self->query,<br /> (timestamp - self->querystart) / 1000000);<br />}<br />Yuliyas-Air:server Valerii$<br /></span></span></p></blockquote><p>We can try to run it (with yet another hack to do that I leave to the curious reader who is really going to try this) and then execute some queries against MariaDB from another terminal:</p><blockquote><p><span style="font-size: xx-small;"><span style="font-family: courier;">Yuliyas-Air:server Valerii$ <b>sudo support-files/dtrace/query-time.d</b><br />dtrace: system integrity protection is on, some features will not be available<br /><br /><b>Who Database Query Time(ms)<br /></b>Valerii@localhost select @@version_comment limit 1 0<br />Valerii@localhost select 1+1 0<br /><b>Valerii@localhost select sleep(4) 4004<br /></b>Valerii@localhost select @@version_comment limit 1 0<br />Valerii@localhost create database sbtest 1<br /><b>Valerii@localhost sbtest CREATE TABLE sbtest1(<br /> id INTEGER NOT NULL AUTO_INCREMENT,<br /> k INTEGER DEFAULT '0' NOT NULL,<br /> c CHAR(120) DEFAULT '' NOT NULL,<br /> pad CHAR(60) DEFAULT '' NOT NULL,<br /> PRIMARY KEY (id)<br />) /*! ENGINE = innodb */ 91<br /></b>Valerii@localhost sbtest INSERT INTO sbtest1(k, c, pad) VALUES(366941, '31451373586-15688153734-79729593694-96509299839-83724898275-86711833539-78981337422-35049690573-51724173961-87474696253', '98996621624-36689827414-04092488557-09587706818-65008859162'),(277750, '21472970079-7 181<br />Valerii@localhost sbtest INSERT INTO sbtest1(k, c, pad) VALUES(124021, '80697810288-90543941719-80227288793-55278810422-59841440561-49369413842-83550451066-12907725305-62036548401-86959403176', '65708342793-83311865079-53224065384-18645733125-16333693298'),(660496, '07381386584-5 31<br />Valerii@localhost sbtest INSERT INTO sbtest1(k, c, pad) VALUES(425080, '48883413333-17783399741-03981526516-97596354402-27141206678-83563692683-30244461835-25263435890-49140039573-28211133426', '81560227417-96691828090-72817141653-15106797886-43970285630'),(322421, '68618246702-7 33<br />...</span><br /></span></p></blockquote><p>to end up with a lightweight general query log with query execution time repported in milliseconds. Note some statement from, yes, <b>sysbench</b> test preapre :)<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEi2y6eKlWD7qnj1rNNCvsVn6WCo5buj-SUE60wTUoZOlUFZo4T67nssRcR55L0z98qKbVLd5ygYiCPBZPc0W3LLLm0fjAczJsGaKrZEQVBQvpqNlWvS7Mhdc10uijNd9QZSs28ef2nqb5bVpv7X8uSaJluu2uwe9af7QbqgL-t8EHPXi5Stv6H3mUl6eA=s3264" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="2448" data-original-width="3264" height="480" src="https://blogger.googleusercontent.com/img/a/AVvXsEi2y6eKlWD7qnj1rNNCvsVn6WCo5buj-SUE60wTUoZOlUFZo4T67nssRcR55L0z98qKbVLd5ygYiCPBZPc0W3LLLm0fjAczJsGaKrZEQVBQvpqNlWvS7Mhdc10uijNd9QZSs28ef2nqb5bVpv7X8uSaJluu2uwe9af7QbqgL-t8EHPXi5Stv6H3mUl6eA=w640-h480" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Nice view. macOS is also nice and useful platform to run MariaDB server for fun :)<br /></td></tr></tbody></table><p>Thta's probably enough for a simgle post. Now go and get that from your MySQL 8 on any OS! <a href="https://bugs.mysql.com/bug.php?id=105741" target="_blank">USDTs rule</a>!</p><p>To summarize:</p><ol style="text-align: left;"><li>It is surely possible to build even latest ang greates still alpha MariaDB 10.8 on macOS even as old as 10.13.6, and get usable result.</li><li>One of the main benefits of macOS is DTrace support.</li><li>It is still possible to use a <a href="https://dev.mysql.com/doc/refman/5.6/en/dba-dtrace-server.html" target="_blank">limited set of USDTs</a> once added to MySQL and now gone in all current versions of MariaDB server. <br /></li><li>That <b>rocksdb.partition</b> test failure is<span style="font-size: small;"> to be studied an maybe reported as a MDEV</span></li><li><span style="font-size: small;">DTrace is surely more capable than wehat those examples in the source code show. More DTrace-related posts are coming this year. Stay tuned!</span></li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-60655226895578898822022-01-15T22:45:00.000+02:002022-01-15T22:45:43.286+02:00Accessing Complex Structures in MariaDB Server Code in bpftrace Probes - First Steps<p>I had already written many blog posts about <a href="http://mysqlentomologist.blogspot.com/search/label/bpftrace" target="_blank"><b>bpftrace</b></a>. All my public talks devoted to <b>bpftrace</b> included a slide about "problems", and one of them usually sounded like this:</p><blockquote><p><i><span style="font-family: inherit;"><span style="font-size: small;">...<span id="docs-internal-guid-d708211b-7fff-25eb-1ccb-b9cf9a620ce5" style="background-color: transparent; color: black; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">access to complex structures (</span><span style="background-color: transparent; color: black; font-variant: normal; font-weight: 700; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;">bpftrace</span><span style="background-color: transparent; color: black; font-variant: normal; font-weight: 400; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"> needs headers)...</span></span></span></i> </p></blockquote><p>So, today I decided to demonstrate how to resolve this minor problem. As an example I tried to reproduce <b>gdb</b> debugging steps from <a href="http://mysqlentomologist.blogspot.com/2015/03/using-gdb-to-understand-what-locks-and_31.html" target="_blank">this older post</a>. I want to trace and report table and row level locks set during execution of various SQL statements against InnoDB tables in MariaDB Server 10.6.6 built from more or less current GitHub code. To reduce the performance impact I'll use recent version of <b>bpftrace</b> build from GitHub source:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>src/bpftrace --version</b><br />bpftrace v0.14.0-50-g4228</span></span><br /></p></blockquote><p>First thing to find out (given the same test case as in that older blog post) is what functions to add <b>uprobe</b> on and what information is available. Let's start with <b>gdb</b> and set a couple of breakpoints on <b>lock_tables</b> and <b>lock_row_lock</b> functions:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>sudo gdb -p `pidof mariadbd`</b><br />GNU gdb (Ubuntu 9.2-0ubuntu1~20.04) 9.2<br />Copyright (C) 2020 Free Software Foundation, Inc.<br />License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html><br />This is free software: you are free to change and redistribute it.<br />There is NO WARRANTY, to the extent permitted by law.<br />Type "show copying" and "show warranty" for details.<br />This GDB was configured as "x86_64-linux-gnu".<br />Type "show configuration" for configuration details.<br />For bug reporting instructions, please see:<br /><http://www.gnu.org/software/gdb/bugs/>.<br />Find the GDB manual and other documentation resources online at:<br /> <http://www.gnu.org/software/gdb/documentation/>.<br /><br />For help, type "help".<br />Type "apropos word" to search for commands related to "word".<br />Attaching to process 14850<br />[New LWP 14852]<br />[New LWP 14853]<br />[New LWP 14854]<br />[New LWP 14855]<br />[New LWP 14859]<br />[New LWP 14861]<br />[New LWP 14862]<br />[Thread debugging using libthread_db enabled]<br />Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".<br />--Type <RET> for more, q to quit, c to continue without paging--<br />0x00007f126bc04aff in __GI___poll (fds=0x55ef5e36c838, nfds=3,<br /> timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29<br />29 ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.<br />(gdb) <b>b lock_table</b><br />Breakpoint 1 at 0x55ef5c44a840: lock_table. (4 locations)<br />(gdb) <b>b lock_rec_lock</b><br />Breakpoint 2 at 0x55ef5c1c4896: lock_rec_lock. (2 locations)<br />(gdb) <b>c</b><br />Continuing.<br />[New Thread 0x7f1245ffd700 (LWP 14899)]<br />[New Thread 0x7f1246fff700 (LWP 14900)]<br />[New Thread 0x7f1268518700 (LWP 14904)]<br />[Switching to Thread 0x7f1268518700 (LWP 14904)]<br /><br />Thread 11 "mariadbd" hit Breakpoint 1, lock_table (table=0x7f1218065b20,<br /> mode=LOCK_IS, thr=0x7f121806cc10)<br /> at /home/openxs/git/server/storage/innobase/lock/lock0lock.cc:3481<br />warning: Source file is more recent than executable.<br />3481 {<br />(gdb) <b>p table</b><br />$1 = (dict_table_t *) 0x7f1218065b20<br />(gdb) <b>p table->name</b><br />$2 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}<br />(gdb) <b>c</b><br />Continuing.<br /><br />Thread 11 "mariadbd" hit Breakpoint 1, lock_table (table=0x7f1218065b20,<br /> mode=LOCK_IS, thr=0x7f121806cc10)<br /> at /home/openxs/git/server/storage/innobase/include/que0que.ic:37<br />warning: Source file is more recent than executable.<br />37 return(thr->graph->trx);<br />(gdb) <b>c</b><br />Continuing.<br /><br />Thread 11 "mariadbd" hit Breakpoint 2, lock_rec_lock (impl=false, mode=2,<br /> block=0x7f12480325a0, heap_no=2, index=0x7f1218066f90, thr=0x7f121806cc10)<br /> at /home/openxs/git/server/storage/innobase/include/que0que.ic:37<br />37 return(thr->graph->trx);<br />(gdb) <b>p index</b><br />$3 = (dict_index_t *) 0x7f1218066f90<br />(gdb) <b>p index->name</b><br />$4 = {m_name = 0x7f1218067120 "PRIMARY"}<br />(gdb) <b>p index->table->name</b><br />$5 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}<br />(gdb) <b>p index->table</b><br />$6 = (dict_table_t *) 0x7f1218065b20<br />(gdb) <b>q</b><br />A debugging session is active.<br /><br /> Inferior 1 [process 14850] will be detached.<br /><br />Quit anyway? (y or n) <b>y</b><br />Detaching from program: /home/openxs/dbs/maria10.6/bin/mariadbd, process 14850<br />[Inferior 1 (process 14850) detached]<br />openxs@ao756:~/dbs/maria10.6$</span></span><br /></p></blockquote><p>From the above we already see that breakpoints are set in more than one place (so we expect more that one function name to match for the <b>uprobe</b>). We also see some structures used as functions arguments to access, and so we need a way to define them for tracing.</p><p>Now, if we try something lame to add a probe (should be done as <b>root</b> or via <b>sudo</b>) for just <b>lock_tables</b>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table { printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'</b><br />ERROR: bpftrace currently only supports running as the root user.<br />openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table { printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'</b><br />[sudo] password for openxs:<br />Attaching 20 probes...<br /><b>^C</b></span></span><br /></p></blockquote><p>we can already suspect something bad, as we ended up with 20 probes. We can even list them in a readable way:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | c++filt</b><br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned int)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_names(THD*, DDL_options_st const&, TABLE_LIST*, TABLE_LIST*, unsigned long, unsigned int)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_names(THD*, DDL_options_st const&, TABLE_LIST*, TABLE_LIST*, unsigned long, unsigned int) [clone .cold]<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_create(dict_table_t*, unsigned int, trx_t*, ib_lock_t*)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_for_trx(dict_table_t*, trx_t*, lock_mode)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_has_locks(dict_table_t*)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_resurrect(dict_table_t*, trx_t*, lock_mode)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_resurrect(dict_table_t*, trx_t*, lock_mode) [clone .cold]<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_lock_list_init(ut_list_base<ib_lock_t, ut_list_node<ib_lock_t> lock_table_t::*>*)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_print(_IO_FILE*, ib_lock_t const*)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_print(_IO_FILE*, ib_lock_t const*) [clone .cold]<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_wsrep(dict_table_t*, lock_mode, que_thr_t*, trx_t*)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_wsrep(dict_table_t*, lock_mode, que_thr_t*, trx_t*) [clone .cold]<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .cold]<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .constprop.0]<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table_dequeue(ib_lock_t*, bool) [clone .constprop.0] [clone .cold]<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables_precheck(THD*, TABLE_LIST*)<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_tables_open_and_lock_tables(THD*, TABLE_LIST*)<br />openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | wc -l</b><br />20</span></span><br /></p></blockquote><p>So, as expected, out probe maptaches any function with "lock_tables" in its name, and using the first row of the output (demangled function signature) does NOT help:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | head -1 | c++filt</b><br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)<br />openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)'</b><br />stdin:1:1-59: ERROR: syntax error, unexpected (, expecting {<br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table(dict_table_t*, lock_mode, que_thr_t*)<br />~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~</span></span><br /></p></blockquote><p>We can NOT use demandled signature, but we can use mangled equivalent:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -l 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_table' | head -1</b><br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t<br /></span></span></p></blockquote><p>Like this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t<br /> { printf("lock_table: %p, %d, %p\n", arg0, arg1, arg2); }'</b><br />Attaching 1 probe...<br />lock_table: 0x7f1218065b20, 0, 0x7f121806cc10<br /><b>lock_table: 0x7f121801d690, 4, 0x7f1218068d60<br />lock_table: 0x7f121801d690, 1, 0x7f1218068d60<br /></b>lock_table: 0x7f122400d350, 1, 0x7f121807a158<br />lock_table: 0x7f122400d350, 1, 0x7f121807a158<br />lock_table: 0x7f12240101e0, 1, 0x7f121807a900<br />lock_table: 0x7f12240101e0, 1, 0x7f121807a900<br />lock_table: 0x7f12240101e0, 1, 0x7f121807a900<br />lock_table: 0x7f12240101e0, 1, 0x7f121807a900<br />lock_table: 0x7f12240101e0, 1, 0x7f121807a900<br />lock_table: 0x7f12240101e0, 1, 0x7f121807a900<br /><b>^C</b></span></span><br /></p></blockquote><p>I've got the above output whilke execvuting rhis SQL statement:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">MariaDB [test]> <b>insert into t(val) select 100 from tt;</b><br />Query OK, 4 rows affected (0.038 sec)<br />Records: 4 Duplicates: 0 Warnings: 0</span></span><br /></p></blockquote><p>A bit more table level lock requests than one would expect, and from more than one thread. This can be explained, but the real problem is that we see lock mode, locking thread address but NOT the actual name of the locked table (that is hidden inside a complex structure). Fro0m <b>gdb</b> output we know that the first argument (<b>arg0</b>) of the function is a pointer to <b>dict_table_t</b> structure, but how is it defined? </p><p>I do not know InnoDB source code by heart, so I have to search (with some assumptions in mind this is easier):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/server$ <b>grep -rn 'struct dict_table_t' * | grep '\.h'</b><br />storage/innobase/include/row0import.h:34:struct dict_table_t;<br />storage/innobase/include/srv0start.h:33:struct dict_table_t;<br /><b>storage/innobase/include/dict0mem.h:1788:struct dict_table_t {</b><br />storage/innobase/include/dict0types.h:39:struct dict_table_t;</span></span><br /></p></blockquote><p>I've highlighted the line where structure definition begins, so I can see what's there (only the much later important part is quoted):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">...<br />public:<br /> /** Id of the table. */<br /> table_id_t id;<br /> /** dict_sys.id_hash chain node */<br /> dict_table_t* id_hash;<br /> /** Table name in name_hash */<br /><b> table_name_t name;</b><br />...</span></span><br /></p></blockquote><p>Even the <b>name</b> is not a simple scalar data type, so we can find the definition in te code (I had not cared much to make sure this is really a 10.6 branhc, assuming that some basic things do not change that often in MariaDB, optimistic approach):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/server$ <b>vi +102 storage/innobase/include/dict0types.h</b><br /><br />struct table_name_t<br />{<br /> /** The name in internal representation */<br /><b> char* m_name;<br /></b><br /> /** Default constructor */<br /> table_name_t() {}<br /> /** Constructor */<br /> table_name_t(char* name) : m_name(name) {}<br /><br /> /** @return the end of the schema name */<br /> const char* dbend() const<br /> {<br /> const char* sep = strchr(m_name, '/');<br /> ut_ad(sep);<br /> return sep;<br /> }<br /><br /> /** @return the length of the schema name, in bytes */<br /> size_t dblen() const { return size_t(dbend() - m_name); }<br /><br /> /** Determine the filename-safe encoded table name.<br /> @return the filename-safe encoded table name */<br /> const char* basename() const { return dbend() + 1; }<br /><br /> /** The start of the table basename suffix for partitioned tables */<br /><b> static const char part_suffix[4];</b><br /><br /> /** Determine the partition or subpartition name suffix.<br /> @return the partition name<br /> @retval NULL if the table is not partitioned */<br /> const char* part() const { return strstr(basename(), part_suffix); }<br /><br /> /** @return whether this is a temporary or intermediate table name */<br /> inline bool is_temporary() const;<br />};</span></span><br /></p></blockquote><p>I do not care about fun ction members, just data stored in the strucute (highlughted). Moreover, read <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#6-struct-struct-declaration" target="_blank">this part of <b>bpftrace</b> reference</a> carefully:</p><blockquote><p><i>You can define your own structs when needed. In some cases, kernel structs are not declared in the kernel
headers package, and are declared manually in bpftrace tools (or partial structs are: enough to reach the
member to dereference).</i></p></blockquote><p>Thing is, while we theoretically can use <b>#include</b> in <b>bpftrace</b>, structured used in MariaDB server are defined in so many heards amd are so specific and sometimes deeply nested, that usually it's easier to use properly sised placeholderns in explicit <b>struct</b> definitions. You can get proper sizes from <b>gdb</b>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>p sizeof(table_id_t)</b><br />$1 = 8<br />(gdb) <b>p sizeof(dict_table_t *)</b><br />$2 = 8<br />(gdb) <b>p sizeof(long long)</b><br />$3 = 8</span></span><br /></p></blockquote><p>Taking the above into account, we can try to define structured like these:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">struct table_name_t<br />{<br /> char* m_name;<br /> char part_suffix[4];<br />}<br /><br />struct dict_table_t {<br /> long long id;<br /> struct dict_table_t* id_hash;<br /> struct table_name_t name;<br />}</span></span><br /></p></blockquote><p>in our <b>bpftrace</b> code to be able to pproperly derefence the table name as a null-terminated string:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -e '</b><br />><br />> <b>struct table_name_t</b><br />> <b>{</b><br />> <b> char* m_name;</b><br />> <b> char part_suffix[4];</b><br />> <b>}<br /></b>><br />> <b>struct dict_table_t {</b><br />> <b> long long id;</b><br />> <b> struct dict_table_t* id_hash;</b><br />> <b> struct table_name_t name;</b><br />> <b>}</b><br />><br />> <b>uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t</b><br />> <b>{ printf("lock_table: %s, %d, %p\n", str(((struct dict_table_t *)arg0)->name.m_name), arg1, arg2); }'</b><br />Attaching 1 probe...<br /><b>lock_table: test/tt, 0, 0x7f121806cc10<br />lock_table: test/t, 4, 0x7f1218068d60<br />lock_table: test/t, 1, 0x7f1218068d60<br /></b>lock_table: mysql/innodb_table_stats, 1, 0x7f122800e7d8<br />lock_table: mysql/innodb_table_stats, 1, 0x7f122800e7d8<br />lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80<br />lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80<br />lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80<br />lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80<br />lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80<br />lock_table: mysql/innodb_index_stats, 1, 0x7f122800ef80<br /><b>^C</b></span></span><br /></p></blockquote><p>Looks good enough as a proof of concept already. We see how to define simplified structured containing just enough information to find and derefernce items of complex nested structures used all over the MariaDB server code.<br /></p><p>I'd surely want to add a probe to <b>lock_row_lock</b> too, and for this I need some more <b>gdb</b> outputs from the breakpoint set:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">Thread 11 "mariadbd" hit Breakpoint 2, lock_rec_lock (impl=false, mode=2,<br /> block=0x7f12480325a0, heap_no=2, index=0x7f1218066f90, thr=<b>0x7f121806cc10</b>)<br /> at /home/openxs/git/server/storage/innobase/include/que0que.ic:37<br />37 return(thr->graph->trx);<br />(gdb) <b>p index</b><br />$3 = (dict_index_t *) 0x7f1218066f90<br />(gdb) <b>p index->name</b><br />$4 = {m_name = 0x7f1218067120 "PRIMARY"}<br />(gdb) <b>p index->table->name</b><br />$5 = {m_name = 0x7f1218020908 "test/tt", static part_suffix = "#P#"}<br />(gdb) <b>p index->table</b><br />$6 = (dict_table_t *) 0x7f1218065b20<br />(gdb) <b>p index->id</b><br />$7 = 48<br />(gdb) <b>p sizeof(index->id)</b><br />$8 = 8<br />(gdb) <b>p sizeof(index->heap)</b><br />$9 = 8<br />(gdb) <b>p sizeof(index->name)</b><br />$10 = 8</span></span><br /><span style="font-family: courier;"><span style="font-size: x-small;">(gdb)<b> p impl</b><br />$11 = false<br />(gdb) <b>p sizeof(impl)</b><br />$12 = 1<br />(gdb) <b>p sizeof(mode)</b><br />$13 = 4</span></span><br /></p></blockquote><p>I can surely find the defintion of <b>dict_index_t</b> structure and work based on it:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">struct dict_index_t {<br /> /** Maximum number of fields */<br /> static constexpr unsigned MAX_N_FIELDS= (1U << 10) - 1;<br /><br /> index_id_t id; /*!< id of the index */<br /> mem_heap_t* heap; /*!< memory heap */<br /><b> id_name_t name; /*!< index name */<br /> dict_table_t* table; /*!< back pointer to table */<br /></b>...</span></span><br /></p></blockquote><p>but sizes above are actually enough to come up with a proper code like this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">sudo src/bpftrace -e '<br /><br />struct table_name_t<br />{<br /> char* m_name;<br /> char part_suffix[4];<br />}<br /><br />struct dict_table_t {<br /> long long id;<br /> struct dict_table_t* id_hash;<br /> struct table_name_t name;<br />}<br /><br />struct dict_index_t {<br /> long long id;<br /> long long heap;<br /> char* name;<br /> struct dict_table_t* table;<br />}<br /><br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t<br /> { printf("lock_table: %s, mode: %d, thread: %p\n", <br /> str(((struct dict_table_t *)arg0)->name.m_name), <br /> arg1, <br /> arg2); }<br /><br />uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_rec_lock<br /> { printf("lock_rec_lock: impl (%d) mode %d index %s rec of %s, thread: %p\n", <br /> arg0, <br /> arg1, <br /> str(((struct dict_index_t *)arg4)->name), <br /> str(((struct dict_index_t *)arg4)->table->name.m_name),<br /> arg5); }</span></span><br /></p></blockquote><p>that produces the following: </p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/bpftrace/build$ <b>sudo src/bpftrace -e '</b><br />><br />> <b>struct table_name_t</b><br />> <b>{</b><br />> <b> char* m_name;</b><br />> <b> char part_suffix[4];</b><br />> <b>}<br /></b>><br />> <b>struct dict_table_t {</b><br />> <b> long long id;</b><br />> <b> struct dict_table_t* id_hash;</b><br />> <b> struct table_name_t name;</b><br />> <b>}</b><br />><br />> <b>struct dict_index_t {</b><br />> <b> long longid;</b><br />> <b> long longheap;</b><br />> <b> char*name;</b><br />> <b> struct dict_table_t*table;</b><br />> <b>}</b><br />><br />> <b>uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:_Z10lock_tableP12dict_table_t9lock_modeP9que_thr_t</b><br />> <b> { printf("lock_table: %s, mode: %d, thread: %p\n",</b><br />> <b> str(((struct dict_table_t *)arg0)->name.m_name),</b><br />> <b> arg1,</b><br />> <b> arg2); }</b><br />><br />> <b>uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:lock_rec_lock</b><br />> <b> { printf("lock_rec_lock: impl (%d) mode %d index %s rec of %s, thread: %p\n",</b><br />> <b> arg0,</b><br />> <b> arg1,</b><br />> <b> str(((struct dict_index_t *)arg4)->name),</b><br />> <b> str(((struct dict_index_t *)arg4)->table->name.m_name),</b><br />> <b> arg5); }</b><br />><br />> '<br />Attaching 3 probes...<br /><b>lock_table: test/tt, mode: 0, thread: 0x7f121806cc10<br />lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10<br /></b>lock_table: test/t, mode: 4, thread: 0x7f1218068d60<br />lock_table: test/t, mode: 1, thread: 0x7f1218068d60<br /><b>lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10<br />lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10<br />lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10<br />lock_rec_lock: impl (0) mode 2 index PRIMARY rec of test/tt, thread: 0x7f121806cc10<br /></b>lock_table: mysql/innodb_table_stats, mode: 1, thread: 0x7f123400ebc8<br />lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8<br />lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8<br />lock_table: mysql/innodb_table_stats, mode: 1, thread: 0x7f123400ebc8<br />lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_table_stats, thread: 0x7f123400ebc8<br />lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br />lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br />lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br />lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br />lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 3 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br />lock_table: mysql/innodb_index_stats, mode: 1, thread: 0x7f123400f370<br />lock_rec_lock: impl (0) mode 1026 index PRIMARY rec of mysql/innodb_index_stats, thread: 0x7f123400f370<br /><b>^C</b></span></span><br /></p></blockquote><p>while this SQL is executed, for example:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">MariaDB [test]> <b>insert into t(val) select 100 from tt;</b><br />Query OK, 4 rows affected (0.080 sec)<br />Records: 4 Duplicates: 0 Warnings: 0</span></span><br /></p></blockquote><p>Please, check that <a href="http://mysqlentomologist.blogspot.com/2015/03/using-gdb-to-understand-what-locks-and_31.html" target="_blank">older post</a> from proper interpretation of the output, including contants used to represent lock mode etc. The idea here was to show that <b>bpftrace</b> understands a subset of C/C++ <b>struct</b> statement and show the ways to create such simplified strucutres to be able to dereference and access deeply nexted arguments if needed for tracing.</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEicSvt0xa-Go6sKLBEe7yvJGjVWFfXMMxVXZL1WdAjZVpAHuBdk4J5enb7EmJq7UkCILoO5-0Antown69Jdihfc-IsQKZO_8rPe8IhZeXl6V8NxQqa0h7kyFHdh6KCR3a3O4v-HEsaYKQA-ziXMvoJW1iKnNuzJJiC16SYatkUGjuhYevuNfiaVvM3aIA=s665" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="395" data-original-width="665" height="380" src="https://blogger.googleusercontent.com/img/a/AVvXsEicSvt0xa-Go6sKLBEe7yvJGjVWFfXMMxVXZL1WdAjZVpAHuBdk4J5enb7EmJq7UkCILoO5-0Antown69Jdihfc-IsQKZO_8rPe8IhZeXl6V8NxQqa0h7kyFHdh6KCR3a3O4v-HEsaYKQA-ziXMvoJW1iKnNuzJJiC16SYatkUGjuhYevuNfiaVvM3aIA=w640-h380" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">A bit more advanced code may show more details about the individual rows locked...<br /></td></tr></tbody></table></p><p style="text-align: center;">* * *<br /></p><p>To summarize:</p><ol style="text-align: left;"><li>With some efforts like source code checks and <b>gdb</b> breakpoints/prints one can eventually figure out what structures to define to be able to access members of complex structures typical for MariaDB and MySQL server code in the <b>bpftrace</b> code.</li><li>Inlcuding existing headers is NOT possible and is hardly practical for complex software like MariaDB that uses C++ nowadays.</li><li><b>bpftrace</b>, with some efforts like those presented above, allows to study what happens in complex server operations (like InnoDB locking) easily, safely and with minimal impact on a running system.</li><li><b>bpftrace</b> is cool, but you already know that!<br /></li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-42969749817323514572022-01-10T22:03:00.001+02:002022-01-10T22:09:29.182+02:00Differential Flame Graphs to Highlight Performance Schema Waits Impact<p>Yet another type of flame graphs that I had not discussed yet is a <a href="https://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html" target="_blank"><i>differential flame graph</i></a> (again invented by <b>Brendan Gregg</b>). It shows the difference of two flame graphs in a clear way (assuming they are comparable - it's on you to make sure comparison makes sense and interpret the output properly). The flame graph is drawn using the "after" profile (such that the frame
widths show values for the second graph), and then colorized by the
delta to show how we got there. If the metric for the specific frame in the same stack is larger then it is shown in red. If it's smaller the frame is blue (hence the name <i>red/blue</i> differential flame graphs). The saturation is relative to the delta, so dark red is for the frame that has much bigger value in the second graph.While and very light frames can be ignored.</p><p>Let me apply this approach to the flame graphs showing waits reported by MySQL's performance schema (built as described in <a href="http://mysqlentomologist.blogspot.com/2022/01/visualizing-performance-schema-events.html" target="_blank">this blog post</a>). As a proof of concept I'll use an easy to interpret case where the same <b>oltp_write_only.lua</b> <b>sysbench</b> test is run with different values of <b>innodb_flush_log_at_trx_commit</b> and otherwise the same settings like 16 concurrent threads on my old 2 cores Ubuntu 20.04 "home server" with slow HDD.</p><p>So, here is the first run:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua cleanup<br />sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua prepare<br />bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'truncate table events_waits_history_long'<br /><b>bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'set global innodb_flush_log_at_trx_commit=0'<br /></b>sysbench --table-size=1000000 --threads=16 --time=120 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua run<br />bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits64_0.txt<br />cat /tmp/waits64_0.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' > /tmp/w64_0.out<br /></span></span></p></blockquote><p></p><p>That ended up with this <b>sysbench</b> statistics:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">SQL statistics:<br /> queries performed:<br /> read: 0<br /> write: 26252<br /> other: 13126<br /> total: 39378<br /><b> transactions: 6563 (54.60 per sec.)<br /> queries: 39378 (327.59 per sec.)<br /></b> ignored errors: 0 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />General statistics:<br /> total time: 120.2017s<br /> total number of events: 6563<br /><br />Latency (ms):<br /> min: 66.37<br /> avg: 292.94<br /> max: 1222.92<br /><b> 95th percentile: 530.08</b><br /> sum: 1922587.72<br /><br />Threads fairness:<br /> events (avg/stddev): 410.1875/4.69<br /> execution time (avg/stddev): 120.1617/0.05</span></span><br /></p></blockquote><p>and then the second:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua cleanup<br />sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua prepare<br />bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'truncate table events_waits_history_long'<br /><b>bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'set global innodb_flush_log_at_trx_commit=1'<br /></b>sysbench --table-size=1000000 --threads=16 --time=120 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua run<br />bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits64_1.txt<br />cat /tmp/waits64_1.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' > /tmp/w64_1.out</span></span><br /></p></blockquote><p>that produced the following:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">SQL statistics:<br /> queries performed:<br /> read: 0<br /> write: 16238<br /> other: 8119<br /> total: 24357<br /><b> transactions: 4059 (33.71 per sec.)<br /> queries: 24357 (202.27 per sec.)<br /></b> ignored errors: 1 (0.01 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />General statistics:<br /> total time: 120.4139s<br /> total number of events: 4059<br /><br />Latency (ms):<br /> min: 151.87<br /> avg: 474.29<br /> max: 1316.65<br /><b> 95th percentile: 773.68</b><br /> sum: 1925150.61<br /><br />Threads fairness:<br /> events (avg/stddev): 253.6875/3.51<br /> execution time (avg/stddev): 120.3219/0.13</span></span><br /></p></blockquote><p>Now with resulting "collapsed stacks" in <b>/tmp/w64_?.out</b> files, we can build a simple differential flame graph </p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>~/git/FlameGraph/difffolded.pl /tmp/w64_0.out /tmp/w64_1.out | ~/git/FlameGraph/flamegraph.pl --count picoseconds --title Waits > /tmp/w64_01_diff.svg</b></span></span><br /></p></blockquote><p>that, when captured as a .png file for this blog post looks as follows:<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEje7v5fKHHNe1wZRa4VEE7M7rTEW0YGCd1uR8MGgzK9Rbz5RmHuC6p7JReh1Gwj1ZBmhiPWHU6DXP7hb7SsIdqlqTcjW_wzzlGS-M3N3fzIaPZ3YBYAmz4k514dsZJMREAi3Xhz1_zlOj-jYQ_IJFDo25ILByKTMBEA8ws2AjpEaAToFhs6aWJONIS-eg=s1199" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="152" data-original-width="1199" height="82" src="https://blogger.googleusercontent.com/img/a/AVvXsEje7v5fKHHNe1wZRa4VEE7M7rTEW0YGCd1uR8MGgzK9Rbz5RmHuC6p7JReh1Gwj1ZBmhiPWHU6DXP7hb7SsIdqlqTcjW_wzzlGS-M3N3fzIaPZ3YBYAmz4k514dsZJMREAi3Xhz1_zlOj-jYQ_IJFDo25ILByKTMBEA8ws2AjpEaAToFhs6aWJONIS-eg=w640-h82" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Differential flame graph visualizing the impact of <b>innodb_flush_log_at_trx_commit</b> value (the difference 1 makes vs 0) on waits reported by <b>performance_schema</b> for the same <b>oltp_write_only.lua</b> test<br /></td></tr></tbody></table><p></p><p>We clearly see the negative impact on performance (54 TPS vs 33 TPS). We clearly see that time spent on InnoDB log I/O increased, same as idle time somewhat increased. The highest relative increase was for <b>/wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done</b> condition variable (5+%) and that was probably related to binlog group commit implementation where we could not write to the binary log until redo log is flushed. Wild guess, surely.</p><p>I deliberately used as simple flame graphs as possible to make interpretation of the difference based on just a screenshot easier. This lame test demonstrated that we really can see what we expected - somewhat increased redo log I/O waits, and they are highlighted as red. Good enough for the proof of concept and way easier to speculate about than usual Off-CPU differential flame graph like this:</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh9N3ROOT9_YwHoj1ZlQ21SRoO4N7Xu9Uw9rpa8rQg7saZx0nngvOpbn_I9_j9-tHHHtQzFapEq8NSM8Dhu5CF1KWJMOMnhCmZC4AynWKzHpbPeMS10Mv_FipKxxaa6_lMtSUAGlixMCzJXzmeeNSaraAG_bbd3aZvEdFO9E_8lEEwp5-ZGNowiEMH_3A=s1199" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="541" data-original-width="1199" height="288" src="https://blogger.googleusercontent.com/img/a/AVvXsEh9N3ROOT9_YwHoj1ZlQ21SRoO4N7Xu9Uw9rpa8rQg7saZx0nngvOpbn_I9_j9-tHHHtQzFapEq8NSM8Dhu5CF1KWJMOMnhCmZC4AynWKzHpbPeMS10Mv_FipKxxaa6_lMtSUAGlixMCzJXzmeeNSaraAG_bbd3aZvEdFO9E_8lEEwp5-ZGNowiEMH_3A=w640-h288" width="640" /></a></div><p></p><p>that I built <a href="http://mysqlentomologist.blogspot.com/2021/05/off-cpu-analysis-attempt-to-find-reason.html" target="_blank">while attempting to reproduce one real life MariaDB performance problem</a>...<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-90407600852016207482022-01-08T22:20:00.000+02:002022-01-08T22:20:57.706+02:00Visualizing MySQL Plan Execution Time With Flame Graphs<p>In the <a href="http://mysqlentomologist.blogspot.com/2022/01/visualizing-performance-schema-events.html" target="_blank">previous post</a> I've already shown one non-classical but useful application of <i>flame graphs</i> as a visualization of something interesting for MySQL DBAs (besides usual stack traces from profilers), time spent on waits or even statements execution by stages with related waits as reported by <b>performance_schema</b>. Today I am going to abuse <a href="https://github.com/brendangregg/FlameGraph" target="_blank">the tools</a> even more and try to show the real impact of each step in the query execution as reported by the <a href="https://dev.mysql.com/doc/refman/8.0/en/explain.html#explain-analyze" target="_blank"><b>EXPLAIN ANALYZE</b></a> statement that is supported since MySQL 8.0.18.</p><p> The idea to use flame graphs for SQL plans visualization was (to the best of my knowledge) first suggested by <b>Tanel Poder</b> in <a href="https://tanelpoder.com/posts/visualizing-sql-plan-execution-time-with-flamegraphs/" target="_blank">this article</a>, in context of Oracle RDBMS. There are several implementations of the idea in a free software <a href="https://github.com/mgartner/pg_flame" target="_blank">here</a> and <a href="https://github.com/felixge/flame-explain" target="_blank">there</a>, for example. <br /></p><p>Unfortunately I had not found any similar posts for MySQL and tools mentioned above do not work with MySQL. It made me wondering why? </p><p>For quite a some time we have a way to get several metrics for each execution step of the query plan in MySQL. Let's start with some stupid query (used to prove the concept and test possible steps to implement it):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>bin/mysql -uroot --socket=/tmp/mysql8.sock test</b><br />Reading table information for completion of table and column names<br />You can turn off this feature to get a quicker startup with -A<br /><br />Welcome to the MySQL monitor. Commands end with ; or \g.<br />Your MySQL connection id is 33<br />Server version: 8.0.27 Source distribution<br /><br />Copyright (c) 2000, 2021, Oracle and/or its affiliates.<br /><br />Oracle is a registered trademark of Oracle Corporation and/or its<br />affiliates. Other names may be trademarks of their respective<br />owners.<br /><br />Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.<br /><br />mysql> <b>explain analyze select user, host from mysql.user u1 where u1.user not in (select distinct user from mysql.user) order by host desc\G</b><br />*************************** 1. row ***************************<br />EXPLAIN: <b>-> Nested loop antijoin</b> (cost=3.75 rows=25) (<b>actual time</b>=0.139..<b>0.139</b> rows=0 loops=1)<br /><span style="background-color: #b6d7a8;"> </span><b>-> Covering index scan on u1 using PRIMARY</b> (reverse) (cost=0.75 rows=5) (<b>actual time</b>=0.058..<b>0.064</b> rows=5 loops=1)<br /><span style="background-color: #b6d7a8;"> </span>-> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) (actual time=0.003..0.003 rows=1 loops=5)<br /><span style="background-color: #b6d7a8;"> </span>-> Materialize with deduplication (cost=1.25..1.25 rows=5) (actual time=0.071..0.071 rows=4 loops=1)<br /><span style="background-color: #b6d7a8;"> </span>-> Filter: (mysql.`user`.`User` is not null) (cost=0.75 rows=5) (actual time=0.022..0.035 rows=5 loops=1)<br /><span style="background-color: #b6d7a8;"> </span>-> Covering index scan on user using PRIMARY (cost=0.75 rows=5) (actual time=0.020..0.031 rows=5 loops=1)<br /><br />1 row in set (0,02 sec)</span></span><br /></p></blockquote><p>We see some hierarchical representation of query execution steps (it's Oracle's famous TREE format of <b>EXPLAIN</b> output, the only one supported for <b>EXPLAIN ANALYZE</b>). Each row represents some step, explain what the step was doing, quite verbosely, and then provides several useful metrics like cost, estimated number of rows and then actual time to return the first and all rows for this step etc. The actual time to return all rows at this step is what DBAs are usually interested in The steps form a tree, but the way it's represented is a bit unusual - we do not see JSON or any other structured format <a href="https://bugs.mysql.com/bug.php?id=106083" target="_blank">as some other RDBMSes provide</a>. Instead each row has some level in the hierarchy represented by the number of spaces before the very informative "<b>-></b>" prompt. The first row with its additional <b>"EXPLAIN: "</b> prompts aside, each nesting level adds 4 spaces, as highlighted by the light green background above.<br /></p><p>Maybe this unusual representation (not in the table with plan steps like in Oracle and not as JSON) prevented people from quickly implementing flame graphs-based query "profiler" for MySQL. But this definitely is not going to stop me from hacking something that may even work. I am not really a developer during recent 16+ years already, so I will not use Node.js or Python or anything fancy - just old Unix text processing tools like <b>sed</b> and <b>awk</b>, and surely the power of SQL (that small part I managed to master). </p><p>So, with the output like the above saved in a file:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>cat /tmp/explain.txt</b><br />*************************** 1. row ***************************<br />EXPLAIN: -> Nested loop antijoin (cost=3.75 rows=25) (actual time=0.115..0.115 rows=0 loops=1)<br /> -> Covering index scan on u1 using PRIMARY (reverse) (cost=0.75 rows=5) (actual time=0.045..0.049 rows=5 loops=1)<br /> -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) (actual time=0.002..0.002 rows=1 loops=5)<br /> -> Materialize with deduplication (cost=1.25..1.25 rows=5) (actual time=0.063..0.063 rows=4 loops=1)<br /> -> Filter: (mysql.`user`.`User` is not null) (cost=0.75 rows=5) (actual time=0.025..0.034 rows=5 loops=1)<br /> -> Covering index scan on user using PRIMARY (cost=0.75 rows=5) (actual time=0.024..0.032 rows=5 loops=1)<br /><br />openxs@ao756:~/dbs/8.0$</span></span><br /></p></blockquote><p>The first stage is simple and represented by the following command line I came up with quite fast:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>cat /tmp/explain.txt | awk 'NR > 1' | sed 's/EXPLAIN: //' | sed 's/(cost[^)][^)]*)//' | sed 's/(actual time=//' | sed 's/\(..*\)\.\...*/\1/' | sed 's/ \([^ ][^ ]*\)$/; \1/'</b><br />-> Nested loop antijoin ; 0.115<br /> -> Covering index scan on u1 using PRIMARY (reverse) ; 0.045<br /> -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`); 0.002<br /> -> Materialize with deduplication ; 0.063<br /> -> Filter: (mysql.`user`.`User` is not null) ; 0.025<br /> -> Covering index scan on user using PRIMARY ; 0.024<br /><br />openxs@ao756:~/dbs/8.0$</span><br /></span></p></blockquote><p>At each step sequentially I removed the first line of the output, removed that stupid <b>"EXPLAIN :"</b> "prompt", removed cost and number of rows estimations and then extracted just the actual time to return all rows from the step (separated by ";" as <b>flamegraph.pl</b> wants) , and kept the step detailed description with the initial spaces.</p><p>The next stage of processing took me a lot of time and efforts, as I had to convert the output above into a different format that is suitable for loading into the database:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>cat /tmp/plan.txt</b><br />1;0;0; Nested loop antijoin ; 0.115<br />2;1;1; Covering index scan on u1 using PRIMARY (reverse) ; 0.045<br />3;1;1; Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`); 0.002<br />4;3;2; Materialize with deduplication ; 0.063<br />5;4;3; Filter: (mysql.`user`.`User` is not null) ; 0.025<br />6;5;4; Covering index scan on user using PRIMARY ; 0.024</span></span><br /></p></blockquote><p>The format is simple: semicolon separated row number (step in the plan), row number of the "parent" step in the hierarchy (with 0 used for the very first step, obviously), the level of the step in the hierarchy, just in case, that same detailed step description and the value of the metric for the step (time to return all rows in milliseconds). This file is then loaded into the following MySQL table:</p><blockquote><p><span style="font-size: xx-small;"><span style="font-family: courier;">mysql> <b>desc plan;</b><br />+--------+---------------+------+-----+---------+-------+<br />| Field | Type | Null | Key | Default | Extra |<br />+--------+---------------+------+-----+---------+-------+<br />| seq | int | YES | | NULL | |<br />| parent | int | YES | | NULL | |<br />| level | int | YES | | NULL | |<br />| step | varchar(1024) | YES | | NULL | |<br />| val | decimal(10,3) | YES | | NULL | |<br />+--------+---------------+------+-----+---------+-------+<br />5 rows in set (0,00 sec)<br /><br />mysql> <b>truncate plan;</b><br />Query OK, 0 rows affected (1,77 sec)<br /><br />mysql> <b>load data infile '/tmp/plan.txt' into table plan fields terminated by ';';</b><br />Query OK, 6 rows affected (0,13 sec)<br />Records: 7 Deleted: 0 Skipped: 0 Warnings: 0<br /><br />mysql> <b>select * from plan;</b><br />+------+--------+-------+------------------------------------------------------------------------------------------+-------+<br />| seq | parent | level | step | val |<br />+------+--------+-------+------------------------------------------------------------------------------------------+-------+<br />| 1 | 0 | 0 | Nested loop antijoin | 0.115 |<br />| 2 | 1 | 1 | Covering index scan on u1 using PRIMARY (reverse) | 0.045 |<br />| 3 | 1 | 1 | Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) | 0.002 |<br />| 4 | 3 | 2 | Materialize with deduplication | 0.063 |<br />| 5 | 4 | 3 | Filter: (mysql.`user`.`User` is not null) | 0.025 |<br />| 6 | 5 | 4 | Covering index scan on user using PRIMARY | 0.024 |<br />+------+--------+-------+------------------------------------------------------------------------------------------+-------+<br />6 rows in set (0,00 sec)</span></span><br /></p></blockquote><p>Now, how to end up with that format? I spent a lot of time with <b>awk</b> and created the following script: <br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">awk '<br />function st_push(val) {<br />st_array[st_pos++] = val;<br />}<br /><br />function st_pop() {<br />return (st_size() > 0) ? st_array[--st_pos] : "ERROR";<br />}<br /><br />function st_size() {<br />return st_pos;<br />}<br /><br />BEGIN { p[0] = 0; level[0] = 0; parent[0] = 0; } <br /><br />NF > 0 { <br /> # print "> " NR ";" $1;<br /> split($0,a,"->"); <br /> lvl = length(a[1])/4;<br /><br /> level[NR] = lvl;<br /> parent[NR] = NR-1;<br /><br /> if (level[NR] > level[NR-1]) {<br /> st_push(parent[NR]);<br /> p[level[NR]] = NR-1;<br /> }<br /> else if (level[NR] < level[NR-1]) {<br /> for (i=0; i<=(level[NR-1]-level[NR]); i++) {<br /> parent[NR] = st_pop();<br /> # print "poped " parent[NR];<br /> }<br /> }<br /> else {<br /> parent[NR]=parent[NR-1]<br /> }<br /> print NR ";" parent[NR] ";" lvl ";" a[2]<br />}'</span></span><br /></p></blockquote><p>that starts with 3 functions implementing a stack and then 3 associative arrays: <b>p[]</b> to store current parent row number for the specific nesting level, <b>level[]</b> to store hierarchy level of specific row and <b>parent[]</b> to store that number of row that is a parent of a given row in the plan.<br /></p><p>To find the level of the row I <b>split()</b> the input row with at least one default filed separator ';' inside, using "-<b>></b>" as a field separator (assuming it is not used in the step description, ever) and then end up with initial spaces in <b>a[1]</b> and the rest (the step description and metrics value) in <b>a[2]</b>. I divide the length of <b>a[1]</b> by 4 to find the level. </p><p>This was a simple part. Then I try to build a value of a parent row using a stack to save previous parent row number for the level. It took some testing with a more primitive input format to come up with "tree traversal" code for such a weird tree representation format that seem to work on few different nested structures tested (correct me if it fails for more complex plans). I am not a developer any more so the code above may be wrong.</p><p>So, with the plan in the table, where each line has a seq number of seq number of the parent line as parent, one needs a simple enough recursive CTE to sum up all possible paths and summarizing time spent in the path:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">mysql> <b>with recursive cte_plan as (select seq, parent, level, concat(seq, concat(' -', step)) as step, round(val*1000) as val from plan where level = 0 union all select p.seq, p.parent, p.level, concat(c.step, concat(';', concat(p.seq, concat(' -',p.step)))), round(p.val*1000) as val from plan p join cte_plan c on p.parent = c.seq) select step, val from cte_plan\G</b><br />*************************** 1. row ***************************<br />step: 1 - Nested loop antijoin<br /> val: 115<br />*************************** 2. row ***************************<br />step: 1 - Nested loop antijoin ;2 - Covering index scan on u1 using PRIMARY (reverse)<br /> val: 45<br />*************************** 3. row ***************************<br />step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)<br /> val: 2<br />*************************** 4. row ***************************<br />step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication<br /> val: 63<br />*************************** 5. row ***************************<br />step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null)<br /> val: 25<br />*************************** 6. row ***************************<br />step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) ;6 - Covering index scan on user using PRIMARY<br /> val: 24<br />6 rows in set (0,01 sec)</span></span><br /></p></blockquote><p>Note that I've multiplied the metrics value by 1000 as <b>flamegraph.pl</b> script that expects metrics to be integer numbers. So they are actually in microseconds. I also added step number to each step, as actually we may have exact same step in different places in the hierarchy if wee read from the same table more than once in the query.</p><p>The final code to produce the input for <b>flamegraph.pl</b> is like this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>~/dbs/8.0/bin/mysql -uroot --socket=/tmp/mysql8.sock test -B -e"with recursive cte_plan as (select seq, parent, level, concat(seq, concat(' -', step)) as step, round(val*1000) as val from plan where level = 0 union all select p.seq, p.parent, p.level, concat(c.step, concat(';', concat(p.seq, concat(' -',p.step)))), round(p.val*1000) as val from plan p join cte_plan c on p.parent = c.seq) select step, val from cte_plan;" | awk ' NR > 1' > /tmp/processed_plan.txt</b><br />openxs@ao756:~/dbs/8.0$ <b>cat /tmp/processed_plan.txt</b><br />1 - Nested loop antijoin 115<br />1 - Nested loop antijoin ;2 - Covering index scan on u1 using PRIMARY (reverse) 45<br />1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) 2<br />1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication 63<br />1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) 25<br />1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) ;6 - Covering index scan on user using PRIMARY 24</span></span><br /></p></blockquote><p>This is how I create the final flame graph representing the plan steps):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>cat /tmp/processed_plan.txt | ~/git/FlameGraph/flamegraph.pl --title "EXPLAIN Steps" --inverted --countname miliseconds > /tmp/explain2.svg</b></span></span><br /></p></blockquote><p>Surely these steps can be simplified, put into a script etc. I checked intermediate stages one by one, left some debugging details and so on. You can do better, but the final result is this nice icicle graph:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEjBOnBEsgRZ7_ExcAQ-QlszKOEeZiqpmrRL_x11SfOa3_VSIoFkFgRaoUElJadprHr2zte0g4kvA3_oIled_ru3Alk1HYZz0wOnF1mElzBphb5E1ont_3GH76qxbDWiDEpr36erdawX_YXO_giMOSRkUTyRGC3LWVkdQiHoSv-2Kr8Y4lDpyJNcxZnkhA=s1196" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="159" data-original-width="1196" height="85" src="https://blogger.googleusercontent.com/img/a/AVvXsEjBOnBEsgRZ7_ExcAQ-QlszKOEeZiqpmrRL_x11SfOa3_VSIoFkFgRaoUElJadprHr2zte0g4kvA3_oIled_ru3Alk1HYZz0wOnF1mElzBphb5E1ont_3GH76qxbDWiDEpr36erdawX_YXO_giMOSRkUTyRGC3LWVkdQiHoSv-2Kr8Y4lDpyJNcxZnkhA=w640-h85" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Time spent on each step of query execution represented as an icicle graph<br /></td></tr></tbody></table><p></p><p>where I highlighted the impact of the <b>"Covering index scan on u1 using PRIMARY (reverse)"</b> step. As usual with flame graphs, in the original <b>.svg</b> file you can do search, zoom in an out etc. For complex queries it may be really useful.</p><p>As a DBA I'd ask for something better than these weird stages (even if scripted). I wish we had all execution steps in tables or view, like in Oracle's <b>v$sql_plan</b> and <b>v$sql_plan_statistics_all</b>, or at least in JSON format maybe (like <b>EXPLAIN FORMAT=JSON</b> produces for the plan itself). One day, maybe (I am going to file a feature request for this later). Yet another day I plan to use the output of <b>ANALYZE FORMAT=JSON</b> in MariaDB to get similar information :)<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-32025070588629733762022-01-06T22:52:00.000+02:002022-01-06T22:52:15.894+02:00Visualizing Performance Schema Events with Flame Graphs<p>Happy New Year 2022, dear readers! It's not the first time I am writing something about or illustrate some points with a <a href="https://mysqlentomologist.blogspot.com/search/label/flame%20graph" target="_blank">flame graph</a>. No wonder. <a href="http://www.brendangregg.com/flamegraphs.html" target="_blank"><i>Flame graph</i></a> concept and <a href="https://github.com/brendangregg/FlameGraph" target="_blank">related tools</a> by <b>Brendan Gregg</b> provide a great way for visualizing metrics in any nested hierarchies, with function calls stacks being just the most popular and known example of them.</p><p>In this new blog post that I am writing while preparing to my upcoming FOSDEM 2022 talk "<a href="https://fosdem.org/2022/schedule/event/mysql_flame/" target="_blank"><b>Flame Graphs for MySQL DBAs</b></a>" I am going to explore a different hierarchy that is obvious from these simple SQL queries and their outputs:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">mysql> <b>select event_name, timer_wait from events_waits_history_long order by 1 desc limit 5;</b><br />+----------------------------------------------+------------+<br />| event_name | timer_wait |<br />+----------------------------------------------+------------+<br />| wait/synch/sxlock/innodb/trx_purge_latch | 747273 |<br />| wait/synch/sxlock/innodb/index_tree_rw_lock | 767343 |<br />| wait/synch/sxlock/innodb/hash_table_locks | 637557 |<br />| wait/synch/sxlock/innodb/hash_table_locks | 280980 |<br />| wait/synch/sxlock/innodb/dict_operation_lock | 1731372 |<br />+----------------------------------------------+------------+<br />5 rows in set (0,02 sec)</span></span></p></blockquote><p>Another one, from the summary table:</p><blockquote><blockquote><p></p></blockquote></blockquote><p></p><blockquote><span style="font-family: courier;"><span style="font-size: x-small;">mysql> <b>select event_name, sum_timer_wait from performance_schema.events_waits_summary_global_by_event_name order by 2 desc limit 5;</b><br />+---------------------------------------------------------+-------------------+<br />| event_name | sum_timer_wait |<br />+---------------------------------------------------------+-------------------+<br />| wait/synch/cond/mysqlx/scheduler_dynamic_worker_pending | 20961067338483726 |<br />| idle | 1461111486462000 |<br />| wait/io/socket/sql/client_connection | 1373040224524194 |<br />| wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done | 1238396604397386 |<br />| <b>wait/io/file/innodb/innodb_temp_file</b> | 330058399794447 |<br />+---------------------------------------------------------+-------------------+<br />5 rows in set (0,05 sec)</span></span><br /></blockquote><p></p><p>We clearly see a hierarchy of waits of different types. It takes just few simple steps to convert this kind of output to the formal expected by the <a href="https://github.com/brendangregg/FlameGraph/blob/810687f180f3c4929b5d965f54817a5218c9d89b/flamegraph.pl#L18" target="_blank"><b>flamegraph.pl</b></a> tool:</p><blockquote><p><span style="font-size: xx-small;"><span style="font-family: courier;"># The input is stack frames and sample counts formatted as single lines. Each<br /># frame in the stack is semicolon separated, with a space and count at the end<br /># of the line. These can be generated for Linux perf script output using<br /># stackcollapse-perf.pl, for DTrace using stackcollapse.pl, and for other tools<br /># using the other stackcollapse programs. Example input:<br />#<br /><b># swapper;start_kernel;rest_init;cpu_idle;default_idle;native_safe_halt 1</b><br />#<br /># An optional extra column of counts can be provided to generate a differential<br /># flame graph of the counts, colored red for more, and blue for less. This<br /># can be useful when using flame graphs for non-regression testing.<br /># See the header comment in the difffolded.pl program for instructions.</span></span><br /></p></blockquote><p>Let me show a really simple way. We can get the lines without extra decorations:</p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits.txt</b><br />openxs@ao756:~/dbs/8.0$ <b>head -10 /tmp/waits.txt</b><br />event_name timer_wait<br />wait/synch/mutex/innodb/log_writer_mutex 933255<br />wait/synch/mutex/innodb/log_flush_notifier_mutex 937938<br />wait/synch/mutex/innodb/log_flusher_mutex 561960<br />wait/synch/mutex/innodb/trx_sys_serialisation_mutex 345873<br />wait/synch/mutex/innodb/log_write_notifier_mutex 441540<br />wait/synch/mutex/mysqlx/lock_list_access 908502<br />wait/synch/mutex/innodb/log_checkpointer_mutex 927903<br />wait/synch/mutex/innodb/flush_list_mutex 160560<br />wait/synch/mutex/innodb/log_limits_mutex 120420</span></span><br /></p></blockquote><p>and then with an assumption that there are no spaces inside the <b>event_name</b> we can apply simple combination of <b>sed</b> and <b>awk</b> commands to replace '<b>/</b>' with expected '<b>;</b>' and make sure there is just one space before the metric (time in picoseconds in this case). The rest is for the <b>flamegraph.pl</b> script to handle:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$<b> cat /tmp/waits.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --inverted --colors io --title "Waits" --countname picoseconds --width 1000 > /tmp/wait.svg</b><br />openxs@ao756:~/dbs/8.0$<br /></span></span></p></blockquote><p>I've used a few options above, including <b>--inverted</b> and <b>--colors io</b> to end up with "icicles". The resulting graph with "frames" named like "log" highlighted is below:<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEgUNjVOCzrKpBqtraqMsibueR3LGbZ2S9VGjrxLkdevGogDmiwyl47ZKpgpieirSomZmnOAmSvc01iH0VwqG9aMaXD29nyRp_FAVQB8PGorNHhr3XyBSgoYSAAY-XHFQAaeeFoJgJZisFYesxQbwVAuQG5czyQvCZs7oW10Bj0tGBquKij3gypBiHP0jQ=s990" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="164" data-original-width="990" height="106" src="https://blogger.googleusercontent.com/img/a/AVvXsEgUNjVOCzrKpBqtraqMsibueR3LGbZ2S9VGjrxLkdevGogDmiwyl47ZKpgpieirSomZmnOAmSvc01iH0VwqG9aMaXD29nyRp_FAVQB8PGorNHhr3XyBSgoYSAAY-XHFQAaeeFoJgJZisFYesxQbwVAuQG5czyQvCZs7oW10Bj0tGBquKij3gypBiHP0jQ=w640-h106" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Icicle graph for waits, showing the impact of "idle" and InnoDB logging-related waits<br /></td></tr></tbody></table><p></p><p>The graph above is not very funny looking. That's because the hierarchy of event names in performance_schema is not very deep. You can surely make similar highlights with proper SQL filtering, but <b>.svg</b> file is interactive and you can get a lot of insights after running just one query. You can surely add more fun by adding more data and using more complex SQL statements and text processing.<br /></p><p>Since MySQL started to support recursive CTEs I always wanted to use it to navigate through more complex transactions/statements/stages/waits hierarchy that is present in the <b>performance_schema</b>. So, today I tried to use a "proof o concept" recursive CTE to build a bit more funny flame graph and get myself ready to summarize time spent per statement with the option to drill down to related waits if needed. I ended up with this kind of query to show statements by type by stage and by wait related to a stage. No comments to begin with, just a lame SQL (somewhat inspired by the <b>sys.ps_trace_thread()</b>'s cursor <a href="https://github.com/mysql/mysql-server/blob/3e90d07c3578e4da39dc1bce73559bbdf655c28c/scripts/sys_schema/procedures/ps_trace_thread.sql#L19" target="_blank">here</a>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;"><b>with recursive ssw</b> as (<br />with sw as (<br /> select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union<br /> select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union<br /> select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long<br />)<br />select 0 as level, sw.* from sw<br /> where sw.event_name not like 'wait%'<br /> and sw.event_name not like 'stage%'<br /> and nesting_event_id is null<br />union all<br /><b>select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait <br /></b> from ssw<br /> inner join sw on sw.nesting_event_id = ssw.event_id<br />)<br />select event_name, timer_wait from ssw <br />order by event_id;</span></span><br /></p></blockquote><p>My idea is to start with statements that are at the top level (<b>nested_event_id is NULL</b>) and then concatenate statement events with stages and waits in the hierarchy, to get a longer stack trace for the flame graph. With the following steps (note that I needed longer column for deeper hierarchy):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e"with recursive ssw as (<br />with sw as (<br /> select event_id, cast(event_name as char(1024)) as event_name, nesting_event_id, timer_wait from performance_schema.events_statements_history_long union<br /> select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_stages_history_long union<br /> select event_id, event_name, nesting_event_id, timer_wait from performance_schema.events_waits_history_long<br />)<br />select 0 as level, sw.* from sw<br /> where sw.event_name not like 'wait%'<br /> and sw.event_name not like 'stage%'<br /> and nesting_event_id is null<br />union all<br />select ssw.level + 1 as level, sw.event_id, concat(ssw.event_name,concat('/',sw.event_name)) as event_name, sw.nesting_event_id, sw.timer_wait<br /> from ssw<br /> inner join sw on sw.nesting_event_id = ssw.event_id<br />)<br />select event_name, timer_wait from ssw<br />order by event_id<br />" > /tmp/sqlstages2.txt</b><br />openxs@ao756:~/dbs/8.0$ <b>ls -l /tmp/sqlstages2.txt</b><br />-rw-rw-r-- 1 openxs openxs 1372235 січ 6 22:36 /tmp/sqlstages2.txt<br />openxs@ao756:~/dbs/8.0$ <b>tail -10 /tmp/sqlstages2.txt</b><br />statement/com/Close stmt/stage/sql/cleaning up 439000<br />statement/com/Close stmt 7371000<br />statement/com/Close stmt/stage/sql/starting 6554000<br />statement/com/Close stmt/stage/sql/cleaning up 433000<br />statement/com/Close stmt 6733000<br />statement/com/Close stmt/stage/sql/starting 5979000<br />statement/com/Close stmt/stage/sql/cleaning up 433000<br />statement/com/Quit 5343000<br />statement/com/Quit/stage/sql/starting 4552000<br />statement/com/Quit/stage/sql/cleaning up 409000<br /></span></span></p></blockquote><p>and then:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">cat /tmp/sqlstages2.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' | ~/git/FlameGraph/flamegraph.pl --colors io --title "Waits" --countname picoseconds > /tmp/stages2.svg</span></span></p></blockquote><p>I ended up with a graph like this:</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEih3Ep44L1GrxfoGkbENAYYoJprc6AA8hShfsvGqwUZUBg45f3crre3BBgaSz5oTJrRfQE3WscGSSV78-a2Zgk2oU25DOGZx2h4uQUkNXp7kJPczzKuvALrC1Qc-EHBsRExSmb8aKn9gLZ4-vx4iFmB7v9elatQEFoyOBaKZW8S8_g870xbFtokBkUAzg=s1194" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="272" data-original-width="1194" height="146" src="https://blogger.googleusercontent.com/img/a/AVvXsEih3Ep44L1GrxfoGkbENAYYoJprc6AA8hShfsvGqwUZUBg45f3crre3BBgaSz5oTJrRfQE3WscGSSV78-a2Zgk2oU25DOGZx2h4uQUkNXp7kJPczzKuvALrC1Qc-EHBsRExSmb8aKn9gLZ4-vx4iFmB7v9elatQEFoyOBaKZW8S8_g870xbFtokBkUAzg=w640-h146" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Flame graph of time spent per statement type/stage/wait while running <b>sysbench</b> test<br /></td></tr></tbody></table></p><p>after running <b>sysbench</b> test:</p><p></p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>sysbench --table-size=1000000 --threads=4 --time=20 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_read_write.lua run</b><br />sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)<br /><br />Running the test with following options:<br />Number of threads: 4<br />Report intermediate results every 5 second(s)<br />Initializing random number generator from current time<br /><br /><br />Initializing worker threads...<br /><br />Threads started!<br /><br />[ 5s ] thds: 4 tps: 4.00 qps: 85.72 (r/w/o: 60.74/13.99/10.99) lat (ms,95%): 1771.29 err/s: 0.00 reconn/s: 0.00<br />[ 10s ] thds: 4 tps: 8.00 qps: 166.02 (r/w/o: 116.61/27.40/22.00) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00<br />[ 15s ] thds: 4 tps: 10.20 qps: 207.40 (r/w/o: 144.60/33.20/29.60) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00<br />[ 20s ] thds: 4 tps: 10.20 qps: 204.00 (r/w/o: 142.80/33.20/28.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00<br />SQL statistics:<br /> queries performed:<br /> read: 2324<br /> write: 539<br /> other: 457<br /> total: 3320<br /> transactions: 166 (8.20 per sec.)<br /> queries: 3320 (164.05 per sec.)<br /> ignored errors: 0 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />General statistics:<br /> total time: 20.2331s<br /> total number of events: 166<br /><br />Latency (ms):<br /> min: 177.42<br /> avg: 484.72<br /> max: 2039.24<br /> 95th percentile: 960.30<br /> sum: 80463.77<br /><br />Threads fairness:<br /> events (avg/stddev): 41.5000/0.87<br /> execution time (avg/stddev): 20.1159/0.07 </span></span><br /></p></blockquote><p>Note on the graph above that stages time does NOT sum up to the <b>select</b> or <b>Prepare</b> total time spent - it means large part of the code (where time IS spent) is NOT instrumented as stages of execution. There is still a lot of work to do on Performance Schema it seems...<br /></p><p>I could include exact SQL statements text into the graph as well (maybe with a separate stored procedure, not just a lame combination of single SQL statement and simple text processing command line), but that would not help much unless I use width of multiple screens.<br /></p><p>To summarize, flame graphs are great for quick overview of the impact of individual "stages" into a cumulative measure collected in a hierarchy. Recursive CTEs are also cool. One day I'll proceed with further steps on this way, but for tonight the proof of concept and "invention" of "P_S Flame Graphs" for MySQL DBAs is enough. Stay tuned!<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-4565689364261939562021-10-03T21:29:00.000+03:002021-10-03T21:29:59.108+03:00bpftrace as a code/function coverage tool for MariaDB server<p>I am going to speak about <a href="https://github.com/iovisor/bpftrace" target="_blank"><b>bpftrace</b></a> again soon, this time at <a href="https://mariadb.org/fest2021/tracing/" target="_blank">MariaDB Server Fest 2021</a>. Among other useful applications of <b>bpftrace</b> I mention using it as a <i>code coverage</i> (or, <a href="https://en.wikipedia.org/wiki/Code_coverage" target="_blank">more precisely</a>, <i>function coverage</i>) tool to check if some test executes all/specific functions in MariaDB server source code. In this blog post I am going to present some related tests in more details.</p><p>For testing I use new enough MariaDB server and <b>bpftrace</b> both built from GitHub source code on my Ubuntu 20.04 "home server":<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>bin/mariadbd --version</b><br />bin/mariadbd Ver 10.6.5-MariaDB for Linux on x86_64 (MariaDB Server)<br />openxs@ao756:~/dbs/maria10.6$ <b>bpftrace -V</b><br />bpftrace v0.13.0-120-gc671<br />openxs@ao756:~/dbs/maria10.6$ <b>cat /etc/lsb-release</b><br />DISTRIB_ID=Ubuntu<br />DISTRIB_RELEASE=20.04<br />DISTRIB_CODENAME=focal<br />DISTRIB_DESCRIPTION="Ubuntu 20.04.3 LTS"<br /></span></span></p></blockquote><p>The idea I tried immediately as a lame function coverage approach was to attach a probe to print function name for every function in the <b>mariadbd</b> binary:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:* { printf("%s\n", func); }'</b><br />ERROR: Can't attach to <b>34765</b> probes because it exceeds the current limit of 512 probes.<br />You can increase the limit through the BPFTRACE_MAX_PROBES environment variable, but BE CAREFUL since a high number of probes attached can cause your system to crash.<br />openxs@ao756:~/dbs/maria10.6$ <b>bpftrace --help 2>&1 | grep MAX_PROBES</b><br /> BPFTRACE_MAX_PROBES [default: <b>512</b>] max number of probes</span></span> <br /></p></blockquote><p>This way I found out that <b>bpftrace</b> found 34765 different functions in this binary, but by default it can have at most 512 probes per call. So, let me increase <b>BPFTRACE_MAX_PROBES</b>, as suggested (I don't mind system crash, it's for testing anyway):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>sudo BPFTRACE_MAX_PROBES=35000 bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:* { printf("%s\n", func); }'</b><br />Attaching 34765 probes...<br /><b>bpf: Failed to load program: Too many open files</b><br />processed 17 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 0<br /><br />bpf: Failed to load program: Too many open files<br />processed 17 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 0<br /><br />ERROR: Error loading program: uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:thd_client_ip (try -v)<br /><b>Segmentation fault<br /></b>openxs@ao756:~/dbs/maria10.6$ <b>ulimit -n</b><br />1024</span></span><br /></p></blockquote><p>This time I crashed <b>bpftrace</b>, and before that hit the limit on number of open files, quite small by default. What if I try to increase the limit? Let me try:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>ulimit -n 40000</b><br />openxs@ao756:~/dbs/maria10.6$ <b>ulimit -n</b><br />40000<br />openxs@ao756:~/dbs/maria10.6$ <b>sudo BPFTRACE_MAX_PROBES=35000 bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:* { printf("%s\n", func); }'</b><br />Attaching 34765 probes...<br />ERROR: Offset outside the function bounds ('register_tm_clones' size is 0)<br />Segmentation fault</span></span><br /></p></blockquote><p>Still crash, but with different error message before. Looks like some functions are not traceable for some reason I have yet to understand. Segmentation fault still may be related to the number of probes to create, but we can get the same error with much less probes (and for a different function):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>sudo BPFTRACE_MAX_PROBES=40000 bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:*do* { printf("%s\n", func); }'<br /></b>Attaching 1076 probes...<br />ERROR: Offset outside the function bounds ('__do_global_dtors_aux' size is 0)<br />openxs@ao756:~/dbs/maria10.6$</span></span><br /></p></blockquote><p>The problematic function above is actually added by the compiler for callings the destructors of static objects (if we trust <a href="https://stackoverflow.com/questions/6477494/do-global-dtors-aux-and-do-global-ctors-aux" target="_blank">this source</a>), it's not even from MariaDB server code itself. <b>register_tm_clones</b> function we hit before is even more mysterious (read <a href="https://oneraynyday.github.io/dev/2020/05/03/Analyzing-The-Simplest-C++-Program/" target="_blank">this</a>). It is related to transactional memory model in C++ somehow. Looks like on day I'll have to just create a list of all traceable and important functions in the MariaDB server source code and add explicit probe for each. Lame attempts to trace all do not work.</p><p>But we surely can trace a lot of known useful functions, like in the following example where I add a probe for each function named like 'ha_*' and try to count how many times it was called during some <b>sysbench</b> test:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>sudo BPFTRACE_MAX_PROBES=1000 bpftrace -e 'uprobe:/home/openxs/dbs/maria10.6/bin/mariadbd:ha_* { @cnt[func] += 1; }' > /tmp/funcs2.txt</b><br />[sudo] password for openxs:<br /><b>^C</b>openxs@ao756:~/dbs/maria10.6$ </span></span><br /></p></blockquote><p>The test that I executed in another shell was the following (numbers do not really matter):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>sysbench oltp_read_write --mysql-socket=/tmp/mariadb.sock --mysql-user=openxs --tables=2 --table-size=1000 --threads=2 run --time=1600 --report-interval=10</b><br />sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)<br /><br />Running the test with following options:<br />Number of threads: 2<br />Report intermediate results every 10 second(s)<br />Initializing random number generator from current time<br /><br /><br />Initializing worker threads...<br /><br />Threads started!<br /><br />[ 10s ] thds: 2 tps: 74.96 qps: 1502.59 (r/w/o: 1052.20/300.26/150.13) lat (ms,95%): 44.17 err/s: 0.00 reconn/s: 0.00<br />[ 20s ] thds: 2 tps: 70.31 qps: 1404.51 (r/w/o: 983.07/280.82/140.61) lat (ms,95%): 36.24 err/s: 0.00 reconn/s: 0.00<br />[ 30s ] thds: 2 tps: 82.50 qps: 1650.50 (r/w/o: 1155.30/330.20/165.00) lat (ms,95%): 40.37 err/s: 0.00 reconn/s: 0.00<br />[ 40s ] thds: 2 tps: 82.60 qps: 1652.60 (r/w/o: 1156.80/330.60/165.20) lat (ms,95%): 33.12 err/s: 0.00 reconn/s: 0.00<br />[ 50s ] thds: 2 tps: 50.20 qps: 1004.48 (r/w/o: 703.29/200.80/100.40) lat (ms,95%): 89.16 err/s: 0.00 reconn/s: 0.00<br />[ 60s ] thds: 2 tps: 74.80 qps: 1494.61 (r/w/o: 1045.91/299.20/149.50) lat (ms,95%): 44.98 err/s: 0.00 reconn/s: 0.00<br />[ 70s ] thds: 2 tps: 82.70 qps: 1654.31 (r/w/o: 1158.21/330.60/165.50) lat (ms,95%): 33.72 err/s: 0.00 reconn/s: 0.00<br />[ 80s ] thds: 2 tps: 81.60 qps: 1632.61 (r/w/o: 1142.80/326.60/163.20) lat (ms,95%): 44.17 err/s: 0.00 reconn/s: 0.00<br /><b>^C</b><br />openxs@ao756:~/dbs/maria10.6$</span></span><br /></p></blockquote><p>I let it work for a minute or so, and now I can check what <b>ha_*</b> functions of MariaDB server were called, and how many times:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.6$ <b>cat /tmp/funcs2.txt</b><br />Attaching 953 probes...<br /><br /><br />@cnt[ha_rollback_trans(THD*, bool)]: 2<br />@cnt[ha_close_connection(THD*)]: 2<br />@cnt[ha_heap::open(char const*, int, unsigned int)]: 6296<br />@cnt[ha_heap::close()]: 6298<br />@cnt[ha_heap::update_key_stats()]: 6320<br />@cnt[ha_heap::drop_table(char const*)]: 6353<br />@cnt[ha_lock_engine(THD*, handlerton const*)]: 6524<br />@cnt[ha_innobase::delete_row(unsigned char const*)]: 6551<br />@cnt[ha_innobase::rnd_init(bool)]: 6602<br />@cnt[ha_innobase::innobase_lock_autoinc()]: 6604<br />@cnt[ha_innobase::write_row(unsigned char const*)]: 6604<br />@cnt[ha_heap::table_flags() const]: 11746<br />@cnt[ha_heap::info(unsigned int)]: 12606<br />@cnt[ha_check_if_updates_are_ignored(THD*, handlerton*, char const*)]: 13094<br />@cnt[ha_innobase::update_row(unsigned char const*, unsigned char const*)]: 13122<br />@cnt[ha_heap::~ha_heap()]: 18819<br />@cnt[ha_heap::rnd_init(bool)]: 18865<br />@cnt[ha_innobase::unlock_row()]: 26252<br />@cnt[ha_innobase::records_in_range(unsigned int, st_key_range const*, st_key_range const*, st_page_range*)]: 26320<br />@cnt[ha_innobase::try_semi_consistent_read(bool)]: 26413<br />@cnt[ha_innobase::referenced_by_foreign_key()]: 26432<br />@cnt[ha_innobase::was_semi_consistent_read()]: 33022<br />@cnt[ha_innobase::multi_range_read_explain_info(unsigned int, char*, unsigned long)]: 46221<br />@cnt[ha_innobase::read_time(unsigned int, unsigned int, unsigned long long)]: 46228<br />@cnt[ha_innobase::multi_range_read_info_const(unsigned int, st_range_seq_if*, void*, unsigned int, unsigned int*, unsigned int*, Cost_estimate*)]: 46230<br />@cnt[ha_innobase::multi_range_read_init(st_range_seq_if*, void*, unsigned int, unsigned int, st_handler_buffer*)]: 46232<br />@cnt[ha_heap::extra(ha_extra_function)]: 50393<br />@cnt[ha_innobase::estimate_rows_upper_bound()]: 52828<br />@cnt[ha_innobase::scan_time()]: 72638<br />@cnt[ha_innobase::lock_count() const]: 107072<br />@cnt[ha_innobase::index_init(unsigned int, bool)]: 111372<br />@cnt[ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)]: 111432<br />@cnt[ha_innobase::info(unsigned int)]: 112253<br />@cnt[ha_innobase::info_low(unsigned int, bool)]: 112271<br />@cnt[ha_innobase::store_lock(THD*, st_thr_lock_data**, thr_lock_type)]: 118088<br />@cnt[ha_innobase::change_active_index(unsigned int)]: 118700<br />@cnt[ha_innobase::reset()]: 118934<br />@cnt[ha_commit_trans(THD*, bool)]: 123967<br />@cnt[ha_commit_one_phase(THD*, bool)]: 124096<br />@cnt[ha_check_and_coalesce_trx_read_only(THD*, Ha_trx_info*, bool)]: 124363<br />@cnt[ha_innobase::column_bitmaps_signal()]: 191279<br />@cnt[ha_innobase::table_flags() const]: 213982<br />@cnt[ha_innobase::build_template(bool)]: 230402<br />@cnt[ha_innobase::external_lock(THD*, int)]: 236446<br />@cnt[ha_innobase::rnd_end()]: 237608<br />@cnt[ha_innobase::extra(ha_extra_function)]: 415923<br />@cnt[ha_innobase::index_flags(unsigned int, unsigned int, bool) const]: 451710<br />@cnt[ha_heap::write_row(unsigned char const*)]: 628369<br />@cnt[ha_heap::position(unsigned char const*)]: 629184<br />@cnt[ha_heap::rnd_pos(unsigned char*, unsigned char*)]: 629341<br />@cnt[ha_heap::rnd_next(unsigned char*)]: 635061<br />@cnt[ha_innobase::position(unsigned char const*)]: 660590<br />@cnt[ha_innobase::index_next(unsigned char*)]: 2621915<br />@cnt[ha_innobase::multi_range_read_next(void**)]: 2708026<br />openxs@ao756:~/dbs/maria10.6$</span></span><br /></p></blockquote><p>I think this is really awesome! I can trace 900+ function calls with a <b>bpftrace</b> one liner and get the <b>@cnt</b> associative array printed at the end with a (not mangled!) function name as a key and number of times function was called while we run the test as a value, ordered by increasing the value, automatically!</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2tYap2dm2vB4Lre0x_tLM7R2v8sg4-2kMEHSGynnoRypJm09fKt7AXwOpMRuL7ys-bS0cfOl1rlKkt6zdwU1rJfnolQloY5liAIYFz2r0WiBX1SwoVPt3szjcLAw4KZCTLT9Fv024aJsy/s663/bpftrace_help.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="415" data-original-width="663" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj2tYap2dm2vB4Lre0x_tLM7R2v8sg4-2kMEHSGynnoRypJm09fKt7AXwOpMRuL7ys-bS0cfOl1rlKkt6zdwU1rJfnolQloY5liAIYFz2r0WiBX1SwoVPt3szjcLAw4KZCTLT9Fv024aJsy/w640-h400/bpftrace_help.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><b>bpftrace</b> may help MariaDB DBAs and developers in many cases!<br /></td></tr></tbody></table></p><p>This is already useful to find out if any of the functions traced was called at all. I can run several <b>bpftrace</b> command lines concurrently, each tracing a batch of different function calls and then summarize their outputs, I can measure time spent in each function, get stack traces if needed, and so on. I hope you already agree that the idea to use <b>bpftrace</b> as a function coverage tool may work even for such a complex software as MariaDB server.</p><p>I demonstrate this during my upcoming talk. See you <a href="https://mariadb.org/fest2021/tracing/" target="_blank">there</a> on Wednesday, October 6!<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-8396921964049260542021-05-12T14:51:00.000+03:002021-05-12T14:51:13.581+03:00Dynamic Tracing of Memory Allocations in MySQL With bcc Tools<p>Last year I started my lame attempts to apply different Linux dynamic tracing tools and approaches to frequent events like memory allocations. In <a href="http://mysqlentomologist.blogspot.com/2020/05/dynamic-tracing-of-memory-allocations.html" target="_blank">this blog post</a> I already described how to use <b>perf</b> to add user probe to trace <b>malloc()</b> calls with the number of bytes requested. Unfortunately this approach is not practical for production use for more than several seconds.</p><p>Recently I plan with <b>bpftrace</b> a lot and so far ended up with an easy way to trace calls and call stacks, and was on my way to trace only outstanding allocations and care only about memory areas not yet freed. If you are interested, the primitive approach may look like this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc33 ~]$ <b>cat malloc.bt</b><br />#!/usr/bin/env bpftrace<br /><br />BEGIN<br />{<br /> printf("Tracing MariaDB's malloc() calls, Ctrl-C to stop\n");<br />}<br /><br />interval:s:$1 { exit(); }<br /><br />uprobe:/lib64/libc.so.6:malloc<br />/ comm == "mariadbd" /<br />{<br /> @size[tid] += arg0;<br /><b>/* printf("Allocating %d bytes in thread %u...\n", arg0, tid); */<br /></b>}<br /><br />uretprobe:/lib64/libc.so.6:malloc<br />/ comm == "mariadbd" && @size[tid] > 0 /<br />{<br /> @memory[tid,retval] = @size[tid];<br /> @stack[ustack(perf)] += @size[tid];<br /><br /> print(@stack);<br /> clear(@stack);<br /><br /> delete(@size[tid]);<br />}<br /><br />uprobe:/lib64/libc.so.6:free<br />/ comm == "mariadbd" /<br />{<br /> delete(@memory[tid, arg0]);<br /><b>/* printf("Freeing %p...\n", arg0); */<br /></b>}<br /><br />END<br />{<br /> clear(@size);<br /> clear(@memory);<br /> clear(@stack);<br />}<br />[openxs@fc33 ~]$</span></span><br /></p></blockquote><p>But while it works (both for tracing that is commented out above and for summarizing the non-freed allocations) and produced some outputs as expected:<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc33 ~]$ <b>time sudo ./malloc.bt 1 2>/dev/null >/tmp/malloc_raw.txt<br /></b><br /><b>real 8m47.963s<br /></b>user 2m53.513s<br /><b>sys 5m50.685s<br /></b><br />[openxs@fc33 maria10.5]$ <b>ls -l /tmp/malloc_raw.txt</b><br />-rw-r--r--. 1 openxs openxs 461675 Apr 22 10:13 /tmp/malloc_raw.txt<br />[openxs@fc33 ~]$ <b>tail /tmp/malloc_raw.txt</b><br /> 558c68a27d12 row_purge_step(que_thr_t*)+770 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c689e8256 que_run_threads(que_thr_t*)+2166 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68a47ab7 purge_worker_callback(void*)+375 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68b88989 tpool::task_group::execute(tpool::task*)+137 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68b87bdf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 7f599ae455f4 0x7f599ae455f4 ([unknown])<br /> 558c69fc26b0 0x558c69fc26b0 ([unknown])<br /> 558c68b87cb0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> dde907894810c083 0xdde907894810c083 ([unknown])<br />]: 33<br /><br />...<br /><br />@stack[<br /> 558c689bd25c mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+108 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68a3f2ae row_vers_old_has_index_entry(bool, unsigned char const*, mtr_t*, dict_index_t*, dtuple_t const*, unsigned long, unsigned long)+126 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68a24c60 row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+512 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68a2627b row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+971 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68a272c3 row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+1459 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68a27d12 row_purge_step(que_thr_t*)+770 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c689e8256 que_run_threads(que_thr_t*)+2166 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68a47ab7 purge_worker_callback(void*)+375 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68b88989 tpool::task_group::execute(tpool::task*)+137 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 558c68b87bdf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 7f599ae455f4 0x7f599ae455f4 ([unknown])<br /> 558c69fc26b0 0x558c69fc26b0 ([unknown])<br /> 558c68b87cb0 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> dde907894810c083 0xdde907894810c083 ([unknown])<br />]: 1152</span></span> <br /></p></blockquote><p>it took 8 minutes(!) to deal with data collected for over 1 second of tracing under high load, and caused notable load:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">top - 09:59:04 up 1:16, 3 users, load average: 1.60, <b>7.14, 5.31</b><br />Tasks: 228 total, 2 running, 226 sleeping, 0 stopped, 0 zombie<br />%Cpu(s): 10.8 us, <b>17.5 sy</b>, 0.0 ni, 71.6 id, 0.0 wa, 0.2 hi, 0.0 si, 0.0 st<br />MiB Mem : 7916.5 total, 1759.6 free, 2527.0 used, 3629.9 buff/cache<br />MiB Swap: 3958.0 total, 3958.0 free, 0.0 used. 4910.8 avail Mem<br /><br /> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND<br /> 14813 root 20 0 307872 111852 75516 R <b>99.7</b> 1.4 7:57.17 bpftrace<br />...</span></span> <br /></p></blockquote><p>and drop in performance for the system in the meantime:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">[openxs@fc33 maria10.5]$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run</b><br />sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)<br /><br />Running the test with following options:<br />Number of threads: 32<br />Report intermediate results every 10 second(s)<br />Initializing random number generator from current time<br /><br /><br />Initializing worker threads...<br /><br />Threads started!<br /><br />[ 10s ] thds: 32 tps: 759.62 qps: 15246.37 (r/w/o: 10676.06/3047.88/1522.44) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00<br />...<br /><b>[ 70s ] thds: 32 tps: 708.20 qps: 14174.96 (r/w/o: 9920.64/2837.91/1416.41) lat (ms,95%): 74.46 err/s: 0.00 reconn/s: 0.00<br />[ 80s ] thds: 32 tps: 354.60 qps: 7080.28 (r/w/o: 4964.46/1406.62/709.21) lat (ms,95%): 134.90 err/s: 0.00 reconn/s: 0.00<br />[ 90s ] thds: 32 tps: 332.91 qps: 6661.34 (r/w/o: 4657.50/1338.03/665.81) lat (ms,95%): 132.49 err/s: 0.00 reconn/s: 0.00</b><br />...<br /></span></span><br /></p></blockquote><p>So I mention that my lame tracing approach here for no one to try to do the same - monitoring that caused 2 times drop in TPS for minutes is hardly acceptable. I obviously made some mistake that I have yet to identify. Probably resolving stack traces and summarizing them in kernel context was too much for the system, and I can not do better in <b>bpftrace</b> itself, if only use it to just produce raw traces.</p><p>The approach above is too primitive also, as I traced only<b> malloc()</b>, while theoretically <b>realloc()</b> and <b>callock()</b> calls may be used. So in a hope to see how this task is approached by really experienced people, I checked what <a href="https://github.com/iovisor/bcc" target="_blank">bcc tools</a> provide for tracing memory allocations.</p><p>The <a href="https://github.com/iovisor/bcc/blob/master/tools/memleak.py" target="_blank"><b>memleak.py</b></a> tool there is quite advanced. It allows both to trace individual calls and output periodic summaries of outstanding allocations:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/BPF-tools$ <b>/usr/share/bcc/tools/memleak -h</b><br />usage: memleak [-h] [-p PID] [-t] [-a] [-o OLDER] [-c COMMAND]<br /> [--combined-only] [--wa-missing-free] [-s SAMPLE_RATE] [-T TOP]<br /> [-z MIN_SIZE] [-Z MAX_SIZE] [-O OBJ] [--percpu]<br /> [interval] [count]<br /><br />Trace outstanding memory allocations that weren't freed.<br />Supports both user-mode allocations made with libc functions and kernel-mode<br />allocations made with kmalloc/kmem_cache_alloc/get_free_pages and corresponding<br />memory release functions.<br /><br />positional arguments:<br /> interval interval in seconds to print outstanding allocations<br /> count number of times to print the report before exiting<br /><br />optional arguments:<br /> -h, --help show this help message and exit<br /> -p PID, --pid PID the PID to trace; if not specified, trace kernel<br /> allocs<br /> -t, --trace print trace messages for each alloc/free call<br /> -a, --show-allocs show allocation addresses and sizes as well as call<br /> stacks<br /> -o OLDER, --older OLDER<br /> prune allocations younger than this age in<br /> milliseconds<br /> -c COMMAND, --command COMMAND<br /> execute and trace the specified command<br /> --combined-only show combined allocation statistics only<br /> --wa-missing-free Workaround to alleviate misjudgments when free is<br /> missing<br /> -s SAMPLE_RATE, --sample-rate SAMPLE_RATE<br /> sample every N-th allocation to decrease the overhead<br /> -T TOP, --top TOP display only this many top allocating stacks (by size)<br /> -z MIN_SIZE, --min-size MIN_SIZE<br /> capture only allocations larger than this size<br /> -Z MAX_SIZE, --max-size MAX_SIZE<br /> capture only allocations smaller than this size<br /> -O OBJ, --obj OBJ attach to allocator functions in the specified object<br /> --percpu trace percpu allocations<br /><br />EXAMPLES:<br /><br />./memleak -p $(pidof allocs)<br /> Trace allocations and display a summary of "leaked" (outstanding)<br /> allocations every 5 seconds<br />./memleak -p $(pidof allocs) -t<br /> Trace allocations and display each individual allocator function call<br />./memleak -ap $(pidof allocs) 10<br /> Trace allocations and display allocated addresses, sizes, and stacks<br /> every 10 seconds for outstanding allocations<br />./memleak -c "./allocs"<br /> Run the specified command and trace its allocations<br />./memleak<br /> Trace allocations in kernel mode and display a summary of outstanding<br /> allocations every 5 seconds<br />./memleak -o 60000<br /> Trace allocations in kernel mode and display a summary of outstanding<br /> allocations that are at least one minute (60 seconds) old<br />./memleak -s 5<br /> Trace roughly every 5th allocation, to reduce overhead<br />openxs@ao756:~/git/BPF-tools$</span></span><br /></p></blockquote><p>I've applied it to MySQL 8.0.25 recently built on my Ubuntu 20.04 and running <b>sysbench</b> <b>oltp_read_write</b> load test:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/8.0$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/tmp/mysql8.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run</b><br />sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)<br /><br />Running the test with following options:<br />Number of threads: 4<br />Report intermediate results every 10 second(s)<br />Initializing random number generator from current time<br /><br /><br />Initializing worker threads...<br /><br />Threads started!<br /><br />[ 10s ] thds: 4 tps: 18.49 qps: 377.44 (r/w/o: 264.49/75.57/37.38) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00<br />[ 20s ] thds: 4 tps: 14.80 qps: 296.01 (r/w/o: 207.21/59.20/29.60) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00<br />[ 30s ] thds: 4 tps: 25.99 qps: 519.89 (r/w/o: 363.92/103.98/51.99) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00<br />[ 40s ] thds: 4 tps: 25.30 qps: 506.04 (r/w/o: 354.23/101.21/50.60) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00<br /><b>[ 50s ] thds: 4 tps: 21.90 qps: 437.92 (r/w/o: 306.54/87.58/43.79) lat (ms,95%): 356.70 err/s: 0.00 reconn/s: 0.00<br />[ 60s ] thds: 4 tps: 23.51 qps: 470.05 (r/w/o: 329.10/93.93/47.01) lat (ms,95%): 308.84 err/s: 0.00 reconn/s: 0.00<br />[ 70s ] thds: 4 tps: 20.29 qps: 405.99 (r/w/o: 284.12/81.28/40.59) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00<br />[ 80s ] thds: 4 tps: 20.51 qps: 408.20 (r/w/o: 286.17/81.02/41.01) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00</b><br />[ 90s ] thds: 4 tps: 22.80 qps: 457.80 (r/w/o: 320.03/92.18/45.59) lat (ms,95%): 383.33 err/s: 0.00 reconn/s: 0.00<br /><b>^C</b></span></span><br /></p></blockquote><p>Ignore ther absolute numbers, but note that (unlike with my <b>bpftrace</b> program) there was no very significant drop in QPS over that 20+ seconds I was collecting stacks for outstanding allocations in another shell:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/BPF-tools$ <b>time sudo /usr/share/bcc/tools/memleak -p $(pidof mysqld) --top 100 >/tmp/memleak.out</b><br />[sudo] password for openxs:<br />^C<br />real 0m21,142s<br />user 0m0,998s<br />sys 0m0,466s<br /></span></span></p></blockquote><p>Now, what was collected? Let' check top 40 rows:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/BPF-tools$ <b>head -40 /tmp/memleak.out</b><br />Attaching to pid 3416, Ctrl+C to quit.<br />[13:51:26] Top 100 stacks with outstanding allocations:<br /><b> 1536 bytes in 2 allocations from stack</b><br /> mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0xc8 [mysqld]<br /> mem_heap_add_block(mem_block_info_t*, unsigned long)+0x53 [mysqld]<br /> row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x78c [mysqld]<br /> row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x2b78 [mysqld]<br /> ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+0x32e [mysqld]<br /> handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+0x389 [mysqld]<br /> handler::read_range_first(key_range const*, key_range const*, bool, bool)+0x6e [mysqld]<br /> ha_innobase::read_range_first(key_range const*, key_range const*, bool, bool)+0x27 [mysqld]<br /> handler::multi_range_read_next(char**)+0x135 [mysqld]<br /> handler::ha_multi_range_read_next(char**)+0x2c [mysqld]<br /> QUICK_RANGE_SELECT::get_next()+0x5a [mysqld]<br /> IndexRangeScanIterator::Read()+0x3f [mysqld]<br /> FilterIterator::Read()+0x18 [mysqld]<br /> MaterializeIterator::MaterializeQueryBlock(MaterializeIterator::QueryBlock const&, unsigned long long*)+0x133 [mysqld]<br /> MaterializeIterator::Init()+0x319 [mysqld]<br /> filesort(THD*, Filesort*, RowIterator*, unsigned long, unsigned long long, Filesort_info*, Sort_result*, unsigned long long*)+0x39d [mysqld]<br /> SortingIterator::DoSort()+0x72 [mysqld]<br /> SortingIterator::Init()+0x34 [mysqld]<br /> Query_expression::ExecuteIteratorQuery(THD*)+0x2ea [mysqld]<br /> Query_expression::execute(THD*)+0x33 [mysqld]<br /> Sql_cmd_dml::execute_inner(THD*)+0x30b [mysqld]<br /> Sql_cmd_dml::execute(THD*)+0x545 [mysqld]<br /> mysql_execute_command(THD*, bool)+0x9f0 [mysqld]<br /> Prepared_statement::execute(String*, bool)+0x8b0 [mysqld]<br /> Prepared_statement::execute_loop(String*, bool)+0x117 [mysqld]<br /> mysqld_stmt_execute(THD*, Prepared_statement*, bool, unsigned long, PS_PARAM*)+0x1b1 [mysqld]<br /> dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x175d [mysqld]<br /> do_command(THD*)+0x1a4 [mysqld]<br /> handle_connection+0x258 [mysqld]<br /> pfs_spawn_thread+0x162 [mysqld]<br /> start_thread+0xd9 [libpthread-2.31.so]<br /><b> 2304 bytes in 3 allocations from stack<br /></b> mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long)+0xc8 [mysqld]<br /> mem_heap_add_block(mem_block_info_t*, unsigned long)+0x53 [mysqld]<br /> row_vers_build_for_consistent_read(unsigned char const*, mtr_t*, dict_index_t*, unsigned long**, ReadView*, mem_block_info_t**, mem_block_info_t*, unsigned char**, dtuple_t const**, lob::undo_vers_t*)+0x78c [mysqld]<br /> row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+0x2b78 [mysqld]<br /> ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+0x32e [mysqld]<br />openxs@ao756:~/git/BPF-tools$</span></span><br /></p></blockquote><p>This information may be really useful for further analysis. But it is not easy to collapse/summarize for further <a href="http://www.brendangregg.com/FlameGraphs/memoryflamegraphs.html" target="_blank">memory flame graphs</a> creation, while I'd really use them to quickly see where most of the memory is allocated.</p><p>One day I'll get back to <b>memleak</b> and create some script to fold its output. But for this blog post I was looking for qucik and dirty ways and, according to the page linked above, for that I had either to use general purpose <b><a href="https://github.com/iovisor/bcc/blob/master/tools/stackcount.py" target="_blank">stackcount.py</a></b> (that just counts the number of occurences per unique stack), or <a href="http://www.brendangregg.com/overview.html" target="_blank"><b>Brendan Gregg</b></a>'s unsupported <b><a href="https://github.com/brendangregg/BPF-tools/tree/master/old/2017-12-23">mallocstacks</a></b>, which is similar to <b>stackcount</b> but sums the <b>size_t</b> argument to <b>malloc()</b> as the metric. I've used the latter and had to make a small fix to make it run on my Ubuntu 20.04 netbook:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/BPF-tools$ <b>git diff old/2017-12-23/mallocstacks.py</b><br />diff --git a/old/2017-12-23/mallocstacks.py b/old/2017-12-23/mallocstacks.py<br />index 8891e82..92271ed 100755<br />--- a/old/2017-12-23/mallocstacks.py<br />+++ b/old/2017-12-23/mallocstacks.py<br />@@ -96,7 +96,7 @@ struct key_t {<br /> char name[TASK_COMM_LEN];<br /> };<br /> BPF_HASH(bytes, struct key_t);<br />-BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE)<br /><b>+BPF_STACK_TRACE(stack_traces, STACK_STORAGE_SIZE);</b><br /><br /> int trace_malloc(struct pt_regs *ctx, size_t size) {<br /> u32 pid = bpf_get_current_pid_tgid();<br />openxs@ao756:~/git/BPF-tools$</span></span><br /></p></blockquote><p>Then I've used it as follows, run for some 10+ seconds against the same MySQL 8.0.25 under the same load (and with similar acceptable impact as with <b>memleak</b>):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/BPF-tools/old/2017-12-23$ <b>sudo ~/git/BPF-tools/old/2017-12-23/mallocstacks.py -p $(pidof mysqld) -f >/tmp/alloc.out<br />^C<br /></b>openxs@ao756:~/git/BPF-tools/old/2017-12-23$ <b>ls -l /tmp/alloc.out</b><br />-rw-rw-r-- 1 openxs openxs 859059 тра 12 10:28 /tmp/alloc.out<br />openxs@ao756:~/git/BPF-tools$ <b>head -2 /tmp/alloc.out</b><br />mysqld;[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::~_State_impl();dict_stats_thread();[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run();dict_stats_thread();dict_stats_update(dict_table_t*, dict_stats_upd_option_t);dict_stats_save(dict_table_t*, index_id_t const*);dict_stats_exec_sql(pars_info_t*, char const*, trx_t*);que_eval_sql(pars_info_t*, char const*, unsigned long, trx_t*);que_run_threads(que_thr_t*);row_sel_step(que_thr_t*);row_sel(sel_node_t*, que_thr_t*);eval_cmp(func_node_t*);eval_node_alloc_val_buf(void*, unsigned long);__libc_malloc <b>33</b><br />mysqld;[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::~_State_impl();dict_stats_thread();[unknown];std::thread::_State_impl<std::thread::_Invoker<std::tuple<Runnable, void (*)()> > >::_M_run();dict_stats_thread();dict_stats_update(dict_table_t*, dict_stats_upd_option_t);dict_stats_save(dict_table_t*, index_id_t const*);dict_stats_exec_sql(pars_info_t*, char const*, trx_t*);trx_commit_for_mysql(trx_t*);trx_commit(trx_t*);trx_commit_low(trx_t*, mtr_t*);__libc_malloc <b>40</b><br />openxs@ao756:~/git/BPF-tools/old/2017-12-23$ <b>cd</b><br />openxs@ao756:~$ <b>cat /tmp/alloc.out | ~/git/FlameGraph/flamegraph.pl --color=mem --title="malloc() Flame Graph" --countname="bytes" >/tmp/mysql8_malloc.svg</b></span></span><br /></p></blockquote><p>So, we've got nice long folded stacks with the amount of bytes allocated (no matter if freed or not already), pure (as low impact as possible) <b>malloc()</b> tracing in a way immediately usable by flamegraph.pl, that produced the following output:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOp2K1erSPtJckXtO34-SDw4HOMDyKhgOYmFb-h1pJc2fFQ3fuPW3DIv5b82X6BUCe6OqYnTZ16f1s8_16FDaBZn_S-qYVlibHKGq66XVVzviQrWbn9-Y3wUpFQ5o4wDqY7OJRKIuV6zMQ/s1070/mysql8_malloc.svg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="539" data-original-width="1070" height="322" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOp2K1erSPtJckXtO34-SDw4HOMDyKhgOYmFb-h1pJc2fFQ3fuPW3DIv5b82X6BUCe6OqYnTZ16f1s8_16FDaBZn_S-qYVlibHKGq66XVVzviQrWbn9-Y3wUpFQ5o4wDqY7OJRKIuV6zMQ/w640-h322/mysql8_malloc.svg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Flame graph showing where most of the memory was allocated from in MySQL 8.0.25 over 10+ seconds, while running standard <b>sysbench oltp_read_write</b> test.<br /></td></tr></tbody></table><p></p><p>Primitive, but it worked and allowed to see that most of allocations were related to <b>filesort</b>.<br /></p><p></p><p style="text-align: center;">* * *<br /></p>To summarize:<ul style="text-align: left;"><li>Relatively low impact tracing of memory allocations (ongoing and outstanding) is possible, using bcc tools.</li><li>One may use <b>memleak</b> tool to get quick insights on outstanding memory allocations, no matter where and how they were made, with period sampling to reduce the performance impact, or rely on some custom or general tracing tools to collect some metric per stack trace and represent the result as memory flame graphs.</li><li>Looks like it may make sense to do just primitive tracing with bpftrace and not try to overload the tool wiuth collecting per stack data in the maps, as stacks resolution seem to take too much CPU resources.<br /></li></ul>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-14318795395237200812021-05-09T18:05:00.000+03:002021-05-09T18:05:08.952+03:00Off-CPU Analysis Attempt to Find the Reason of Performance Regression in MariaDB 10.4<p>I did not write new blog posts for more than 2 months already. Busy days... But now I am on vacation and my Percona Live Online 2021 <a href="https://perconaliveonline.sched.com/event/io6r/flame-graphs-for-mysql-dbas" target="_blank">talk on flame graphs</a> is coming soon, so I decided to renew my experience with <a href="https://github.com/iovisor/bcc/tree/master/tools" target="_blank">bcc tools</a> and try to get some insights for <a href="https://jira.mariadb.org/browse/MDEV-24272" target="_blank">one of the bugs</a> I've reported for MariaDB using the <a href="http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html" target="_blank">off-CPU flame graphs</a>.</p><p>The idea was to check why <b>sysbench oltp_read_write</b> test started to work notably slower in a newer version of MariaDB 10.4.x after 10.4.15. On my good, old and slow Acer netbook with recently updated Ubuntu: </p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/server$ <b>cat /etc/lsb-release</b><br />DISTRIB_ID=Ubuntu<br />DISTRIB_RELEASE=20.04<br />DISTRIB_CODENAME=focal<br />DISTRIB_DESCRIPTION="Ubuntu 20.04.2 LTS"<br />openxs@ao756:~/git/server$ <b>uname -a</b><br />Linux ao756 5.8.0-50-generic #56~20.04.1-Ubuntu SMP Mon Apr 12 21:46:35 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux</span></span> <br /></p></blockquote><p>I've compiled current MariaDB 10.4 from GitHub source following my usual way:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/server$ <b>git checkout 10.4</b><br />...<br />openxs@ao756:~/git/server$ <b>git pull</b><br />...<br />openxs@ao756:~/git/server$ <b>git submodule update --init --recursive</b><br />Submodule path 'extra/wolfssl/wolfssl': checked out '9c87f979a7f1d3a6d786b260653d566c1d31a1c4'<br />Submodule path 'libmariadb': checked out '180c543704d627a50a52aaf60e24ca14e0ec4686'<br />Submodule path 'wsrep-lib': checked out 'f271ad0c6e3c647df83c1d5ec9cd26d77cef2337'<br />Submodule path 'wsrep-lib/wsrep-API/v26': checked out '76cf223c690845bbf561cb820a46e06a18ad80d1'<br />openxs@ao756:~/git/server$ <b>git branch</b><br /> 10.3<br />* 10.4<br /> 10.5<br /> 10.6<br />openxs@ao756:~/git/server$ <b>git log -1</b><br />commit 583b72ad0ddbc46a7aaeda1c1373b89d4bded9ea (HEAD -> 10.4, origin/bb-10.4-merge, origin/10.4)<br />Merge: 473e85e9316 a4139f8d68b<br />Author: Oleksandr Byelkin <sanja@mariadb.com><br />Date: Fri May 7 11:50:24 2021 +0200<br /><br /> Merge branch 'bb-10.4-release' into 10.4<br />openxs@ao756:~/git/server$ <b>rm CMakeCache.txt</b><br />openxs@ao756:~/git/server$ <b>cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/maria10.4 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DPLUGIN_TOKUDB=NO -DWITH_SSL=system</b><br />...<br />-- Generating done<br />-- Build files have been written to: /home/openxs/git/server<br />openxs@ao756:~/git/server$ <b>time make -j 3</b><br />...<br />[100%] Building C object extra/mariabackup/CMakeFiles/mariabackup.dir/__/__/libmysqld/libmysql.c.o<br />[100%] Linking CXX executable mariabackup<br />[100%] Built target mariabackup<br /><br />real 74m9,550s<br />user 134m10,837s<br />sys 7m0,387s<br /><br />openxs@ao756:~/git/server$ <b>make install && make clean</b><br />...<br />openxs@ao756:~/git/server$ <b>cd ~/dbs/maria10.4</b><br />openxs@ao756:~/dbs/maria10.4$ <b>bin/mysqld_safe --no-defaults --port=3309 --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 &</b><br />[1] 27483<br />openxs@ao756:~/dbs/maria10.4$ 210507 19:15:37 mysqld_safe Logging to '/home/openxs/dbs/maria10.4/data/ao756.err'.<br />210507 19:15:37 mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/maria10.4/data<br /><br />openxs@ao756:~/dbs/maria10.4$ <b>bin/mysql --socket=/tmp/mariadb.sock</b><br />Welcome to the MariaDB monitor. Commands end with ; or \g.<br />Your MariaDB connection id is 8<br />Server version: 10.4.19-MariaDB MariaDB Server<br /><br />Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.<br /><br />Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.<br /><br />MariaDB [(none)]> <b>drop database if exists sbtest;</b><br />Query OK, 5 rows affected (1.297 sec)<br /><br />MariaDB [(none)]> <b>create database sbtest;</b><br />Query OK, 1 row affected (0.001 sec)<br /><br />MariaDB [(none)]> <b>exit</b><br />Bye</span></span><br /></p></blockquote><p>and compared to 10.4.15 with the following test:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/dbs/maria10.4$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=4 prepare</b><br />sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)<br /><br />Initializing worker threads...<br /><br />Creating table 'sbtest2'...<br />Creating table 'sbtest3'...<br />Creating table 'sbtest4'...<br />Creating table 'sbtest1'...<br />Inserting 100000 records into 'sbtest1'<br />Inserting 100000 records into 'sbtest4'<br />Inserting 100000 records into 'sbtest2'<br />Inserting 100000 records into 'sbtest3'<br />Creating a secondary index on 'sbtest1'...<br />Creating a secondary index on 'sbtest4'...<br />Creating table 'sbtest5'...<br />Inserting 100000 records into 'sbtest5'<br />Creating a secondary index on 'sbtest2'...<br />Creating a secondary index on 'sbtest3'...<br />Creating a secondary index on 'sbtest5'...<br /><br />openxs@ao756:~/dbs/maria10.4$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run</b><br />...<br /> transactions: 35630 (<b>118.54</b> per sec.)<br /> queries: 712600 (<b>2370.87</b> per sec.)<br /> ignored errors: 0 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />General statistics:<br /> total time: 300.5612s<br /> total number of events: 35630<br /><br />Latency (ms):<br /> min: 3.30<br /> avg: 33.70<br /> max: 2200.17<br /> 95th percentile: <b>164.45</b><br />...<br /><br />openxs@ao756:~/dbs/maria10.4.15$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --time=300 --report-interval=10 --threads=4 run</b><br />...<br /> transactions: 56785 (<b>189.25</b> per sec.)<br /> queries: 1135700 (<b>3784.99</b> per sec.)<br /> ignored errors: 0 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />General statistics:<br /> total time: 300.0501s<br /> total number of events: 56785<br /><br />Latency (ms):<br /> min: 3.15<br /> avg: 21.13<br /> max: 704.36<br /> 95th percentile: <b>108.68</b><br />...</span></span><br /></p></blockquote><p>So, basically with the same test with all tables fitting into the buffer pool (1G) and all but few default settings current MariaDB 10.4.19 demonstrate up to 60% drop in throughput and increase of 95th latency on this netbook (even more than 15% or so reported on faster quad core Fedora desktop previously).<br /></p><p>If you read <a href="https://jira.mariadb.org/browse/MDEV-24272" target="_blank"><b>MDEV-24272</b></a> carefully, the regression was tracked up to a specific commit, but I tried to apply various tools to actually see where more time is spent now, specifically. Profiling with <b>perf</b> and creating on-CPU flame graphs had not given me any clear insight that would explain that increase in latency, so my next idea was to trace <a href="http://www.brendangregg.com/FlameGraphs/offcpuflamegraphs.html#Off-CPU" target="_blank">off-CPU time spent</a>, that is, try to find out how long MariaDB server waits and where in the code that's mostly happen.</p><p>For this I've used the <a href="https://github.com/iovisor/bcc/blob/master/tools/offcputime.py" target="_blank"><b>offcputime</b></a> tool:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>/usr/share/bcc/tools/offcputime -h</b><br />usage: offcputime [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f]<br /> [--stack-storage-size STACK_STORAGE_SIZE]<br /> [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]<br /> [duration]<br /><br />Summarize off-CPU time by stack trace<br /><br />positional arguments:<br /> duration duration of trace, in seconds<br /><br />optional arguments:<br /> -h, --help show this help message and exit<br /><b> -p PID, --pid PID trace this PID only</b><br /> -t TID, --tid TID trace this TID only<br /> -u, --user-threads-only<br /> user threads only (no kernel threads)<br /> -k, --kernel-threads-only<br /> kernel threads only (no user threads)<br /><b> -U, --user-stacks-only<br /> show stacks from user space only (no kernel space<br /> stacks)<br /></b> -K, --kernel-stacks-only<br /> show stacks from kernel space only (no user space<br /> stacks)<br /> -d, --delimited insert delimiter between kernel/user stacks<br /><b> -f, --folded output folded format<br /></b> --stack-storage-size STACK_STORAGE_SIZE<br /> the number of unique stack traces that can be stored<br /> and displayed (default 1024)<br /> -m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME<br /> the amount of time in microseconds over which we store<br /> traces (default 1)<br /> -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME<br /> the amount of time in microseconds under which we<br /> store traces (default U64_MAX)<br /> --state STATE filter on this thread state bitmask (eg, 2 ==<br /> TASK_UNINTERRUPTIBLE) see include/linux/sched.h<br /><br />examples:<br /> ./offcputime # trace off-CPU stack time until Ctrl-C<br /> ./offcputime 5 # trace for 5 seconds only<br /> ./offcputime -f 5 # 5 seconds, and output in folded format<br /> ./offcputime -m 1000 # trace only events that last more than 1000 usec<br /> ./offcputime -M 10000 # trace only events that last less than 10000 usec<br /> ./offcputime -p 185 # only trace threads for PID 185<br /> ./offcputime -t 188 # only trace thread 188<br /> ./offcputime -u # only trace user threads (no kernel)<br /> ./offcputime -k # only trace kernel threads (no user)<br /> ./offcputime -U # only show user space stacks (no kernel)<br /> ./offcputime -K # only show kernel space stacks (no user)</span></span><br /></p></blockquote><p></p><p></p><p>I've stored the outputs in <b>/dev/shm</b> to have less impact on disk I/O that I suspected as one of the reasons:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>ls /dev/shm</b><br />openxs@ao756:~$ <b>mkdir /dev/shm/out</b><br />openxs@ao756:~$ <b>ls /dev/shm</b><br />out</span></span><br /></p></blockquote><p>Basically the following commands were used to create folded (ready to use for building flame graphs) user-space only stacks and time spent off-CPU in them over <b>60</b> seconds of tracing while <b>sysbench</b> tests were running for clean setup on MariaDB 10.4.15 and then on current 10.4.19, and create flame graphs based on those stacks where start_thread frame is present (to clean up irrelevant details):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10415.out</b><br />WARNING: 27 stack traces lost and could not be displayed.<br />openxs@ao756:~$ <b>cat /dev/shm/out/offcpu_10415.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10415.svg</b><br /><br />openxs@ao756:~$ <b>sudo /usr/share/bcc/tools/offcputime -f -p `pidof mysqld` -U 60 > /dev/shm/out/offcpu_10419.out</b><br />WARNING: 24 stack traces lost and could not be displayed.<br />openxs@ao756:~$ <b>cat /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us > /tmp/offcpu_10419.svg</b><br /></span></span></p></blockquote><p>As a result I've got the following graphs (<b>.png</b> screenshots from real <b>.svg</b> files below). On 10.4.15 we spent around 43 seconds (out of 60 we monitored) off-CPU, mostly in <b>do_command()</b> and waiting for network I/O, and the graph was the following:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4pHZFfxj3fjBVRFTKyPhKHvHeVp-dJRWTs0-I6g7aogXLMqBsdeOvsQJUvlOuepISO1KEXC0lKKjoLhYKzmwDUzbK53HhVc4FcF58H94dEk5R_mKPahN_7calOmbEyPLVPOVXE2xmRSX2/s1199/offcpu_10415.svg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="526" data-original-width="1199" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4pHZFfxj3fjBVRFTKyPhKHvHeVp-dJRWTs0-I6g7aogXLMqBsdeOvsQJUvlOuepISO1KEXC0lKKjoLhYKzmwDUzbK53HhVc4FcF58H94dEk5R_mKPahN_7calOmbEyPLVPOVXE2xmRSX2/w640-h280/offcpu_10415.svg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Off-CPU time for MariaDB 10.4.15<br /></td></tr></tbody></table><p>In case of 10.4.19 the graph is very different and we seem to have spent 79 seconds off-CPU, mostly in background <b>io_handler_thread()</b>:<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik5gm8WCdIXMXIfBZIohRUsLq_VMe0OJV5UbbeAAyc_MaOtnE4jAih0maAI4A6EjZmoGIPtJvmD1iswr17W3FCbRhRY0zEhWr3b7jgqqs6DeP53mEnHYdKleikvAZYrVEbzjFYoa3z09eu/s1197/offcpu_10419.svg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="539" data-original-width="1197" height="288" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEik5gm8WCdIXMXIfBZIohRUsLq_VMe0OJV5UbbeAAyc_MaOtnE4jAih0maAI4A6EjZmoGIPtJvmD1iswr17W3FCbRhRY0zEhWr3b7jgqqs6DeP53mEnHYdKleikvAZYrVEbzjFYoa3z09eu/w640-h288/offcpu_10419.svg" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Off-CPU time for MariaDB 10.4.19<br /></td></tr></tbody></table><p>I was surprised to see more than 60 seconds spent off-CPU in this case. Maybe this is possible because I have 2 cores and MariaDB threads were waiting on both most of the time.</p><p>I've then tried to use <a href="http://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html" target="_blank">differential flame graph</a> to highlight the call stacks that the main difference is related to. I've crated it from existing folded stack traces with the following command:<br /></p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>~/git/FlameGraph/difffolded.pl /dev/shm/out/offcpu_10415.out /dev/shm/out/offcpu_10419.out | grep start_thread | ~/git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Diff Flame Graph" --countname=us > /tmp/offcpu_diff.svg</b><br /></span></span></p></blockquote><p>The resulting graph is presented below:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjy587_iBIld8DDn9POasYz6VCscY66oV1dvu72wnYgDYmvfsglpV3YpCf056FfPH2rzyH0WNRBqwtzDrSK-vnisiNytaM_zdKGTwmvQ4iM7qf4uRCtNfi63M6FxMPK0hXrbyFQQ_cXNvyq/s1199/offcpu_diff.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="541" data-original-width="1199" height="288" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjy587_iBIld8DDn9POasYz6VCscY66oV1dvu72wnYgDYmvfsglpV3YpCf056FfPH2rzyH0WNRBqwtzDrSK-vnisiNytaM_zdKGTwmvQ4iM7qf4uRCtNfi63M6FxMPK0hXrbyFQQ_cXNvyq/w640-h288/offcpu_diff.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The source of increase is highlighted in red<br /></td></tr></tbody></table><p></p><p>Here we clearly see that main increase in time spent waiting in 10.4.19 is related to <b>io_handler_thread()</b>, but increase happened in almost all background threads.<br /></p><p></p><p style="text-align: center;">* * *</p><p style="text-align: left;">To summarize:</p><ul style="text-align: left;"><li>When some performance regression happens you should check not only those code paths in the application where software started to spend more time working, but also where it started to wait more.</li><li>In case of <a href="https://jira.mariadb.org/browse/MDEV-24272">https://jira.mariadb.org/browse/MDEV-24272</a> we clearly started to flush more to disk from the very beginning of <b>sysbench oltp_read_write</b> test in newer versions of MariaDB 10.4.x, and on my slow encrypted HDD this matters a lot. The load that was supposed to be CPU-bound (as we have large enough buffer pool) becomes disk-bound.</li><li>Flame graphs are cool for highlighting the difference and in this post I've shown both a classical smart way to produce them without too much impact, and a way to highlight the difference in them with a differential flam graph produced by the <a href="http://difffolded.pl" target="_blank"><b>difffolded.pl</b></a> tool created by <b>Brendan Gregg</b>.</li><li>Other cases when Flame Graphs may help MySQL or MariaDB DBAs are discussed during my upcoming Percona Live 2021 talk on May 12. See you there!</li><li>I'll get back to this nice regression bug to study the test case in more details with other tools, maybe more than once. Stay tuned!<br /></li></ul>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-63629166315986205772021-02-28T20:43:00.002+02:002021-02-28T20:43:52.112+02:005 Years of Working for MariaDB Corporation<p>March 1, 2016, was <a href="http://mysqlentomologist.blogspot.com/2016/04/building-mariadb-101x-and-galera-from.html" target="_blank">my first day</a> of working for MariaDB Corporation. It means that I worked for them full 5 years in a row! It's the longest period I've ever spent working in the same company. I worked for more than 7 years in MySQL Bugs Verification team of MySQL Support, but formally it was for 3 different companies over that period, MySQL AB, Sun and Oracle. So tonight, after 2 weekend on-call shifts in a row I want to summarize what I lost over that 5 years and what I won.</p><p>I never planned to leave <a href="http://mysqlentomologist.blogspot.com/2016/02/mysql-support-people-percona-support.html" target="_blank">Percona</a> after just a bit more than 3 years, but I had to do something to either fix the direction of their services business development after April, 2015, or at least to make <a href="http://mysqlentomologist.blogspot.com/2016/01/im-winston-wolf-i-solve-problems.html" target="_blank">some</a> <a href="http://mysqlentomologist.blogspot.com/2016/02/mysql-support-people-directors-managers.html" target="_blank">points</a> that are noticed and remembered. I made them by January 26, 2016, and then had to move on. </p><p>So, what I lost after leaving Percona:</p><ul style="text-align: left;"><li>In Percona I worked in <a href="http://mysqlentomologist.blogspot.com/2016/02/mysql-support-people-percona-support.html" target="_blank">the best Support team</a> in the industry at the moment!</li><li>In Percona I was proud to be working for a company that does the right things, both for business around open source software, Community and customers.</li><li>In Percona I was involved in decision making, I influenced bug fixing process and priorities and was a kind of authority on everything I cared to state or do.</li><li>I had a really good salary, regular bonuses, longer vacations or otherwise properly compensated extra working time and all the opportunities to become a public person in MySQL Community.</li><li>I spent a lot of time working with <b>Peter Zaitsev</b> and really brilliant engineers both from Development, QA and Support.<br /></li><li>After few initial months of getting used to a lot of work and work style, then till April 14 or so, 2015, it was a really comfortable place for me to work at and do things I like and good at.</li></ul><p>I lost most of the above when I left. No more decision making of any kind (it was my decision to avoid that while joining MariaDB, to begin with). No more bugs processing or prioritizing. No more Percona Live conferences till 2019 when I finally managed to clarify the problems I had with (cancelled) participation in Percona Live Europe 2015. Nobody ever asked me to blog about anything since 2016 and until the beginning of 2020 (when MariaDB Foundation got really interested in my public performances). Joining MariaDB Corporation made me a suspicious member of MySQL Community and eventually forced me to leave <a href="https://planet.mysql.com/" target="_blank">Planet MySQL</a> where my posts were not appreciated. </p><p>It takes me just one short tweet with MySQL bug number to share to have some of these bugs immediately hidden from the Community and made private. Looks like people suspect I have some secret agenda and mission from MariaDB Corporation, while I have none related to MySQL, neither to software nor to bugs in it - I do it at my own time and based on my own views that are not influenced by anyone...</p><p>Now what I gained from joining MariaDB Support team:</p><ul style="text-align: left;"><li>I still work in the best Support team in the industry, with new (to me) brilliant people, <a href="http://mysqlentomologist.blogspot.com/2016/01/mysql-support-people-those-who-were.html" target="_blank">some of my former colleagues in MySQL</a> and some of those I worked with in Percona and managed to get back into my team now in MariaDB Corporation.</li><li>I work harder than ever, at least twice as more as I did in Percona (at least speaking about the real life customer issues). The issues are more interesting, complex and challenging in general, and cover everything from MySQL 4.0.x to NDB cluster, MySQL 8.0.x and all versions of MariaDB Server, MaxScale and Connectors, and Galera clusters, with everything MySQL-related that Percona releases in between! This work is properly compensated in recent years.</li><li>Yes, I do work on Galera issues a lot, read the megabytes of logs and make sense of everything Galera. Who could imagine I'll got that way back in 2015?</li><li>I work closely and directly with the best developers, from Codership's Galera team, MariaDB Foundation developers, to <b>Sergey Golubchik</b>, <span><b>Marko Mäkelä</b>, </span><span>V<b>ladislav Vaintroub</b>, <b>Elena Stepanova</b> and good old MySQL Optimizer team (now MariaDB's), and <b>Monty</b> himself, and more... We chat, we talk and discuss technical topics almost every day! I do some work they ask about, I build more things from sources than ever. It's really great and had almost never been the case before I joined MariaDB. I love this part of the corporate culture here.</span></li><li><span>My blog that you read now is way more popular than ever before 2016. At good times in 2017 I had more than 1000 reads per day, for weeks and months.</span></li><li><span>I am presenting at the conferences way more often than ever in Percona, from FOSDEM to Percona Live and everything MariaDB.<br /></span></li><li><span>My influence and impact on MySQL Community got increased. I was declared a MySQL Community Contributor of the Year 2019. As often happens, it's easier to make impact when you are outsider. They can not ignore you even if that's only because you are considered "asshole" and "enemy" with "corporate agenda", for whatever reasons.<br /></span></li></ul><p>So far I do not regret that I made a decision in favor of MariaDB back in 2016, even though it forced me to keep up with or ignore many things I don't like at my current company. I am sorry that back in 2010 Monty and Kay were just 10 days too late to get me to SkySQL of the times. I had signed the contract with Oracle, and in 2012 there I really was mostly wasting my time, unfortunately...<br /></p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYhI8BXASZ_s7f-WHY_sv3BfrGqie7678P2woGZPg1UsBOLsO56jQtzJmhUUFEk6oSVnDAYpmIVcz2bzEGaDmVbowlvkiP758WJQ7YM9XzTLPXCJ7JXcrGwWejiUwSqz2pCa0BgRi5TWdA/s640/092.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="640" data-original-width="480" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgYhI8BXASZ_s7f-WHY_sv3BfrGqie7678P2woGZPg1UsBOLsO56jQtzJmhUUFEk6oSVnDAYpmIVcz2bzEGaDmVbowlvkiP758WJQ7YM9XzTLPXCJ7JXcrGwWejiUwSqz2pCa0BgRi5TWdA/w480-h640/092.jpg" width="480" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">That switch formally happened on March 1, 2016. It was a good decision.</td><td class="tr-caption" style="text-align: center;"><br /></td></tr></tbody></table><br /></p><p>Just few random notes at the and of a hard 7 days week of work. I hope you would not blame me too much for these. I also hope I'll still have my job tomorrow :)<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com1tag:blogger.com,1999:blog-3080615211468083537.post-91440624505483333672021-02-14T16:59:00.001+02:002021-02-14T16:59:20.770+02:00What about the "Fun with Bugs" series on YouTube?<p>My "<a href="http://mysqlentomologist.blogspot.com/search/label/funwithbugs" target="_blank"><b>Fun with Bugs</b></a>" series of blog posts about interesting or badly handled MySQL bug reports <a href="http://mysqlentomologist.blogspot.com/2020/06/fun-with-bugs-100-on-mysql-bug-reports.html" target="_blank">ended</a> more than 7 months ago. The time had come for that. But I honestly miss that media for bitching about something wrong in MySQL once in a while...</p><p>The year of COVID-19 pandemic with video conferences that replaced normal offline ones forced me to start recording videos, and I even used to like the process so much that I am working on <a href="https://www.youtube.com/channel/UCzBDIplzdOrSKqrR3hAZ7uw" target="_blank">my wife's channel content</a> as a hobby. I've created <a href="https://www.youtube.com/channel/UC8j1f4jMBcuAcg2s6QGVsDw/" target="_blank">my own channel</a> as well, for uploading some draft/bad/long/extended videos recorded in the process of work on my online talks:</p><p></p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/giikJfsCK14" width="320" youtube-src-id="giikJfsCK14"></iframe></div><br />Being fluent enough with recording videos at home (using different software tools and cameras) and publishing them at YouTube, I wonder now if it makes sense to turn this activity into a regular and MySQL-concentrated one? My next talk(s) will be submitted to <a href="https://cfp.percona.com/" target="_blank">Percona Live 2021</a>, but it means they may go live only in May, and I'd like to be on screens earlier than that.<br /><p></p><p>So, I wonder should I maybe have a regular video recordings shared, let's say, once a week every Tuesday, up to 5 minutes long at most, and devoted to some MySQL-related topics? What topics would you like me to cover? Would you mind if it will be a five minutes talk about a recent MySQL bug report or few of them, either interesting in general or badly handled by my former Oracle colleagues? Something else to better spend megabytes of video on? Leave it to younger and more attractive and experienced speakers? Keep writing here or stop bitching about the bugs once and for good?<br /></p><p>I am waiting for your comments in this blog post and in social media that I'll share it till March 5, 2021. Then I'll decide on how to proceed with this regular YouTube videos idea.<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com4tag:blogger.com,1999:blog-3080615211468083537.post-65719381919570493362021-02-05T14:05:00.000+02:002021-02-05T14:05:27.883+02:00On Upcoming FOSDEM 2021<p><a href="https://fosdem.org/2021/" target="_blank">FOSDEM 2021</a> starts tomorrow. This time I have 3 talks to present. Here they are, in the order of appearance, with links to my related blog posts:</p><ol style="text-align: left;"><li>"<a href="https://fosdem.org/2021/schedule/event/mariadb_upgrade/" target="_blank"><b>Upgrading to a newer major version of MariaDB</b></a>" - it should start at 10:30 on Saturday in the <a href="https://fosdem.org/2021/schedule/track/mariadb/" target="_blank">MariaDB devroom</a> and is mostly devoted to <b>mysql_upgrade</b>. Related blog posts are:<br /><ul><li>"<a href="https://mysqlentomologist.blogspot.com/2020/04/what-mysqlupgrade-really-does-in.html" target="_blank">What mysql_upgrade really does in MariaDB, Part I </a>" - original study of internal workings of <b>mysql_upgrade</b> utility inspired by <a href="http://monty-says.blogspot.com/2020/04/upgrading-between-major-mariadb-versions.html" target="_blank">Monty's post</a>.</li><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/what-mysqlupgrade-really-does-in.html" target="_blank">What mysql_upgrade really does in MariaDB, Part II, Bugs and Missing Features</a>" - recent review of bugs and feature requests for <b>mysql_upgrade</b> reported since the previous blog post.<br /></li></ul></li><li>"<b><a href="https://fosdem.org/2021/schedule/event/mariadb_bpftrace/" target="_blank">Monitoring MariaDB Server with bpftrace on Linux</a></b>" - it should start at 12:40 on Sunday in the <a href="https://fosdem.org/2021/schedule/track/monitoring_and_observability/" target="_blank">Monitoring and Observability devroom</a> and is devoted to <b>bpftrace</b> basics. A lot more information is provided in recent blog posts:<br /><ul><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and.html" target="_blank">Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part I, Basic uprobes</a>" - few details on building recent version from source and basic probes to capture SQL queries and their execution time and to trace <b>pthread_mutex_lock</b> calls</li><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_24.html" target="_blank">Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part II, Using the Existing Tools</a>" - review and tests of some <b>bpftrace</b>-based tools/small programs that come with it, mostly related to disk I/O monitoring. It's a;ways great to study by example and have more tools for typical tasks.</li><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_27.html" target="_blank">Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part III, Creating a New Tool for Tracing Mutexes</a>" - on my first lame attempt to get interesting stacks and time spent waiting in them for <b>pthread_mutex_lock</b> and <b>pthread_mutex_unlock</b> pair of functions.</li><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_28.html" target="_blank">Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part IV, Creating a New Tool for Tracing Time Spent in __lll_lock_wait</a>" - in that case I tried to measure time, per unique stack, spent inside the <b>__lll_lock_wait</b> function when MariaDB was under high concurrent load. Interesting results, but the performance drop is notable and it took too much time to get the outputs...</li><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_30.html" target="_blank">Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part V, Proper Way To Summarize Time Spent per Stack</a>" - in this last post I finally managed to find a proper balance of spreading processing between <b>bpftrace</b> and user space <b>awk</b> etc tools, and got fast and useful results. Some more <b>bpftrace</b> features were also used.<br /></li></ul></li><li>"<b><a href="https://fosdem.org/2021/schedule/event/linux_porc_mysql/" target="_blank">Linux /proc filesystem for MySQL DBAs</a></b>" - this talk should start at 15:00 on Sunday in <a href="https://fosdem.org/2021/schedule/track/mysql/" target="_blank">MySQL devroom</a> and is devoted to a totally different (comparing to dynamic tracing I am so big fan of) way to get insights about MySQL internal working, waits etc - sampling of files in<b> /proc</b> file system. I had written 3 blog posts on the topic:<br /><ul><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/linux-proc-filesystem-for-mysql-dbas.html" target="_blank">Linux /proc Filesystem for MySQL DBAs - Part I, Basics</a>" - mostly quotes from <b>man 5 proc</b> and few tests with Percona Server 5.7.x running under load.</li><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/linux-proc-filesystem-for-mysql-dbas_7.html" target="_blank">Linux /proc Filesystem for MySQL DBAs - Part II, Threads of the mysqld Process</a>" - some basic ideas on how to match MySQL "thread"/connection with Linux thread that we can then sample via <b>/proc</b>. The approach works for MySQL 5.7+ and MariaDB 10.5+. For older versions one probably has to use <b>gdb</b> to identify threads for monitoring.</li><li>"<a href="https://mysqlentomologist.blogspot.com/2021/01/linux-proc-filesystem-for-mysql-dbas_8.html" target="_blank">Linux /proc Filesystem for MySQL DBAs - Part III, 0x.tools by Tanel Poder</a>" - using these great and simple tools for regular monitoring and... to create an off-CPU flame graphs for MySQL server that is I/O bound or otherwise spe3nds time waiting on something in the kernel.<br /></li></ul></li></ol><p>Slides are uploaded to the talks pages and will be shared via <a href="https://www.slideshare.net/ValeriyKravchuk" target="_blank">SlideShare</a>. Draft, longer versions of the talks that I've recorded but FOSDEM system had not accepted will also be shared at my <a href="https://www.youtube.com/channel/UC8j1f4jMBcuAcg2s6QGVsDw/" target="_blank">YouTube channel</a> soon.</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWz2Yjcp-4Ct8_AoS5a0pDYT7rB_OlJfPP9Zsdw0B1JHvuR00oBNbSggwxG6oe_cMqnX-aE66Y1PkISmWmNt5tCrTbftiM_lhEo2BxEenKqBCUKMGoAkze8TjUN5eKQqqlqZvYk8zWj0EU/s2048/iPhone4s+160.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhWz2Yjcp-4Ct8_AoS5a0pDYT7rB_OlJfPP9Zsdw0B1JHvuR00oBNbSggwxG6oe_cMqnX-aE66Y1PkISmWmNt5tCrTbftiM_lhEo2BxEenKqBCUKMGoAkze8TjUN5eKQqqlqZvYk8zWj0EU/w400-h300/iPhone4s+160.JPG" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Usual view on my way to FOSDEM ULB site... I miss Brussels!<br /></td></tr></tbody></table></p><p>Other talks in these devrooms that I consider interesting (will not be able to attend them all live though):</p><ul style="text-align: left;"><li><a href="https://fosdem.org/2021/schedule/event/mariadb_arm/">Migrating MariaDB Cluster to ARM<br /></a></li><li><a href="https://fosdem.org/2021/schedule/event/mariadb_atomic_ddl/">Atomic DDL in MariaDB</a></li><li>
<a href="https://fosdem.org/2021/schedule/event/mariadb_buffer_pool_improvements/">Buffer pool performance improvements</a> </li><li><a href="https://fosdem.org/2021/schedule/event/performance_analysis_troubleshooting/">Performance Analysis and Troubleshooting Methodologies for Databases</a> </li><li>
<a href="https://fosdem.org/2021/schedule/event/vitess/">Open Source Database Infrastructure with Vitess</a> </li><li><a href="https://fosdem.org/2021/schedule/event/mysql_xa/">Making MySQL-8.0 XA transaction processing crash safe</a> </li><li>
<a href="https://fosdem.org/2021/schedule/event/rewrite_mysql/">Rewrite Your Complex MySQL Queries for Better Performance</a> <br /></li></ul><p>See you there! FOSDEM was a real driver of my more or less advanced performance studies this year.<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-9567908340632362342021-01-30T19:49:00.000+02:002021-01-30T19:49:04.157+02:00Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part V, Proper Way To Summarize Time Spent per Stack<p>In my previous post in this series I was tracing <b>__lll_lock_wait()</b> function and measuring time spent in it per unique MariaDB server's stack trace. I've got the results that had not looked like totally wrong or useless and even presented them as a flame graph, but it took a long time and during all thyat time, for minutes, <b>sysbench</b> test TPS and QPS results were notably smaller. So, it was a useful tracing, but surely not a low impact one.</p><p>I already stated before that the approach used is surely not optimal. There should be some better way to summarize the stack traces without such a performance impact. Today I continued to work on this problem. First of all, I tried to measure finally how much time exactly I had to wait for the results from <b>bpftrace</b> command:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt</b><br /><br /><b>real 9m51.904s<br /></b>user 8m25.576s<br />sys 1m8.057s<br />[openxs@fc31 ~]$ <b>ls -l /tmp/lll_lock_perf_stacks1.txt</b><br />-rw-rw-r--. 1 openxs openxs 1564291 Jan 30 11:58 /tmp/lll_lock_perf_stacks1.txt</span></span><br /></p></blockquote><p>So, test was running for 300 seconds (5 minutes) and monitoring results (for about 20 seconds) came 5 minutes later! This is definitely NOT acceptable for any regular use.</p><p>I start to check things and test more. I noted in the <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md" target="_blank">Reference Guide</a> (that I've read entirely a couple of times yesterday and today) that <b>ustack()</b> function of <b>bpftrace</b> can accept the second argument, number of frames to show in stack. Based on the previous flame graphs most interesting stacks were not long, so I decided to check if 8 or just 5 may be still enough for useful results while reducing the performance impact. I changed the code here:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">uprobe:/lib64/libpthread.so.0:__lll_lock_wait<br />/comm == "mariadbd"/<br />{<br /> @start[tid] = nsecs;<br /><b> @tidstack[tid] = ustack(perf,5); </b>-- <-- I also tried 8<br />}</span></span><br /></p></blockquote><p>and got these run times for 8 and 5 frames:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks1.txt</b><br /><br />real 9m39.195s<br />user 8m9.188s<br />sys 1m8.430s<br /><br />[openxs@fc31 ~]$ <b>ls -l /tmp/lll_lock_perf_stacks1.txt</b><br />-rw-rw-r--. 1 openxs openxs 1085019 Jan 30 12:12 /tmp/lll_lock_perf_stacks1.txt<br /><br />[openxs@fc31 ~]$ <b>time sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks2.txt</b><br /><br />real 6m6.801s<br />user 5m5.537s<br />sys 0m42.226s<br /></span></span></p></blockquote><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>ls -l /tmp/lll_lock_perf_stacks2.txt</b><br />-rw-rw-r--. 1 openxs openxs 728072 Jan 30 12:19 /tmp/lll_lock_perf_stacks2.txt</span></span><br /></p></blockquote><p>So, reducing stack to 5 frames made notable impact, and QPS drop was a bit smaller, but still waiting for 5 minutes is not an option in general.</p><p>I had not found really powerful functions to replace <b>awk</b> and collapse stacks inside the <b>bpftrace</b> program, so decided to export them "raw", on every hit of <b>uretprobe</b>, and then process by Linux tools as unusual. I'll skip steps, tests and failures and just show the resulting new program for the task:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat lll_lock_wait2.bt</b><br />#!/usr/bin/env bpftrace<br /><br />BEGIN<br />{<br />/* printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n"); */<br />}<br /><br /><b>interval:s:$1 { exit(); }<br /></b><br />uprobe:/lib64/libpthread.so.0:__lll_lock_wait <br />/comm == "mariadbd"/ <br />{ <br /> @start[tid] = nsecs;<br /> @tidstack[tid] = ustack(perf); <br />}<br /><br />uretprobe:/lib64/libpthread.so.0:__lll_lock_wait <br />/comm == "mariadbd" && @start[tid] != 0/ <br />{<br /> $now = nsecs;<br /> $time = $now - @start[tid];<br /> @futexstack[@tidstack[tid]] += $time;<br /><b> print(@futexstack);<br /> delete(@futexstack[@tidstack[tid]]);<br /></b>/*<br /> printf("Thread: %u, time: %d\n", tid, $time);<br />*/<br /> delete(@start[tid]);<br /> delete(@tidstack[tid]);<br />}<br /><br />END<br />{<br /> clear(@start);<br /> clear(@tidstack);<br /><b> clear(@futexstack); <br /></b>}</span></span><br /></p></blockquote><p>Changes vs the previous version are highlighted. Basically I print and then delete every stack collected, as function call ends, along with time, to further sum them up externally. The function used is the following:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">print(@x[, top [, div]]) - Print the map, optionally the top entries only and with a divisor</span></span><br /></p></blockquote><p><span style="font-family: georgia;">I've also added the <b>interval</b> probe at the beginning, referring to the first program argument as seconds to run. It just forces exit after N seconds (0 by default means immediate exist, wrong format will be reported as error).</span></p><p><span style="font-family: georgia;">Now if I run it while test is running, this way:</span></p><p></p><blockquote><span style="font-family: georgia;"><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>time sudo ./lll_lock_wait2.bt 10 2>/dev/null | awk '<br />BEGIN { s = ""; }<br />/^@futexstack\[\]/ { s = ""; }<br />/^@futexstack/ { s = ""; }<br />/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }<br />/^]/ { print $2, s }<br />' > /tmp/collapsed_lll_lock_v2_raw.txt</b><br /><br />real 0m10.646s<br />user 0m5.648s<br />sys 0m0.843s<br />[openxs@fc31 ~]$ <b>ls -l /tmp/collapsed_lll_lock_v2_raw.txt<br /></b>-rw-rw-r--. 1 openxs openxs 1566 Jan 30 13:12 /tmp/collapsed_lll_lock_v2_raw.txt<br />[openxs@fc31 ~]$ <b>cat /tmp/collapsed_lll_lock_v2_raw.txt</b><br />15531 __lll_lock_wait;<br />1233833 __lll_lock_wait;<br />10638 __lll_lock_wait;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;<br />7170 __lll_lock_wait;tpool::thread_pool_generic::worker_main;;;;<br />273330 __lll_lock_wait;tdc_acquire_share;open_table;open_tables;open_and_lock_tables;execute_sqlcom_select;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br />1193083 __lll_lock_wait;trx_undo_report_row_operation;btr_cur_update_in_place;btr_cur_optimistic_update;row_upd_clust_rec;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br />183353 __lll_lock_wait;<br />43231 __lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;ha_commit_trans;trans_commit;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread</span></span><br /></span></blockquote><p></p><p><span style="font-family: georgia;">I get a very condensed output almost immediately! I can afford running for 60 seconds without much troubles:</span></p><p></p><blockquote><span style="font-family: georgia;"><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>time sudo ./lll_lock_wait2.bt 60 2>/dev/null | awk '<br />BEGIN { s = ""; }<br />/^@futexstack\[\]/ { s = ""; }<br />/^@futexstack/ { s = ""; }<br />/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }<br />/^]/ { print $2, s }<br />' > /tmp/collapsed_lll_lock_v2_raw.txt</b><br /><br />real 1m0.959s<br />user 0m44.146s<br />sys 0m6.126s<br /><br /><br />[openxs@fc31 ~]$ <b>ls -l /tmp/collapsed_lll_lock_v2_raw.txt</b><br />-rw-rw-r--. 1 openxs openxs 12128 Jan 30 13:17 /tmp/collapsed_lll_lock_v2_raw.txt</span></span><br /></span></blockquote><p></p><p><span style="font-family: georgia;">The impact of these 60 seconds is visible in the <b>sysbench</b> output:</span><br /></p><p></p><blockquote><span style="font-family: georgia;"><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run</b><br />sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)<br /><br />Running the test with following options:<br />Number of threads: 32<br />Report intermediate results every 10 second(s)<br />Initializing random number generator from current time<br /><br /><br />Initializing worker threads...<br /><br />Threads started!<br /><br />[ 10s ] thds: 32 tps: 625.68 qps: 12567.66 (r/w/o: 8800.49/2512.81/1254.36) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00<br />[ 20s ] thds: 32 tps: 891.51 qps: 17825.69 (r/w/o: 12479.31/3563.26/1783.13) lat (ms,95%): 142.39 err/s: 0.00 reconn/s: 0.00<br />[ 30s ] thds: 32 tps: 512.09 qps: 10230.44 (r/w/o: 7163.62/2042.55/1024.27) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00<br /><b>[ 40s ] thds: 32 tps: 369.10 qps: 7390.48 (r/w/o: 5171.39/1480.90/738.20) lat (ms,95%): 350.33 err/s: 0.00 reconn/s: 0.00<br />[ 50s ] thds: 32 tps: 417.83 qps: 8347.73 (r/w/o: 5845.14/1667.03/835.56) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00<br />[ 60s ] thds: 32 tps: 484.03 qps: 9687.55 (r/w/o: 6782.09/1937.31/968.16) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00<br />[ 70s ] thds: 32 tps: 431.98 qps: 8640.35 (r/w/o: 6049.38/1727.01/863.95) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00<br />[ 80s ] thds: 32 tps: 354.60 qps: 7097.39 (r/w/o: 4968.86/1419.32/709.21) lat (ms,95%): 419.45 err/s: 0.00 reconn/s: 0.00<br />[ 90s ] thds: 32 tps: 380.98 qps: 7600.82 (r/w/o: 5317.17/1521.80/761.85) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00<br />[ 100s ] thds: 32 tps: 423.01 qps: 8467.99 (r/w/o: 5928.71/1693.16/846.13) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00<br />[ 110s ] thds: 32 tps: 475.66 qps: 9525.07 (r/w/o: 6669.22/1904.53/951.32) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00<br />[ 120s ] thds: 32 tps: 409.13 qps: 8171.48 (r/w/o: 5718.38/1634.94/818.17) lat (ms,95%): 458.96 err/s: 0.00 reconn/s: 0.00<br />[ 130s ] thds: 32 tps: 190.91 qps: 3826.72 (r/w/o: 2682.19/762.62/381.91) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00<br /></b>[ 140s ] thds: 32 tps: 611.40 qps: 12229.99 (r/w/o: 8558.99/2448.20/1222.80) lat (ms,95%): 223.34 err/s: 0.00 reconn/s: 0.00<br />[ 150s ] thds: 32 tps: 581.99 qps: 11639.19 (r/w/o: 8148.53/2326.68/1163.99) lat (ms,95%): 287.38 err/s: 0.00 reconn/s: 0.00<br />[ 160s ] thds: 32 tps: 653.21 qps: 13058.51 (r/w/o: 9139.57/2612.52/1306.41) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00<br />[ 170s ] thds: 32 tps: 561.87 qps: 11231.27 (r/w/o: 7860.53/2246.99/1123.75) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00<br />[ 180s ] thds: 32 tps: 625.66 qps: 12526.32 (r/w/o: 8770.19/2504.82/1251.31) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00<br />[ 190s ] thds: 32 tps: 554.50 qps: 11088.01 (r/w/o: 7760.61/2218.40/1109.00) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00<br />[ 200s ] thds: 32 tps: 607.10 qps: 12143.89 (r/w/o: 8501.49/2428.20/1214.20) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00<br />[ 210s ] thds: 32 tps: 424.17 qps: 8467.63 (r/w/o: 5925.43/1693.87/848.33) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00<br />[ 220s ] thds: 32 tps: 466.80 qps: 9335.03 (r/w/o: 6533.32/1868.31/933.40) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00<br />[ 230s ] thds: 32 tps: 365.83 qps: 7330.56 (r/w/o: 5133.49/1465.21/731.86) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00<br />[ 240s ] thds: 32 tps: 411.27 qps: 8218.45 (r/w/o: 5754.25/1641.87/822.34) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00<br />[ 250s ] thds: 32 tps: 127.10 qps: 2534.71 (r/w/o: 1772.60/507.70/254.40) lat (ms,95%): 450.77 err/s: 0.00 reconn/s: 0.00<br />[ 260s ] thds: 32 tps: 642.35 qps: 12856.29 (r/w/o: 8999.09/2572.50/1284.70) lat (ms,95%): 282.25 err/s: 0.00 reconn/s: 0.00<br />[ 270s ] thds: 32 tps: 603.80 qps: 12078.79 (r/w/o: 8456.89/2414.30/1207.60) lat (ms,95%): 314.45 err/s: 0.00 reconn/s: 0.00<br />[ 280s ] thds: 32 tps: 642.70 qps: 12857.60 (r/w/o: 9001.40/2570.80/1285.40) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00<br />[ 290s ] thds: 32 tps: 716.57 qps: 14325.60 (r/w/o: 10026.98/2865.48/1433.14) lat (ms,95%): 144.97 err/s: 0.00 reconn/s: 0.00<br />[ 300s ] thds: 32 tps: 611.16 qps: 12219.69 (r/w/o: 8551.80/2445.66/1222.23) lat (ms,95%): 292.60 err/s: 0.00 reconn/s: 0.00<br />SQL statistics:<br /> queries performed:<br /> read: 2124836<br /> write: 607096<br /> other: 303548<br /> total: 3035480<br /> <b>transactions: 151774 (505.83 per sec.)<br /> queries: 3035480 (10116.62 per sec.)</b><br /> ignored errors: 0 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />Throughput:<br /> events/s (eps): 505.8309<br /> time elapsed: 300.0489s<br /> total number of events: 151774<br /><br />Latency (ms):<br /> min: 2.17<br /> avg: 63.25<br /> max: 8775.26<br /> 95th percentile: 331.91<br /> sum: 9599021.34<br /><br />Threads fairness:<br /> events (avg/stddev): 4742.9375/42.76<br /> execution time (avg/stddev): 299.9694/0.03</span></span> <br /></span></blockquote><p></p><p><span style="font-family: georgia;">but it continues only for some time after the end of monitoring and overal QPS is not much affected.<br /></span></p><p><span style="font-family: georgia;">Now what about top 5 stacks by wait time? Here they are (well, I had NOT cared to sum up several entires with the same stack, but you should know how to do this with <b>awk</b> and <b>flamegraph.pl</b> will do it for us later) :</span></p><p></p><blockquote><span style="font-family: georgia;"><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ c<b>at /tmp/collapsed_lll_lock_v2_raw.txt | sort -r -n -k 1,1 | head -5</b><br /><b>340360731 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br /></b>335903154 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_upd_sec_index_entry;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br />287819974 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br />281374237 __lll_lock_wait;btr_cur_optimistic_insert;row_ins_sec_index_entry_low;row_ins_sec_index_entry;row_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;write_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br />268817930 __lll_lock_wait;buf_page_init_for_read;buf_read_page_background;btr_cur_prefetch_siblings;btr_cur_optimistic_delete_func;row_purge_remove_sec_if_poss_leaf;row_purge_record_func;row_purge_step;que_run_threads;purge_worker_callback;tpool::task_group::execute;tpool::thread_pool_generic::worker_main;;;;</span></span><br /></span></blockquote><p></p><p><span style="font-family: georgia;">A lot of contention happens on INSERTing rows. I'd say it's expected! Now to create the flame graph we just have to re-order the output a bit:</span></p><p></p><blockquote><span style="font-family: georgia;"><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat /tmp/collapsed_lll_lock_v2_raw.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="Time spent in ___lll_lock_wait in MariaDB 10.5, all frames" --countname=nsecs > ~/Documents/lll_lock_v2_2.svg</b></span></span><br /></span></blockquote><p></p><p><span style="font-family: georgia;">Here it is:</span></p><p><span style="font-family: georgia;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkEUdo3xbjBXmt4BvW6pEy0g9zE5T0ddqOCwzf_OpN9YuMGWCtORJ-_X5zQfzj7pyYELF4aowwuMcg82agzxjrOZnU2n2sR8CV9Cpr-_TGpM-vDXJFMc-ou9T0uGp86gfk202_BLwNAceh/s1198/lll_lock_v2_2.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="420" data-original-width="1198" height="224" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjkEUdo3xbjBXmt4BvW6pEy0g9zE5T0ddqOCwzf_OpN9YuMGWCtORJ-_X5zQfzj7pyYELF4aowwuMcg82agzxjrOZnU2n2sR8CV9Cpr-_TGpM-vDXJFMc-ou9T0uGp86gfk202_BLwNAceh/w640-h224/lll_lock_v2_2.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;"><span style="font-family: georgia;">Most mutex waits for <b>oltp_read_write</b> test happen on inserting rows...</span></td></tr></tbody></table><br />Monitoring over 60 seconds allowed to see more realistic picture of contention. It mostly happens on inserting rows by 32 threads into just 5 tables on a system with 4 cores only.<br /></span></p><p><span style="font-family: georgia;">Now it seems we have the tool that can be used more than once ina lifetime and seems to provide a useful information, fast.<br /></span></p><p style="text-align: center;">* * *</p><p>To summarize:</p><ol style="text-align: left;"><li>One should not even try to summarize all stacks in a single associative arrays for any function that is called to often in many different contests <i>inside</i> the <b>bftrace</b>! There is a limit on number of items in the ma you may hit, and the impact of storing and exporting this is too high for monitoring more than a couple of seconds.</li><li>Cleaning up stack traces should be done externally, by usual Linux test processing tools thta produce smaller summary output. The <b>bpftrace</b> program should NOT be designed to collect a lot of outputs for a long time and then output them at the end. Looks like smaller shunks exported to the userspace regularly is a better approach.</li><li>We can use simple command line arguments for the program that can be literally substituted as parts of probe definitions. Next step would be to make the more generic tool with binary and function name to probe as command line arguments.<br /></li><li><b>bpftrace</b> is cool!<br /></li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com2tag:blogger.com,1999:blog-3080615211468083537.post-92159888947216093982021-01-28T22:29:00.001+02:002021-01-30T18:00:26.247+02:00Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part IV, Creating a New Tool for Tracing Time Spent in __lll_lock_wait<p>So far in this series I am just trying to find out what can be done with <b>bpftrace</b> and how to get interesting insights or implement something I was asked about. I am not sharing "best practices" (mostly worst, probably, when it's my own inventions) or any final answers on what to trace given the specific MariaDB performance problem. This is still work in progress, and on really early stage.<br /></p><p>In the comments for my <a href="https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_27.html" target="_blank">previous post</a> you can see that I probably measured a wrong thing if the goal is to find the mutex that causes most of the contention. <a href="https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_27.html?showComment=1611827063628#c4103326712659513483" target="_blank">One of the comments</a> suggested a different tracing idea:</p><blockquote><p><i>"</i><i>... indexed by the *callstack*, with the goal to find the place where waits happen. I think
<b>__lll_lock_wait</b> could be a better indicator for contention, measuring
uncontended pthread_mutex_locks could give some false alarms."</i> <br /></p></blockquote><p>Today I had an hour or so again near my Fedora box with <b>bpftrace</b>, so I tried to modify my previous tool to store the time spent in the <b>__lll_lock_wait</b> function, and sum it up (the same inefficient way so far, I had no chance yet to test and find anything more suitable for production use) these times per unique stack that led to <b>__lll_lock_wait</b> call.</p><p>To remind you, <b>__lll_lock_wait()</b> is a low-level wrapper around the Linux futex system call. The prototype for this function is:</p><p></p><blockquote><span style="font-size: x-small;"><span style="font-family: courier;">void __lll_lock_wait (int *futex, int private)</span></span></blockquote>It is also from the <b>libpthread.so</b> library:<p></p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>strings /lib64/libpthread.so.0 | grep '__lll_lock_wait'</b><br />__lll_lock_wait<br />__lll_lock_wait_private</span></span><br /></p></blockquote><p>So, my code will not require many modifications. Basic quick code to prove the concept is as simple as this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat lll_lock_wait.bt</b><br />#!/usr/bin/env bpftrace<br /><br />BEGIN<br />{<br /> printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n");<br />}<br /><br /><b>uprobe:/lib64/libpthread.so.0:__lll_lock_wait<br />/comm == "mariadbd"/</b><br />{<br /> @start[tid] = nsecs;<br /> @tidstack[tid] = ustack(perf);<br />}<br /><br /><b>uretprobe:/lib64/libpthread.so.0:__lll_lock_wait<br />/comm == "mariadbd" && @start[tid] != 0/</b><br />{<br /> $now = nsecs;<br /> $time = $now - @start[tid];<br /> @futexstack[@tidstack[tid]] += $time;<br /><br /> printf("Thread: %u, time: %d\n", tid, $time);<br /><br /> delete(@start[tid]);<br /> delete(@tidstack[tid]);<br />}<br /><br />END<br />{<br /> clear(@start);<br /> clear(@tidstack);<br /> clear(@futexstack);<br />}</span></span><br /></p></blockquote><p>The main difference is that to measure time spent inside a single function I need <b>uprobe</b> on it and <b>uretprobe</b> on it. Function call arguments are NOT available in the <b>uretprobe</b>.</p><p>When I made this .bt executable and tried to run it with MariaDB 10.5 server started but not under any load I had NOT got any output from <b>uretprobe</b> - no wonder, there is no contention! With <b>sysbench</b> test started the result was different, a flood of outputs:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>sudo ./lll_lock_wait.bt 2>/dev/null</b><br />...<br />Thread: 7488, time: 70520<br />Thread: 7494, time: 73351<br />Thread: 5790, time: 106635<br />Thread: 5790, time: 10008<br />Thread: 7484, time: 87016<br />Thread: 5790, time: 18723<br /><b>^C</b></span></span><br /></p></blockquote><p>So, the program works to some extent and reports some timer per thread (but I have no mutex address in <b>uretprobe</b>). So, I modified it to remove prints and keep the stack associative array for the output in the <b>END</b> probe:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">#!/usr/bin/env bpftrace<br /><br />BEGIN<br />{<br />/* printf("Tracing time from __lll_lock_wait, Ctrl-C to stop\n"); */<br />}<br /><br />uprobe:/lib64/libpthread.so.0:__lll_lock_wait <br />/comm == "mariadbd"/ <br />{ <br /> @start[tid] = nsecs;<br /> @tidstack[tid] = ustack(perf); <br />}<br /><br />uretprobe:/lib64/libpthread.so.0:__lll_lock_wait <br />/comm == "mariadbd" && @start[tid] != 0/ <br />{<br /> $now = nsecs;<br /> $time = $now - @start[tid];<br /> @futexstack[@tidstack[tid]] += $time;<br />/*<br /> printf("Thread: %u, time: %d\n", tid, $time);<br />*/<br /> delete(@start[tid]);<br /> delete(@tidstack[tid]);<br />}<br /><br />END<br />{<br /> clear(@start);<br /> clear(@tidstack);<br />/* clear(@futexstack); */<br />}</span></span></p></blockquote><p>Then I started usual <b>sysbench</b> test for this series:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run</b><br />sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)<br /><br />Running the test with following options:<br />Number of threads: 32<br />Report intermediate results every 10 second(s)<br />Initializing random number generator from current time<br /><br /><br />Initializing worker threads...<br /><br />Threads started!<br /><br />[ 10s ] thds: 32 tps: 516.06 qps: 10362.55 (r/w/o: 7260.90/2066.43/1035.22) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00<br />[ 20s ] thds: 32 tps: 550.82 qps: 11025.98 (r/w/o: 7718.97/2205.28/1101.74) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00</span></span><br /></p></blockquote><p>and at this moment started my program and kept it running for some time more than 10 but less than 20 seconds:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ sudo ./lll_lock_wait.bt 2>/dev/null >/tmp/lll_lock_perf_stacks.txt<br />^C</span></span> <br /></p></blockquote><p>I've got immediate drop in QPS and it continued till the end of 300 seconds test: <br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;"><b>[ 30s ] thds: 32 tps: 397.67 qps: 7950.29 (r/w/o: 5565.37/1589.58/795.34) lat (ms,95%): 434.83 err/s: 0.00 reconn/s: 0.00<br />[ 40s ] thds: 32 tps: 439.87 qps: 8787.15 (r/w/o: 6149.45/1757.97/879.74) lat (ms,95%): 397.39 err/s: 0.00 reconn/s: 0.00<br /></b>...<br /><br /><b> transactions: 101900 (339.38 per sec.)<br /> queries: 2038016 (6787.74 per sec.)</b><br /> ignored errors: 1 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br />...</span></span><br /></p></blockquote><p>I had to wait some more and ended up with this big output:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>ls -l /tmp/lll_lock_perf_stacks.txt</b><br />-rw-rw-r--. 1 openxs openxs 1347279 Jan 28 14:29 /tmp/lll_lock_perf_stacks.txt<br /><br />[openxs@fc31 maria10.5]$ <b>more /tmp/lll_lock_perf_stacks.txt</b><br />Attaching 4 probes...<br /><br /><br />@futexstack[<br /> 7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)<br /> 14e5f 0x14e5f ([unknown])<br />]: 4554<br /><br />...<br />@futexstack[<br /> 7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)<br /> 55a71e6dd31c MDL_lock::remove_ticket(LF_PINS*, MDL_lock::Ticket_list MDL<br />_lock::*, MDL_ticket*)+60 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71e6ddca5 MDL_context::release_lock(enum_mdl_duration, MDL_ticket*)+3<br />7 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71ebc5ef4 row_purge_step(que_thr_t*)+388 (/home/openxs/dbs/maria10.5/<br />bin/mariadbd)<br /> 55a71eb874b8 que_run_threads(que_thr_t*)+2264 (/home/openxs/dbs/maria10.<br />5/bin/mariadbd)<br /> 55a71ebe5773 purge_worker_callback(void*)+355 (/home/openxs/dbs/maria10.<br />5/bin/mariadbd)<br /> 55a71ed2b99a tpool::task_group::execute(tpool::task*)+170 (/home/openxs/<br />dbs/maria10.5/bin/mariadbd)<br /> 55a71ed2a7cf tpool::thread_pool_generic::worker_main(tpool::worker_data*<br />)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 7fadbcac43d4 0x7fadbcac43d4 ([unknown])<br /> 55a721f10f00 0x55a721f10f00 ([unknown])<br /> 55a71ed2ab60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<v<br />oid (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_gen<br />eric*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bi<br />n/mariadbd)<br /> 2de907894810c083 0x2de907894810c083 ([unknown])<br />]: 4958<br />...<br /><br />@futexstack[<br /> 7fadbd0bd5e0 __lll_lock_wait+0 (/usr/lib64/libpthread-2.30.so)<br /> 55a71eba3e20 row_ins_clust_index_entry_low(unsigned long, unsigned long,<br /> dict_index_t*, unsigned long, dtuple_t*, unsigned long, que_thr_t*)+4144 (/home<br />/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71eba4a36 row_ins_clust_index_entry(dict_index_t*, dtuple_t*, que_thr<br />_t*, unsigned long)+198 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71eba52b4 row_ins_step(que_thr_t*)+1956 (/home/openxs/dbs/maria10.5/b<br />in/mariadbd)<br /> 55a71ebb5ea1 row_insert_for_mysql(unsigned char const*, row_prebuilt_t*,<br /> ins_mode_t)+865 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71eafadc1 ha_innobase::write_row(unsigned char const*)+177 (/home/ope<br />nxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71e7e5e10 handler::ha_write_row(unsigned char const*)+464 (/home/open<br />xs/dbs/maria10.5/bin/mariadbd)<br /> 55a71e598b8d write_record(THD*, TABLE*, st_copy_info*, select_result*)+4<br />77 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71e59f6d7 mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item<br />> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*)+2967 (/ho<br />me/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71e5d894a mysql_execute_command(THD*)+7722 (/home/openxs/dbs/maria10.<br />5/bin/mariadbd)<br /> 55a71e5ec945 Prepared_statement::execute(String*, bool)+981 (/home/openx<br />s/dbs/maria10.5/bin/mariadbd)<br /> 55a71e5ecbd5 Prepared_statement::execute_loop(String*, bool, unsigned ch<br />ar*, unsigned char*)+133 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71e5ed8f5 mysql_stmt_execute_common(THD*, unsigned long, unsigned cha<br />r*, unsigned char*, unsigned long, bool, bool)+549 (/home/openxs/dbs/maria10.5/b<br />in/mariadbd)<br /> 55a71e5edb3c mysqld_stmt_execute(THD*, char*, unsigned int)+44 (/home/op<br />enxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71e5d4ba6 dispatch_command(enum_server_command, THD*, char*, unsigned<br /> int, bool, bool)+9302 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 55a71e5d5f12 do_command(THD*)+274 (/home/openxs/dbs/maria10.5/bin/mariad<br />bd)<br /> 55a71e6d3bf1 do_handle_one_connection(CONNECT*, bool)+1025 (/home/openxs<br />/dbs/maria10.5/bin/mariadbd)<br /> 55a71e6d406d handle_one_connection+93 (/home/openxs/dbs/maria10.5/bin/ma<br />riadbd)<br /> 55a71ea4abf2 pfs_spawn_thread+322 (/home/openxs/dbs/maria10.5/bin/mariad<br />bd)<br /> 7fadbd0b34e2 start_thread+226 (/usr/lib64/libpthread-2.30.so)<br />]: 5427<br /><br />...</span></span><br /></p></blockquote><p>Some stacks surely look reasonable, so I continued with ccollapsing them into a simpler form:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>cat /tmp/lll_lock_perf_stacks.txt | awk '<br />BEGIN { s = ""; }<br />/^@futexstack\[\]/ { s = ""; }<br />/^@futexstack/ { s = ""; }<br />/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }<br />/^]/ { print $2, s }<br />' | more<br /></b><br />...<br />5380 __lll_lock_wait;<br />5406 __lll_lock_wait;<br />5421 __lll_lock_wait;<br />5427 __lll_lock_wait;row_ins_clust_index_entry_low;row_ins_clust_index_entry;row<br />_ins_step;row_insert_for_mysql;ha_innobase::write_row;handler::ha_write_row;writ<br />e_record;mysql_insert;mysql_execute_command;Prepared_statement::execute;Prepared<br />_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_<br />command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thre<br />ad;start_thread<br />5436 __lll_lock_wait;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_inn<br />obase::delete_row;handler::ha_delete_row;mysql_delete;mysql_execute_command;Prep<br />ared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_comm<br />on;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;hand<br />le_one_connection;pfs_spawn_thread;start_thread<br />5443 __lll_lock_wait;<br />5445 __lll_lock_wait;purge_worker_callback;tpool::task_group::execute;tpool::thr<br />ead_pool_generic::worker_main;;;;<br />5502 __lll_lock_wait;<br />...</span></span><br /></p></blockquote><p>The <b>awk</b> code is a bit different this time. I know that for flame graphs I need function calls separated by '<b>;</b>', so I am doing it immediately. Non-resolved stack traces are all removed etc. Now I have to sort this to find top N:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>cat /tmp/lll_lock_perf_stacks.txt | awk '<br />BEGIN { s = ""; }<br />/^@futexstack\[\]/ { s = ""; }<br />/^@futexstack/ { s = ""; }<br />/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s ";" targ } else { s = targ } }<br />/^]/ { print $2, s }<br />' | sort -r -n -k 1,1 > /tmp/collapsed_llll_lock.txt</b></span></span><b><br /></b></p></blockquote><p>and then: <br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print} }' | head -5</b><br />28276454 __lll_lock_wait;timer_handler;start_thread<br />35842 __lll_lock_wait;MDL_map::find_or_insert;MDL_context::try_acquire_lock_impl;MDL_context::acquire_lock;open_table;open_tables;open_and_lock_tables;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;;;;;;<br />35746 __lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;MDL_context::release_locks_stored_before;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br />35675 __lll_lock_wait;row_upd_clust_step;row_upd_step;row_update_for_mysql;ha_innobase::delete_row;handler::ha_delete_row;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br />35563 __lll_lock_wait;row_upd_sec_index_entry;row_upd_step;row_update_for_mysql;ha_innobase::update_row;handler::ha_update_row;mysql_update;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread<br /></span></span></p></blockquote><p>So, the above is the top 5 fully resolved stack traces where the longest time was spent inside the <b>__lll_lock_wait</b> function. Now let's create the flame graph. We need this output, stack first and nanoseconds spent in it next, comma separated, piped for processing by the same <b>flamegraph.pl</b> program as before: <br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | head -3</b><br />__lll_lock_wait;timer_handler;start_thread 28276454<br />__lll_lock_wait;MDL_map::find_or_insert;MDL_context::try_acquire_lock_impl;MDL_context::acquire_lock;open_table;open_tables;open_and_lock_tables;mysql_delete;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;;;;;; 35842<br />__lll_lock_wait;MDL_lock::remove_ticket;MDL_context::release_lock;MDL_context::release_locks_stored_before;mysql_execute_command;Prepared_statement::execute;Prepared_statement::execute_loop;mysql_stmt_execute_common;mysqld_stmt_execute;dispatch_command;do_command;do_handle_one_connection;handle_one_connection;pfs_spawn_thread;start_thread 35746</span></span><br /></p></blockquote><p>Not the flame graph was created as follows:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">cat /tmp/collapsed_llll_lock.txt | awk '{ if (length($2) > 0) {print $2, $1} }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="Time spent in ___lll_lock_wait in MariaDB 10.5" --countname=nsecs > ~/Documents/lll_lock.svg</span></span><br /></p></blockquote><p>The resulting flame graph is presented below:</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiU7dWs72Mcro8LCAOe9PyP9TD6nKArLzlEXiIhA6c4DAKHGQsDKf1EtKbVvADibKS-9MzaL8ePb2vv8F2TMP5wKRi_V9L5Gnapyg5XY8M9psaGVUl758iwdq8f9fPJq2Wgq15fHQ8nJLNN/s1191/lll_lock_wait.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="629" data-original-width="1191" height="338" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiU7dWs72Mcro8LCAOe9PyP9TD6nKArLzlEXiIhA6c4DAKHGQsDKf1EtKbVvADibKS-9MzaL8ePb2vv8F2TMP5wKRi_V9L5Gnapyg5XY8M9psaGVUl758iwdq8f9fPJq2Wgq15fHQ8nJLNN/w640-h338/lll_lock_wait.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Time related to MDL is highlighted, some 3% only<br /></td></tr></tbody></table><p></p><p>I highlighted the impact of MDL-related stack traces, some 3%. The shape is totally different from the one of <b>pthread_mutex_wait</b>, so this shows a how different point of view may lead to different conclusions about the contention.<br /></p><p></p><p style="text-align: center;">* * *</p><p>To summarize:</p><ol style="text-align: left;"><li>I am still not any good with finding low impact way (if any) to get stack traces with time summed up. It seems associative array with them become too big if stacks are not preprocessed, and exporting it to user space at the end of the program takes minutes, literally, and impact performance during all this time.</li><li>It's easy to measure time spent inside the function with <b>uprobe</b> + <b>ureprobe</b> for the function</li><li>I am not sure yet about the tools to run in production, but for checking the ideas and creating prototypes <b>bpftrace</b> is really easy to use and flexible.</li><li>It is not clear where the stack traces with all calls unresolved to symbols may come from.<br /></li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-72364361184180782312021-01-27T13:53:00.001+02:002021-01-30T17:47:01.638+02:00Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part III, Creating a New Tool for Tracing Mutexes<p>I had some free time yesterday to do some more tests on Fedora, so I've got back to the old request for one of MariaDB developers that I first mentioned in the blog <a href="http://mysqlentomologist.blogspot.com/2020/09/dynamic-tracing-of-pthreadmutexlock-in.html" target="_blank">few months ago</a>:</p><blockquote><p><i>"ideally, collect stack traces of <b>mariadbd</b>, and sort them in descending order by time spent between <b>pthread_mutex_lock</b> and next <b>pthread_mutex_unlock</b>."</i> <br /></p></blockquote><p>The original request above got no answer yet, and I was recently reminded about it. What I did with <b>perf</b> and <a href="http://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and.html" target="_blank">recently with <b>bpftrace</b></a> was (lame) counting number of samples per unique stack, while I actually had to count time spent between acquiring and releasing specific mutexes.<br /></p><p>From the very beginning I was sure that <b>bpftrace</b> should allow to get the answer easily, and after <a href="http://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and_24.html" target="_blank">reviewing the way existing tools are coded</a>, yesterday I decided to finally write some real, multiple liner <b>bpftrace</b> program, with multiple probes, myself. I wanted to fulfill the request literally, no matter how much that would "cost" for now, with <b>bpftrace</b>. It turned out that a couple of hours of calm vacation time is more than enough to get a draft of solution.</p><p>I've started with checking the <a href="https://linux.die.net/man/3/pthread_mutex_lock" target="_blank"><b>pthread_mutext_lock</b> manual</a> page. From it I've got the (primitive) idea of two functions used in the process, with single argument, mutex pointer/address:</p><blockquote><p><span style="font-family: courier;">int pthread_mutex_lock(pthread_mutex_t *mutex);<br />int pthread_mutex_unlock(pthread_mutex_t *mutex); </span><br /></p></blockquote><p>Multiple threads can try to lock the same mutex and those that found it locked will wait until unlock eventually makes the mutex available for acquire for one of them (as decided by the scheduler). I've made the assumption (correct me if I am wrong) that that same thread that locked the mutex must unlock it eventually. Based on that I came up with the following initial lame version of <b>bpftrace</b> program:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat pthread_mutex.bt</b><br />#!/usr/bin/env bpftrace<br /><br />BEGIN<br />{<br /> printf("Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop\n");<br />}<br /><br /><b>uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/</b><br />{<br /> @start[arg0] = nsecs;<br /> @mutexid[arg0] = tid;<br /> @tidstack[tid] = ustack;<br />}<br /><br /><b>uprobe:/lib64/libpthread.so.0:pthread_mutex_unlock<br />/comm == "mariadbd" && @start[arg0] != 0 && @mutexid[arg0] == tid/<br /></b>{<br /> $now = nsecs;<br /> $time = $now - @start[arg0];<br /> @mutexstack[@tidstack[tid]] += $time;<br /> printf("Mutex: %u, time: %d\n", arg0, $time);<br /> delete(@start[arg0]);<br /> delete(@mutexid[arg0]);<br /> delete(@tidstack[tid]);<br />}<br /><br />END<br />{<br /> clear(@start);<br /> clear(@mutexid);<br /> clear(@tidstack);<br />}<br />/* the end */</span></span><br /></p></blockquote><p> Do you recognize <b>biosnoop.bt style?</b> Yes, this is what I was inspired by... So, I've added two <b>uprobes</b> for the library providing the function, both checking that the call is done from the <b>mariadbd</b>. The first one, for lock, stores start time for the given mutex address, thread id that locked it, and stack trace of the thread at the moment of locking. The second one, for unlock, computes time difference since the same mutex was locked last, but it fires only if unlock thread has the same id that the lock one. Then I add this time difference to the time spent "within this stack trace", by referring to the thread stack stored as index in the <b>@mutexstack[]</b> associative array. Then I print some debugging output to see what happens in the process of tracing and remove items from the associative arrays that were added to them by the first probe.</p><p>In the <b>END</b> probe I just clean up all associative arrays but <b>@mutexstack[]</b>, and, as we've seen before, then its content is just dumped to the output by the <b>bpftrace</b>. This is what I am going to post process later, after quick debugging session proves my idea gives some reasonable results.</p><p>So, with MariaDB 10.5 up and running, started like this (no real tuning for anything, no wonder QPS is not high in the tests below):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">./bin/mysqld_safe --no-defaults --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --port=3309 &</span></span><br /></p></blockquote><p>and having zero user connections, I made <b>pthread_mutex.bt</b> executable and started my very first <b>bpftrace</b> program for the very first time (OK, honestly, few previous runs shown some syntax errors that I corrected):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>sudo ./pthread_mutex.bt</b><br />Attaching 4 probes...<br />Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop<br />Mutex: 652747168, time: 6598<br />Mutex: 629905136, time: 46594<br />...<br />Mutex: 652835840, time: 26491<br />Mutex: 652835712, time: 4569<br /><b>^C</b><br /><br /><br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79<br /> 0x7f7e60fb53d4<br /> 0x5562258c9080<br /> std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0<br /> 0x2de907894810c083<br />]: 23055<br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> tpool::thread_pool_generic::timer_generic::execute(void*)+188<br /> tpool::task::execute()+50<br /> tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79<br /> 0x7f7e60fb53d4<br /> 0x5562258c9080<br /> std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0<br /> 0x2de907894810c083<br />]: 23803<br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> tpool::thread_pool_generic::timer_generic::execute(void*)+210<br /> tpool::task::execute()+50<br /> tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79<br /> 0x7f7e60fb53d4<br /> 0x5562258c9080<br /> std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0<br /> 0x2de907894810c083<br />]: 24555<br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> tpool::thread_pool_generic::submit_task(tpool::task*)+88<br /> timer_handler+326<br /> start_thread+226<br />]: 31859<br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> srv_monitor_task+130<br /> tpool::thread_pool_generic::timer_generic::execute(void*)+53<br /> tpool::task::execute()+50<br /> tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79<br /> 0x7f7e60fb53d4<br /> 0x5562258c9080<br /> std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0<br /> 0x2de907894810c083<br />]: 53282<br />@mutexstackERROR: failed to look up stack id 0 (pid 0): -1<br />[]: 322499</span></span><br /></p></blockquote><p>Not bad. I see mutexes are locked and (last ERROR aside) time is summed up per stack trace as planned. More over, stack traces look reasonable for 10.5 (generic thread pool is used inside InnoDB in this version, to run background tasks, are you aware of that?). Some symbols are not resolved, but what can I do about it? I'll just skip that addresses at some later stage, maybe.<br /></p><p>I just decided to check what threads are locking mutexes, and modified <b>print</b>:</p><blockquote><p> <span style="font-family: courier;"><span style="font-size: x-small;">printf("Mutex: %u, thread: %u, time: %d\n", arg0, tid, $time);</span></span><br /></p></blockquote><p>With that modification I also redirected errors to <b>/dev/null</b> and got this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>sudo ./pthread_mutex.bt 2>/dev/null</b><br />Attaching 4 probes...<br />Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop<br />Mutex: 652835712, thread: 4476, time: 6354<br />Mutex: 629905136, thread: 4476, time: 37053<br /><b>Mutex: 621289632, thread: 4485, time: 5254<br />Mutex: 621289632, thread: 4485, time: 4797<br /></b>Mutex: 652835840, thread: 4485, time: 31465<br /><b>Mutex: 652835712, thread: 4485, time: 4374<br />Mutex: 652835712, thread: 4476, time: 6048<br /></b>Mutex: 629905136, thread: 4476, time: 35703<br />Mutex: 621289632, thread: 4485, time: 4917<br />Mutex: 621289632, thread: 4485, time: 4779<br />Mutex: 652835840, thread: 4485, time: 30316<br />Mutex: 652835712, thread: 4485, time: 4389<br />Mutex: 652835712, thread: 4476, time: 6733<br />Mutex: 629905136, thread: 4476, time: 40936<br />Mutex: 621289632, thread: 4485, time: 4719<br />Mutex: 621289632, thread: 4485, time: 4725<br />Mutex: 652835840, thread: 4485, time: 30637<br />Mutex: 652835712, thread: 4485, time: 4441<br /><b>^C</b><br /><br /><br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79<br /> 0x7f7e60fb53d4<br /> 0x5562258c9080<br /> std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0<br /> 0x2de907894810c083<br />]: 13204<br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> tpool::thread_pool_generic::timer_generic::execute(void*)+210<br /> tpool::task::execute()+50<br /> tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79<br /> 0x7f7e60fb53d4<br /> 0x5562258c9080<br /> std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0<br /> 0x2de907894810c083<br />]: 14301<br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> tpool::thread_pool_generic::timer_generic::execute(void*)+188<br /> tpool::task::execute()+50<br /> tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79<br /> 0x7f7e60fb53d4<br /> 0x5562258c9080<br /> std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0<br /> 0x2de907894810c083<br />]: 14890<br />@mutexstack[<br /> __pthread_mutex_lock+0<br /> tpool::thread_pool_generic::submit_task(tpool::task*)+88<br /> timer_handler+326<br /> start_thread+226<br />]: 19135<br />@mutexstack[]: 206110</span></span><br /></p></blockquote><p>I see different threads locking same mutexes etc. One day given more time I'd try to figure out what mutexes are that and what was the purpose of each thread (it can be seen based on OS thread id <a href="http://mysqlentomologist.blogspot.com/2021/01/linux-proc-filesystem-for-mysql-dbas_7.html" target="_blank">in the <b>performance_schema.threads</b></a> in 10.5, fortunately, or inferred from the stacks at the moment).<br /></p><p>I've removed debug <b>print</b> (no interactive output, just final summarized data), changed <b>ustack</b> to <b>ustack(perf)</b> (assuming I know better how to deal with that output format later - it was not really a good idea), and ended up with this final version of the tool:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">#!/usr/bin/env bpftrace<br /><br />BEGIN<br />{<br />/*<br /> printf("Tracing time from pthread_mutex_lock to _unlock, Ctrl-C to stop\n");<br />*/<br />}<br /><br />uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/ <br />{ <br /> @start[arg0] = nsecs;<br /> @mutexid[arg0] = tid;<br /> @tidstack[tid] = ustack(perf); <br />}<br /><br />uprobe:/lib64/libpthread.so.0:pthread_mutex_unlock <br />/comm == "mariadbd" && @start[arg0] != 0 && @mutexid[arg0] == tid/ <br />{<br /> $now = nsecs;<br /> $time = $now - @start[arg0];<br /> @mutexstack[@tidstack[tid]] += $time;<br />/*<br /> printf("Mutex: %u, thread: %u, time: %d\n", arg0, tid, $time);<br />*/<br /> delete(@start[arg0]);<br /> delete(@mutexid[arg0]);<br /> delete(@tidstack[tid]);<br />}<br /><br />END<br />{<br /> clear(@start);<br /> clear(@mutexid);<br /> clear(@tidstack);<br />}<br />/* the end */</span></span><br /></p></blockquote><p>I saved stack trace to the file in <b>/tmp</b>, to work on the output further outside of the <b>bpftrace</b>. Again probably it was not the best idea, but I am not yet fluent with strings processing in <b>bpftrace</b> anyway, I rely on <b>awk</b>, <b>sort</b> etc.:</p><blockquote><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>sudo ./pthread_mutex.bt 2>/dev/null >/tmp/pthread_mutex_perf_stacks.txt</b><br /><b>^C</b>[openxs@fc31 ~]<b>cat /tmp/pthread_mutex_perf_stacks.txt</b><br />Attaching 4 probes...<br /><br /><br /><br />@mutexstack[<br /> 7f7e615a6e70 __pthread_mutex_lock+0 (/usr/lib64/libpthread-2.30.so)<br /> 556223eb07cf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 7f7e60fb53d4 0x7f7e60fb53d4 ([unknown])<br /> 5562258c9080 0x5562258c9080 ([unknown])<br /> 556223eb0b60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 2de907894810c083 0x2de907894810c083 ([unknown])<br />]: 21352<br />@mutexstack[<br /> 7f7e615a6e70 __pthread_mutex_lock+0 (/usr/lib64/libpthread-2.30.so)<br /> 556223eb0d62 tpool::thread_pool_generic::timer_generic::execute(void*)+210 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 556223eb1c52 tpool::task::execute()+50 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 556223eb07cf tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 7f7e60fb53d4 0x7f7e60fb53d4 ([unknown])<br /> 5562258c9080 0x5562258c9080 ([unknown])<br /> 556223eb0b60 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0 (/home/openxs/dbs/maria10.5/bin/mariadbd)<br /> 2de907894810c083 0x2de907894810c083 ([unknown])<br />]: 22975<br />...<br /></span></span><p></p></blockquote><p>After checking what I did with such stack traces <a href="http://mysqlentomologist.blogspot.com/2020/01/using-bpftrace-on-fedora-29-more.html" target="_blank">previously</a> to collapse them into one line per stack <b>pt-pmp</b> style, and multiple clarification runs and changes I ended up with the following <b>awk</b> code:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">awk '<br />BEGIN { s = ""; }<br /><b>/^@mutexstack\[\]/</b> { s = ""; }<br /><b>/^@mutexstack/ { s = ""; }</b><br /><b>/^\t/</b> { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }<br /><b>/^]/</b> { print $2, s }<br />' </span></span><br /></p></blockquote><p>I process the lines that are not containing stacks, those around the stack block, resetting the stack <b>s</b> at the beginning and printing it at the end of the block, and for the stack lines I take only function name and ignore everything else to form a <b>targ</b>, and concatenate it to the stack already collected with comma (that was a wrong idea for future use) as a separator between function names. Original code that inspired all these came from <b>pt-pmp</b> as far as I remember. I just adapted it to the format, better than in the previous posts.</p><p>Post-processing the output with this <b>awk</b> code gave me the following:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat /tmp/pthread_mutex_perf_stacks.txt | awk '</b><br />> <b>BEGIN { s = ""; }</b><br />> <b>/^@mutexstack\[\]/ { s = ""; }</b><br />> /^@mutexstack/ { s = ""; }<br />> <b>/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }</b><br />> <b>/^]/ { print $2, s }</b><br />> <b>'</b><br />21352 __pthread_mutex_lock,tpool::thread_pool_generic::worker_main,,,,<br />22975 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,<br />24568 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,<br />33469 __pthread_mutex_lock,tpool::thread_pool_generic::submit_task,timer_handler,start_thread</span></span><br /></p></blockquote><p>Non-resolved addresses are removed, same as offsents from the functrion start. Now sorting remains, in descending order, on the first column as a key:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ cat /tmp/pthread_mutex_perf_stacks.txt | awk '<br />BEGIN { s = ""; }<br />/^@mutexstack\[\]/ { s = ""; }<br />/^@mutexstack/ { s = ""; }<br />/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }<br />/^]/ { print $2, s }<br />' | sort -r -n -k 1,1<br /><b>33469 __pthread_mutex_lock,tpool::thread_pool_generic::submit_task,timer_handler,start_thread<br />24568 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,<br />22975 __pthread_mutex_lock,tpool::thread_pool_generic::timer_generic::execute,tpool::task::execute,tpool::thread_pool_generic::worker_main,,,,<br />21352 __pthread_mutex_lock,tpool::thread_pool_generic::worker_main,,,,</b></span></span><br /></p></blockquote><p>That's what we have, for the server without user connections. Now let me put it under the high concurrent <b>sysbench</b> test load (good idea, isn't it?):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=openxs --mysql-socket=/tmp/mariadb.sock --mysql-db=sbtest --threads=32 --report-interval=10 --time=300 run</b><br />sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)<br /><br />Running the test with following options:<br />Number of threads: 32<br />Report intermediate results every 10 second(s)<br />Initializing random number generator from current time<br /><br /><br />Initializing worker threads...<br /><br />Threads started!<br /><br />[ 10s ] thds: 32 tps: 653.07 qps: 13097.92 (r/w/o: 9174.69/2613.89/1309.34) lat (ms,95%): 240.02 err/s: 0.00 reconn/s: 0.00<br />[ 20s ] thds: 32 tps: 1025.71 qps: 20511.58 (r/w/o: 14358.12/4102.14/2051.32) lat (ms,95%): 71.83 err/s: 0.00 reconn/s: 0.00<br /><b>[ 30s ] thds: 32 tps: 588.21 qps: 11770.70 (r/w/o: 8238.74/2355.44/1176.52) lat (ms,95%): 235.74 err/s: 0.00 reconn/s: 0.00<br />[ 40s ] thds: 32 tps: 306.22 qps: 6135.54 (r/w/o: 4298.14/1224.97/612.43) lat (ms,95%): 369.77 err/s: 0.00 reconn/s: 0.00<br />[ 50s ] thds: 32 tps: 467.00 qps: 9339.64 (r/w/o: 6537.96/1867.69/933.99) lat (ms,95%): 308.84 err/s: 0.00 reconn/s: 0.00<br />[ 60s ] thds: 32 tps: 302.19 qps: 6044.31 (r/w/o: 4230.60/1209.34/604.37) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00<br />[ 70s ] thds: 32 tps: 324.91 qps: 6496.60 (r/w/o: 4548.67/1298.12/649.81) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00<br /></b>[ 80s ] thds: 32 tps: 303.58 qps: 6058.55 (r/w/o: 4238.05/1213.33/607.16) lat (ms,95%): 646.19 err/s: 0.00 reconn/s: 0.00<br />[ 90s ] thds: 32 tps: 258.39 qps: 5176.10 (r/w/o: 3625.73/1033.58/516.79) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00<br />[ 100s ] thds: 32 tps: 213.72 qps: 4279.43 (r/w/o: 2995.93/856.07/427.43) lat (ms,95%): 707.07 err/s: 0.00 reconn/s: 0.00<br />[ 110s ] thds: 32 tps: 208.29 qps: 4144.23 (r/w/o: 2896.58/831.07/416.58) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00<br />[ 120s ] thds: 32 tps: 456.29 qps: 9135.45 (r/w/o: 6397.03/1826.05/912.38) lat (ms,95%): 363.18 err/s: 0.00 reconn/s: 0.00<br />[ 130s ] thds: 32 tps: 582.21 qps: 11641.73 (r/w/o: 8148.49/2328.63/1164.61) lat (ms,95%): 277.21 err/s: 0.00 reconn/s: 0.00<br />[ 140s ] thds: 32 tps: 560.39 qps: 11208.17 (r/w/o: 7845.84/2241.55/1120.78) lat (ms,95%): 257.95 err/s: 0.00 reconn/s: 0.00<br />[ 150s ] thds: 32 tps: 338.03 qps: 6768.93 (r/w/o: 4739.47/1353.41/676.05) lat (ms,95%): 442.73 err/s: 0.00 reconn/s: 0.00<br />[ 160s ] thds: 32 tps: 410.20 qps: 8210.38 (r/w/o: 5748.19/1641.80/820.40) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00<br />[ 170s ] thds: 32 tps: 480.28 qps: 9599.94 (r/w/o: 6716.68/1922.81/960.45) lat (ms,95%): 325.98 err/s: 0.00 reconn/s: 0.00<br />[ 180s ] thds: 32 tps: 397.62 qps: 7952.16 (r/w/o: 5568.62/1588.19/795.35) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00<br />[ 190s ] thds: 32 tps: 338.77 qps: 6769.31 (r/w/o: 4739.09/1352.78/677.44) lat (ms,95%): 475.79 err/s: 0.00 reconn/s: 0.00<br />[ 200s ] thds: 32 tps: 417.81 qps: 8372.59 (r/w/o: 5857.10/1679.76/835.73) lat (ms,95%): 331.91 err/s: 0.00 reconn/s: 0.00<br />[ 210s ] thds: 32 tps: 267.40 qps: 5340.01 (r/w/o: 3742.10/1063.10/534.80) lat (ms,95%): 634.66 err/s: 0.00 reconn/s: 0.00<br />[ 220s ] thds: 32 tps: 267.70 qps: 5355.78 (r/w/o: 3748.96/1071.42/535.41) lat (ms,95%): 590.56 err/s: 0.00 reconn/s: 0.00<br />[ 230s ] thds: 32 tps: 243.11 qps: 4859.74 (r/w/o: 3401.70/971.83/486.21) lat (ms,95%): 733.00 err/s: 0.00 reconn/s: 0.00<br />[ 240s ] thds: 32 tps: 173.99 qps: 3474.97 (r/w/o: 2430.94/696.05/347.98) lat (ms,95%): 1013.60 err/s: 0.00 reconn/s: 0.00<br />[ 250s ] thds: 32 tps: 169.71 qps: 3403.05 (r/w/o: 2384.37/679.25/339.42) lat (ms,95%): 877.61 err/s: 0.00 reconn/s: 0.00<br />[ 260s ] thds: 32 tps: 407.57 qps: 8151.27 (r/w/o: 5704.23/1631.89/815.15) lat (ms,95%): 272.27 err/s: 0.00 reconn/s: 0.00<br />...<br />[ 300s ] thds: 32 tps: 382.41 qps: 7641.05 (r/w/o: 5348.01/1528.43/764.62) lat (ms,95%): 434.83 err/s: 0.00 reconn/s: 0.00<br />SQL statistics:<br /> queries performed:<br /> read: 1663592<br /> write: 475312<br /> other: 237656<br /> total: 2376560<br /><b> transactions: 118828 (396.04 per sec.)<br /> queries: 2376560 (7920.89 per sec.)<br /></b> ignored errors: 0 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />Throughput:<br /> events/s (eps): 396.0445<br /> time elapsed: 300.0370s<br /> total number of events: 118828<br /><br />Latency (ms):<br /> min: 2.17<br /> avg: 80.79<br /> max: 5012.10<br /> 95th percentile: 390.30<br /> sum: 9600000.68<br /><br />Threads fairness:<br /> events (avg/stddev): 3713.3750/40.53<br /> execution time (avg/stddev): 300.0000/0.01</span></span><br /></p></blockquote><p>Trust me that I started by <b>bpftrace</b> program after initial 20 seconds of the test run, and let it work at most 20 seconds. But the entire test, next 280 seconds, were notably affected by a visible drop in QPS! I pressed Ctrl-C but got the command probm back much later, not even after 300 seconds... I was wathing the output growth in another shell:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>ls -l /tmp/pthread_mutex_perf_stacks.txt</b><br />-rw-rw-r--. 1 openxs openxs 264177 Jan 26 13:26 /tmp/pthread_mutex_perf_stacks.txt<br />...<br />[openxs@fc31 ~]$ <b>ls -l /tmp/pthread_mutex_perf_stacks.txt</b><br />-rw-rw-r--. 1 openxs openxs 281111 Jan 26 13:27 /tmp/pthread_mutex_perf_stacks.txt<br />...<br />[openxs@fc31 ~]$ <b>ls -l /tmp/pthread_mutex_perf_stacks.txt</b><br />-rw-rw-r--. 1 openxs openxs <b>4116283</b> Jan 26 13:35 /tmp/pthread_mutex_perf_stacks.txt</span></span><br /></p></blockquote><p>So I ended with 4M of text data expoirted to the userland, for just 20 seconds of data collection and with performance drop for many miutes for my 32 threds test on 4 cores old system. Not that impressive and I should care better to probably aggregate and process data more in my <b>bpftrace</b> program, or maybe just dump raw stak-time entries as they are collected. I'll test and see how to improve, as this way of collection is not suitable for production use on a loaded system :(</p><p>Anyway, I have to process what was collected with such an impact. To remind you, the data were collected this way:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>sudo ./pthread_mutex.bt 2>/dev/null >/tmp/pthread_mutex_perf_stacks.txt</b><br />[sudo] password for openxs:<br /><b>^C</b></span></span><br /></p></blockquote><p>and then I applied that same <b>awk</b> followed by <b>sort</b> command line as above to get collapsed stacks. This is what I've seen as a result:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat /tmp/collapsed_pthread_mutex.txt | more</b><br /><b>104251253 __pthread_mutex_lock,buf_flush_page,buf_flush_try_neighbors,buf_do_flu<br />sh_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread<br /></b>78920938<br />74828263<br />74770599<br />74622438<br />72853129<br />67893142<br />66546439 <b>__pthread_mutex_lock,buf_do_flush_list_batch,buf_flush_lists,buf_flush_<br />page_cleaner,start_thread<br /></b>61669188<br />59330217<br />55480213<br />55045396<br />53531941<br />53216338<br />...</span></span><br /></p></blockquote><p>I am yet to find out what are those non-resolved and all removed entries are coming from and what to do with them to not influence the analysis. For now I need to get rid of them as useless. This is how I did it to get "top 5" stacks with times (in nanoseconds) spent in them:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat /tmp/collapsed_pthread_mutex.txt | awk '{ if (length($2) > 0) {print} }' | head -5</b><br />104251253 __pthread_mutex_lock,buf_flush_page,buf_flush_try_neighbors,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread<br />66546439 __pthread_mutex_lock,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread<br />31431176 __pthread_mutex_lock,buf_flush_try_neighbors,buf_do_flush_list_batch,buf_flush_lists,buf_flush_page_cleaner,start_thread<br />27100601 __pthread_mutex_lock,tpool::aio_linux::getevent_thread_routine,,,,<br />11730055 __pthread_mutex_lock,buf_flush_lists,buf_flush_page_cleaner,start_thread<br /></span></span><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat /tmp/collapsed_pthread_mutex.txt | awk '{ if (length($2) > 0) {print} }' > /tmp/collapsed_clean_pthread_mutex.txt</b></span></span><br /><span style="font-family: courier;"></span></p></blockquote><p>I saved the output into the <b>/tmp/collapsed_clean_pthread_mutex.txt</b> file. The enxt step would be to represent the result in some nice graphical way, a flame graph! I have the software in place:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>ls /mnt/home/openxs/git/FlameGraph/</b><br />aix-perf.pl stackcollapse-gdb.pl<br />demos stackcollapse-go.pl<br />dev stackcollapse-instruments.pl<br />difffolded.pl stackcollapse-java-exceptions.pl<br />docs stackcollapse-jstack.pl<br />example-dtrace-stacks.txt stackcollapse-ljp.awk<br />example-dtrace.svg stackcollapse-perf.pl<br />example-perf-stacks.txt.gz stackcollapse-perf-sched.awk<br />example-perf.svg stackcollapse.pl<br />files.pl stackcollapse-pmc.pl<br /><b>flamegraph.pl</b> stackcollapse-recursive.pl<br />jmaps stackcollapse-sample.awk<br />pkgsplit-perf.pl stackcollapse-stap.pl<br />range-perf.pl stackcollapse-vsprof.pl<br />README.md stackcollapse-vtune.pl<br />record-test.sh stackcollapse-xdebug.php<br />stackcollapse-aix.pl test<br /><b>stackcollapse-bpftrace.pl</b> test.sh<br />stackcollapse-elfutils.pl</span></span><br /></p></blockquote><p>But I quickly recalled that <b>flamegraph.pl</b> expects this kind of format of the imput, ";" as separator and number as a second column, not the first: </p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;...;schedule 8</span></span><br /></p></blockquote><p>There is also a tool to collapse raw <b>bpftrace</b> stacks, <b>stackcollapse-bpftrace.pl</b>, and I have to check how it work for my case one day... Yesterday I just wanted to complet testing as soon as possible, so proceeded with a quick and dirty <b>awk</b> hack:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat /tmp/collapsed_clean_pthread_mutex.txt | awk ' { gsub(",",";",$2); print "mariadbd;"$2, $1 }' | head -5</b><br />mariadbd;__pthread_mutex_lock;buf_flush_page;buf_flush_try_neighbors;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 104251253<br />mariadbd;__pthread_mutex_lock;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 66546439<br />mariadbd;__pthread_mutex_lock;buf_flush_try_neighbors;buf_do_flush_list_batch;buf_flush_lists;buf_flush_page_cleaner;start_thread 31431176<br />mariadbd;__pthread_mutex_lock;tpool::aio_linux::getevent_thread_routine;;;; 27100601<br />mariadbd;__pthread_mutex_lock;buf_flush_lists;buf_flush_page_cleaner;start_thread 11730055</span></span><br /></p></blockquote><p>This format looks acceptable, so I've generated the flame graph with the same hack:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>cat /tmp/collapsed_clean_pthread_mutex.txt | awk ' { gsub(",",";",$2); print "mariadbd;"$2, $1 }' | /mnt/home/openxs/git/FlameGraph/flamegraph.pl --title="pthread_mutex_waits in MariaDB 10.5" --countname=nsecs > ~/Documents/mutex.svg</b></span></span><br /></p></blockquote><p>and here is the result, with sdearhc for "<b>tpool</b>" highlighting how much time of the mutex waits is related to the thread pool of background InnoDB threads:<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjRB46GIZtgEBWY6TdP4PDvb7kbYJcWx-5-jl0xtiftkKH5FDknWQxbQl8o1JBh1qAWCKclewKSvJPPVym8AON-PSwaDE7G2wdf5xIy0y3aJfS_THzOaU1CADC7mQzJEUVjKT2BGl7bCcB/s1195/pthread_mutex_waits_tpool.png" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="617" data-original-width="1195" height="330" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjRB46GIZtgEBWY6TdP4PDvb7kbYJcWx-5-jl0xtiftkKH5FDknWQxbQl8o1JBh1qAWCKclewKSvJPPVym8AON-PSwaDE7G2wdf5xIy0y3aJfS_THzOaU1CADC7mQzJEUVjKT2BGl7bCcB/w640-h330/pthread_mutex_waits_tpool.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">One can surely create a flame graph based on stacks collected by the <b>bpftrace</b> program, one way or the other...<br /></td></tr></tbody></table><p>I'll stop at this stage and maybe continue testing later this week. Stay tuned!<br /></p><p style="text-align: center;">* * *</p><p>To summarize:</p><ul style="text-align: left;"><li>I am not yet sure if my logic in the programs above was correct. I have to think more about it.</li><li>I surely need to find another way to process the data, either by collapsing/processing stacks in my <b>bpftrace</b> program to make them smaller, or maybe by submitting raw stack/time data as they are collected to the user level. More tests to come...</li><li>It is easy to create custom <b>bpftrace</b> programs for collecting the data you need. I think memory allocations tracing is my next goal. Imagine a printout of memory allocated and not freed, per allocating thread... If only that had less impact on QPS than what my lame program above demonstrated :)<br /></li></ul>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com7tag:blogger.com,1999:blog-3080615211468083537.post-56544731577793697182021-01-26T23:37:00.001+02:002021-01-26T23:37:57.935+02:00What mysql_upgrade really does in MariaDB, Part II, Bugs and Missing Features<p>Both <a href="http://mysqlentomologist.blogspot.com/search/label/proc" target="_blank"><b>/proc</b></a> sampling and<b> <a href="http://mysqlentomologist.blogspot.com/search/label/bpftrace" target="_blank">bpftrace</a></b> are cool topics to write about, but I should not forget my third upcoming talk at FOSDEM 2021 in less than 2 weeks, <a href="https://fosdem.org/2021/schedule/event/mariadb_upgrade/" target="_blank"><b>"Upgrading to a newer major version of MariaDB"</b></a>, that is mostly devoted to <b>mysql_upgrade</b> internals and is based on <a href="http://mysqlentomologist.blogspot.com/2020/04/what-mysqlupgrade-really-does-in.html" target="_blank">this blog post</a>. Today I am going to provide some more background details and, probably for the first time, act as <i>MariaDB entomologist</i> and study some bug reports and feature requests related to <b>mysql_upgrade</b> (a.k.a. <b>mariadb-upgrade</b> in 10.5+) that were added over last 10 months or so.</p><p>I tried different ways to search for MariaDB <a href="https://jira.mariadb.org" target="_blank">JIRA issues</a> related to <b>mysql_upgrade</b>. For examle, this is how I tried to find any bugs in<b> mysql_upgrade</b> closed over last two months (I recall there were few):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">text ~ mysql_upgrade AND project = "MariaDB Server" AND status = Closed AND createdDate >= "2020-12-01" ORDER BY createdDate DESC</span></span><br /></p></blockquote><p>I've got a list of 7 reports, with 2 relevant bugs that are already fixed:</p><ul style="text-align: left;"><li><a href="https://jira.mariadb.org/browse/MDEV-24566" target="_blank">MDEV-24566</a> - "<b>mysql_upgrade failed with "('mariadb.sys'@'localhost') does not exist" and mariadb 10.4/10.5 on docke</b>r". This is fixed in current docker images at <a href="https://hub.docker.com/r/mariadb/server" target="_blank">https://hub.docker.com/r/mariadb/server</a>. The problem was related to the Docker image only, and <b>mysql_upgrade</b> was actually affected by the initial database content there.<br /></li><li><a href="https://jira.mariadb.org/browse/MDEV-24452" target="_blank">MDEV-24452</a> - "<b>ALTER TABLE event take infinite time which for example breaks mysql_upgrade</b>". Now this was a real blocker bug in recent 10.5. If any event managed to start before you executed <b>mysql_upgrade</b>, the utility and proper upgrade process was blocked. Good to see this fixed in upcoming 10.5.9.</li></ul><p>So, Monty really fixes related bugs when they are reported, as promised. Let's consider now the following query, still not closed issues reported since April 1, 2020:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">text ~ "mysql_upgrade" AND project = "MariaDB Server" AND status != Closed AND createdDate >= "2020-04-01" ORDER BY createdDate DESC</span></span><br /></p></blockquote><p>I checked then one by one (as they may be related to upgrade process but not to <b>mysql_upgrade</b> specifically) and placed into two lists, bugs and feature requests. Let me start with tasks (feature requests), so you know what was kind of missing by design or has to be done differently:</p><ul style="text-align: left;"><li><a href="https://jira.mariadb.org/browse/MDEV-24586" target="_blank">MDEV-24586</a> - "<b>remove scripts/mysql_to_mariadb.sql</b>". Actually proper logic is implemented in <b>scripts/mysql_system_tables_fix.sql</b> that forms part of <b>mysql_upgrade</b>, so separate script is no longer needed.</li><li><a href="https://jira.mariadb.org/browse/MDEV-24540" target="_blank">MDEV-24540</a> - "<b>Detect incompatible MySQL virtual columns and report in error log.</b>" MariaDB does not support migration from MySQL-generated physical
database tables containing virtual columns, and produces column mismatch
errors, failures in <b>mysql_upgrade</b> etc. It would be great to have proper error messages from <b>mysql_upgrade</b> in this case, explaining the real problem and possible solutions (dump, drop and reload or whatever).</li><li><a href="https://jira.mariadb.org/browse/MDEV-24453" target="_blank">MDEV-24453</a> - "<b>mysql_upgrade does not honor --verbose parameter</b>". It is not passed to other binaries called and this may make debugging upgrade issues more complex.</li><li><a href="https://jira.mariadb.org/browse/MDEV-24316" target="_blank">MDEV-24316</a> - "<b>cross upgrade from MySQL - have precheck tool</b>". According to the reporter, <a href="https://jira.mariadb.org/secure/ViewProfile.jspa?name=danblack" target="_blank"><b><span class="view-issue-field" id="reporter-val"><span class="user-hover" id="issue_summary_reporter_danblack" rel="danblack">Daniel Black</span></span></b></a>, the goal would be to check on a database/global scale looking at
tables, at features used, at settings, at character sets in table and
determine the "migratablilty" of a given MySQL instance. I voted for this feature!</li><li><a href="https://jira.mariadb.org/browse/MDEV-24093" target="_blank">MDEV-24093</a> - "<b>Detect during mysql_upgrade if type_mysql_json.so is needed and load it</b>". After <a class="issue-link" data-issue-key="MDEV-18323" href="https://jira.mariadb.org/browse/MDEV-18323" title="Convert MySQL JSON type to MariaDB TEXT in mysql_upgrade"><del>MDEV-18323</del></a>, <b>MYSQL_JSON</b> type is available as a dynamically loadable plugin.
To make <b>mysql_upgrade</b> runs seamlessly we need to make sure it is loaded appropriately and unloaded when done with upgrade). This is already in review, so will be implemented really soon.</li><li><a href="https://jira.mariadb.org/browse/MDEV-23962" target="_blank">MDEV-23962</a> - "<b>Remove arc directory support</b>". I think only <a href="https://jira.mariadb.org/secure/ViewProfile.jspa?name=monty" target="_blank"><b>Monty</b></a> (bug reporter) knows what is this about. I don't :)</li><li><a href="https://jira.mariadb.org/browse/MDEV-23008" target="_blank">MDEV-23008</a> - "<b>store mysql_upgrade version info in system table instead of local file</b>". One of the really important feature requests from my colleague since 2005, <a href="https://jira.mariadb.org/secure/ViewProfile.jspa?name=hholzgra" id="avatar-full-name-link" target="_blank" title="hholzgra"><b>Hartmut Holzgraef</b></a><a href="Hartmut Holzgraefe" target="_blank">e</a>.</li><li><a href="https://jira.mariadb.org/browse/MDEV-22357" target="_blank">MDEV-22357</a> - "<b>Clearing InnoDB autoincrement counter when upgrading from MariaDB < 10.2.4</b>". CHECK TABLE ... FOR UPGRADE should work differently for InnoDB tables, for <b>mysql_upgrade</b> to work properly.</li><li><a href="https://jira.mariadb.org/browse/MDEV-22323" target="_blank">MDEV-22323</a> - "<b>Upgrading MariaDB</b>". This is the "umbrella" task to cover all problematic cases of MySQL to MariaDB, Percona Server to MariaDB and minor MariaDB server upgrades.<br /></li><li><a href="https://jira.mariadb.org/browse/MDEV-22322" target="_blank">MDEV-22322</a> - "<b>Percona Server -> Mariadb Upgrades</b>". Summary of all the related issues. See <a href="https://jira.mariadb.org/browse/MDEV-22679" target="_blank">MDEV-22679</a> etc.<br /></li></ul><p>Now back to more or less serious bugs that are still waiting for the fix:</p><ul style="text-align: left;"><li><a href="https://jira.mariadb.org/browse/MDEV-24579" target="_blank">MDEV-24579</a> - "<b>Error table->get_ref_count() after update to 10.5.8</b><span class="overlay-icon aui-icon aui-icon-small aui-iconfont-edit"></span>". It seems DDL executed on <b>mysql.*</b> tables with InnoDB persistent statistics (like those executed by <b>mysql_upgrade</b>!) may cause problems for concurrent queries (up to assertion failure in this case). So we either should remove those tables (<a href="http://mysqlentomologist.blogspot.com/2018/01/on-innodbs-persistent-optimizer.html" target="_blank">I wish!</a>) or do something with <b>mysql_upgrade</b>, or (IMHO even better) do not let users connect and execute queries while <b>mysql_upgrade</b> is running, like MySQL 8 does when the server is started for the first time on older <b>datadir</b> and performs upgrade. Take care in the meantime...</li><li><a href="https://jira.mariadb.org/browse/MDEV-23652" target="_blank">MDEV-23652</a> - "<b>Assertion failures upon reading InnoDB system table after normal upgrade from 10.2</b>". Now this is a real bug :) Assertion failure during <b>mysql_upgrade</b>, this is surely something to fix!</li><li><a href="https://jira.mariadb.org/browse/MDEV-23636" target="_blank">MDEV-23636</a> - "<b>mysql_upgrade [ERROR] InnoDB: Fetch of persistent statistics requested for table</b>". I am not sure what's going on here, and why.</li><li><a href="https://jira.mariadb.org/browse/MDEV-23392" target="_blank">MDEV-23392</a> - "<b>main.mysql_upgrade_view failed in buildbot with another wrong result</b><span class="overlay-icon aui-icon aui-icon-small aui-iconfont-edit"></span>". MTR test case failures is something to care about.</li><li><a href="https://jira.mariadb.org/browse/MDEV-22683" target="_blank">MDEV-22683</a> - "<b>mysql_upgrade misses some changes to mysql schema</b>". Over ttime and different versions, some structures in <b>mysql</b> <span style="font-family: inherit;"></span>schema get changed, but not all the changes make it to the scripts executed by <b>mysql_ugrade</b>. As a result a schema freshly created by <b>mysql_install_db</b> on a versions 10.x.y differs from a schema created on an earlier version and upgraded to 10.x.y by <b>mysql_ugrade</b>. The real diffs are listed, per version, in this bug report from <a href="https://jira.mariadb.org/secure/ViewProfile.jspa?name=elenst" id="avatar-full-name-link" title="elenst">Elena Stepanova</a>.</li><li><a href="https://jira.mariadb.org/browse/MDEV-22655" target="_blank">MDEV-22655</a> - "<b>CHECK TABLE ... FOR UPGRADE fails to report old datetime format</b>". That's my favorite, unfortunately. Makes running <b>mysql_upgrade</b> useless for some cases of upgrade from pre-5.6 MySQL versions and leads to problems for tables partioned by datetime etc columns. See also <a href="https://jira.mariadb.org/browse/MDEV-24499" target="_blank">MDEV-24499</a> - "<b>Server upgrade causes compound index and related query to fail.</b>".</li><li><a href="https://jira.mariadb.org/browse/MDEV-22645" target="_blank">MDEV-22645</a> - "<b>default_role gets removed when migrating from 10.1 to 10.4</b>". May have something to do with mydumper/myloader used, but still a problem.</li><li><a href="https://jira.mariadb.org/browse/MDEV-22482" target="_blank">MDEV-22482</a> - "<b>pam v2: mysql_upgrade doesn't fix the ownership/privileges of auth_pam_tool</b>". No comments.</li><li><a href="https://jira.mariadb.org/browse/MDEV-22477" target="_blank">MDEV-22477</a> - "<b>mysql_upgrade fails with sql_notes=OFF</b>". <b>mysql_ugrade</b>, or, more exactly <b>mysql_system_tables.sql</b>, uses<b> @@warning_count</b> variable in the upgrade logic. The variable, in turn, depends on the value of <b>sql_notes</b>. When it is <b>OFF</b>, <b>@@warnings_count</b> is not incremented, and <b>mysql_upgrade</b> doesn't work as expected.</li></ul><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjg8dbGGetyeBZmAlxwoktVG117C2qUomQUIIPKbhDyuywXFp-wUrBLVQQolu0VYWLsyCSKiOlvuEX5QgvU1ulGvVL9ChKAvDtPHpPcIsBsnShFbyPuvsd02n5nj9aQ30brS1bbcaZoGB00/s640/Nokia+145.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjg8dbGGetyeBZmAlxwoktVG117C2qUomQUIIPKbhDyuywXFp-wUrBLVQQolu0VYWLsyCSKiOlvuEX5QgvU1ulGvVL9ChKAvDtPHpPcIsBsnShFbyPuvsd02n5nj9aQ30brS1bbcaZoGB00/w400-h300/Nokia+145.jpg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">After <a href="http://monty-says.blogspot.com/2020/04/upgrading-between-major-mariadb-versions.html" target="_blank">Monty's post</a> in April 2020 many new <b>mysql_upgrade</b> bugs were reported, and some were already fixed. So, we are on the way...<br /></td></tr></tbody></table><p></p><p style="text-align: center;">* * *<br /></p><p>To summarize:</p><ol style="text-align: left;"><li>There are still some bugs and missing features in <b>mysql_ugrade</b>.</li><li>MariaDB actively works on fixing them, as once promised by Monty.</li><li>Check the lists in this blog post if you plan to upgrade to recent MariaDB 10.x.y versions, carefully.</li><li>Please, report any problem with <b>mysql_upgrade</b> or upgrades in general to our JIRA.</li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-65823569288339557142021-01-25T21:30:00.000+02:002021-01-25T21:30:08.895+02:00Checking User Threads and Temporary Tables With gdb in MariaDB 10.4+, Step By Step<p>There were no posts about <b>gdb</b> tricks for a long time in this blog. This is surely unusual, but I had not done anything fancy with <b>gdb</b> for more than a year. Today I've got a chance finally to find something new in the code and answer yet another question based on code review and some basic <b>gdb</b> commands.</p><p>The question was about the way to find what temporary tables, if any, that are created by some connection. There is no way to do this in MariaDB 10.4, see <a href="https://jira.mariadb.org/browse/MDEV-12459" target="_blank">MDEV-12459</a> for some related plans (and <a href="https://dev.mysql.com/doc/refman/5.7/en/innodb-information-schema-temp-table-info.html" target="_blank"><b>I_S.INNODB_TEMP_TABLE_INFO</b></a> of MySQL that appeared in MariaDB for a short time only). My immediate answer was that this is surely stored somewhere in <b>THD</b> structure and I just have to find it (and a way to work with that information) using code review and/or <b>gdb</b>.</p><p>The first step was easy. I know that <b>THD</b> is defined in <a href="https://github.com/MariaDB/server/blob/10.4/sql/sql_class.h" target="_blank"><b>sql/sql_class.h</b></a>, and there I see:</p><blockquote><p>
<span style="font-size: x-small;"><span style="font-family: courier;">class THD: public THD_count, /* this must be first */<br /> public Statement,<br /> /*<br /> This is to track items changed during execution of a prepared<br /> statement/stored procedure. It's created by<br /> nocheck_register_item_tree_change() in memory root of THD,<br /> and freed in rollback_item_tree_changes().<br /> For conventional execution it's always empty.<br /> */<br /> public Item_change_list,<br /> public MDL_context_owner,<br /> <b>public Open_tables_state</b><br />...</span></span><br /></p></blockquote><p>Temporary tables surely must be somewhere in that <b>Open_tables_state</b>. In the same file we can <a href="class Open_tables_state" target="_blank">find the following</a>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">class Open_tables_state<br />{<br />public:<br /> /**<br /> As part of class THD, this member is set during execution<br /> of a prepared statement. When it is set, it is used<br /> by the locking subsystem to report a change in table metadata.<br /> When Open_tables_state part of THD is reset to open<br /> a system or INFORMATION_SCHEMA table, the member is cleared<br /> to avoid spurious ER_NEED_REPREPARE errors -- system and<br /> INFORMATION_SCHEMA tables are not subject to metadata version<br /> tracking.<br /> @sa check_and_update_table_version()<br /> */<br /> Reprepare_observer *m_reprepare_observer;<br /><br /> /**<br /> List of regular tables in use by this thread. Contains temporary and<br /> base tables that were opened with @see open_tables().<br /> */<br /> TABLE *open_tables;<br /><br /><b> /**<br /> A list of temporary tables used by this thread. This includes<br /> user-level temporary tables, created with CREATE TEMPORARY TABLE,<br /> and internal temporary tables, created, e.g., to resolve a SELECT,<br /> or for an intermediate table used in ALTER.<br /> */<br /> All_tmp_tables_list *temporary_tables;</b><br />...</span></span><br /></p></blockquote><p>With this information I am ready to dive into <b>gdb</b> session. I have MariaDB 10.4.18 at hand and create a couple of temporary tables in connection with id 9:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">MariaDB [test]> <b>select version(), connection_id(), current_user();</b><br />+-----------------+-----------------+------------------+<br />| version() | connection_id() | current_user() |<br />+-----------------+-----------------+------------------+<br />| 10.4.18-MariaDB | <b>9</b> | openxs@localhost |<br />+-----------------+-----------------+------------------+<br />1 row in set (0,000 sec)<br /><br />MariaDB [test]> <b>create temporary table mytemp(c1 int, c2 varchar(100));</b><br />Query OK, 0 rows affected (0,034 sec)<br /><br />MariaDB [test]> <b>create temporary table mytemp2(id int, c2 int) engine=MyISAM;</b><br />Query OK, 0 rows affected (0,001 sec)<br /><br />MariaDB [test]> <b>show processlist;</b><br />+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+<br />| Id | User | Host | db | Command | Time | State | Info | Progress |<br />+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+<br />| 3 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |<br />| 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |<br />| 1 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 |<br />| 2 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 |<br />| 5 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 |<br />| 9 | openxs | localhost | test | Query | 0 | Init | show processlist | 0.000 |<br />+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+<br />6 rows in set (0,000 sec)</span></span><br /></p></blockquote><p>Now I attach <b>gdb</b> and immediately try to check what's inside the temporary_table filed of the <b>do_command</b> frame where <b>thd</b> is present:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>sudo gdb -p `pidof mysqld`</b><br />[sudo] password for openxs:<br />GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1<br />Copyright (C) 2016 Free Software Foundation, Inc.<br />License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html><br />This is free software: you are free to change and redistribute it.<br />There is NO WARRANTY, to the extent permitted by law. Type "show copying"<br />and "show warranty" for details.<br />This GDB was configured as "x86_64-linux-gnu".<br />Type "show configuration" for configuration details.<br />For bug reporting instructions, please see:<br /><http://www.gnu.org/software/gdb/bugs/>.<br />Find the GDB manual and other documentation resources online at:<br /><http://www.gnu.org/software/gdb/documentation/>.<br />For help, type "help".<br />Type "apropos word" to search for commands related to "word".<br />Attaching to process 26620<br />[New LWP 26621]<br /><b>... 28 more LWPs were here</b><br />[New LWP 26658]<br />[Thread debugging using libthread_db enabled]<br />Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".<br />0x00007f59fb69080d in poll () at ../sysdeps/unix/syscall-template.S:84<br />84 ../sysdeps/unix/syscall-template.S: No such file or directory.<br />(gdb) <b>thread 31</b><br />[Switching to thread 31 (Thread 0x7f59dd6ca700 (LWP 26658))]<br />#0 0x00007f59fb69080d in poll () at ../sysdeps/unix/syscall-template.S:84<br />84 in ../sysdeps/unix/syscall-template.S<br />(gdb) <b>p do_command::thd->thread_id</b><br />$1 = <b>9</b><br />(gdb) <b>p do_command::thd->temporary_tables</b><br />$2 = (All_tmp_tables_list *) 0x5606b575cd28<br />(gdb) <b>p *do_command::thd->temporary_tables</b><br />$3 = {<I_P_List_null_counter> = {<No data fields>}, <I_P_List_no_push_back<TMP_TABLE_SHARE>> = {<No data fields>}, <b>m_first</b> = 0x5606b60dfa38}<br />(gdb) <b>p *do_command::thd->temporary_tables->m_first</b><br />$4 = {<TABLE_SHARE> = {table_category = TABLE_CATEGORY_TEMPORARY, name_hash = {<br /> key_offset = 0, key_length = 0, blength = 0, records = 0, flags = 0,<br /> array = {buffer = 0x0, elements = 0, max_element = 0,<br /> alloc_increment = 0, size_of_element = 0, malloc_flags = 0},<br /> get_key = 0x0, hash_function = 0x0, free = 0x0, charset = 0x0},<br /> mem_root = {free = 0x5606b60e3cf8, used = 0x5606b60e40e8, pre_alloc = 0x0,<br /> min_malloc = 32, block_size = 985, total_alloc = 2880, block_num = 6,<br /> first_block_usage = 0,<br /> error_handler = 0x5606b1e6bac0 <sql_alloc_error_handler()>,<br /> name = 0x5606b2538d63 "tmp_table_share"}, keynames = {count = 0,<br /> name = 0x0, type_names = 0x5606b60e3d78, type_lengths = 0x5606b60e3d94},<br /> fieldnames = {count = 2, name = 0x0, type_names = 0x5606b60e3d60,<br /> type_lengths = 0x5606b60e3d88}, intervals = 0x0, LOCK_ha_data = {<br /> m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,<br /> __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,<br /> __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},<br /> m_psi = 0x0}, LOCK_share = {m_mutex = {__data = {__lock = 0,<br /> __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0,<br /> __elision = 0, __list = {__prev = 0x0, __next = 0x0}},<br /> __size = '\000' <repeats 39 times>, __align = 0}, m_psi = 0x0},<br /> tdc = 0x0, tabledef_version = {<br /> str = 0x5606b60e3d10 "~A\370\275_4\021К°Ё\364\267\342\023=\275",<br /> length = 16}, option_list = 0x0, option_struct = 0x0,<br />---Type <return> to continue, or q <return> to quit---field = 0x5606b60e3dQuit</span></span><br /></p></blockquote><p>You may be wondering why I jumped to <b>Thread 31</b> immediately, how did I know that it corresponds to connection with thread_id 9, as I verified with later print? It was not pure luck, I knew I am the only user and just jumped to the last thread in order of creation. There is a better way for a general case, and it's navigating over a "list" of threads that must exist somewhere, as <b>SHOW PROCESSLIST</b> must have a way to get them all, easy one. We'll get back to that important task later in this post.</p><p>Now, in <b>temporary_tables->m_first</b> filed we have a table share, with a lot of details we may need. We can try to see some of them that were actually requested originally:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>p do_command::thd->temporary_tables->m_first->table_name</b><br />$5 = {str = 0x5606b60dfeef "mytemp2", length = 7}<br />(gdb) <b>p do_command::thd->temporary_tables->m_first->table_name.str</b><br />$6 = 0x5606b60dfeef <b>"mytemp2"</b><br />(gdb) <b>p do_command::thd->temporary_tables->m_first->path</b><br />$7 = {str = 0x5606b60dfed8 "/tmp/#sql67fc_9_3", length = 17}<br />(gdb) <b>p do_command::thd->temporary_tables->m_first->path.str</b><br />$8 = 0x5606b60dfed8 <b>"/tmp/#sql67fc_9_3"</b></span></span><br /></p></blockquote><p>So, I can get as many details as are presented or can be found from the <b>TABLE_SHARE</b> structure. I see them immediately for the last temporary table I've created in that session. But what about the other, there might be many of them. I expected some kind of a linked list or array, but type information presented above gave me no real hint. Where is the next or previous item? This hints towards list by name, but that's all:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>p *do_command::thd->temporary_tables</b><br />$3
= {<I_P_List_null_counter> = {<No data fields>},
<b><I_P_List_no_push_back<TMP_TABLE_SHARE>></b> = {<No data
fields>}, <b>m_first</b> = 0x5606b60dfa38}</span></span> <br /></p></blockquote><p>The type, <b><I_P_List_no_push_back<TMP_TABLE_SHARE>></b>, looks like some template class instantiated with <b>TMP_TABLE_SHARE</b>, and I can find <a href="https://github.com/MariaDB/server/blob/b4fb15ccd4f2864483f8644c0236e63c814c8beb/sql/sql_plist.h#L19" target="_blank">the source code</a>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">template <typename T> class I_P_List_no_push_back;<br /><br /><br />/**<br /> Intrusive parameterized list.<br /> Unlike I_List does not require its elements to be descendant of ilink<br /> class and therefore allows them to participate in several such lists<br /> simultaneously.<br /> Unlike List is doubly-linked list and thus supports efficient deletion<br /> of element without iterator.<br /> @param T Type of elements which will belong to list.<br /> @param B Class which via its methods specifies which members<br /> of T should be used for participating in this list.<br /> Here is typical layout of such class:<br /> struct B<br /> {<br /> static inline T **next_ptr(T *el)<br /> {<br /> return &el->next;<br /> }<br /> static inline T ***prev_ptr(T *el)<br /> {<br /> return &el->prev;<br /> }<br /> };<br /> @param C Policy class specifying how counting of elements in the list<br /> should be done. Instance of this class is also used as a place<br /> where information about number of list elements is stored.<br /> @sa I_P_List_null_counter, I_P_List_counter<br /> @param I Policy class specifying whether I_P_List should support<br /> efficient push_back() operation. Instance of this class<br /> is used as place where we store information to support<br /> this operation.<br /> @sa I_P_List_no_push_back, I_P_List_fast_push_back.<br />*/<br /><br />template <typename T, typename B,<br /> typename C = I_P_List_null_counter,<br /> typename I = I_P_List_no_push_back<T> ><br />class I_P_List : public C, public I<br />{<br /> T *<b>m_first</b>;<br />...</span></span><br /></p></blockquote><p>but I get lost in all these C++ stuff. Luckily I asked at the Engineering channel and got a hint that "I" in the name means "Intrusive" and that base type <b>T</b> is supposed to include pointers to the next and previous item. Moreover, in case of <b>TMP_TABLE_SHARE</b> they are named <b>tmp_next</b> and <b>tmp_prev</b>. I had to read the entire structure, as next and prev had not worked for me...</p><p>With this hint it was easy to proceed:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>p do_command::thd->temporary_tables->m_first->tmp_next</b><br />$12 = (TMP_TABLE_SHARE *) 0x5606b60df558<br />(gdb) <b>set $t = do_command::thd->temporary_tables->m_first</b><br />(gdb) <b>p $t</b><br />$13 = (TMP_TABLE_SHARE *) 0x5606b60dfa38<br />(gdb) <b>p $t->table_name.str</b><br />$14 = 0x5606b60dfeef "mytemp2"<br />(gdb) <b>set $t = $t->tmp_next</b><br />(gdb) <b>p $t</b><br />$15 = (TMP_TABLE_SHARE *) 0x5606b60df558<br />(gdb) <b>p $t->table_name.str</b><br />$16 = 0x5606b60dfa0f "mytemp"<br />(gdb) <b>set $t = $t->tmp_next</b><br />(gdb) p $t<br />$17 = (TMP_TABLE_SHARE *) <b>0x0</b></span></span><br /></p></blockquote><p>The idea is to iterate while <b>$t</b> is not zero, starting from <b>temporary_tables->m_first</b>. You can surely put it into a Python loop for automation. One day I'll do this too. For now I am happy to be able to list all temporary tables with all the details manually, with <b>gdb</b> commands.</p><p>The remaining question is: how to iterate over user threads in this MariaDB version? No more global <b>threads</b> variable:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>p threads</b><br />No symbol "threads" in current context.</span></span><br /></p></blockquote><p>No surprize, we had that changed <a href="http://mysqlentomologist.blogspot.com/2018/03/checking-user-threads-with-gdb-in-mysql.html" target="_blank">in MySQL 5.7+ too</a>. </p><p>Here I also used a hint from a way more experienced colleague, <a href="https://github.com/vuvova" target="_blank"><b>Sergei Golubchik</b></a>. That's what we have now:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>p server_threads<br /></b>$18 = {threads = {<base_ilist> = {first = 0x5606b60b6d28, last = {<br /> _vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,<br /> prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}, lock = {<br /> m_rwlock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0,<br /> __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0,<br /> __writer = 0, __shared = 0, __rwelision = 0 '\000',<br /> __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0},<br /> __size = '\000' <repeats 55 times>, __align = 0}, m_psi = 0x0}}<br />(gdb) <b>ptype server_threads</b><br />type = class THD_list {<br /> private:<br /> <b>I_List<THD> threads;</b><br /> mysql_rwlock_t lock;<br /><br /> public:<br /> void init();<br /> void destroy();<br /> void insert(THD *);<br /> void erase(THD *);<br /> int iterate<std::vector<unsigned long long> >(my_bool (*)(THD *,<br /> std::vector<unsigned long long> *), std::vector<unsigned long long> *);<br />}<br />(gdb) <b>p server_threads.threads</b><br />$19 = {<base_ilist> = {<b>first</b> = 0x5606b60b6d28, last = {<br /> _vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,<br /> prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}</span></span><br /></p></blockquote><p>From that I had to proceed myself. I already know what "I" means in these templates, so I expect to find the next pointer somewhere if I start from <b>first</b>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>p server_threads.threads</b><br />$19 = {<base_ilist> = {first = 0x5606b60b6d28, last = {<br /> _vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,<br /> prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}<br />(gdb) <b>p server_threads.threads.first</b><br />$20 = (ilink *) 0x5606b60b6d28<br />(gdb) <b>p *server_threads.threads.first</b><br />$21 = {_vptr.ilink = 0x5606b2d08f80 <vtable for THD+16>,<br /> prev = 0x5606b2ed1de0 <server_threads>, next = 0x7f59980009a8}<br />(gdb) <b>set $thd = (THD *)server_threads.threads.first</b><br />(gdb) <b>p $thd->thread_id</b><br />$22 = <b>9</b></span></span><br /></p></blockquote><p>This was the initialization part, now let's check some more and iterate:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">(gdb) <b>p $thd->proc_info</b><br />$23 = 0x5606b2521cd9 "Reset for next command"<br />(gdb) <b>set $thd = (THD *)$thd->next</b><br />(gdb) <b>p $thd->thread_id</b><br />$24 = 5<br />(gdb) <b>p $thd->proc_info</b><br />$25 = 0x5606b267a145 "InnoDB shutdown handler"<br />(gdb) <b>set $thd = (THD *)$thd->next</b><br />(gdb) <b>p $thd->thread_id</b><br />$26 = 2<br />(gdb) <b>p $thd->main_security_ctx.user</b><br />$27 = 0x0<br />(gdb) <b>p $thd->proc_info</b><br />$28 = 0x5606b26a3da9 "InnoDB purge coordinator"<br />(gdb) <b>set $thd = (THD *)$thd->next</b><br />(gdb) <b>p $thd->thread_id</b><br />$29 = 1<br />(gdb) <b>p $thd->proc_info</b><br />$30 = 0x5606b26a3e20 "InnoDB purge worker"<br />(gdb) <b>set $thd = (THD *)$thd->next</b><br />(gdb) <b>p $thd->thread_id</b><br />$31 = 4<br />(gdb) <b>p $thd->proc_info</b><br />$32 = 0x5606b26a3e20 "InnoDB purge worker"<br />(gdb) set $thd = (THD *)$thd->next<br />(gdb) <b>p $thd->thread_id</b><br />$33 = 3<br />(gdb) <b>p $thd->proc_info</b><br />$34 = 0x5606b26a3e20 "InnoDB purge worker"<br />(gdb) <b>set $thd = (THD *)$thd->next</b><br />(gdb) <b>p $thd->thread_id</b><br />$35 = 1095216660735<br />(gdb) <b>p $thd->proc_info</b><br />$36 = 0x0<br />(gdb) <b>set $thd = (THD *)$thd->next</b><br />(gdb) <b>p $thd</b><br />$37 = (THD *) 0x0</span></span><br /></p></blockquote><p style="text-align: left;">The idea of iteration is also clear: we move to <b>$thd->next</b> if it's not zero. What we see matches the <b>SHOW PROCESSLIST</b> output with the exception of the last thread, with zero <b>proc_info </b>too. It is some "sentinel" that is not present in the <b>PROCESSLIST</b>. One day I'll figure out why is it so and automate checking all threads based on Python code <a href="http://mysqlbugs.blogspot.com/2012/09/how-to-obtain-all-executing-queries.html" target="_blank">of this kind, suggested by <b>Shane Bester</b></a>. Tonight I am just happy to document what I recently found, as all details related to <b>gdb</b> usage do change with time and new versions released.</p><p style="text-align: left;"><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgp8Df2nq9vgP7EwCwZQvIIjocr7B1wHS5RrUGIrk6ZWi8Jmsq60a74wnYZIrCOSOIlfoKwU8gxoFydhG_8eYLHwhriQAtCjOAegHHARtt6-eAwyVGbpaHX3iGV2k3LY2-FVgAssKTO6nQA/s640/Nokia+030.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgp8Df2nq9vgP7EwCwZQvIIjocr7B1wHS5RrUGIrk6ZWi8Jmsq60a74wnYZIrCOSOIlfoKwU8gxoFydhG_8eYLHwhriQAtCjOAegHHARtt6-eAwyVGbpaHX3iGV2k3LY2-FVgAssKTO6nQA/w400-h300/Nokia+030.jpg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Free travels and digging into the code in <b>gdb</b> with specific goal in mind - I miss these activities recently</td></tr></tbody></table></p><p style="text-align: center;">* * *<br /></p><p>To summarize:</p><ol style="text-align: left;"><li>It's relatively easy to find out all the details about every temporary table of any kind created in any MariaDB server user thread, in <b>gdb</b>. <br /></li><li>It's still fun to work on MariaDB, as you can promptly get help from developers no matter what crazy questions you may ask</li><li>Changes towards more modern C++ may make it more diffical to debug in <b>gdb</b> initially for those unaware of the details of clasees implementation and design.<br /></li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-70699138142554119202021-01-24T15:00:00.000+02:002021-01-24T15:00:34.324+02:00Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part II, Using the Existing Tools<p>In the <a href="https://mysqlentomologist.blogspot.com/2021/01/playing-with-recent-bpftrace-and.html" target="_blank">previous post</a> in this series I've presented a couple of my quick and dirty attempts to use <b>bpftrace</b> add uprobes to MariaDB server and dynamic libraries it uses, to trace queries and their execution times and collect stack traces related to mutex waits. Looks like for any non-trivial monitoring task we are going to end up with more than one probe and would need to do more processing to produce clean and useful results with minimal CPU and memory impact both in the process of collecting the data while in kernel context and processing them in user space. </p><p>I have a long way to go with my lame command towards this goal, so in this post I decided to check several existing <b>bpftrace</b> programs not directly related to MariaDB, see how they are structured, how they use built in variables and functions etc. I'll also try to apply some of them to MariaDB 10.5 running <b>sysbench</b> read write test with high enough concurrency,</p><p>Most popular tools are <a href="https://github.com/iovisor/bpftrace/tree/master/tools" target="_blank">included into the <b>bpftrace</b> source code</a>, along with examples of their usage. We can find them in the <b>tools</b> subdirectory:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[root@fc31 tools]# <b>pwd</b><br />/mnt/home/openxs/git/bpftrace/tools<br />[root@fc31 tools]# <b>ls</b><br />bashreadline.bt loads_example.txt syscount_example.txt<br />bashreadline_example.txt mdflush.bt tcpaccept.bt<br /><b>biolatency.bt</b> mdflush_example.txt tcpaccept_example.txt<br />biolatency_example.txt naptime.bt tcpconnect.bt<b><br />biosnoop.bt</b> naptime_example.txt tcpconnect_example.txt<br />biosnoop_example.txt oomkill.bt tcpdrop.bt<b><br />biostacks.bt</b> oomkill_example.txt tcpdrop_example.txt<br />biostacks_example.txt <b>opensnoop.bt</b> tcplife.bt<br /><b>bitesize.bt</b> opensnoop_example.txt tcplife_example.txt<br />bitesize_example.txt pidpersec.bt tcpretrans.bt<br />capable.bt pidpersec_example.txt tcpretrans_example.txt<br />capable_example.txt runqlat.bt tcpsynbl.bt<br />CMakeLists.txt runqlat_example.txt tcpsynbl_example.txt<br />cpuwalk.bt runqlen.bt <b>threadsnoop.bt</b><br />cpuwalk_example.txt runqlen_example.txt threadsnoop_example.txt<br />dcsnoop.bt setuids.bt <b>vfscount.bt</b><br />dcsnoop_example.txt setuids_example.txt vfscount_example.txt<br />execsnoop.bt <b>statsnoop.bt</b> <b>vfsstat.bt</b><br />execsnoop_example.txt statsnoop_example.txt vfsstat_example.txt<br />gethostlatency.bt swapin.bt <b>writeback.bt</b><br />gethostlatency_example.txt swapin_example.txt writeback_example.txt<br />killsnoop.bt <b>syncsnoop.bt</b> xfsdist.bt<br />killsnoop_example.txt syncsnoop_example.txt xfsdist_example.txt<br />loads.bt syscount.bt</span></span><br /></p></blockquote><p>I highlighted the tools I am going to try. But let's check the source code for one of them, with quite non-trivial and 48 lines long code,<b> biosnoop.bt</b>, to begin with. My comments are after each code fragment below:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;"><b>#!/usr/bin/env bpftrace<br />#include <linux/blkdev.h></b><br />/*<br /> * biosnoop.bt Block I/O tracing tool, showing per I/O latency.<br /> * For Linux, uses bpftrace, eBPF.<br /> *<br /> * TODO: switch to block tracepoints. Add offset and size columns.<br /> *<br /> * This is a bpftrace version of the bcc tool of the same name.<br /> *<br /> * 15-Nov-2017 Brendan Gregg Created this.<br /> */<br /></span></span><br /></p></blockquote><p>Here we can see how to use shebang first line to run the program with <b>bpftrace</b> if it's executable. Next line shows that for some cases <b>bpftrace</b> (as other eBPF tools) may need headers (in this case kernel header) to be able to resolve references to complex structures passed as arguments.</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">BEGIN<br />{<br /> printf("%-12s %-7s %-16s %-6s %7s\n", "TIME(ms)", "DISK", "COMM", "PID", "LAT(ms)");<br />}</span></span><br /></p></blockquote><p>Next we see the <b>BEGIN</b> probe that, same as with <b>awk</b>, is executed once at the beginning of the program and in this case prints the formatted header for the further output.</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;"><b>kprobe:blk_account_io_start</b><br />{<br /> @start[arg0] = <b>nsecs</b>;<br /> @iopid[arg0] = <b>pid</b>;<br /> @iocomm[arg0] = <b>comm</b>;<br /> @disk[arg0] = <b>((struct request *)arg0)->rq_disk->disk_name</b>;<br />}<br /></span></span><br /></p></blockquote><p>Here we define a kernel probe for the <b>blk_account_io_start</b> function and store information in 4 <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#3--associative-arrays" target="_blank"><i>associative arrays</i></a> indexed by <b>arg0</b>, to store start time for the call in nanoseconds since the probe attached (<a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#6-nsecs-timestamps-and-time-deltas" target="_blank"><b>nsecs</b></a>), <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#1-builtins" target="_blank"><b>pid</b></a> of the calling program, calling command itself (<b>comm</b>) and disk name that we get from deep nested structure of the first traced function call argument, <b>arg0</b>, via type cast and pointers. That's why we needed kernel headers, to reference different structure members by name and eventually de-reference to proper offset/address.</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;"><b>kprobe:blk_account_io_done<br />/@start[arg0] != 0 && @iopid[arg0] != 0 && @iocomm[arg0] != ""/<br /></b><br />{<br /> $now = nsecs;<br /> printf("%-12u %-7s %-16s %-6d %7d\n",<br /> elapsed / 1e6, @disk[arg0], @iocomm[arg0], @iopid[arg0],<br /> ($now - @start[arg0]) / 1e6);<br /><br /> <b>delete</b>(@start[arg0]);<br /> delete(@iopid[arg0]);<br /> delete(@iocomm[arg0]);<br /> delete(@disk[arg0]);<br />}</span></span><br /></p></blockquote><p>In the probe above that we define for the <b>block_account_io_done</b> function we first make sure to do something only if the <b>block_account_io_start</b> was already called by this same process with the same argument, otherwise we would not match the times properly. The problem here is that the system call is completed not when the start function returns, but when the corresponding system call informs the caller that we are done. It's not as easy as <b>kprobe</b>/<b>kretprobe</b> for the same function implemented by the kernel.</p><p>The action of the probe is simple. We calculate the time difference in millseconds by comparing the stored start timestampt with current <b>nsecs</b> value and output the details collected. Then, and this is essential, we delete the element from the associative array (a.k.a. <i>map</i>) with <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#1-builtins-2" target="_blank"><b>delete()</b></a>, so that repeataed call from the same process to the same disk are not mixed up together and we do not use more memory than really needed.</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">END<br />{<br /> clear(@start);<br /> clear(@iopid);<br /> clear(@iocomm);<br /> clear(@disk);<br />}</span></span><br /></p></blockquote><p>In the <b>END</b> probe we <b>clear()</b> all associative arrays that were used by the program, essnetially deleteting all the items in the maps. Otherwise, as we've seen in my lame examples previously, they are dumped at the end of the program, and for this tool that produce monitoring output while it runs, we definitely do not need that.<br /></p><p>Now that we better understand how "real" <b>bpftrace</b> programs are usually structured and designed, let's try to run <a href="https://github.com/iovisor/bpftrace/blob/master/tools/biosnoop_example.txt" target="_blank"><b>biosnoop.bt</b></a> that traces block I/O, and shows the issuing process (at least, the process that was on-CPU at the time of queue insert) and the latency of the I/O:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[root@fc31 tools]# ./biosnoop.bt<br />Attaching 4 probes...<br />TIME(ms) DISK COMM PID LAT(ms)<br />98 sda mariadbd 6147 0<br />98 sda mariadbd 6147 0<br />...<br /><b>195 sda mariadbd 6147 92<br /></b>196 sda mariadbd 6147 0<br />...<br />201 sda mariadbd 6147 0<br /><b>274 sda mariadbd 6147 73</b><br />275 sda mariadbd 6147 0<br />278 sda mariadbd 6147 2<br />293 sda mariadbd 6147 9<br />295 sda mariadbd 6147 0<br />295 sda mariadbd 6147 0<br />303 sda mariadbd 6147 7<br />303 sda mariadbd 6147 0<br />303 sda mariadbd 6147 0<br />303 sda mariadbd 6147 0<br />304 sda mariadbd 6147 0<br />304 sda mariadbd 6147 0<br />305 sda mariadbd 6147 0<br />305 sda mariadbd 6147 0<br />306 sda mariadbd 6147 0<br />335 sda mariadbd 6147 29<br />336 sda jbd2/dm-0-8 419 14<br />337 sda mariadbd 6147 1<br /><b>365 sda jbd2/dm-0-8 419 28<br />365 sda mariadbd 6147 28<br />365 sda kworker/2:4 24472 0</b><br />392 sda mariadbd 6147 26<br />...</span></span><br /></p></blockquote><p>Here we see block I/O requests for my only disk, <b>sda</b>, from the <b>mariadbd</b> and few other processes, with timestamps starting from the startup and related latency (that is less than 1 millisecond for most cases, but sometimes appoached 100 milliseconds on this HDD).<br /></p><p>The next tool to check is <a href="https://github.com/iovisor/bpftrace/blob/master/tools/biolatency_example.txt" target="_blank"><b>biolatency.bt</b></a> that traces block I/O and shows latency as a power-of-2 histogram using <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#8-hist-log2-histogram" target="_blank"><b>hist()</b></a> function:</p><blockquote><p><span style="font-size: xx-small;"><span style="font-family: courier;">[root@fc31 tools]# <b>./biolatency.bt</b><br />Attaching 4 probes...<br />Tracing block device I/O... Hit Ctrl-C to end.<br /><b>^C</b><br /><br /><br />@usecs:<br />[128, 256) 31 |@@ |<br /><b>[256, 512) 421 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |<br />[512, 1K) 754 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br /></b>[1K, 2K) 94 |@@@@@@ |<br />[2K, 4K) 38 |@@ |<br />[4K, 8K) 36 |@@ |<br />[8K, 16K) 65 |@@@@ |<br />[16K, 32K) 109 |@@@@@@@ |<br />[32K, 64K) 80 |@@@@@ |<br />[64K, 128K) 25 |@ |<br />[128K, 256K) 16 |@ |</span></span><br /></p></blockquote><p>So, during the monitoring interval, until I hit Ctrl+C, the majority of block I/O calls had the latency of 256 to 1024 microseconds, less than 1 millisecond. There were few longer waiting calls too.</p><p>The next tool to check is <a href="http://biostacks.bt" target="_blank"><b>biostacks.bt</b></a> that is supposed to show block I/O latency as a histogram, with the kernel stack trace that initiated the I/O (do you still remember about <b>/proc</b> kernel stacks sampling that can be used for similar purposes?). This can help explain disk I/O that is not directly requested by applications. I've got the following:</p><blockquote><p><span style="font-size: xx-small;"><span style="font-family: courier;">[root@fc31 tools]# <b>./biostacks.bt > /tmp/biostacks.txt</b><br />cannot attach kprobe, probe entry may not exist<br />WARNING: could not attach probe kprobe:blk_start_request, skipping.<br />Attaching 5 probes...<br />Tracing block I/O with init stacks. Hit Ctrl-C to end.<br /><br /><b>^C</b>[root@fc31 tools]# <b>more /tmp/biostacks.txt<br /></b><br />...<br /><br />@usecs[<br /> blk_account_io_start+1<br /> blk_mq_make_request+481<br /> generic_make_request+653<br /> submit_bio+75<br /> ext4_io_submit+73<br /> ext4_writepages+694<br /> do_writepages+51<br /> __filemap_fdatawrite_range+172<br /> file_write_and_wait_range+107<br /> ext4_sync_file+240<br /> do_fsync+56<br /> __x64_sys_fdatasync+19<br /> do_syscall_64+77<br /> entry_SYSCALL_64_after_hwframe+68<br />]:<br />[4K, 8K) 2 |@@ |<br />[8K, 16K) 2 |@@ |<br />[16K, 32K) 9 |@@@@@@@@@@@@ |<br />[32K, 64K) 0 | |<br />[64K, 128K) 1 |@ |<br />[128K, 256K) 38 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br />[256K, 512K) 19 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |<br />[512K, 1M) 5 |@@@@@@ |<br />[1M, 2M) 0 | |<br />[2M, 4M) 4 |@@@@@ |<br />[4M, 8M) 9 |@@@@@@@@@@@@ |<br />[8M, 16M) 2 |@@ |<br />[16M, 32M) 1 |@ |<br />[32M, 64M) 1 |@ |<br />[64M, 128M) 1 |@ |<br /><br />@usecs[<br /> blk_account_io_start+1<br /> blk_mq_make_request+481<br /> generic_make_request+653<br /> submit_bio+75<br /> ext4_io_submit+73<br /> ext4_bio_write_page+609<br /> mpage_submit_page+97<br /> mpage_process_page_bufs+274<br /> mpage_prepare_extent_to_map+437<br /> ext4_writepages+668<br /> do_writepages+51<br /> __filemap_fdatawrite_range+172<br /> file_write_and_wait_range+107<br /> ext4_sync_file+240<br /> do_fsync+56<br /> __x64_sys_fdatasync+19<br /> do_syscall_64+77<br /> entry_SYSCALL_64_after_hwframe+68<br />]:<br />[16K, 32K) 8 |@ |<br />[32K, 64K) 31 |@@@@ |<br />[64K, 128K) 14 |@@ |<br /><b>[128K, 256K) 346 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br />[256K, 512K) 180 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |<br />[512K, 1M) 22 |@@@ |</b><br />[1M, 2M) 2 | |<br />[2M, 4M) 3 | |<br />[4M, 8M) 3 | |<br />[8M, 16M) 2 | |<br />[16M, 32M) 2 | |<br />[32M, 64M) 1 | |<br />[64M, 128M) 1 | |</span></span><br /></p></blockquote><p>I left a couple of stacks with typical high enough latency related to fdatasync on ext4.</p><p><a href="https://github.com/iovisor/bpftrace/blob/master/tools/bitesize_example.txt" target="_blank"><b>bitesize.bt</b></a> allows to see what are usual block I/O request size in bytes per program. This is what I've got for MariaDB:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: xx-small;">[root@fc31 tools]# <b>./bitesize.bt</b><br />Attaching 3 probes...<br />Tracing block device I/O... Hit Ctrl-C to end.<br /><b>^C</b><br />I/O size (bytes) histograms by process name:<br /><br />@[NetworkManager]:<br />[4K, 8K) 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br /><br />@[jbd2/dm-0-8]:<br />[4K, 8K) 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br />[8K, 16K) 0 | |<br />[16K, 32K) 0 | |<br />[32K, 64K) 0 | |<br />[64K, 128K) 0 | |<br />[128K, 256K) 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br /><br />...<br /><b>@[mariadbd]:<br /></b>[0] 3 |@ |<br />[1] 0 | |<br />[2, 4) 0 | |<br />[4, 8) 0 | |<br />[8, 16) 0 | |<br />[16, 32) 0 | |<br />[32, 64) 0 | |<br />[64, 128) 0 | |<br />[128, 256) 0 | |<br />[256, 512) 0 | |<br />[512, 1K) 0 | |<br />[1K, 2K) 0 | |<br />[2K, 4K) 0 | |<br />[4K, 8K) 0 | |<br />[8K, 16K) 0 | |<br /><b>[16K, 32K) 94 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|<br /></b>[32K, 64K) 26 |@@@@@@@@@@@@@@ |<br />[64K, 128K) 16 |@@@@@@@@ |<br />[128K, 256K) 8 |@@@@ |<br />[256K, 512K) 1 | |<br />[512K, 1M) 2 |@ |<br />[1M, 2M) 7 |@@@ |<br />...</span></span><br /></p></blockquote><p>The majority of I/Os were in the 16K to 32K range, one or two InnoDB data pages.</p><p>The next well known tool is <a href="https://github.com/iovisor/bpftrace/blob/master/tools/opensnoop_example.txt" target="_blank"><b>opensnoop.bt</b></a> that traces the <b>open()</b> syscall system-wide, and prints various details. This is what I've got while <b>sysbench </b>test was starting:<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[root@fc31 tools]# <b>./opensnoop.bt</b><br />Attaching 6 probes...<br />Tracing open syscalls... Hit Ctrl-C to end.<br />PID COMM FD ERR PATH<br />...<br /><b>25558 sysbench 29 0 /usr/local/share/sysbench/oltp_read_write.lua</b><br />25558 sysbench 29 0 /usr/local/share/sysbench/oltp_read_write.lua<br /><b>6147 mariadbd 45 0 ./sbtest/db.opt<br />6147 mariadbd 31 0 ./sbtest/db.opt<br />6147 mariadbd 35 0 ./sbtest/db.opt<br />6147 mariadbd 29 0 ./sbtest/db.opt<br />6147 mariadbd 27 0 ./sbtest/db.opt</b><br />25558 sysbench -1 2 ./oltp_common.lua<br />25558 sysbench -1 2 ./oltp_common/init.lua<br />25558 sysbench -1 2 ./src/lua/oltp_common.lua<br />25558 sysbench -1 2 /home/openxs/.luarocks/share/lua/5.1/oltp_common.lua<br />25558 sysbench -1 2 /home/openxs/.luarocks/share/lua/5.1/oltp_common/init.lua<br />25558 sysbench -1 2 /home/openxs/.luarocks/share/lua/oltp_common.lua<br />25558 sysbench -1 2 /home/openxs/.luarocks/share/lua/oltp_common/init.lua<br /><b>25558 sysbench -1 2 /usr/local/share/lua/5.1/oltp_common.lua<br />25558 sysbench -1 2 /usr/share/lua/5.1/oltp_common.lua</b><br />25558 sysbench 32 0 /usr/local/share/sysbench/oltp_common.lua<br />25558 sysbench 32 0 /usr/local/share/sysbench/oltp_common.lua<br />6147 mariadbd 80 0 ./sbtest/db.opt<br />...</span></span><br /></p></blockquote><p>It's interesting to find out that <b>sysbench</b>, based on test names, tries to open <b>.lua</b> files in some predefined lockations where they do not exist (based on error 2).</p><p>There are also similar tools to trace <b>stat()</b> and <b>sync()</b> calls:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[root@fc31 tools]# <b>./statsnoop.bt</b><br />Attaching 10 probes...<br />Tracing stat syscalls... Hit Ctrl-C to end.<br />PID COMM ERR PATH<br />...<br />6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI<br />6147 mariadbd 0 ./mysql/table_stats.MAI<br />6147 mariadbd 0 ./mysql/table_stats.MAD<br />6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql<br /><b>6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/column_stats.MAI<br />6147 mariadbd 0 ./mysql/column_stats.MAI<br />6147 mariadbd 0 ./mysql/column_stats.MAD</b><br />...<br />6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI<br />6147 mariadbd 0 ./mysql/table_stats.MAI<br />6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql<br />6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/column_stats.MAI<br />6147 mariadbd 0 ./mysql/column_stats.MAI<br />6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql<br />6147 mariadbd 0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI<br />6147 mariadbd 0 ./mysql/table_stats.MAI<br /><b>6147 mariadbd 0 ./sbtest/sbtest3.frm<br />6147 mariadbd 0 ./sbtest/sbtest1.frm<br />6147 mariadbd 0 ./sbtest/sbtest2.frm</b><br />...<br /><br />[root@fc31 tools]# <b>./syncsnoop.bt</b><br />Attaching 7 probes...<br />Tracing sync syscalls... Hit Ctrl-C to end.<br />TIME PID COMM EVENT<br />12:10:36 621 auditd tracepoint:syscalls:sys_enter_fsync<br /><b>12:10:36 6147 mariadbd tracepoint:syscalls:sys_enter_fdatasync<br />12:10:36 6147 mariadbd tracepoint:syscalls:sys_enter_fdatasync</b><br />...</span></span><br /></p></blockquote><p>We see that MariaDB server does not only check <b>sbtest.*</b> tables used in the test, but also the tables with engine-independent statistics in the <b>mysql</b> database. In the latter output we see wallclock timestamp of the call (this is <a href="https://mysqlentomologist.blogspot.com/2019/11/time-in-performance-schema.html" target="_blank">a problem with Performance Schema</a>, by the way) that is provided by the <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#3-time-time" target="_blank"><b>time()</b></a> builtin function of <b>bpftrace</b>. The printed timestamp is also async, it is the time at
which userspace has processed the queued up event, <em>not</em> the time at which the
<b>bpftrace</b> probe calls <b>time()</b>.</p><p> Some tools may not work at all, for example:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[root@fc31 tools]# <a href="https://github.com/iovisor/bpftrace/blob/master/tools/threadsnoop_example.txt" target="_blank"><b>./threadsnoop.bt</b></a><br />./threadsnoop.bt:19-21: ERROR: uprobe target file '/lib/x86_64-linux-gnu/libpthread.so.0' does not exist or is not executable<br /><br />[openxs@fc31 ~]$ <b>ldd /home/openxs/dbs/maria10.5/bin/mariadbd | grep thread</b><br /> libpthread.so.0 => <b>/lib64/libpthread.so.0</b> (0x00007f3d957bf000)</span></span><br /></p></blockquote><p>That's because the library in my case is in the different directory.</p><p>One can trace <b>vfs_*</b> functions too. For example, <a href="https://github.com/iovisor/bpftrace/blob/master/tools/vfscount_example.txt" target="_blank"><b>vfscount.bt</b></a> just traces and counts all VFS calls:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[root@fc31 tools]# <b>./vfscount.bt</b><br />Attaching 65 probes...<br />Tracing VFS calls... Hit Ctrl-C to end.<br />^C<br /><br />@[vfs_test_lock]: 2<br />@[vfs_symlink]: 3<br />@[vfs_setxattr]: 3<br />@[vfs_getxattr]: 4<br />@[vfs_mkdir]: 9<br />@[vfs_rmdir]: 9<br />@[vfs_rename]: 19<br />@[vfs_readlink]: 105<br />@[vfs_unlink]: 113<br />@[vfs_fallocate]: 394<br />@[vfs_statfs]: 450<br />@[vfs_lock_file]: 1081<br />@[vfs_fsync_range]: 1752<br />@[vfs_statx]: 2789<br />@[vfs_statx_fd]: 2846<br />@[vfs_open]: 2925<br />@[vfs_getattr]: 5360<br />@[vfs_getattr_nosec]: 5490<br />@[vfs_writev]: 6340<br />@[vfs_readv]: 12482<br /><b>@[vfs_write]: 161284<br />@[vfs_read]: 307655</b></span></span><br /></p></blockquote><p>We can surely summarize calls per second as <b><a href="https://github.com/iovisor/bpftrace/blob/master/tools/vfsstat_example.txt" target="_blank">vfsstat.bt</a></b> does:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[root@fc31 tools]# <b>./vfsstat.bt</b><br />Attaching 11 probes...<br />Tracing key VFS calls... Hit Ctrl-C to end.<br />12:15:14<br />@[vfs_open]: 22<br />@[vfs_writev]: 64<br />@[vfs_readv]: 124<br />@[vfs_read]: 1631<br />@[vfs_write]: 2015<br /><br />12:15:15<br />@[vfs_readv]: 96<br />@[vfs_write]: 1006<br />@[vfs_read]: 1201<br />@[vfs_writev]: 3093<br /><br />12:15:16<br />@[vfs_open]: 54<br />@[vfs_readv]: 139<br />@[vfs_writev]: 153<br />@[vfs_read]: 2640<br />@[vfs_write]: 4003<br /><br />12:15:17<br />@[vfs_open]: 6<br />@[vfs_writev]: 89<br />@[vfs_readv]: 132<br />@[vfs_write]: 1709<br />@[vfs_read]: 3904<br /><br />12:15:18<br />@[vfs_readv]: 271<br />@[vfs_write]: 1689<br />@[vfs_writev]: 2709<br />@[vfs_read]: 4479<br /><br /><b>^C</b></span></span><br /></p></blockquote><p>The <a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#2-interval-interval-output" target="_blank"><b>interval</b></a> probe is used for this in the code:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">interval:s:1<br />{<br /> time();<br /> print(@);<br /> clear(@);<br />}</span></span><br /></p></blockquote><p>Let me check the last tool for today, <b><a href="https://github.com/iovisor/bpftrace/blob/master/tools/writeback_example.txt" target="_blank">writeback.bt</a></b>, that traces when the kernel writeback procedure is writing dirtied pages to disk, and shows details such as the time, device numbers, reason for the write back, and the duration:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[root@fc31 tools]# <b>./writeback.bt<br /></b>Attaching 4 probes...<br />Tracing writeback... Hit Ctrl-C to end.<br />TIME DEVICE PAGES REASON ms<br />12:16:02 253:0 65486 background 0.000<br />12:16:02 253:0 43351 periodic 0.005<br />12:16:02 253:0 43351 periodic 0.005<br />12:16:02 253:0 43351 periodic 0.000<br />12:16:02 253:0 65534 background 0.045<br />12:16:02 253:0 65534 background 0.000<br />12:16:05 253:0 43508 periodic 0.006<br />12:16:06 253:0 43575 periodic 0.004<br />12:16:06 8:0 43575 periodic 0.004<br />12:16:06 253:0 43575 periodic 0.001<br /><b>12:16:07 253:0 65495 background 434.285</b><br />12:16:07 253:0 43947 periodic 0.005<br />12:16:07 253:0 43676 periodic 0.000<br />12:16:07 253:0 43549 periodic 22.272<br />12:16:07 253:0 43549 periodic 0.001<br /><b>12:16:11 253:0 43604 periodic 301.541<br />12:16:11 253:0 43528 periodic 147.890<br />12:16:11 253:0 43433 periodic 119.225</b><br />12:16:11 253:0 43433 periodic 0.004<br />12:16:11 253:0 43433 periodic 0.000<br /><b>^C</b></span></span><br /></p></blockquote><p>We clearly see notable time spent on background writeback at 12:16:07 and on periodic flushes at 12:16:11.<br /></p><p>Yet another source of tools to check is <a href="http://www.brendangregg.com/bpf-performance-tools-book.html" target="_blank"><b>Brendan Gregg</b>'s book</a> and related GitHub source code examples. You can get them from GitHub as follows:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 git]$ <b>git clone https://github.com/brendangregg/bpf-perf-tools-book.git</b><br />Cloning into 'bpf-perf-tools-book'...<br />remote: Enumerating objects: 600, done.<br />remote: Total 600 (delta 0), reused 0 (delta 0), pack-reused 600<br />Receiving objects: 100% (600/600), 991.41 KiB | 2.73 MiB/s, done.<br />Resolving deltas: 100% (394/394), done.<br />[openxs@fc31 git]$ <b>cd bpf-perf-tools-book/</b><br />[openxs@fc31 bpf-perf-tools-book]$ <b>ls</b><br />exercises images originals README.md updated</span></span><br /></p></blockquote><p>One day I'll check them in more details and share some outputs. This blog post is already too long...<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyfH5jhcm4R7SccKJA3UhCOx_8UIqw12omNl2-VvuhTNyHcF7BAWhpS7ej867bniiS6ebySlcKcxNI0zcsV16HBgs0Wk8e841OqWU4tL_A4f2i3gx7Mm4i4T8_W3hziq12J_pRmJtMFQUh/s640/Nokia+075.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyfH5jhcm4R7SccKJA3UhCOx_8UIqw12omNl2-VvuhTNyHcF7BAWhpS7ej867bniiS6ebySlcKcxNI0zcsV16HBgs0Wk8e841OqWU4tL_A4f2i3gx7Mm4i4T8_W3hziq12J_pRmJtMFQUh/w400-h300/Nokia+075.jpg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Tracing across the River Thames with minimal impact back in 2019 :)<br /></td></tr></tbody></table><p style="text-align: center;">* * *<br /></p><p>To summarize:</p><ol style="text-align: left;"><li>You should check all the <b>tools/*.bt</b> tools and related examples, both to know what is ready to use and to study how proper bpftrace programs are written.</li><li>Some tools may rely on kernel headers avaiable or specific pathnames of the libraries probed.</li><li>The tools presented in this post are good for studying disk I/O issues that may impact MariaDB performance. Some are useful alternatives to <a href="https://mysqlentomologist.blogspot.com/2017/12/using-strace-for-mysql-troubleshooting.html" target="_blank"><b>strace</b> everything</a>...<br /></li><li>See also this my older post "<a href="http://mysqlentomologist.blogspot.com/2020/09/bcc-tools-for-disk-io-analysis-and-more.html" target="_blank"><b>BCC Tools for disk I/O Analysis and More</b></a>" where <b>bcc</b> tools were used to monitor disk I/O on older Ubuntu 16.04. Some of them are more advanced and may have no <b>bpftrace</b>-based alternatives, but we all know that <b>bpftrace</b> is the future :)<br /></li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-32803354951228028342021-01-23T18:03:00.000+02:002021-01-23T18:03:14.136+02:00Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part I, Basic uprobes<p>There is still some non-zero probability that my talk called <b>"Monitoring MariaDB Server with bpftrace on Linux"</b> is accepted for the <a href="https://fosdem.org/2021/schedule/track/monitoring_and_observability/" target="_blank">FOSDEM 2021 Monitoring and Observability devroom</a>, so it's time to forget for a while about <a href="https://mysqlentomologist.blogspot.com/search/label/proc" target="_blank"><b>/proc</b> sampling</a> and revisit my old posts <a href="http://mysqlentomologist.blogspot.com/search/label/bpftrace" target="_blank">about <b>bpftrace</b></a>.</p><p>This time I am going to build recent <b>bpftrace</b> version from GitHub source with recent <b>bcc</b> tools:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 bcc]$ <b>git log -1</b><br />commit 97cded04a9d6370ac722c6ad8e73b72c4794e851 (HEAD -> master, origin/master, origin/HEAD)<br />Author: Chunmei Xu <xuchunmei@linux.alibaba.com><br />Date: Fri Jan 15 09:51:27 2021 +0800<br /><br /> test/test_histogram.py: fix test failed on kernel-5.10<br /><br /> kernel commit(cf25e24db61cc) rename tsk->real_start_time to<br /> start_boottime, so test_hostogram will get failed on kernel>=5.5<br /><br /> Signed-off-by: Chunmei Xu <xuchunmei@linux.alibaba.com><br /><br />...<br /><br />[openxs@fc31 bpftrace]$ <b>git log -1</b><br />commit 691c5e23259bfa82257016c65612fe9a3d6be7d4 (HEAD -> master, origin/master, origin/HEAD)<br />Author: Masanori Misono <m.misono760@gmail.com><br />Date: Wed Nov 25 05:51:08 2020 +0900<br /><br /> Update changelog and fuzzing.md<br />...</span></span><br /></p></blockquote><p>and check how it works on Fedora 31. I wanted to write "up to date Fedora 31", but surely it's "up to date" for 2 months already, as it's <a href="https://fedoraproject.org/wiki/End_of_life" target="_blank">EOL and no longer supported</a>... This is something to fix next week by upgrading to Fedora 33 while I am on vacation.<br /></p><p>The build process was not any different from the one described in <a href="http://mysqlentomologist.blogspot.com/2020/01/using-bpftrace-on-fedora-29-more.html" target="_blank">this post</a>. I've got some test failures for <b>bcc</b> tools:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">...<br /><br />84% tests passed, 7 tests failed out of 44<br /><br />Total Test time (real) = 908.76 sec<br /><br />The following tests FAILED:<br /> 2 - c_test_static (Failed)<br /> 3 - test_libbcc (Failed)<br /> 4 - py_test_stat1_b (Failed)<br /> 9 - py_test_trace1 (Failed)<br /> 18 - py_test_clang (Failed)<br /> 23 - py_test_stackid (Failed)<br /> 29 - py_test_disassembler (Failed)<br />Errors while running CTest<br />make: *** [Makefile:106: test] Error 8<br />...</span></span><br /></p></blockquote><p>but eventually ended up with this up to date version of <b>bpftrace</b> that basically works for my purposes:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 build]$ <b>/usr/local/bin/bpftrace --version</b><br />bpftrace v0.11.0-324-g691c5<br /><br />[root@fc31 tools]# <b>which bpftrace</b><br />/usr/local/bin/bpftrace</span></span> <br /><br /><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>bpftrace --help</b><br />USAGE:<br /> bpftrace [options] filename<br /> bpftrace [options] - <stdin input><br /> bpftrace [options] -e 'program'<br /><br />OPTIONS:<br /><b> -B MODE output buffering mode ('full', 'none')<br /> -f FORMAT output format ('text', 'json')<br /> -o file redirect bpftrace output to file<br /></b> -d debug info dry run<br /> -dd verbose debug info dry run<br /><b> -b force BTF (BPF type format) processing<br /></b> -e 'program' execute this program<br /> -h, --help show this help message<br /> -I DIR add the directory to the include search path<br /> --include FILE add an #include file before preprocessing<br /> -l [search] list probes<br /> -p PID enable USDT probes on PID<br /> -c 'CMD' run CMD and enable USDT probes on resulting process<br /><b> --usdt-file-activation<br /> activate usdt semaphores based on file path<br /></b> --unsafe allow unsafe builtin functions<br /> -q keep messages quiet<br /> -v verbose messages<br /><b> --info Print information about kernel BPF support<br /> -k emit a warning when a bpf helper returns an error (except read functions)<br /> -kk check all bpf helper functions<br /></b> -V, --version bpftrace version<br /><b> --no-warnings disable all warning messages<br /></b><br />ENVIRONMENT:<br /> BPFTRACE_STRLEN [default: 64] bytes on BPF stack per str()<br /> BPFTRACE_NO_CPP_DEMANGLE [default: 0] disable C++ symbol demangling<br /> BPFTRACE_MAP_KEYS_MAX [default: 4096] max keys in a map<br /> BPFTRACE_CAT_BYTES_MAX [default: 10k] maximum bytes read by cat builtin<br /> BPFTRACE_MAX_PROBES [default: 512] max number of probes<br /> BPFTRACE_LOG_SIZE [default: 1000000] log size in bytes<br /> BPFTRACE_PERF_RB_PAGES [default: 64] pages per CPU to allocate for ring buffer<br /> BPFTRACE_NO_USER_SYMBOLS [default: 0] disable user symbol resolution<br /> BPFTRACE_CACHE_USER_SYMBOLS [default: auto] enable user symbol cache<br /> BPFTRACE_VMLINUX [default: none] vmlinux path used for kernel symbol resolution<br /> BPFTRACE_BTF [default: none] BTF file<br /><br />EXAMPLES:<br />bpftrace -l '*sleep*'<br /> list probes containing "sleep"<br />bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'<br /> trace processes calling sleep<br />bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'<br /> count syscalls by process name</span></span><br /></p></blockquote><p>I've highlighted options that I consider "new" or changed comparing to version 0.9 I've used <a href="http://mysqlentomologist.blogspot.com/2019/10/dynamic-tracing-of-mariadb-server-with.html" target="_blank">here</a>.<br /></p><p>As the first test I tried to find out if <a href="https://github.com/iovisor/bpftrace/pull/1116" target="_blank">this PR</a> mentioned in the comments to one of my posts really made it to the current code and if I can use user probe names in demangled C++ format. For this I tried to capture all queries with a probe on <b>dispatch_command</b> function:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>ps aux | grep mariadb</b><br />openxs 3196 0.0 0.0 217048 3828 pts/0 S 08:45 0:00 /bin/sh ./bin/mysqld_safe --no-defaults --socket=/tmp/mariadb.sock --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --port=3309<br />openxs 3293 140 3.5 3633176 287132 pts/0 Sl 08:45 2:56 <b>/home/openxs/dbs/maria10.5/bin/mariadbd</b> --no-defaults --basedir=/home/openxs/dbs/maria10.5 --datadir=/home/openxs/dbs/maria10.5/data --plugin-dir=/home/openxs/dbs/maria10.5/lib/plugin --innodb_buffer_pool_size=1G --innodb_flush_log_at_trx_commit=2 --log-error=/home/openxs/dbs/maria10.5/data/fc31.err --pid-file=fc31.pid --socket=/tmp/mariadb.sock --port=3309<br />openxs 3494 0.0 0.0 215992 844 pts/1 S+ 08:47 0:00 grep --color=auto mariadb<br /><br />[openxs@fc31 ~]$ <b>sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command { printf("%s\n", str(arg2)); }'</b><br />Attaching 3 probes...<br />select @@version_comment limit 1<br />select @@version_comment limit 1<br />select 1+1<br />select 1+1<br />show processlist<br />show processlist<br />^C</span></span><br /></p></blockquote><p>So, demangled function name was accepted, but note "3 probes" above and duplicated SQL statements in the output. So, at least 2 of 3 probes were executed. I tried to "debug" the problem wityh the <b>-d</b> option:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>sudo bpftrace -d -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command { printf("%s\n", str(arg2)); }'</b><br /><br />AST<br />-------------------<br />Program<br /> uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command<br /> call: printf<br /> string: %s\n<br /> call: str<br /> builtin: arg2<br /><br /><br />AST after semantic analysis<br />-------------------<br />Program<br /> uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command<br /> call: printf :: type[none, ctx: 0]<br /> string: %s\n :: type[string[3], ctx: 0]<br /> call: str :: type[string[64], ctx: 0, AS(user)]<br /> builtin: arg2 :: type[unsigned int64, ctx: 0, AS(user)]<br /><br />; ModuleID = 'bpftrace'<br />source_filename = "bpftrace"<br />target datalayout = "e-m:e-p:64:64-i64:64-n32:64-S128"<br />target triple = "bpf-pc-linux"<br /><br />%printf_t = type { i64, [64 x i8] }<br /><br />; Function Attrs: nounwind<br />declare i64 @llvm.bpf.pseudo(i64, i64) #0<br /><br />define i64 @"uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command"(i8*) local_unnamed_addr section "s_uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:dispatch_command_1" {<br />entry:<br /> %str = alloca [64 x i8], align 1<br /> %printf_args = alloca %printf_t, align 8<br /> %1 = bitcast %printf_t* %printf_args to i8*<br /> call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %1)<br /> %2 = getelementptr inbounds [64 x i8], [64 x i8]* %str, i64 0, i64 0<br /> %3 = bitcast %printf_t* %printf_args to i8*<br /> call void @llvm.memset.p0i8.i64(i8* nonnull align 8 %3, i8 0, i64 72, i1 false)<br /> call void @llvm.lifetime.start.p0i8(i64 -1, i8* nonnull %2)<br /> call void @llvm.memset.p0i8.i64(i8* nonnull align 1 %2, i8 0, i64 64, i1 false)<br /> %4 = getelementptr i8, i8* %0, i64 96<br /> %5 = bitcast i8* %4 to i64*<br /> %arg2 = load volatile i64, i64* %5, align 8<br /> %probe_read_user_str = call i64 inttoptr (i64 114 to i64 ([64 x i8]*, i32, i64)*)([64 x i8]* nonnull %str, i32 64, i64 %arg2)<br /> %6 = getelementptr inbounds %printf_t, %printf_t* %printf_args, i64 0, i32 1, i64 0<br /> call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 %6, i8* nonnull align 1 %2, i64 64, i1 false)<br /> call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %2)<br /> %pseudo = call i64 @llvm.bpf.pseudo(i64 1, i64 1)<br /> %get_cpu_id = call i64 inttoptr (i64 8 to i64 ()*)()<br /> %perf_event_output = call i64 inttoptr (i64 25 to i64 (i8*, i64, i64, %printf_t*, i64)*)(i8* %0, i64 %pseudo, i64 %get_cpu_id, %printf_t* nonnull %printf_args, i64 72)<br /> call void @llvm.lifetime.end.p0i8(i64 -1, i8* nonnull %1)<br /> ret i64 0<br />}<br /><br />; Function Attrs: argmemonly nounwind<br />declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #1<br /><br />; Function Attrs: argmemonly nounwind<br />declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg) #1<br /><br />; Function Attrs: argmemonly nounwind<br />declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #1<br /><br />; Function Attrs: argmemonly nounwind<br />declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i1 immarg) #1<br /><br />attributes #0 = { nounwind }<br />attributes #1 = { argmemonly nounwind }</span></span><br /></p></blockquote><p>But the output does NOT list 3 probes and gives no hints. I had probably try to care better and provide (demangled) function signature</p><pre><blockquote>dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)</blockquote></pre><p>or check what probes are really added with</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">sudo cat /sys/kernel/tracing/uprobe_events</span></span><br /></p></blockquote><p>But being lazy, I ended up just double checking what mangled name to use:<span style="font-family: courier;"><span style="font-size: x-small;"><br /></span></span></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>objdump -T /home/openxs/dbs/maria10.5/bin/mariadbd | grep dispatch_command</b><br />000000000070a170 g DF .text 000000000000289b Base _Z16dispatch_command19enum_server_commandP3THDPcjbb<br />...</span></span><br /></p></blockquote><p>and used the same familiar mangled name in further probes:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb { printf("%s\n", str(arg2)); }'</b><br />Attaching 1 probe...<br />select @@version_comment limit 1<br />select user, host from mysql.user<br />SELECT DATABASE()<br />test<br />show databases<br />show tables<br />ts<br />select count(*) from t1<br />select count(*) from t<br />show tables<br /><br />^C</span></span><br /></p></blockquote><p>The next <b>bpftrace</b> "oneliner" to try was my actually more advanced attempt to not only capture the test of SQL statements, but also the time to execute them via <b>uretprobe</b>, and make it work in the multithreaded environment. I quickly found out that one of examples in <a href="http://mysqlentomologist.blogspot.com/2019/10/dynamic-tracing-of-mariadb-server-with.html" target="_blank">the older post</a> has a bug and that explained "64" at the end of timestamps :) So, here is a more correct <b>bpftrace</b> program:<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>sudo bpftrace -e 'uprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql[tid] = str(arg2); @start[tid] = nsecs; } uretprobe:/home/openxs/dbs/maria10.5/bin/mariadbd:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[tid] != 0/ { printf("%s : %u %u ms\n", @sql[tid], tid, (nsecs - @start[tid])/1000000); } '</b><br />Attaching 2 probes...<br />select sleep(1) : 4029 1000 ms<br /> : 4029 0 ms<br />select sleep(2) : 4281 2000 ms<br /> : 4281 0 ms<br />select sleep(3) : 4283 3000 ms<br /> : 4283 0 ms<br />select sleep(4) : 4282 4000 ms<br /> : 4282 0 ms<br />^C<br /><br />@sql[4029]:<br />@sql[4281]:<br />@sql[4282]:<br />@sql[4283]:<br /><br />@start[4029]: 2609790546240<br />@start[4281]: 2610789764269<br />@start[4283]: 2611790224979<br />@start[4282]: 2612789761146</span></span><br /></p></blockquote><p>The output was taken while this shell script was running:</p><blockquote><p style="text-align: left;"><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 maria10.5]$ <b>for i in `seq 1 4`; do mysql --socket=/tmp/mariadb.sock -e"select sleep($i)" & done</b></span></span><br /></p></blockquote><p>Just to remind you, I've used two associative arrays, <b>@sql[]</b> for queries and <b>@start[]</b> for start times, both indexed by <b>tid</b>
- built in <b>bpftrace</b> variable for thread id. Note that <b>bpftrace</b> automatically outputs the content of all global associative arrays at the end,
unless we free them explicitly. So, reimplementing a slow query log in <b>bpftrace</b> properly is no longer a "one liner" program, we have tyo care about more details.<br /></p><p>As the next test, I tried to add user probe to the library to trace <b>pthread_mutex_lock</b> calls only for the <b>mariadbd</b> binary (as I did <a href="http://mysqlentomologist.blogspot.com/2020/09/dynamic-tracing-of-pthreadmutexlock-in.html" target="_blank">in this post</a> with <b>perf</b>):<span style="font-family: courier;"><span style="font-size: x-small;"> </span></span></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ <b>ldd /home/openxs/dbs/maria10.5/bin/mariadbd | grep thread</b><br /> libpthread.so.0 => <b>/lib64/libpthread.so.0</b> (0x00007f3d957bf000)<br />[openxs@fc31 ~]$ <b>sudo bpftrace -e 'uprobe:/lib64/libpthread.so.0:pthread_mutex_lock /comm == "mariadbd"/ { @[ustack] = count(); }' > /tmp/bpfmutex.txt</b><br />[sudo] password for openxs:<br />^C^C</span></span><br /></p></blockquote><p>Here I am collecting and counting unique user stacks at the moment of calling <b>pthread_mutex_lock()</b>, while <b>sysbench</b> test is running:</p><blockquote><p>...<br />[ 10s ] thds: 32 tps: 658.05 qps: 13199.78 (r/w/o: 9246.09/2634.40/1319.30) lat (ms,95%): 227.40 err/s: 0.00 reconn/s: 0.00<br />[ 20s ] thds: 32 tps: 737.82 qps: 14752.19 (r/w/o: 10325.44/2951.30/1475.45) lat (ms,95%): 193.38 err/s: 0.00 reconn/s: 0.00<br /><b>[ 30s ] thds: 32 tps: 451.18 qps: 9023.16 (r/w/o: 6316.56/1804.03/902.57) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00<br />[ 40s ] thds: 32 tps: 379.09 qps: 7585.24 (r/w/o: 5310.19/1516.87/758.18) lat (ms,95%): 390.30 err/s: 0.00 reconn/s: 0.00<br />[ 50s ] thds: 32 tps: 448.78 qps: 8985.48 (r/w/o: 6292.88/1795.14/897.47) lat (ms,95%): 350.33 err/s: 0.00 reconn/s: 0.00<br />[ 60s ] thds: 32 tps: 400.33 qps: 7997.32 (r/w/o: 5595.86/1600.70/800.75) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00<br />[ 70s ] thds: 32 tps: 392.96 qps: 7865.59 (r/w/o: 5506.30/1573.36/785.93) lat (ms,95%): 369.77 err/s: 0.00 reconn/s: 0.00<br /></b>[ 80s ] thds: 32 tps: 410.02 qps: 8197.77 (r/w/o: 5739.36/1638.47/819.94) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00<br />[ 90s ] thds: 32 tps: 390.15 qps: 7803.48 (r/w/o: 5462.45/1560.62/780.41) lat (ms,95%): 427.07 err/s: 0.00 reconn/s: 0.00<br />[ 100s ] thds: 32 tps: 405.08 qps: 8111.76 (r/w/o: 5677.96/1623.63/810.17) lat (ms,95%): 411.96 err/s: 0.00 reconn/s: 0.00<br />^C<br /></p></blockquote><p>Note some drop of performance that is yet to be measured properly (collection time vs exporting to the useland <b>/tmp/bpfmutex.txt</b> file. It is notable for sure for such a frequent event to trace.<br /></p><p>In the results I see mostly unique stacks like these:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">[openxs@fc31 ~]$ head -100 /tmp/bpfmutex.txt<br />Attaching 1 probe...<br /><br /><br />@[<br /> __pthread_mutex_lock+0<br /> sync_array_wait_event(sync_array_t*, sync_cell_t*&)+167<br /> rw_lock_sx_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int)+488<br /> pfs_rw_lock_sx_lock_func(rw_lock_t*, unsigned long, char const*, unsigned int) [clone .constprop.0]+140<br /> btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+8555<br /> btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.0]+146<br /> row_search_index_entry(dict_index_t*, dtuple_t const*, unsigned long, btr_pcur_t*, mtr_t*)+47<br /> row_purge_remove_sec_if_poss_tree(purge_node_t*, dict_index_t*, dtuple_t const*)+497<br /> row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+1492<br /> row_purge_step(que_thr_t*)+738<br /> que_run_threads(que_thr_t*)+2264<br /> purge_worker_callback(void*)+355<br /> tpool::task_group::execute(tpool::task*)+170<br /> tpool::thread_pool_generic::worker_main(tpool::worker_data*)+79<br /> 0x7f00f7fc43d4<br /> 0x56302093bd80<br /> std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::~_State_impl()+0<br /> 0x2de907894810c083<br />]: 1<br />@[<br /> __pthread_mutex_lock+0<br /> mtr_t::commit()+2660<br /> row_ins_sec_index_entry_low(unsigned long, unsigned long, dict_index_t*, mem_block_info_t*, mem_block_info_t*, dtuple_t*, unsigned long, que_thr_t*)+563<br /> row_ins_sec_index_entry(dict_index_t*, dtuple_t*, que_thr_t*, bool)+246<br /> row_ins_step(que_thr_t*)+1305<br /> row_insert_for_mysql(unsigned char const*, row_prebuilt_t*, ins_mode_t)+865<br /> ha_innobase::write_row(unsigned char const*)+177<br /> handler::ha_write_row(unsigned char const*)+464<br /> write_record(THD*, TABLE*, st_copy_info*, select_result*)+477<br /> mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*)+2967<br /> mysql_execute_command(THD*)+7722<br /> Prepared_statement::execute(String*, bool)+981<br /> Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+133<br /> mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool)+549<br /> mysqld_stmt_execute(THD*, char*, unsigned int)+44<br /> dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)+9302<br /> do_command(THD*)+274<br /> do_handle_one_connection(CONNECT*, bool)+1025<br /> handle_one_connection+93<br /> pfs_spawn_thread+322<br /> start_thread+226<br />]: 1<br />...</span></span><br /></p></blockquote><p>Looks like I have to generate them in perf format and then maybe aggregate somehow in <b>bpftrace</b> itself, in the <b>END</b> probe, similar to the way I did with <b>awk</b> postprocessing inspired by <b>pt-pmp</b> <a href="http://mysqlentomologist.blogspot.com/2020/01/using-bpftrace-on-fedora-29-more.html" target="_blank">in this post</a>. That should reduce the negative performance impact of the tracing, hopefully to the level that makes it practical to use in production. I'd l;ike to build flame graphs one day directly based on <b>bpftrace</b> outputs.</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAPwu_vk_0lVJ9lTVrvfpw-zQIE3vqMI5LwBuBdeY37IkNw5b8TkmmzbukIidY6Y586n0eFj8hyphenhyphenWUnTl3bO_btUyWVYnVbj3o5z1pys_WC4lu2T0Kf_nG27zYYpe6INUIXzWwX1DaOrz3u/s640/Nokia+074.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="480" data-original-width="640" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiAPwu_vk_0lVJ9lTVrvfpw-zQIE3vqMI5LwBuBdeY37IkNw5b8TkmmzbukIidY6Y586n0eFj8hyphenhyphenWUnTl3bO_btUyWVYnVbj3o5z1pys_WC4lu2T0Kf_nG27zYYpe6INUIXzWwX1DaOrz3u/w400-h300/Nokia+074.jpg" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">The view is still not entirely clear, but I am getting there, to be as flent with <b>bpftrace</b> as I am with <b>perf<br /></b></td></tr></tbody></table><p></p><p style="text-align: center;">* * *<br /></p><p>To summarize:</p><ol style="text-align: left;"><li><b>bpftrace</b> version 0.11 supports demangled C++ function signatures. You may still have problems making sure proper function is instrumented, so I continue to use mangled names.</li><li>My plan is to find out how to trace/do with <b>bpftrace</b> anything I usually do with <b>perf</b>, because <b>bpftrace</b> is the future of ad hoc monitoring and tracing tools for Linux.</li><li>I am yet to start collecting larger MariaDB and MySQL-related <b>bpftrace</b> programs in some repository for reuse, but there are many generci OS level tools to check in the <a href="https://github.com/iovisor/bpftrace/tree/master/tools" target="_blank"><b>bpftrace/tools</b></a> subdirectory. They will be covered in my next blog post.</li><li>It's time for me to upgrade to Fedora 33 and retest <b>bcc</b> tools and <b>bpftrace</b> there.<br /></li></ol>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com6tag:blogger.com,1999:blog-3080615211468083537.post-52548911158088876382021-01-16T15:35:00.000+02:002021-01-16T15:35:30.560+02:00Linux /proc Filesystem for MySQL DBAs - Part IV, Creating Off-CPU Flame Graphs<p> My <a href="https://fosdem.org/2021/schedule/event/linux_porc_mysql/" target="_blank">upcoming FOSDEM 2021 MySQL Devroom talk</a> based on this series of blog posts is already prepared and recorded. But in the process I noted that some more details should be shared than one can cover in a 20 minutes talk. So I decided to continue the series, and today I am going to show what one can do with kernel stack samples collected from <b>/proc</b>, for example, with the <b>psn</b> tool we discussed in the <a href="http://mysqlentomologist.blogspot.com/2021/01/linux-proc-filesystem-for-mysql-dbas_8.html" target="_blank">previous post</a>.</p><p>There we've seen that <b>psn</b> allows to summarize kernel stack samples among other per-thread details, for example, we can get the following for Percona Server running under some <b>sysbench</b> load:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>sudo psn -G kstack -p `pidof mysqld`</b><br /><br />Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]<br />Sampling /proc/stat, stack for 5 seconds... finished.<br /><br /><br />=== Active Threads =========================================================================================================================================================================================================================================================================================<br /><br /> samples | avg_threads | comm | state | kstack <br />------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br /><b> 99 | 0.99 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit() </b><br /> 81 | 0.81 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit() <br /> 70 | 0.70 | (mysqld) | Running (ON CPU) | entry_SYSCALL_64_fastpath()->SyS_poll()->do_sys_poll()->poll_schedule_timeout() <br />...<br /> 1 | 0.01 | (mysqld) | Running (ON CPU) | entry_SYSCALL_64_fastpath()->SyS_sendto()->SYSC_sendto()->sock_sendmsg()->unix_stream_sendmsg()->sock_alloc_send_pskb()->alloc_skb_with_frags()->__alloc_skb()->__kmalloc_reserve.isra.34()->__kmalloc_node_track_caller()<br /><br /><br />samples: 100 (expected: 100)<br />total processes: 1, threads: 44<br />runtime: 5.02, measure time: 1.89</span></span><br /></p></blockquote><p>So we have the number of samples in the first column, for threads in the process we see in the third column, per state we see in the fourth column having the given kernel stack fancy formatted and presented in the fifth column. That is, we get number of samples with the given kernel stack trace, for waiting threads (among others).</p><p>Now let me recall where I've seen some similarly structured information processed before? Here it is, "<a href="http://www.brendangregg.com/offcpuanalysis.html" target="_blank"><b>Off-CPU Analysis</b></a>" by <b>Brendan Gregg</b>! There he described different approaches, with <b>/proc</b> sampling among them, but mostly concentrated on his eBPF/<b>bcc</b> based tool, <a href="https://github.com/iovisor/bcc/blob/master/tools/offcputime.py" target="_blank"><b>offcputime</b></a>. that summarizes <span class="pl-c">off-CPU time by stack trace. I used it on Fedora for a coupled of years already, but on this old Ubuntu 16.04 it is not supposed to work (kernel 4.8+ is needed):</span></p><p></p><blockquote><span class="pl-c"><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~$ <b>uname -a</b><br />Linux ao756 4.4.0-198-generic #230-Ubuntu SMP Sat Nov 28 01:30:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux<br />openxs@ao756:~$ <b>sudo /usr/share/bcc/tools/offcputime -h</b><br />usage: offcputime [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f]<br /> [--stack-storage-size STACK_STORAGE_SIZE]<br /> [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE]<br /> [duration]<br /><br />Summarize off-CPU time by stack trace<br /><br />positional arguments:<br /> duration duration of trace, in seconds<br /><br />optional arguments:<br /> -h, --help show this help message and exit<br /> -p PID, --pid PID trace this PID only<br /> -t TID, --tid TID trace this TID only<br /> -u, --user-threads-only<br /> user threads only (no kernel threads)<br /> -k, --kernel-threads-only<br /> kernel threads only (no user threads)<br /> -U, --user-stacks-only<br /> show stacks from user space only (no kernel space<br /> stacks)<br /> -K, --kernel-stacks-only<br /> show stacks from kernel space only (no user space<br /> stacks)<br /> -d, --delimited insert delimiter between kernel/user stacks<br /> -f, --folded output folded format<br /> --stack-storage-size STACK_STORAGE_SIZE<br /> the number of unique stack traces that can be stored<br /> and displayed (default 1024)<br /> -m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME<br /> the amount of time in microseconds over which we store<br /> traces (default 1)<br /> -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME<br /> the amount of time in microseconds under which we<br /> store traces (default U64_MAX)<br /> --state STATE filter on this thread state bitmask (eg, 2 ==<br /> TASK_UNINTERRUPTIBLE) see include/linux/sched.h<br /><br />examples:<br /> ./offcputime # trace off-CPU stack time until Ctrl-C<br /> ./offcputime 5 # trace for 5 seconds only<br /><b> ./offcputime -f 5 # 5 seconds, and output in folded format<br /></b> ./offcputime -m 1000 # trace only events that last more than 1000 usec<br /> ./offcputime -M 10000 # trace only events that last less than 10000 usec<br /> ./offcputime -p 185 # only trace threads for PID 185<br /> ./offcputime -t 188 # only trace thread 188<br /> ./offcputime -u # only trace user threads (no kernel)<br /> ./offcputime -k # only trace kernel threads (no user)<br /> ./offcputime -U # only show user space stacks (no kernel)<br /><b> ./offcputime -K # only show kernel space stacks (no user)<br /></b><br />openxs@ao756:~$ <b>sudo /usr/share/bcc/tools/offcputime -K -f 5 -p `pidof mysqld`</b><br />could not open bpf map: stack_traces, error: Invalid argument<br />Traceback (most recent call last):<br /> File "/usr/share/bcc/tools/offcputime", line 235, in <module><br /> b = BPF(text=bpf_text)<br /> File "/usr/lib/python2.7/dist-packages/bcc/__init__.py", line 325, in __init__<br /> raise Exception("Failed to compile BPF text")<br />Exception: Failed to compile BPF text</span></span><br /></span></blockquote><p></p><p><span class="pl-c">and it does not work as you can see other than wiht <b>-h</b> option to get the usage details. That's unfortunate, as this slow netbook could really benefit from the Off-CPU analysis with its 2 slow cores and encrypted slow HDD :) The <b>offcputime</b> tool output was then used to create nice flame graphs that allow to identify the most often happening kinds of wait:</span></p><p></p><blockquote><pre># <b>/usr/share/bcc/tools/offcputime -df -p `pgrep -nx mysqld` 30 > out.stacks</b>
...<br /># <b>git clone https://github.com/brendangregg/FlameGraph</b>
# <b>cd FlameGraph</b>
# <b>./flamegraph.pl --color=io --title="Off-CPU Time Flame Graph" --countname=us < out.stacks > out.svg</b></pre><span class="pl-c"></span></blockquote><p></p><p>Flame graph created like that <span class="pl-c">shows all off-CPU stack traces, with stack depth on the y-axis, and the width corresponds to the total time in each stack, along with application level stacks.</span></p><p><span class="pl-c">I could not get these on my Ubuntu with <b>offcputime</b>. But I tried to find out what kind of outp[ut the tool produces with the <b>-f</b> option we see used above. We can see this without running the tool from this file, <a href="https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt" target="_blank"><b>https://github.com/iovisor/bcc/blob/master/tools/offcputime_example.txt</b></a>:<br /></span></p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">"A -f option will emit output using the "folded stacks" format, which can be<br />read directly by flamegraph.pl from the FlameGraph open source software<br />(https://github.com/brendangregg/FlameGraph). Eg:<br /><br /># ./offcputime -K -f 5<br /><b>bash;entry_SYSCALL_64_fastpath;sys_read;vfs_read;...;schedule 8</b><br />..."<br /></span></span></p></blockquote><p>So, the format is simple: first column is the program name and kernel stack trace, separated by '<b>;</b>', and then the number of times we've noted this stack for the program. I can surely get the output of <b>psn</b> converted to such simple format!</p><p>I've started a <b>sysbench</b> test like this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-db=sbtest --threads=16 --time=1200 --report-interval=10 run</b><br />sysbench 1.1.0-faaff4f (using bundled LuaJIT 2.1.0-beta3)<br /><br />Running the test with following options:<br />Number of threads: 16<br />Report intermediate results every 10 second(s)<br />Initializing random number generator from current time<br /><br /><br />Initializing worker threads...<br /><br />Threads started!<br /><br />[ 10s ] thds: 16 tps: 69.66 qps: 1423.20 (r/w/o: 997.63/284.64/140.92) lat (ms,95%): 383.33 err/s: 0.00 reconn/s: 0.00<br />[ 20s ] thds: 16 tps: 79.81 qps: 1596.57 (r/w/o: 1117.32/319.63/159.62) lat (ms,95%): 320.17 err/s: 0.00 reconn/s: 0.00<br />...</span></span><br /></p></blockquote><p>and in another shell after several attempts filtered out and agrregated exactly what I needed (I'll skip explanations of my lame <b>awk</b> and <b>sed</b> usage, this transformation can be surely done in a more efficient way and with less commands in the pipe):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>sudo psn -d 60 -G kstack | grep -v Running | awk -F\| '{ print $3, $5, $1 }' | sed 's/->/;/g' | grep '^.(' | sed 's/(//g' | sed 's/)//g' | awk '{ print $1";"$2, $3 }' > /tmp/psnstacks.txt</b></span></span><br /></p></blockquote><p>Kernel stacks of all the processes were collected for 60 seconds, processed to match the format needed to build a flame graph, and saved into the file. </p><p>The next step is to process it to create an <b>.svg</b> file with the graph:<br /></p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~$ <b>git/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Time Flame Graph based on proc sampling" --countname=hits < /tmp/psnstacks.txt > ~/Documents/psn1.svg</b><br />openxs@ao756:~$ <b>ls -l ~/Documents/psn1.svg</b><br />-rw-rw-r-- 1 openxs openxs 25999 Jan 16 15:02 /home/openxs/Documents/psn1.svg</span></span><br /></p></blockquote><p>The result (with a typo I made in the original command) was the following (this is a <b>.png</b> made from a screenshot of a browser used to work with the<b> psn1.svg</b>):<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2SBIo42V_VFVTUPUt_C4CvXY8GeuLV0W-XP1ENvnR9NdAf8W-hw3rWFdwGw5NAtZERnqkUM7uG19wUzRxhXYDf6Gs_6n8AaUYaPhH-kfqC7S3sO5h4lE2Twm2yzQgOGLnwlqxykzQ2HQi/s1194/psn_offcpu_p57_read_write_fosdem.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="459" data-original-width="1194" height="246" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh2SBIo42V_VFVTUPUt_C4CvXY8GeuLV0W-XP1ENvnR9NdAf8W-hw3rWFdwGw5NAtZERnqkUM7uG19wUzRxhXYDf6Gs_6n8AaUYaPhH-kfqC7S3sO5h4lE2Twm2yzQgOGLnwlqxykzQ2HQi/w640-h246/psn_offcpu_p57_read_write_fosdem.png" width="640" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">You can create Off-CPU flame graphs like the above on any Linux 2.6.x by sampling <b>/proc</b>!<br /></td></tr></tbody></table><p><br />I see that basically most of the time when <b>mysqld</b> threads are not running is spent on <b>jbd2_log_wait_commit</b> system call waits. Now we can go to the <a href="https://github.com/torvalds/linux/blob/71c5f03154ac1cb27423b984743ccc2f5d11d14d/fs/jbd2/journal.c#L677" target="_blank">Linux kernel source code</a> and study what it is about:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;"><b>/*<br /> * Wait for a specified commit to complete.<br /> * The caller may not hold the journal lock.<br /> */<br /></b>int jbd2_log_wait_commit(journal_t *journal, tid_t tid)<br />{<br /> int err = 0;<br /><br /> read_lock(&journal->j_state_lock);<br />#ifdef CONFIG_PROVE_LOCKING<br /> /*<br /> * Some callers make sure transaction is already committing and in that<br /> * case we cannot block on open handles anymore. So don't warn in that<br /> * case.<br /> */<br /> if (tid_gt(tid, journal->j_commit_sequence) &&<br /> (!journal->j_committing_transaction ||<br /> journal->j_committing_transaction->t_tid != tid)) {<br /> read_unlock(&journal->j_state_lock);<br /> jbd2_might_wait_for_commit(journal);<br /> read_lock(&journal->j_state_lock);<br /> }<br />#endif<br />... <br /></span></span></p></blockquote><p> So we are waiting for commit of write to the journal of <b>ext4</b> filesystem here. Fair enough.<br /></p><p style="text-align: center;">* * *</p><p>To summarize, the fact that you do not have <b>bcc</b> tools installed or use old enough kernel for them not to work at all does not prevent you from doing off-CPU analysis based on <b>/proc</b> sampling approach. With some simple scripting and command line text processing applied you can get folded stacks for further analysis based on flame graphs. This may help to resolve some obscure MySQL performance problems that go beyong bad execution plans or spinning and wasting CPU cicles...</p><p>Moreover, the impact of such sampling is notably lower than with <b>bcc</b> tools. This is a topic I am going to study in my next blog post in this series, based on testing on Fedora 31 box where all the tools just work.<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-56075496394381521112021-01-08T19:06:00.002+02:002021-01-08T20:54:51.176+02:00Linux /proc Filesystem for MySQL DBAs - Part III, 0x.tools by Tanel Poder<p>In this third post of my "<b>Linux /proc Filesystem for MySQL DBAs</b>" series (see also <a href="http://mysqlentomologist.blogspot.com/2021/01/linux-proc-filesystem-for-mysql-dbas.html" target="_blank">Part I</a> and <a href="http://mysqlentomologist.blogspot.com/2021/01/linux-proc-filesystem-for-mysql-dbas_7.html" target="_blank">Part II</a> for the context and details) I am going to present a useful set of programs to access, summarize and record <b>/proc</b> details created and recently shared by famous <a href="https://tanelpoder.com/" target="_blank"><b>Tanel Poder</b></a>, <a href="https://0x.tools/" target="_blank"><b>0x.tools</b></a>. I'll try to build them and apply to Percona Server 5.7.32-35 running on Ubuntu 16.04 netbook and fighting with some read-write <b>sysbench</b> load.</p><p>To build the tools I clone them from GitHub and just make as usual: </p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~/git$ <b>git clone https://github.com/tanelpoder/0xtools</b><br />Cloning into '0xtools'...<br />remote: Enumerating objects: 103, done.<br />remote: Counting objects: 100% (103/103), done.<br />remote: Compressing objects: 100% (67/67), done.<br />remote: Total 186 (delta 53), reused 77 (delta 32), pack-reused 83<br />Receiving objects: 100% (186/186), 108.43 KiB | 0 bytes/s, done.<br />Resolving deltas: 100% (94/94), done.<br />Checking connectivity... done.<br />openxs@ao756:~/git$ <b>cd 0xtools/</b><br />openxs@ao756:~/git/0xtools$ <b>make</b><br />gcc -I include -Wall -o bin/xcapture src/xcapture.c<br />openxs@ao756:~/git/0xtools$ <b>sudo make install</b><br /># for now the temporary "install" method is with symlinks<br />ln -s `pwd`/bin/xcapture /usr/bin/xcapture<br />ln -s `pwd`/bin/psn /usr/bin/psn<br />ln -s `pwd`/bin/schedlat /usr/bin/schedlat<br />openxs@ao756:~/git/0xtools$ <b>file bin/xcapture</b><br />bin/xcapture: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=ad05c149b0379a29fef765ec4706e912843846ff, not stripped<br />openxs@ao756:~/git/0xtools$ <b>file bin/psn</b><br />bin/psn: Python script, ASCII text executable<br />openxs@ao756:~/git/0xtools$ <b>file bin/schedlat</b><br />bin/schedlat: Python script, ASCII text executable<br />openxs@ao756:~/git/0xtools$ <b>python --version</b><br />Python 2.7.12</span></span><br /></p></blockquote><p>Nothing more than that for now. Executable and two Python scripts. This installation process should work well on any Linux where you can build non-ancient MySQL from source as has some Python installed, starting from RHEL5. </p><p>We can get some details about the programs usage as follows:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~/git/0xtools$ <b>xcapture -h</b><br /><br />0x.Tools xcapture v1.0 by Tanel Poder [https://0x.tools]<br /><br />Usage:<br /> xcapture [options]<br /><br /> By default, sample all /proc tasks in states R, D every second and print to stdout<br /><br /> Options:<br /> -a capture tasks in additional states, even the ones Sleeping (S)<br /> -A capture tasks in All states, including Zombie (Z), Exiting (X), Idle (I)<br /> -c <c1,c2> print additional columns (for example: -c exe,cmdline,kstack)<br /> -d <N> seconds to sleep between samples (default: 1)<br /> -E <string> custom task state Exclusion filter (default: XZIS)<br /> -h display this help message<br /> -o <dirname> write wide output into hourly CSV files in this directory instead of stdout<br /><br />openxs@ao756:~/git/0xtools$ <b>psn -h</b><br />usage: psn [-h] [-d seconds] [-p [pid]] [-t [tid]] [-r] [-a]<br /> [--sample-hz SAMPLE_HZ] [--ps-hz PS_HZ] [-o filename] [-i filename]<br /> [-s csv-columns] [-g csv-columns] [-G csv-columns] [--list]<br /><br />optional arguments:<br /> -h, --help show this help message and exit<br /> -d seconds number of seconds to sample for<br /> -p [pid], --pid [pid]<br /> process id to sample (including all its threads), or<br /> process name regex, or omit for system-wide sampling<br /> -t [tid], --thread [tid]<br /> thread/task id to sample (not implemented yet)<br /> -r, --recursive also sample and report for descendant processes<br /> -a, --all-states display threads in all states, including idle ones<br /> --sample-hz SAMPLE_HZ<br /> sample rate in Hz (default: 20)<br /> --ps-hz PS_HZ sample rate of new processes in Hz (default: 2)<br /> -o filename, --output-sample-db filename<br /> path of sqlite3 database to persist samples to,<br /> defaults to in-memory/transient<br /> -i filename, --input-sample-db filename<br /> path of sqlite3 database to read samples from instead<br /> of actively sampling<br /> -s csv-columns, --select csv-columns<br /> additional columns to report<br /> -g csv-columns, --group-by csv-columns<br /> columns to aggregate by in reports<br /> -G csv-columns, --append-group-by csv-columns<br /> default + additional columns to aggregate by in<br /> reports<br /> --list list all available columns</span></span><br /></p></blockquote><p>The last one is also a python script that does not care to provide <b>-h</b> option, but shows the usage when called without arguments:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~/git/0xtools$ <b>bin/schedlat</b><br />usage: bin/schedlat PID</span></span></p></blockquote><p>There are also two shell scripts in <b>bin/</b>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/0xtools$ <b>ls -F bin</b><br />psn* run_xcapture.sh* run_xcpu.sh* schedlat* xcapture*<br /></span></span></p></blockquote><p>The purposes of the tools are the following:</p><ul style="text-align: left;"><li><b>xcapture</b> - low-overhead thread state sampler based on reading <b>/proc</b> files</li><li><b>psn</b> - shows current top thread activity by sampling <b>/proc</b> files</li><li><b>schedlat</b> - shows CPU scheduling latency for the given <i><b>PID</b></i> as a % of its runtime</li><li><b>run_xcapture.sh</b> - a simple “daemon” script for keeping <b>xcapture</b> running</li><li><b>run_xcpu.sh</b> - low-frequency continuous stack sampling for threads on CPU (using <b>perf</b>)<br /></li></ul><p>The last script, <b>run_xcpu.sh</b>, is out of scope of this series. We may get back to it in some next <b>perf</b>-related posts. It shows proper usge of <b>perf </b>for low frequency (1 Hz) and low impact on-CPU profiling.<br /></p><p>Let's see what we can get with <b>xcapture</b>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/0xtools$ <b>mkdir /tmp/xcap</b><br />openxs@ao756:~/git/0xtools$ <b>xcapture -a -o /tmp/xcap</b><br /><br />0xTools xcapture v1.0 by Tanel Poder [https://0x.tools]<br /><br />Sampling /proc...<br /><br /><b>^C</b><br />openxs@ao756:~/git/0xtools$</span></span></p></blockquote><p>while <b>sysbench</b> test is running against my Percona Server:</p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ sysbench oltp_read_write --db-driver=mysql --tables=5 --table-size=100000 --mysql-user=root --mysql-socket=/var/run/mysqld/mysqld.sock --mysql-db=sbtest --threads=16 --time=1200 --report-interval=10 run<br />...<br />[ 1200s ] thds: 16 tps: 70.99 qps: 1420.02 (r/w/o: 993.87/284.16/141.98) lat (ms,95%): 344.08 err/s: 0.00 reconn/s: 0.00<br />SQL statistics:<br /> queries performed:<br /> read: 1214556<br /> write: 347016<br /> other: 173508<br /> total: 1735080<br /><b> transactions: 86754 (72.28 per sec.)<br /> queries: 1735080 (1445.54 per sec.)<br /></b> ignored errors: 0 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />Throughput:<br /> events/s (eps): 72.2771<br /> time elapsed: 1200.2964s<br /> total number of events: 86754<br /><br />Latency (ms):<br /> min: 89.13<br /> avg: 221.33<br /> max: 855.31<br /><b> 95th percentile: 356.70<br /></b> sum: 19200998.85<br /><br />Threads fairness:<br /> events (avg/stddev): 5422.1250/12.10<br /> execution time (avg/stddev): 1200.0624/0.11<br /></span></span></p><p></p></blockquote><p>To get the impact of capturing the information, I've got the same test results without <b>xcapture</b> running previously:<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">...<br /><br /><b> transactions: 87376 (72.80 per sec.)<br /> queries: 1747537 (1456.07 per sec.)<br /></b> ignored errors: 1 (0.00 per sec.)<br /> reconnects: 0 (0.00 per sec.)<br /><br />Throughput:<br /> events/s (eps): 72.8027<br /> time elapsed: 1200.1747s<br /> total number of events: 87376<br /><br />Latency (ms):<br /> min: 87.54<br /> avg: 219.75<br /> max: 843.19<br /><b> 95th percentile: 356.70</b><br />...</span></span><br /></p></blockquote><p>So we had 72.80 TPS and 1456 QPS over 1200 seconds without monitoring and 72.28 TPS and 1446 QPS with it. Do not mind the absolute (no tuning was performed, all default, old netbook with slow HDD), but the difference is really ignorable.</p><p>Now let's see what was captured (with <b>-a</b>, so all threads of all processes in the system, in all states, even <b>S</b> - sleeping):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/0xtools$ <b>ls -l /tmp/xcap/</b><br />total 56720<br />-rw-rw-r-- 1 openxs openxs 36310058 Jan 8 16:59 2021-01-08.16.csv<br />-rw-rw-r-- 1 openxs openxs 21768834 Jan 8 17:07 2021-01-08.17.csv<br />openxs@ao756:~/git/0xtools$ <b>cat /tmp/xcap/2021-01-08.16.csv | head -30</b><br /><b>TS,PID,TID,USERNAME,ST,COMMAND,SYSCALL,WCHAN,EXE,KSTACK<br /></b>2021-01-08 16:47:24.414,1,1,root,S,(systemd),-,0,-,-<br />2021-01-08 16:47:24.414,2,2,root,S,(kthreadd),-,0,-,-<br />2021-01-08 16:47:24.414,3,3,root,S,(ksoftirqd/0),-,0,-,-<br />2021-01-08 16:47:24.414,5,5,root,S,(kworker/0:0H),-,0,-,-<br />2021-01-08 16:47:24.414,7,7,root,S,(rcu_sched),-,0,-,-<br />2021-01-08 16:47:24.414,8,8,root,S,(rcu_bh),-,0,-,-<br />2021-01-08 16:47:24.414,9,9,root,S,(migration/0),-,0,-,-<br />2021-01-08 16:47:24.414,10,10,root,S,(watchdog/0),-,0,-,-<br />2021-01-08 16:47:24.414,11,11,root,S,(watchdog/1),-,0,-,-<br />2021-01-08 16:47:24.414,12,12,root,S,(migration/1),-,0,-,-<br />2021-01-08 16:47:24.414,13,13,root,S,(ksoftirqd/1),-,0,-,-<br />2021-01-08 16:47:24.414,15,15,root,S,(kworker/1:0H),-,0,-,-<br />2021-01-08 16:47:24.414,16,16,root,S,(kdevtmpfs),-,0,-,-<br />2021-01-08 16:47:24.414,17,17,root,S,(netns),-,0,-,-<br />2021-01-08 16:47:24.414,18,18,root,S,(perf),-,0,-,-<br />2021-01-08 16:47:24.414,19,19,root,S,(khungtaskd),-,0,-,-<br />2021-01-08 16:47:24.414,20,20,root,S,(writeback),-,0,-,-<br />2021-01-08 16:47:24.414,21,21,root,S,(ksmd),-,0,-,-<br />2021-01-08 16:47:24.414,22,22,root,S,(khugepaged),-,0,-,-<br />2021-01-08 16:47:24.414,23,23,root,S,(crypto),-,0,-,-<br />2021-01-08 16:47:24.414,24,24,root,S,(kintegrityd),-,0,-,-<br />2021-01-08 16:47:24.414,25,25,root,S,(bioset),-,0,-,-<br />2021-01-08 16:47:24.414,26,26,root,S,(kblockd),-,0,-,-<br />2021-01-08 16:47:24.414,27,27,root,S,(ata_sff),-,0,-,-<br />2021-01-08 16:47:24.414,28,28,root,S,(md),-,0,-,-<br />2021-01-08 16:47:24.414,29,29,root,S,(devfreq_wq),-,0,-,-<br />2021-01-08 16:47:24.414,34,34,root,S,(kswapd0),-,0,-,-<br />2021-01-08 16:47:24.414,35,35,root,S,(vmstat),-,0,-,-<br />2021-01-08 16:47:24.414,36,36,root,S,(fsnotify_mark),-,0,-,-<br />openxs@ao756:~/git/0xtools$ </span></span><br /></p></blockquote><p>Nothing that fancy in the first 30 lines. But you can then load CSV files into the database for rocessing, or query them with standard Linux text processing tools. It’s like SQL but with different keywords: <b>grep</b> for filtering, <b>awk</b> or <b>sed</b> for column projection, <b>uniq</b> for group by, <b>sort</b> for ordering and <b>head</b>/<b>tail</b> for <b>LIMIT</b>. Let's count number of threads per command, state and syscall, for example, asnd get top 10:<br /></p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/0xtools$ <b>cat /tmp/xcap/*.csv | awk -F, '{
printf("%-20s %-20s %s\n",$6,$5,$7) }' | sort | uniq -c | sort -nbr |
head -10</b><br /> 71442 (console-kit-dae) S -<br /> 69174 (gmain) S read<br /> 69173 (gdbus) S read<br /> 48762 (dconf S read<br /> 47628 (dockerd) S -<br /><b> 46455 (mysqld) S -<br /> 19196 (sysbench) S read</b><br /> 15876 (gmain) S -<br /> 15876 (bioset) S -<br /> 14742 (containerd) S -</span></span><br /></p></blockquote><p>No wonder MySQL threads were mostly sleeping, with 16 of them and just 2 slow CPUs.</p><p>Next stool is <b>psn</b> (Linux Process Snapper), a Python script for troubleshooting
currently on-going issues (no historical capture). It currently reports
more fields directly from <b>/proc</b> than <b>xcapture</b> captures (like filenames
accessed by IO system calls). Let me try to apply <b>psn</b> to the <b>mysqld</b> process:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/0xtools$ <b>psn -p `pidof mysqld`</b><br /><br />Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]<br />Sampling /proc/stat for 5 seconds... finished.<br /><br /><br />=== Active Threads ========================================<br /><br /> samples | avg_threads | comm | state<br />-----------------------------------------------------------<br /> 205 | 2.05 | (mysqld) | Disk (Uninterruptible)<br /> 114 | 1.14 | (mysqld) | Running (ON CPU)<br /><br /><br />samples: 100 (expected: 100)<br />total processes: 1, threads: 44<br />runtime: 5.01, measure time: 1.29</span></span><br /></p></blockquote><p>or to the entire system while <b>sysbench</b> test is running:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/0xtools$ <b>psn</b><br /><br />Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]<br />Sampling /proc/stat for 5 seconds... finished.<br /><br /><br />=== Active Threads ================================================<br /><br /> samples | avg_threads | comm | state<br />-------------------------------------------------------------------<br /> 94 | 2.24 | (mysqld) | Disk (Uninterruptible)<br /> 41 | 0.98 | (jbd*/dm-*-*) | Disk (Uninterruptible)<br /> 31 | 0.74 | (mysqld) | Running (ON CPU)<br /> 31 | 0.74 | (sysbench) | Running (ON CPU)<br /> 14 | 0.33 | (update-manager) | Running (ON CPU)<br /> 5 | 0.12 | (kworker/u*:*) | Running (ON CPU)<br /> 2 | 0.05 | (Xorg) | Running (ON CPU)<br /> 1 | 0.02 | (compiz) | Running (ON CPU)<br /> 1 | 0.02 | (dockerd) | Running (ON CPU)<br /> 1 | 0.02 | (rcu_sched) | Running (ON CPU)<br /><br /><br />samples: 42 (expected: 100)<br />total processes: 221, threads: 650<br />runtime: 5.06, measure time: 5.01</span></span><br /></p></blockquote><p>Some useful summaries of numbers of threads per state per process for the monitoring period, but nothing really fancy. This is, we can add other colunms from this long list with items that should look familiar for those studying <b>/proc</b> content:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~/git/0xtools$ psn --list<br /><br />stat<br />====================================<br />pid int<br />comm str<br />comm2 str<br />state_id str<br />state str<br />ppid int<br />pgrp int<br />session int<br />tty_nr int<br />tpgid int<br />minflt int<br />cminflt int<br />majflt int<br />cmajflt int<br />utime long<br />stime long<br />cutime long<br />cstime long<br />utime_sec long<br />stime_sec long<br />cutime_sec long<br />cstime_sec long<br />priority int<br />nice int<br />num_threads int<br />starttime long<br />vsize long<br />rss long<br />rsslim str<br />exit_signal int<br />processor int<br />rt_priority int<br />delayacct_blkio_ticks long<br />guest_time int<br />cgust_time int<br />exit_code int<br /><br />status<br />====================================<br />name str<br />umask str<br />state str<br />tgid int<br />ngid int<br />pid int<br />ppid int<br />tracerpid int<br />uid int<br />gid int<br />fdsize int<br />groups str<br />nstgid str<br />nspid str<br />nspgid str<br />nssid str<br />vmpeak_kb int<br />vmsize_kb int<br />vmlck_kb int<br />vmpin_kb int<br />vmhwm_kb int<br />vmrss_kb int<br />rssanon_kb int<br />rssfile_kb int<br />rssshmem_kb int<br />vmdata_kb int<br />vmstk_kb int<br />vmexe_kb int<br />vmlib_kb int<br />vmpte_kb int<br />vmpmd_kb int<br />vmswap_kb int<br />hugetlbpages_kb int<br />threads int<br />sigq str<br />sigpnd str<br />shdpnd str<br />sigblk str<br />sigign str<br />sigcgt str<br />capinh str<br />capprm str<br />capeff str<br />capbnd str<br />capamb str<br />seccomp int<br />cpus_allowed str<br />cpus_allowed_list str<br />mems_allowed str<br />mems_allowed_list str<br />voluntary_ctxt_switches int<br />nonvoluntary_ctxt_switches int<br /><br />syscall<br />====================================<br />syscall_id int<br />syscall str<br />arg0 str<br />arg1 str<br />arg2 str<br />arg3 str<br />arg4 str<br />arg5 str<br />filename str<br />filename2 str<br />filenamesum str<br />basename str<br />dirname str<br /><br />wchan<br />====================================<br />wchan str<br /><br />io<br />====================================<br />rchar int<br />wchar int<br />syscr int<br />syscw int<br />read_bytes int<br />write_bytes int<br />cancelled_write_bytes int<br /><br />smaps<br />====================================<br />address_range str<br />perms str<br />offset str<br />dev str<br />inode int<br />pathname str<br />size_kb int<br />rss_kb int<br />pss_kb int<br />shared_clean_kb int<br />shared_dirty_kb int<br />private_clean_kb int<br />private_dirty_kb int<br />referenced_kb int<br />anonymous_kb int<br />anonhugepages_kb int<br />shmempmdmapped_kb int<br />shared_hugetld_kb int<br />private_hugetld_kb int<br />swap_kb int<br />swappss_kb int<br />kernelpagesize_kb int<br />mmupagesize_kb int<br />locked_kb int<br />vmflags str<br /><br />stack<br />====================================<br />kstack str<br /><br />cmdline<br />====================================<br />cmdline str<br />openxs@ao756:~/git/0xtools$</span></span><br /></p></blockquote><p>We can add <b>syscall</b> to find out what threads were waiting on (not only number as in <b>/proc</b>, but syscall decoded, that's what requires some tricks or good tool):<br /></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/0xtools$ <b>sudo psn -p `pidof mysqld` -G syscall</b><br /><br />Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]<br />Sampling /proc/stat, syscall for 5 seconds... finished.<br /><br /><br />=== Active Threads ====================================================<br /><br /> samples | avg_threads | comm | state | syscall<br />-----------------------------------------------------------------------<br /><b> 136 | 1.36 | (mysqld) | Disk (Uninterruptible) | fsync<br /> 97 | 0.97 | (mysqld) | Disk (Uninterruptible) | fdatasync<br /></b> 41 | 0.41 | (mysqld) | Running (ON CPU) | [running]<br /> 5 | 0.05 | (mysqld) | Running (ON CPU) | poll<br /> 4 | 0.04 | (mysqld) | Disk (Uninterruptible) | pwrite64<br /> 3 | 0.03 | (mysqld) | Running (ON CPU) | futex<br /><b> 2 | 0.02 | (mysqld) | Disk (Uninterruptible) | write<br /></b><br /><br />samples: 100 (expected: 100)<br />total processes: 1, threads: 44<br />runtime: 5.02, measure time: 1.67</span></span><br /></p></blockquote><p>We can further group by different kernel stacks:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~/git/0xtools$ <b>sudo psn -p `pidof mysqld` -G kstack</b><br /><br />Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]<br />Sampling /proc/stat, stack for 5 seconds... finished.<br /><br /><br />=== Active Threads =========================================================================================================================================================================================================================================================================================<br /><br /> samples | avg_threads | comm | state | kstack <br />------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------<br /> 101 | 1.01 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit() <br /> 89 | 0.89 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit() <br /> 48 | 0.48 | (mysqld) | Running (ON CPU) | entry_SYSCALL_64_fastpath()->SyS_poll()->do_sys_poll()->poll_schedule_timeout() <br /> 15 | 0.15 | (mysqld) | Running (ON CPU) | int_ret_from_sys_call()->syscall_return_slowpath()->exit_to_usermode_loop() <br /><b> 11 | 0.11 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->blkdev_issue_flush()->submit_bio_wait()</b> <br /> 11 | 0.11 | (mysqld) | Running (ON CPU) | - <br /> 9 | 0.09 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()<br /> 7 | 0.07 | (mysqld) | Running (ON CPU) | entry_SYSCALL_64_fastpath()->SyS_futex()->do_futex()->futex_wait()->futex_wait_queue_me() <br /> 4 | 0.04 | (mysqld) | Running (ON CPU) | retint_user()->prepare_exit_to_usermode()->exit_to_usermode_loop() <br /> 3 | 0.03 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_pwrite64()->vfs_write()->__vfs_write()->new_sync_write()->ext4_file_write_iter()->__generic_file_write_iter()->generic_file_direct_write()->ext4_direct_IO()->__blockdev_direct_IO()->do_blockdev_direct_IO()<br /> 2 | 0.02 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_fdatasync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()<br /> 2 | 0.02 | (mysqld) | Disk (Uninterruptible) | entry_SYSCALL_64_fastpath()->SyS_pread64()->vfs_read()->__vfs_read()->new_sync_read()->generic_file_read_iter()->wait_on_page_bit_killable() <br /> 2 | 0.02 | (mysqld) | Running (ON CPU) | entry_SYSCALL_64_fastpath()->SyS_io_getevents()->read_events() <br /> 1 | 0.01 | (mysqld) | Running (ON CPU) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->filemap_write_and_wait_range()->filemap_fdatawait_range()->__filemap_fdatawait_range()->wait_on_page_bit()<br /> 1 | 0.01 | (mysqld) | Running (ON CPU) | entry_SYSCALL_64_fastpath()->SyS_fsync()->do_fsync()->vfs_fsync_range()->ext4_sync_file()->jbd2_complete_transaction()->jbd2_log_wait_commit() <br /><br /><br />samples: 100 (expected: 100)<br />total processes: 1, threads: 44<br />runtime: 5.01, measure time: 1.78</span></span><br /></p></blockquote><p>or even more, we can see wait channel and if it's disk I/O what file it is related to:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~/git/0xtools$ <b>sudo psn -p `pidof mysqld` -G wchan,filename</b><br /><br />Linux Process Snapper v0.18 by Tanel Poder [https://0x.tools]<br />Sampling /proc/syscall, wchan, stat for 5 seconds... finished.<br /><br /><br />=== Active Threads =====================================================================================================<br /><br /> samples | avg_threads | comm | state | wchan | filename<br />------------------------------------------------------------------------------------------------------------------------<br /><b> 91 | 0.91 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit | /var/lib/mysql/ao756-bin.000092<br /> 89 | 0.89 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit | /var/lib/mysql/ib_logfile1</b><br /> 49 | 0.49 | (mysqld) | Running (ON CPU) | 0 |<br /> 9 | 0.09 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit | /var/lib/mysql/ib_logfile1<br /> 7 | 0.07 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit | /var/lib/mysql/xb_doublewrite<br /> 4 | 0.04 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit | /var/lib/mysql/sbtest/sbtest5.ibd<br /> 3 | 0.03 | (mysqld) | Disk (Uninterruptible) | jbd2_log_wait_commit | /var/lib/mysql/sbtest/sbtest4.ibd<br /> 3 | 0.03 | (mysqld) | Running (ON CPU) | poll_schedule_timeout |<br /> 2 | 0.02 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit | /var/lib/mysql/ao756-bin.000092<br /><b> 1 | 0.01 | (mysqld) | Disk (Uninterruptible) | do_blockdev_direct_IO | /var/lib/mysql/xb_doublewrite</b><br /> 1 | 0.01 | (mysqld) | Disk (Uninterruptible) | submit_bio_wait | /var/lib/mysql/sbtest/sbtest2.ibd<br /> 1 | 0.01 | (mysqld) | Disk (Uninterruptible) | submit_bio_wait | /var/lib/mysql/sbtest/sbtest3.ibd<br /> 1 | 0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit |<br /> 1 | 0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit | /var/lib/mysql/sbtest/sbtest4.ibd<br /> 1 | 0.01 | (mysqld) | Disk (Uninterruptible) | wait_on_page_bit | /var/lib/mysql/sbtest/sbtest5.ibd<br /> 1 | 0.01 | (mysqld) | Running (ON CPU) | 0 | /var/lib/mysql/ib_logfile1<br /><br /><br />samples: 100 (expected: 100)<br />total processes: 1, threads: 44<br />runtime: 5.03, measure time: 2.06</span></span><br /></p></blockquote><p>This is really cool and easy to use! This way I got reminded that actually I have binary logging enabled somewhere and it surely slows things down :)<br /></p><p>The last tool to check today is schedlat. It had not provided me with any insights:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~/git/0xtools$ <b>schedlat `pidof mysqld`</b><br />SchedLat by Tanel Poder [https://0x.tools]<br /><br />PID=19951 COMM=mysqld<br /><br />TIMESTAMP %CPU %LAT %SLP<br />2021-01-08 18:52:38 0.0 0.0 100.0<br />2021-01-08 18:52:39 0.0 0.0 100.0<br />2021-01-08 18:52:40 0.0 0.0 100.0<br />2021-01-08 18:52:41 0.0 0.0 100.0<br /><b>2021-01-08 18:52:42 0.0 0.4 99.6</b><br />2021-01-08 18:52:43 0.0 0.0 100.0<br />2021-01-08 18:52:44 0.0 0.0 100.0<br />2021-01-08 18:52:45 0.0 0.0 100.0<br />2021-01-08 18:52:46 0.0 0.0 100.0<br />2021-01-08 18:52:47 0.0 0.0 100.0<br />2021-01-08 18:52:48 0.0 0.0 100.0<br />2021-01-08 18:52:49 0.0 0.0 100.0<br />2021-01-08 18:52:50 0.0 0.0 100.0<br />2021-01-08 18:52:51 0.0 0.0 100.0<br />2021-01-08 18:52:52 0.0 0.0 100.0<br />2021-01-08 18:52:53 0.0 0.0 100.0<br />2021-01-08 18:52:54 0.0 0.0 100.0<br />2021-01-08 18:52:55 0.0 0.0 100.0<br />2021-01-08 18:52:56 0.0 0.0 100.0<br />2021-01-08 18:52:57 0.0 0.0 100.0<br />2021-01-08 18:52:58 0.0 0.0 100.0<br />2021-01-08 18:52:59 0.0 0.0 100.0<br />2021-01-08 18:53:00 0.0 0.0 100.0<br />2021-01-08 18:53:01 0.0 0.0 100.0<br />2021-01-08 18:53:02 0.0 0.0 100.0<br />2021-01-08 18:53:03 0.0 0.0 100.0<br />2021-01-08 18:53:04 0.0 0.0 100.0<br />2021-01-08 18:53:05 0.0 0.0 100.0<br />2021-01-08 18:53:06 0.0 0.0 100.0<br />2021-01-08 18:53:07 0.0 0.0 100.0<br />^CTraceback (most recent call last):<br /> File "/usr/bin/schedlat", line 36, in <module><br /> time.sleep(1)<br />KeyboardInterrupt</span></span><br /></p></blockquote><p>I was not CPU-bound, the process was mostly sleeping. </p><p>I suggest you to get inspired by the thread states chart below and article it got it copied from, and read other articled by Tanel Poder <a href="https://tanelpoder.com/" target="_blank">here</a>. This is what I am going to do myself before trying to apply the tools in real life. The idea of this blog post was to show what is possible for a creative person while sampling thread states from <b>/proc</b>.<br /></p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsoJuzmlxOtJaVXNVm0NzUNEJEPc8kYNnQdNNoCii9QY1C8whfZYUGTvGDvcy39po6JQ76kPQPi5IrBs6u0dm_vK9GWFrWzpqAjkIYLzvnxTpDQNIWwuLaC4yCMXm2pPmmvgsybvkTmlh_/s1400/thread_states.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="886" data-original-width="1400" height="254" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgsoJuzmlxOtJaVXNVm0NzUNEJEPc8kYNnQdNNoCii9QY1C8whfZYUGTvGDvcy39po6JQ76kPQPi5IrBs6u0dm_vK9GWFrWzpqAjkIYLzvnxTpDQNIWwuLaC4yCMXm2pPmmvgsybvkTmlh_/w400-h254/thread_states.png" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Nice illustration of generic threads states we care about for off-CPU sampling, provided by <b>Brendann Gregg</b> at <a href="http://www.brendangregg.com/offcpuanalysis.html" target="_blank">http://www.brendangregg.com/offcpuanalysis.html</a><br /></td></tr></tbody></table><br /><p style="text-align: center;">* * *</p><p style="text-align: left;">To summarize, why would anyone may prefer or need to use <b>/proc</b> sampling tools instead of other approaches:</p><ol style="text-align: left;"><li>Unlike <a href="http://mysqlentomologist.blogspot.com/search/label/perf" target="_blank"><b>perf</b></a> or <a href="http://mysqlentomologist.blogspot.com/search/label/eBPF" target="_blank">eBPF</a> tools, <b>/proc</b> is always there. It is present on old Linux versions, and poor man's sampling with shell scripts does not require installing anything else. </li><li>Some information about the processes is visible to all users, so there may be no need for <b>root</b>/<b>sudo</b> privileges that MySQLproduction DBAs often lack.<br /></li><li>It is easy to use to do low-overhead off-CPU profiling. While it is possible to enable tracing for <a href="http://www.brendangregg.com/offcpuanalysis.html" target="_blank">off-cpu events</a> in <b>perf</b>, it comes with a higher tracing overhead (and then overhead of post-processing these high-frequency events). eBPF tools can be used to reduce both, but it is still extra overhead on top of what is already there in <b>/proc</b>.</li></ol><p>So I decided to add these tools and ad hoc <b>/proc</b> sampling scripts (and checking <b>sar</b> outputs, if any) to my emergency toolset while studying some complex MySQL performance problems. They fit well into the ovberall system of application-level tracing based on <b>performance_schema</b>, on-CPU sampling with <b>perf</b> and some <a href="http://mysqlentomologist.blogspot.com/search/label/bpftrace" target="_blank"><b>bpftrace</b></a> quick scripts. Use right tools for the job!</p><p>In the next, probably final post in this series, I'll try to present some good example of MySQL performance problem where /proc-based sampling tools really help to get useful insight.<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-78147427620104022592021-01-07T21:16:00.001+02:002021-01-07T21:16:15.797+02:00Linux /proc Filesystem for MySQL DBAs - Part II, Threads of the mysqld Process<p>It's common <a href="https://dev.mysql.com/doc/internals/en/threads.html" target="_blank">knowledge</a> that MySQL server (<b>mysqld</b> process) is <a href="https://mysqlserverteam.com/mysql-connection-handling-and-scaling/" target="_blank">multi-threaded</a>. Let me quote:</p><blockquote><p><span style="font-size: x-small;">"The MySQL Server (mysqld) executes as a single OS <i>process</i>, with multiple <i>threads</i>
executing concurrent activities. MySQL does not have its own thread
implementation, but relies on the thread implementation of the
underlying OS."</span> <br /></p></blockquote><p>In the <a href="http://mysqlentomologist.blogspot.com/2021/01/linux-proc-filesystem-for-mysql-dbas.html" target="_blank">previous post</a> we've seen that on Linux the <b>/proc/[<i>pid</i>]/task/</b> subdirectory contains subdirectories of the form <b>[<i>tid</i>]/</b>, one for each thread of the process (<i><b>pid</b></i>). This is what I have for the Percona Server 5.7.32-35 I use on Ubuntu 16.04 as an example in this series of blog posts:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">mysql> <b>select connection_id(), version(), @@pid_file;</b><br />+-----------------+---------------+----------------------------+<br />| connection_id() | version() | @@pid_file |<br />+-----------------+---------------+----------------------------+<br />| 167 | 5.7.32-35-log | /var/run/mysqld/mysqld.pid |<br />+-----------------+---------------+----------------------------+<br />1 row in set (0.00 sec)<br /><br />mysql> <b>\! sudo cat /var/run/mysqld/mysqld.pid</b><br />30580<br />mysql> <b>\! ls -F /proc/30580/task</b><br />2389/ 2394/ 30582/ 30586/ 30590/ 30594/ 30600/ 30604/ 30608/ 3620/<br />2390/ 2488/ 30583/ 30587/ 30591/ 30597/ 30601/ 30605/ 30609/<br />2391/ 30580/ 30584/ 30588/ 30592/ 30598/ 30602/ 30606/ 31000/<br />2393/ 30581/ 30585/ 30589/ 30593/ 30599/ 30603/ 30607/ 31002/<br />mysql></span></span><br /></p></blockquote><p>I've also shown a poor man's monitoring shell script idea to study some <b>/proc</b> information for each thread. Let me present it nicely structured for the case of <b>wchan</b> check to see where thread waits (I started <b>sysbench</b> test with 16 threads to put some load and see more different wait channels):</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>for dir in `ls /proc/30580/task`</b><br />> <b>do</b><br />> <b>echo -n $dir': '</b><br />> <b>2>/dev/null sudo strings /proc/$dir/wchan</b><br />> <b>done</b><br />2389: futex_wait_queue_me<br />2390: futex_wait_queue_me<br />2391: futex_wait_queue_me<br /><b>2393: 2394: futex_wait_queue_me</b><br />2488: futex_wait_queue_me<br />30580: poll_schedule_timeout<br />30581: do_sigtimedwait<br />30582: read_events<br />...<br />30591: read_events<br />30592: futex_wait_queue_me<br /><b>30593: hrtimer_nanosleep</b><br />30594: futex_wait_queue_me<br />30597: futex_wait_queue_me<br />30598: futex_wait_queue_me<br /><b>30599: hrtimer_nanosleep</b><br />30600: futex_wait_queue_me<br />...<br />30607: futex_wait_queue_me<br />30608: do_sigtimedwait<br />30609: futex_wait_queue_me<br /><b>31000: jbd2_log_wait_commit</b><br />31002: futex_wait_queue_me<br />3620: futex_wait_queue_me<br />...<br />6514: futex_wait_queue_me<br />openxs@ao756:~$</span></span><br /></p></blockquote><p>Note that <b>wchan</b> was empty for the thread with <i><b>tid</b></i> <b>2393</b> based on the output, so it was running at the moment. </p><p>But as <b>comm</b> and <b>cmdline</b> files content for MySQL threads is not changed, how do we know what each thread is doing or what kind of thread is it? Probably if we attach <b>gdb</b>... but the topic of this series is different. Luckily, in MySQL 5.7 we have a nice insight in the <a href="https://dev.mysql.com/doc/refman/5.7/en/performance-schema-threads-table.html" target="_blank"><b>performance_schema.threads</b></a> table, the <b>thread_os_id</b> column contains exactly what we need:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">mysql> <b>select thread_id, thread_os_id, name from performance_schema.threads where type = 'BACKGROUND';</b><br />+-----------+--------------+----------------------------------------+<br />| thread_id | thread_os_id | name |<br />+-----------+--------------+----------------------------------------+<br />| 1 | 30580 | thread/sql/main |<br />| 2 | 30581 | thread/sql/thread_timer_notifier |<br />| 3 | 30582 | thread/innodb/io_ibuf_thread |<br />| 4 | 30584 | thread/innodb/io_read_thread |<br />| 5 | 30583 | thread/innodb/io_log_thread |<br />| 6 | 30585 | thread/innodb/io_read_thread |<br />| 7 | 30586 | thread/innodb/io_read_thread |<br />| 8 | 30587 | thread/innodb/io_read_thread |<br />| 9 | 30588 | thread/innodb/io_write_thread |<br />| 10 | 30589 | thread/innodb/io_write_thread |<br />| 11 | 30590 | thread/innodb/io_write_thread |<br />| 12 | 30591 | thread/innodb/io_write_thread |<br /><b>| 13 | 30592 | thread/innodb/page_cleaner_thread |</b><br /><b>| 14 | 30593 | thread/innodb/buf_lru_manager_thread |</b><br />| 15 | 30594 | thread/innodb/srv_monitor_thread |<br />| 17 | 30603 | thread/innodb/srv_worker_thread |<br />| 18 | 30602 | thread/innodb/srv_worker_thread |<br />| 19 | 30601 | thread/innodb/srv_worker_thread |<br />| 20 | 30600 | thread/innodb/srv_purge_thread |<br /><b>| 21 | 30599 | thread/innodb/srv_master_thread |</b><br />| 22 | 30598 | thread/innodb/srv_error_monitor_thread |<br />| 23 | 30597 | thread/innodb/srv_lock_timeout_thread |<br />| 24 | 30605 | thread/innodb/dict_stats_thread |<br />| 25 | 30604 | thread/innodb/buf_dump_thread |<br />| 26 | 30608 | thread/sql/signal_handler |<br />+-----------+--------------+----------------------------------------+<br />25 rows in set (0.05 sec)</span></span><br /></p></blockquote><p>This is what I've used to state in the previous post that <b>30592</b> was a page cleaner thread. Here we can see that threads waiting for timer are InnoDB's master thread and LRU manager thread that are woken up periodically... What was that thread <b>31000</b> waiting on something related to commit at the filesystem level? It was a user connection thread, obviously:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">mysql> <b>select thread_id, thread_os_id, type, name from performance_schema.threads where thread_os_id = 31000;</b><br />+-----------+--------------+------------+---------------------------+<br />| thread_id | thread_os_id | type | name |<br />+-----------+--------------+------------+---------------------------+<br />| 217 | 31000 | FOREGROUND | thread/sql/one_connection |<br />+-----------+--------------+------------+---------------------------+<br />1 row in set (0.00 sec)</span></span><br /></p></blockquote><p>But are we really forced to run scripts and access <b>/proc</b> directly? No, and <a href="https://dev.mysql.com/doc/refman/5.7/en/performance-schema-threads-table.html" target="_blank">fine MySQL manual</a> gives some hints, for example, about <b>ps -L</b> option:<br /></p><blockquote><p><span style="font-size: x-small;">"For Linux, <code class="literal">THREAD_OS_ID</code> corresponds to
the value of the <code class="literal">gettid()</code> function.
This value is exposed, for example, using the
<span class="command"><b>perf</b></span> or <span class="command"><b>ps -L</b></span>
commands, or in the <code class="literal">proc</code> file system
(<code class="filename">/proc/<i class="replaceable"><code>[pid]</code></i>/task/<i class="replaceable"><code>[tid]</code></i></code>).
For more information, see the
<code class="literal">perf-stat(1)</code>, <code class="literal">ps(1)</code>,
and <code class="literal">proc(5)</code> man pages.
"<br /></span></p></blockquote><p>So, with <b>ps -L</b> we can see the threads of the <b>mysqld</b> process:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>ps -L aux | grep -e PID -e `pidof mysqld` | more</b><br /><b>USER PID LWP %CPU NLWP %MEM VSZ RSS TTY STAT START TIME COMMAND</b><br />...<br />mysql 30580 30580 0.0 38 7.7 746308 299448 ? Sl Jan02 0:00 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid<br />mysql 30580 30581 0.0 38 7.7 746308 299448 ? Sl Jan02 0:00 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid<br />mysql 30580 30582 0.0 38 7.7 746308 299448 ? Sl Jan02 0:06 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid<br />mysql 30580 30583 0.0 38 7.7 746308 299448 ? Sl Jan02 0:07 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid<br />mysql 30580 30584 0.0 38 7.7 746308 299448 ? Sl Jan02 0:06 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid<br />mysql 30580 30585 0.0 38 7.7 746308 299448 ? Sl Jan02 0:06 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid<br />...</span></span><br /></p></blockquote><p> We can clearly see <i><b>tid</b></i> in the column titled <b>LWP</b>. <b>grep</b> allows us to apply conditions to the rows we see in the command line output. It is similar to the <b>WHERE</b> clause of <b>SELECT</b> statement. But as the rest of the information in the output looks the same for all threads we need something better in <b>ps</b> command line options to show the list of columns with useful values. Note that <b>grep</b> over all threads is actually not mandatory, we can use <b>-p</b> option to show threads only for the given <b>pid</b>:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ ps -L -p `pidof mysqld` | more<br /> PID LWP TTY TIME CMD<br />30580 30580 ? 00:00:00 mysqld<br />30580 30581 ? 00:00:00 mysqld<br />30580 30582 ? 00:00:06 mysqld<br />30580 30583 ? 00:00:07 mysqld<br />30580 30584 ? 00:00:06 mysqld<br />30580 30585 ? 00:00:06 mysqld<br />30580 30586 ? 00:00:07 mysqld<br />30580 30587 ? 00:00:06 mysqld<br />...<br /></span></span></p></blockquote><p>The real problem is the list of columns - the default one we see above is even less useful. Time to read <a href="https://man7.org/linux/man-pages/man1/ps.1.html" target="_blank"><b>man ps</b></a> where we can find the <b>-o</b> option:</p><blockquote><p> <span style="font-size: x-small;"><b>-o </b><i>format</i>
</span></p><pre><span style="font-size: x-small;"> User-defined format. <i>format</i> is a single argument in the
form of a blank-separated or comma-separated list, which
offers a way to specify individual output columns. The
recognized keywords are described in the <b>STANDARD FORMAT</b>
<b>SPECIFIERS </b>section below. ...</span></pre></blockquote><p>So, after reading the details we are ready to proceed to listing all columns we need explicitly, like this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>ps -o lwp,pcpu,state,wchan -L -p `pidof mysqld`</b><br /> LWP %CPU S WCHAN<br />30580 0.0 S -<br />30581 0.0 S -<br />...<br />30592 0.0 S -<br />...<br />30609 0.0 S -<br /> 7192 0.6 D -<br /> 7194 0.6 S -<br /> 7195 0.6 S -<br /> 7377 0.6 S -<br /> 7379 0.6 S -<br /> 9112 0.7 S -<br /> 9115 0.7 S -<br /> 9119 0.7 S -<br /> 9120 0.7 D -<br /> 9436 1.5 S -<br />... </span></span><br /></p></blockquote><p>We can also sort the output by different criteria either with <b>ps</b> options or with external <b>sort</b> as a kind of ORDER BY, do GROUP BY with <b>sort</b> + <b>uniq</b> etc. For example, we can summarize thread states in the process like this:</p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>ps -o state -L -p `pidof mysqld` | sort | uniq -c | sort -nbr</b><br /> 37 S<br /> 7 R<br /> 1 D</span></span><br /></p></blockquote><p>The idea is clear, hopefully. For quick checks and summarizing it may be enough to use <b>ps</b> with options to show threads (<b>-L</b>) and specific fields (<b>-o</b>) and process the output by standard Linux command line utilities.<br /></p><p>As a side note, you can monitor threads of the <b>mysqld</b> process with <a href="https://man7.org/linux/man-pages/man1/top.1.html" target="_blank"><b>top</b></a> <b>-H</b> option as well:</p><blockquote><p><span style="font-family: courier;"><span><span style="font-size: x-small;">openxs@ao756:~$ <b>top -version</b></span></span></span><span style="font-family: courier;"><span style="font-size: x-small;"><br /> procps-ng version 3.3.10</span></span><span style="font-family: courier;"><span style="font-size: x-small;"><br />Usage:</span></span><br /><span style="font-family: courier;"></span><span style="font-family: courier;"><span style="font-size: x-small;"> top -hv | -bcHiOSs -d secs -n max -u|U user -p pid(s) -o field -w [cols]</span></span><br /><span style="font-family: courier;"></span></p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>top -H -b -n 1 -o +%CPU -p `pidof mysqld`</b><br />top - 19:40:20 up 30 days, 7:47, 4 users, load average: 4.08, 2.87, 2.06<br />Threads: 44 total, 9 running, 35 sleeping, 0 stopped, 0 zombie<br />%Cpu(s): 25.3 us, 6.3 sy, 5.2 ni, 59.7 id, 3.5 wa, 0.0 hi, 0.1 si, 0.0 st<br />KiB Mem : 3859384 total, 188328 free, 1052156 used, 2618900 buff/cache<br />KiB Swap: 4001788 total, 2464336 free, 1537452 used. 2008620 avail Mem<br /><br /> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND<br />10002 mysql 20 0 746308 307432 8872 D 6.2 8.0 0:15.70 mysqld<br />10977 mysql 20 0 746308 307432 8872 S 6.2 8.0 0:01.97 mysqld<br /><b>10978 mysql 20 0 746308 307432 8872 R 6.2 8.0 0:01.95 mysqld<br />10981 mysql 20 0 746308 307432 8872 R 6.2 8.0 0:01.92 mysqld</b><br />30580 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:00.35 mysqld<br />30581 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:00.00 mysqld<br />30582 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.42 mysqld<br />30583 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.77 mysqld<br />30584 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.46 mysqld<br />30585 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.52 mysqld<br />30586 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.79 mysqld<br />30587 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.50 mysqld<br />30588 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.72 mysqld<br />30589 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:08.08 mysqld<br />30590 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.70 mysqld<br />30591 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.80 mysqld<br />30592 mysql 20 0 746308 307432 8872 S 0.0 8.0 3:09.57 mysqld<br />30593 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:04.68 mysqld<br />30594 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:00.91 mysqld<br />30597 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:04.62 mysqld<br />30598 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:07.12 mysqld<br />30599 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:08.67 mysqld<br />30600 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:41.00 mysqld<br />30601 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:11.73 mysqld<br />30602 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:12.69 mysqld<br />30603 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:14.40 mysqld<br />30604 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:00.00 mysqld<br />30605 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:01.74 mysqld<br />30606 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:00.97 mysqld<br />30607 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:00.00 mysqld<br />30608 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:00.00 mysqld<br />30609 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:00.00 mysqld<br /> 7195 mysql 20 0 746308 307432 8872 S 0.0 8.0 1:34.78 mysqld<br />10106 mysql 20 0 746308 307432 8872 R 0.0 8.0 0:11.63 mysqld<br />10108 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:11.58 mysqld<br />10854 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:06.62 mysqld<br />10855 mysql 20 0 746308 307432 8872 R 0.0 8.0 0:06.69 mysqld<br />10856 mysql 20 0 746308 307432 8872 R 0.0 8.0 0:06.64 mysqld<br />10858 mysql 20 0 746308 307432 8872 R 0.0 8.0 0:06.64 mysqld<br />10859 mysql 20 0 746308 307432 8872 S 0.0 8.0 0:06.70 mysqld<br />10975 mysql 20 0 746308 307432 8872 R 0.0 8.0 0:01.94 mysqld<br />10976 mysql 20 0 746308 307432 8872 R 0.0 8.0 0:01.96 mysqld<br />10979 mysql 20 0 746308 307432 8872 R 0.0 8.0 0:01.91 mysqld<br />10980 mysql 20 0 746308 307432 8872 D 0.0 8.0 0:01.94 mysqld<br />openxs@ao756:~$</span></span><br /><p></p></blockquote><p>In the command above I asked to run <b>top</b> in the batch mode and capture the output once, sort it by highest CPU usage first. Configuring columns in the output (from a relatively small subset of what is visible via <b>/proc</b>) is possible the interactive mode and final set can be saved in the <b>~/.toprc </b>file. Usually I do not care to do that.</p><p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0EtAfmLcRn_av0B6kbwlPPdN3om91Z0NYhdNGhd1V7HIrvcqtP8V8-E2jbbpwiCCWbIZnIvyPDsbAXYWQU-ktwZfnthfNHuvt1AW1C5-PPlomNFA-CJqYmWwuRoz7VQfW6gOoKZLElB4w/s2048/London+266.JPG" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1365" data-original-width="2048" height="266" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi0EtAfmLcRn_av0B6kbwlPPdN3om91Z0NYhdNGhd1V7HIrvcqtP8V8-E2jbbpwiCCWbIZnIvyPDsbAXYWQU-ktwZfnthfNHuvt1AW1C5-PPlomNFA-CJqYmWwuRoz7VQfW6gOoKZLElB4w/w400-h266/London+266.JPG" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">Multi-threaded <span>mooring of boats somewhere at Regent's Canal<br /></span></td></tr></tbody></table><br /></p><p style="text-align: center;">* * * <br /></p><p>So, for quick checks and summarizing different details about MySQL server threads from <b>/proc</b> you can use not only some poor man's monitoring shell scripts, but also <b>ps -L</b> and <b>top -H</b> commands with proper options and then process the output with usual Linux command line utilities like <b>grep</b>, <b>sort</b> and <b>uniq</b>. One can surely write more advanced tools working directly with <b>/proc</b> content, in C, Python etc if not just shell, and in the next post in this series we'll see some of them from <b><a href="https://0x.tools/" target="_blank">https://0x.tools/</a></b> applied to MySQL. </p><p>Also <b>performance_schema.threads</b> table in MySQL 5.7+ (and MariaDB 10.5+) allows to easily find out the role of each OS thread in the MySQL server and link the OS level and internal metrics for it.<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0tag:blogger.com,1999:blog-3080615211468083537.post-92133309363361237052021-01-06T21:07:00.001+02:002021-01-06T21:07:49.270+02:00Linux /proc Filesystem for MySQL DBAs - Part I, Basics<p>Happy New Year 2021, dear readers of my blog!</p><table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto;"><tbody><tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8-mWaQRbQDhP_vvE95GeKwylMHwbA2HYAwagFiX04jRnTD3WYiWqJ-IKLhm2HOayRVKodIGZVcECgdgENTO3RBAelXXElpE_Ty10Q8-RUXwZbEVb1QzqSkhAjunRaQ0ONjWehS9QB5ncf/s2048/023.JPG" style="margin-left: auto; margin-right: auto;"><img border="0" data-original-height="1536" data-original-width="2048" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg8-mWaQRbQDhP_vvE95GeKwylMHwbA2HYAwagFiX04jRnTD3WYiWqJ-IKLhm2HOayRVKodIGZVcECgdgENTO3RBAelXXElpE_Ty10Q8-RUXwZbEVb1QzqSkhAjunRaQ0ONjWehS9QB5ncf/w400-h300/023.JPG" width="400" /></a></td></tr><tr><td class="tr-caption" style="text-align: center;">We used to have real winters at this time of the year. Not any more...<br /></td></tr></tbody></table><p>Among other good things that happened on December 31, 2020, I was informed that my talk "<a href="https://fosdem.org/2021/schedule/event/linux_porc_mysql/" target="_blank"><b>Linux /proc filesystem for MySQL DBAs</b></a>" was accepted for FOSDEM 2021 MySQL devroom. So, it's time to get back to blogging that I abandoned for a while in favor of this <a href="https://www.youtube.com/channel/UCzBDIplzdOrSKqrR3hAZ7uw" target="_blank">YouTube channel</a>, and share some details to refer to on my slides.</p><p>This is not my first talk or blog post in "Something for MySQL DBAs" series. There were some in the past, like these:</p><ul style="text-align: left;"><li>"<a href="https://www.slideshare.net/valeriikravchuk1/fosdem2015-gdb-tips-and-tricks-for-my-sql-db-as" target="_blank">gdb tips and tricks for MySQL DBAs</a>"</li><li>"<a href="http://mysqlentomologist.blogspot.com/2017/12/using-strace-for-mysql-troubleshooting.html" target="_blank">Using strace for MySQL Troubleshooting</a>"</li><li>"<a href="http://mysqlentomologist.blogspot.com/2017/11/how-lsof-utility-may-help-mysql-dbas.html" target="_blank">How lsof Utility May Help MySQL DBAs</a>"</li><li>"<a href="http://mysqlentomologist.blogspot.com/2018/03/windows-tools-for-mysql-dbas-basic.html" target="_blank">Windows Tools for MySQL DBAs: Basic Minidump Analysis</a>"</li><li>... and so on<br /></li></ul><p>My colleague once said that eventually I have to end up with something like "<b>/dev/null for MySQL DBAs</b>" kind of talk. I am not yet there, but I still think that the more OS level tools DBAs know the better they can do the job.</p><p>The idea of talking about <b>/proc</b> filesystem was inspired by the fact that I use some files there on a regular basis while doing my job and by this <a href="https://minervadb.xyz/minervadb-athena-2020-profiling-linux-operations-for-performance-and-troubleshooting-by-tanel-poder/" target="_blank">great talk</a> by <a href="https://tanelpoder.com/" target="_blank"><b>Tanel Poder</b></a> and his <a href="https://0x.tools/" target="_blank"><b>0x.tools</b></a>, a small set of open-source utilities for analyzing application performance on Linux based mostly on <b>/proc</b> sampling. </p><p>I am planning to cover most part of the upcoming talk in a mini-series of three blog posts. In this post I am going to present some basic information about the <a href="https://man7.org/linux/man-pages/man5/procfs.5.html" target="_blank"><b>/proc</b> filesystem</a> and some files and directories there that would be surely useful for MySQL DBAs, as well as (who could imagine) some related MySQL bug reports. In another one I'll discuss how threads in the process are represented in the <b>/proc</b> filesystem (as well as <b>ps</b> and <b>top</b> outputs). Finally in the third post I am going to show some use cases for the <a href="https://0x.tools/" target="_blank"><b>0x.tools</b></a> and summarize the benefits and use cases for <b>/proc</b> sampling approaches.<b></b></p><p style="text-align: center;"><b>* * *<br /></b></p><p>Basically, the <b>proc</b> filesystem is a pseudo-filesystem which provides an interface to kernel data structures. It is commonly mounted at <b>/proc</b>: </p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~$ <b>cat /etc/issue</b><br />Ubuntu 16.04.7 LTS \n \l<br /><br />openxs@ao756:~$ <b>mount | grep '/proc'</b><br /><b>proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)</b><br />...</span></span><br /></p></blockquote><p>Most of it is read-only, but some files allow to change kernel variables. For the purpose of this discussion I skip mount options and most of the <b>/proc/*</b> files and proceed to <b>/proc</b> subdirectories, one per each process running, named after PID. I have the following <b>mysqld</b> process (of Percona Server 5.7.x) running:</p><p></p><blockquote><p><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>ps aux | grep mysqld</b><br />...<br />mysql
<b>30580</b> 0.7 8.1 746308 313984 ? Sl Jan02 9:55
/usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mysqld.pid</span></span><br /></p><p></p></blockquote><p></p><p>So there will be the <b>/proc/30580</b> directory with the following content:</p><blockquote><p><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~$ <b>ls -F /proc/30580</b><br />ls: cannot read symbolic link '/proc/30580/cwd': Permission denied<br />ls: cannot read symbolic link '/proc/30580/root': Permission denied<br />ls: cannot read symbolic link '/proc/30580/exe': Permission denied<br />attr/ cpuset <b>limits</b> net/ projid_map <b>stat</b><br />autogroup cwd@ loginuid ns/ root@ <b>statm</b><br />auxv <b>environ</b> map_files/ <b>numa_maps</b> <b>sched</b> <b>status</b><br />cgroup exe@ <b>maps</b> <b>oom_adj</b> schedstat <b>syscall</b><br />clear_refs <b>fd/</b> mem <b>oom_score</b> sessionid <b>task/</b><br /><b>cmdline</b> <b>fdinfo/</b> mountinfo <b>oom_score_adj</b> setgroups timers<br /><b>comm</b> gid_map mounts pagemap <b>smaps</b> uid_map<br /><b>coredump_filter</b> <b>io</b> mountstats personality <b>stack</b> <b>wchan</b></span></span><br /></p></blockquote><p>I highlighted the files and directories I consider most useful. Note also "Permission denied" messages above that you may get while accessing some files in <b>/proc</b>, even related to the processes you own. You may still need <b>root</b>/<b>sudo</b> access (or belong to some dedicated group) to read them.</p><p>Now let me give short descriptions or hints about the content of the highlighted files and directories (see <a href="https://man7.org/linux/man-pages/man5/procfs.5.html" target="_blank"><b>man 5 proc</b></a> to get more details):</p><ul style="text-align: left;"><li><b>task/[<i>tid</i>]</b> - task subdirectory contains subdirectories of the form <b>task/[<i>tid</i>]</b>, which contain corresponding information about each of the threads in the process, where <i><b>tid</b></i> is the kernel thread ID of the thread. In my case:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>ls -F /proc/30580/task/</b><br />2488/ 30580/ 30584/ 30588/ 30592/ 30598/ 30602/ 30606/ 31972/ 3622/<br />2493/ 30581/ 30585/ 30589/ 30593/ 30599/ 30603/ 30607/ 3618/<br />2800/ 30582/ 30586/ 30590/ 30594/ 30600/ 30604/ 30608/ 3620/<br />2805/ 30583/ 30587/ 30591/ 30597/ 30601/ 30605/ 30609/ 3621/<br /></span></span><br />Each of them in turn has files and subdirectories similar to those <b>/proc/<i>pid</i></b> ones that I discuss below:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo ls -F /proc/30580/task/30592<br /></b>attr/ cpuset io net/ personality smaps wchan<br />auxv cwd@ limits ns/ projid_map stack<br />cgroup environ loginuid numa_maps root@ stat<br />children exe@ maps oom_adj sched statm<br />clear_refs fd/ mem oom_score schedstat status<br />cmdline fdinfo/ mountinfo oom_score_adj sessionid syscall<br />comm gid_map mounts pagemap setgroups uid_map</span></span><b><br /><br /></b></li><li><b>cmdline</b> - this read-only file contains the command line (<b>argv</b>) that the process wants you to see, as strings terminated by null bytes ('\0'). This is how you can check the content to see each string separately:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>strings /proc/30580/cmdline</b><br />/usr/sbin/mysqld<br />--daemonize<br />--pid-file=/var/run/mysqld/mysqld.pid<br /><br /></span></span></li><li><b>comm</b> - command name (up to 16 characters including the terminating null byte, longer values truncated) associated with the process. In my case:<br /><span style="font-size: x-small;"><span style="font-family: courier;"><br />openxs@ao756:~$ <b>cat /proc/30580/comm<br /></b>mysqld<br /><br /></span></span>Individual threads can set different <b>comm</b> values. There is a useful feature request for MySQL threads to be named according to their role. It is <a href="https://bugs.mysql.com/bug.php?id=70858" target="_blank">Bug #70858</a> - "<b>Set thread name</b>" by <b><a href="https://bugs.mysql.com/search.php?cmd=display&status=All&severity=all&reporter=9242646">Daniël van Eeden</a></b>.<br /><br /></li><li><b>coredump_filter</b> - this file can be used to control which memory segments are written to the core dump file in the event that a core dump is performed for the process with the corresponding PID. The value in the file is a bit mask of memory mapping types:<pre><span style="font-size: x-small;"><span style="font-family: courier;">bit 0 Dump anonymous private mappings.
bit 1 Dump anonymous shared mappings.
bit 2 Dump file-backed private mappings.
bit 3 Dump file-backed shared mappings.
bit 4 Dump ELF headers.
bit 5 Dump private huge pages.<br />...</span></span><br /></pre>By default bits 0, 1, 4 and 5 are set, so we see hex value 33 in the file:<br /><br /><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ cat /proc/30580/coredump_filter<br />00000033<br /></span></span><br />We can control core dump size to some extent by writing to this file. Other, way more important files in <b>/proc</b> that are related to core dumps are presented in <a href="https://man7.org/linux/man-pages/man5/core.5.html" target="_blank"><b>man 5 core</b></a> and in <a href="https://fromdual.com/hunting-the-core" target="_blank">other sources</a>. Read them to decode, for example, this output that is typical for <b>systemd</b>-based systems (and do not look for the <b>core.*</b> files related to the <b>mysqld</b> process in the <b>datadir</b> desperately after that):<br /><span style="font-size: x-small;"><span style="font-family: courier;"><br />openxs@ao756:~$ <b>cat /proc/sys/kernel/core_pattern</b><br />|/usr/share/apport/apport %p %s %c %d %P %E</span></span><br /><br /></li><li><b>environ</b> - this file contains the initial environment that was set when the currently executing program was started. The entries are separated by null bytes, so you can check them like this:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo strings /proc/30580/environ</b><br />LANG=en_US.UTF-8<br />...<br />LC_TIME=uk_UA.UTF-8<br />PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin<br />HOME=/nonexistent<br />LOGNAME=mysql<br />USER=mysql<br />SHELL=/bin/false<br /><b>STARTTIMEOUT=120<br />STOPTIMEOUT=600<br />LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1<br /></b></span></span><br />Note the last 3 variables above. They show <b>systemd</b> timeouts for the unit and <b>jemalloc</b> preloaded.<br /><br /></li><li><b>fd/</b> - this subdirectory contains one entry for each file which the process has open. The name is its file descriptor and it is a symbolic link to the actual file:<br /><span style="font-size: x-small;"><span style="font-family: courier;"><br />openxs@ao756:~$ <b>sudo ls -l /proc/30580/fd/</b><br />total 0<br />lr-x------ 1 mysql mysql 64 Jan 5 19:09 0 -> /dev/null<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 <b>1 -> socket:[822912]</b><br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 10 -> /var/lib/mysql/ib_logfile1<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 11 -> /var/lib/mysql/ibdata1<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 12 -> /var/lib/mysql/xb_doublewrite<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 13 -> /tmp/ibGEsYuO (deleted)<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 14 -> /var/lib/mysql/ibtmp1<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 15 -> /var/lib/mysql/sbtest/sbtest3.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 16 -> /var/lib/mysql/mysql/plugin.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 17 -> /var/lib/mysql/mysql/gtid_executed.ibd<br />l-wx------ 1 mysql mysql 64 Jan 5 19:09 18 -> /var/lib/mysql/ao756-bin.000089<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 19 -> socket:[822935]<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 <b>2 -> socket:[822912]</b><br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 20 -> socket:[822936]<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 21 -> /var/lib/mysql/mysql/server_cost.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 22 -> /var/lib/mysql/mysql/engine_cost.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 23 -> /var/lib/mysql/mysql/db.MYI<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 24 -> /var/lib/mysql/mysql/db.MYD<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 25 -> /var/lib/mysql/mysql/user.MYI<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 26 -> /var/lib/mysql/mysql/user.MYD<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 27 -> /var/lib/mysql/mysql/event.MYI<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 28 -> /var/lib/mysql/mysql/event.MYD<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 29 -> /var/lib/mysql/mysql/time_zone_leap_second.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 3 -> /var/lib/mysql/ao756-bin.index<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 30 -> /var/lib/mysql/mysql/time_zone_name.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 31 -> /var/lib/mysql/mysql/time_zone.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 32 -> /var/lib/mysql/mysql/time_zone_transition_type.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 33 -> /var/lib/mysql/mysql/time_zone_transition.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 34 -> /var/lib/mysql/mysql/innodb_table_stats.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 35 -> /var/lib/mysql/sbtest/sbtest2.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 36 -> /var/lib/mysql/sbtest/sbtest4.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 39 -> /var/lib/mysql/sbtest/sbtest1.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 4 -> /var/lib/mysql/sbtest/sbtest5.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 40 -> /var/lib/mysql/mysql/servers.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 41 -> /var/lib/mysql/mysql/slave_master_info.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 42 -> /var/lib/mysql/mysql/slave_relay_log_info.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 43 -> /var/lib/mysql/mysql/slave_worker_info.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 45 -> /var/lib/mysql/mysql/innodb_index_stats.ibd<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 5 -> /var/lib/mysql/ib_logfile0<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 6 -> /tmp/ibmKNTQP (deleted)<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 7 -> /tmp/ib2GbIR1 (deleted)<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 8 -> /tmp/ibME5wSd (deleted)<br />lrwx------ 1 mysql mysql 64 Jan 5 19:09 9 -> /tmp/ibyiPjXB (deleted)<br /></span></span><br />In the above we see <b>stderr</b> (descriptor 2) is pointing out to <b>socket:[822912]</b>. We can get some more information about this inode in <b>/proc/net</b> and in other sources:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>strings /proc/net/unix | grep 822912</b><br />0000000000000000: 00000003 00000000 00000000 0001 03 822912<br />openxs@ao756:~$ <b>sudo netstat -n --program | grep 822912</b><br />unix 3 [ ] STREAM CONNECTED 822912 30580/mysqld<br /></span></span><br />The information in <b>fd/</b> is similar to what you may get from <b>lsof</b>, but it is "always there" while <b>lsof</b> must be installed separately<span style="font-family: courier;"><span style="font-size: x-small;"><span style="font-family: inherit;">.<br /><br /></span></span></span></li><li><b>fdinfo/</b> - files in this subdirectory provide more information about the corresponding file descriptor. Let's check <b>3 -> /var/lib/mysql/ao756-bin.index</b>:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo cat /proc/30580/fdinfo/3</b><br />pos: 1691<br />flags: 0100002<br />mnt_id: 24<br /></span><span style="font-size: x-small;">openxs@ao756:~$ <b>sudo ls -l /var/lib/mysql/ao756-bin.index</b><br />-rw-r----- 1 mysql mysql 1691 Jan 2 17:35 /var/lib/mysql/ao756-bin.index<br /></span></span><br />and then some InnoDB file, for example, <b>5 -> /var/lib/mysql/ib_logfile0</b>:<br /><br /><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>sudo cat /proc/30580/fdinfo/5</b><br />pos: 0<br />flags: 0100002<br />mnt_id: 24<br />lock: 1: POSIX ADVISORY WRITE 30580 fc:01:11143169 0 EOF<br />openxs@ao756:~$ <b>cat /proc/30580/mountinfo | grep ^'24 '</b><br />24 0 252:1 / / rw,relatime shared:1 - ext4 /dev/mapper/ubuntu--vg-root rw,errors=remount-ro,data=ordered<br /></span></span><br />As we can see the details include file offset (<b>pos</b>), octal number that displays the file access mode and file status flags (<b>flags</b>), id of the mountpoint for the file that we were able to find in <b>/proc/$PID/mountinfo</b> (<b>/</b> file system in this case), and some other details depending on file type. For redo log we see <b>lock</b> record with the details about the write lock set on this file by the <b>mysqld</b> process. You can see it among all the file locks in <b>/proc/locks</b> too:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>cat /proc/locks | grep fc:01:11143169</b><br />30: POSIX ADVISORY WRITE 30580 fc:01:11143169 0 EOF<br /></span></span><br />See <a href="https://gavv.github.io/articles/file-locks/" target="_blank">"<b>File locking in Linux</b>"</a> by <b>Victor Gaydov</b>
for much more details.<br /><br /></li><li><b>io</b> - this file contains I/O statistics for the process, for example:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ sudo cat /proc/30580/io<br />rchar: 198026951<br />wchar: 2365236669<br />syscr: 8014<br />syscw: 92558<br />read_bytes: 49184768<br />write_bytes: 3394945024<br />cancelled_write_bytes: 171876352<br /><br /></span></span>Here we can see number of characters read and written (maybe just to the pagecache, without physical I/O), number of system calls to read and write, number of bytes read and written to the storage and, in <b>cancelled_write_bytes</b>, number of written bytes which this process caused to not happen, by truncating pagecache (when deleting the file created, for example).<br /><br />It's important to note that the same information is properly tracked for the individual threads. For example, this is the page cleaner thread of our Percona Server (more on why I am so sure while <b>comm</b> is not set is in the next post):<br /><br /><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>sudo cat /proc/30580/task/30592/io</b><br />rchar: 0<br />wchar: 912518825<br />syscr: 0<br />syscw: 2092<br />read_bytes: 0<br />write_bytes: 1813786624<br />cancelled_write_bytes: 0<br /><br /></span></span>and we can see that it did notable share of all writes.<br /><br /></li><li><b>limits</b> - this file provides the details about the actual resource limits for the process:<br /><br />This is my personal most often used file in the <b>/proc</b> filesystem, at least while working on support issues with customers.<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>cat /proc/30580/limits</b><br />Limit Soft Limit Hard Limit Units<br />Max cpu time unlimited unlimited seconds<br />Max file size unlimited unlimited bytes<br />Max data size unlimited unlimited bytes<br />Max stack size 8388608 unlimited bytes<br />Max core file size 0 unlimited bytes<br />Max resident set unlimited unlimited bytes<br /><b>Max processes 14893 14893 processes<br />Max open files 1024 4096 files<br /></b>Max locked memory 65536 65536 bytes<br />Max address space unlimited unlimited bytes<br />Max file locks unlimited unlimited locks<br />Max pending signals 14893 14893 signals<br />Max msgqueue size 819200 819200 bytes<br />Max nice priority 0 0<br />Max realtime priority 0 0<br />Max realtime timeout unlimited unlimited us<br /></span></span><br />I highlighted rows most imortant in the context of MySQL server - limit on number of processes (and thus threads and connections, thread pool aside) and limit on number of open files. Users often assume values very different from those actually used, for various reasons... Note that on modern kernels one does NOT need <b>root</b>/<b>sudo</b> to check the limits of process that belongs to a different user.<br /><br /></li><li><b>maps</b> - this file contains currently mapped memory regions and their access permissions. It is also very imortant to check in case of MySQL server. Consider the following examle:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo cat /proc/30580/maps | more</b><br /><b>00400000-019d6000 r-xp 00000000 fc:01 12066751 /usr/sbin/mysqld</b><br />01bd6000-01cb7000 r--p 015d6000 fc:01 12066751 /usr/sbin/mysqld<br />01cb7000-01d67000 rw-p 016b7000 fc:01 12066751 /usr/sbin/mysqld<br />01d67000-01e48000 rw-p 00000000 00:00 0<br />7fe19d400000-7fe1a2000000 rw-p 00000000 00:00 0<br />...<br /><b>7fe1caa17000-7fe1caa4a000 r-xp 00000000 fc:01 12064860 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1</b><br /><b>7fe1caa4a000-7fe1cac4a000 ---p 00033000 fc:01 12064860 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1</b><br />7fe1cac4a000-7fe1cac4c000 r--p 00033000 fc:01 12064860 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1<br />7fe1cac4c000-7fe1cac4d000 rw-p 00035000 fc:01 12064860 /usr/lib/x86_64-linux-gnu/libjemalloc.so.1<br />...<br />7fe1cae33000-7fe1cae38000 rw-s 00000000 00:0d 822928 /[aio]<br />(deleted)<br />7fe1cae38000-7fe1cae42000 rw-p 00000000 00:00 0<br />7fe1cae42000-7fe1cae44000 rw-s 00000000 00:0d 822929 /[aio]<br />(deleted)<br />...<br />7fe1cae73000-7fe1cae74000 r--p 00025000 fc:01 1311186 /lib/x86_64-linux-gnu/ld-2.23.so<br />7fe1cae74000-7fe1cae75000 rw-p 00026000 fc:01 1311186 /lib/x86_64-linux-gnu/ld-2.23.so<br />7fe1cae75000-7fe1cae76000 rw-p 00000000 00:00 0<br />7ffe2f559000-7ffe2f57d000 rw-p 00000000 00:00 0 [stack]<br />7ffe2f5c6000-7ffe2f5c8000 r--p 00000000 00:00 0 [vvar]<br />7ffe2f5c8000-7ffe2f5ca000 r-xp 00000000 00:00 0 [vdso]<br />ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]<br /></span></span><br />The format of the output lines is simple: range of <i>addresses</i> in the address space of the process, <i>permissions</i>, <i>offset</i>, <i>device</i> (major:minor), <i>inode</i> and <i>pathname</i>. Permissions <b>r</b>/<b>w</b>/<b>x</b> have usual meanings, while <b>s</b> means shared and <b>p</b> - private. Offset represents offset in the related file, if any. Device represents where the inode is located amd inode corresponds the given pathname (inode 0 is for the memory region). The <i>pathname</i> field will usually be the file that is backing the mapping. This is the huge memory-mapped file for the <a href="https://galeracluster.com/library/kb/gcache-during-state-transfers.html" target="_blank"><i>Galera cache</i></a>, for examle: <br /><br /><span style="font-family: courier;"><span style="font-size: x-small;"><b>7ed567fff000-7fd568000000 rw-s 00000000 fd:0f 2147676724 /home/mariadb-data/galera.cache</b><br /></span></span><br />See also map_files/* for memory-mapped files for the process:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo ls -l /proc/30580/map_files/ | more</b><br />total 0<br />lr-------- 1 mysql mysql 64 Jan 6 16:42 1bd6000-1cb7000 -> /usr/sbin/mysqld<br />lr-------- 1 mysql mysql 64 Jan 6 16:42 1cb7000-1d67000 -> /usr/sbin/mysqld<br /><b>lr-------- 1 mysql mysql 64 Jan 6 16:42 400000-19d6000 -> /usr/sbin/mysqld</b><br />lr-------- 1 mysql mysql 64 Jan 6 16:42 7fe1c01ed000-7fe1c01f8000 -> /lib/x86_6<br />4-linux-gnu/libnss_files-2.23.so<br />lr-------- 1 mysql mysql 64 Jan 6 16:42 7fe1c01f8000-7fe1c03f7000 -> /lib/x86_6<br />4-linux-gnu/libnss_files-2.23.so<br />...</span></span><br /><br /></li><li><b>numa_maps</b> - this file displays information about a process's NUMA memory policy and allocation. Each line contains information about a memory range used by the process, displaying - among other information - the effective memory policy for that memory range and on which nodes the pages have been allocated. Consider the following example (single node only, sorry):<br /><br /><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>sudo cat /proc/30580/numa_maps | more</b><br /><b>00400000 default file=/usr/sbin/mysqld mapped=1551 active=1061 N0=1551 kernelpagesize_kB=4</b><br />01bd6000 default file=/usr/sbin/mysqld anon=37 dirty=1 mapped=99 swapcache=36 active=0 N0=99 kernelpagesize_kB=4<br />01cb7000 default file=/usr/sbin/mysqld anon=46 dirty=45 mapped=71 swapcache=1 active=0 N0=71 kernelpagesize_kB=4<br />01d67000 default anon=115 dirty=93 swapcache=22 active=2 N0=115 kernelpagesize_kB=4<br />...<br />7fe1caa17000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 mapped=14 mapmax=2 N0=14 kernelpagesize_kB=4<br />7fe1caa4a000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1<br />7fe1cac4a000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 anon=2 dirty=1 swapcache=1 active=0 N0=2 kernelpagesize_kB=4<br />7fe1cac4c000 default file=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4<br />...<br />7fe1cae71000 default file=/[aio]\040(deleted)<br />7fe1cae72000 default anon=1 swapcache=1 active=0 N0=1 kernelpagesize_kB=4<br />7fe1cae73000 default file=/lib/x86_64-linux-gnu/ld-2.23.so anon=1 swapcache=1 ac<br />tive=0 N0=1 kernelpagesize_kB=4<br />7fe1cae74000 default file=/lib/x86_64-linux-gnu/ld-2.23.so anon=1 dirty=1 active<br />=0 N0=1 kernelpagesize_kB=4<br />7fe1cae75000 default anon=1 dirty=1 active=0 N0=1 kernelpagesize_kB=4<br />7ffe2f559000 default stack anon=32 <b>dirty=30 swapcache=2</b> active=3 N0=32 kernelpag<br />esize_kB=4<br />7ffe2f5c6000 default<br />7ffe2f5c8000 default<br /></span></span><br />In the above we see starting address of the range, policy (all <b>default</b> in my case), NUMA node (<b>N0</b> in my case) for the allocation and number of pages allocated (see <b>N0=1551</b>), dirty pages (<b>dirty=30</b>), number of pages that have an associated entry on a swap device (<b>swapcache=2</b>) etc. See also <a href="https://man7.org/linux/man-pages/man7/numa.7.html" target="_blank"><b>man 7 numa</b></a> for many more details.<br /><br /></li><li><b>oom_score</b> - this file displays the current score that the kernel gives to this process for the purpose of selecting a process for the OOM-killer. A higher score means that the process is more likely to be selected by the OOM-killer. The range is basically 0-1000, where 1000 basically means the process uses all the memory. Process with the value 0 in this file will never be killed. My server has low enough value:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo cat /proc/30580/oom_score</b><br />40<br /></span></span><br /></li><li><b>oom_score_adj</b> - this file can be used to adjust the OOM score used to select which process gets killed in out-of-memory conditions. Each candidate task is assigned a value ranging from 0 (never kill) to 1000 (always kill) to determine which process is targeted. The value of <b>oom_score_adj</b> is added to the OOM score before it is used to determine which task to kill. Acceptable values range from -1000 (<b>OOM_SCORE_ADJ_MIN</b>) to +1000 (<b>OOM_SCORE_ADJ_MAX</b>). This allows user space to control the preference for OOM-killing, ranging from always preferring a certain task or completely disabling it from OOM killing.<br /><br />In practice you just write large negative valuer into the file:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo cat /proc/30580/oom_score</b><br />40<br />openxs@ao756:~$ <b>sudo cat /proc/30580/oom_score_adj</b><br />0<br />openxs@ao756:~$ <b>sudo echo -100 > /proc/30580/oom_score_adj</b><br />-bash: /proc/30580/oom_score_adj: Permission denied<br />openxs@ao756:~$ <b>sudo su -</b><br />root@ao756:~# <b>echo -100 > /proc/30580/oom_score_adj</b><br />root@ao756:~# <b>exit</b><br />logout<br />openxs@ao756:~$ <b>sudo cat /proc/30580/oom_score_adj</b><br />-100<br />openxs@ao756:~$ <b>sudo cat /proc/30580/oom_score</b><br />0</span></span><br /><br />In my case I had to become <b>root</b> to be able to adjust the score. Read this nice blog post, "<b><a href="https://www.psce.com/en/blog/mysql-oom-killer-and-everything-related/">MySQL, OOM Killer, and everything related</a></b>", for more details.<br /><br /></li><li><b>sched</b> - scheduler related statistcis for the process. Not that it is much documented (see some hints <a href="https://lwn.net/Articles/242900/" target="_blank">here</a> and <a href="http://oliveryang.net/2015/07/linux-scheduler-profiling-2/" target="_blank">there</a>), but the output is more or less clear:<br /><br /><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~$ <b>sudo cat /proc/30580/sched</b><br />mysqld (30580, #threads: 37)<br />-------------------------------------------------------------------<br />se.exec_start : 259734254.219078<br />se.vruntime : 102780.674970<br />se.sum_exec_runtime : 354.092211<br />se.statistics.sum_sleep_runtime : 38046805.925355<br />se.statistics.wait_start : 0.000000<br />se.statistics.sleep_start : 259734254.219078<br />se.statistics.block_start : 0.000000<br />se.statistics.sleep_max : 11587559.290106<br />se.statistics.block_max : 2466.143746<br />se.statistics.exec_max : 4.005223<br />se.statistics.slice_max : 2.790630<br />se.statistics.wait_max : 22.567269<br /><b>se.statistics.wait_sum : 85.791532<br />se.statistics.wait_count : 748<br />se.statistics.iowait_sum : 5117.788260<br />se.statistics.iowait_count : 524</b><br />se.nr_migrations : 50<br />se.statistics.nr_migrations_cold : 0<br />se.statistics.nr_failed_migrations_affine : 0<br />se.statistics.nr_failed_migrations_running : 10<br />se.statistics.nr_failed_migrations_hot : 5<br />se.statistics.nr_forced_migrations : 0<br />se.statistics.nr_wakeups : 658<br />se.statistics.nr_wakeups_sync : 104<br />se.statistics.nr_wakeups_migrate : 42<br />se.statistics.nr_wakeups_local : 468<br />se.statistics.nr_wakeups_remote : 190<br />se.statistics.nr_wakeups_affine : 22<br />se.statistics.nr_wakeups_affine_attempts : 55<br />se.statistics.nr_wakeups_passive : 0<br />se.statistics.nr_wakeups_idle : 0<br />avg_atom : 0.477857<br />avg_per_cpu : 7.081844<br />nr_switches : 741<br />nr_voluntary_switches : 659<br />nr_involuntary_switches : 82<br />se.load.weight : 1024<br />se.avg.load_sum : 6000742<br />se.avg.util_sum : 22528<br />se.avg.load_avg : 125<br />se.avg.util_avg : 0<br />se.avg.last_update_time : 259734254219078<br />policy : 0<br />prio : 120<br />clock-delta : 60<br />mm->numa_scan_seq : 0<br />numa_pages_migrated : 0<br />numa_preferred_nid : -1<br />total_numa_faults : 0<br />current_node=0, numa_group_id=0<br />numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0<br /></span></span><br />and you can reset most counters by writing 0 to the file:<br /><br /><span style="font-size: x-small;"><span style="font-family: courier;">openxs@ao756:~$ sudo su -<br />root@ao756:~# <b>echo 0 > /proc/30580/sched</b><br />root@ao756:~# <b>exit</b><br />logout<br />openxs@ao756:~$ <b>sudo cat /proc/30580/sched</b><br />mysqld (30580, #threads: 37)<br />-------------------------------------------------------------------<br />se.exec_start : 259734254.219078<br />se.vruntime : 102780.674970<br />se.sum_exec_runtime : 354.092211<br />se.statistics.sum_sleep_runtime : 0.000000<br />se.statistics.wait_start : 0.000000<br />se.statistics.sleep_start : 0.000000<br />se.statistics.block_start : 0.000000<br />se.statistics.sleep_max : 0.000000<br />se.statistics.block_max : 0.000000<br />se.statistics.exec_max : 0.000000<br />se.statistics.slice_max : 0.000000<br />se.statistics.wait_max : 0.000000<br />se.statistics.wait_sum : 0.000000<br />se.statistics.wait_count : 0<br />se.statistics.iowait_sum : 0.000000<br />se.statistics.iowait_count : 0<br />se.nr_migrations : 50<br />se.statistics.nr_migrations_cold : 0<br />se.statistics.nr_failed_migrations_affine : 0<br />se.statistics.nr_failed_migrations_running : 0<br />se.statistics.nr_failed_migrations_hot : 0<br />se.statistics.nr_forced_migrations : 0<br />se.statistics.nr_wakeups : 0<br />se.statistics.nr_wakeups_sync : 0<br />se.statistics.nr_wakeups_migrate : 0<br />se.statistics.nr_wakeups_local : 0<br />se.statistics.nr_wakeups_remote : 0<br />se.statistics.nr_wakeups_affine : 0<br />se.statistics.nr_wakeups_affine_attempts : 0<br />se.statistics.nr_wakeups_passive : 0<br />se.statistics.nr_wakeups_idle : 0<br />avg_atom : 0.477857<br />avg_per_cpu : 7.081844<br />nr_switches : 741<br />nr_voluntary_switches : 659<br />nr_involuntary_switches : 82<br />se.load.weight : 1024<br />se.avg.load_sum : 6000742<br />se.avg.util_sum : 22528<br />se.avg.load_avg : 125<br />se.avg.util_avg : 0<br />se.avg.last_update_time : 259734254219078<br />policy : 0<br />prio : 120<br />clock-delta : 77<br />mm->numa_scan_seq : 0<br />numa_pages_migrated : 0<br />numa_preferred_nid : -1<br />total_numa_faults : 0<br />current_node=0, numa_group_id=0<br />numa_faults node=0 task_private=0 task_shared=0 group_private=0 group_shared=0<br /></span></span><br />then run some load and check the impact. The values are displayed in milliseconds; they’re tracked in nanoseconds, and scaled by one million.<br /><br /></li><li><b>smaps</b> - this file shows memory consumption for each of the process's mappings (see <b>maps</b> above). For each mapping there is a series of lines:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo cat /proc/30580/smaps | more</b><br /><b>00400000-019d6000 r-xp 00000000 fc:01 12066751 /usr/sbin/mysqld</b><br />Size: 22360 kB<br />Rss: 6204 kB<br />Pss: 6204 kB<br />Shared_Clean: 0 kB<br />Shared_Dirty: 0 kB<br />Private_Clean: 6204 kB<br />Private_Dirty: 0 kB<br />Referenced: 5704 kB<br />Anonymous: 0 kB<br />AnonHugePages: 0 kB<br />Shared_Hugetlb: 0 kB<br />Private_Hugetlb: 0 kB<br />Swap: 0 kB<br />SwapPss: 0 kB<br />KernelPageSize: 4 kB<br />MMUPageSize: 4 kB<br />Locked: 0 kB<br /><b>VmFlags: rd ex mr mw me dw sd</b><br />01bd6000-01cb7000 r--p 015d6000 fc:01 12066751 /usr/sb<br />in/mysqld<br />Size: 900 kB<br />...<br /></span></span><br />These lines show the size of the mapping, the amount
of the mapping that is currently resident in RAM (<b>"Rss"</b>),
the process's proportional share of this mapping (<b>"Pss"</b>),
the number of clean and dirty shared pages in the mapping,
and the number of clean and dirty private pages in the
mapping. <b>"Referenced"</b> indicates the amount of memory
currently marked as referenced or accessed. <b>"Anonymous"</b>
shows the amount of memory that does not belong to any
file. <b>"Swap"</b> shows how much would-be-anonymous memory is
also used, but out on swap, "<b>VmFlags</b>" represents the kernel flags associated with the virtual memory area as a set of two character flags, and so on.<br /><br /></li><li><b>stack</b> - this file provides a symbolic trace of the function calls in this process's kernel stack. Here is what I have for the process and one of its threads (actually MySQL server's main thread and page cleaner thread as we'll see in the next post in this series):<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo cat /proc/30580/stack</b><br />[<ffffffff81233b94>] <b>poll_schedule_timeout</b>+0x44/0x70<br />[<ffffffff81235223>] do_sys_poll+0x4b3/0x570<br />[<ffffffff81235329>] do_restart_poll+0x49/0x80<br />[<ffffffff81097155>] sys_restart_syscall+0x25/0x30<br />[<ffffffff81869c5b>] entry_SYSCALL_64_fastpath+0x22/0xd0<br />[<ffffffffffffffff>] 0xffffffffffffffff<br />openxs@ao756:~$ <b>sudo cat /proc/30580/task/30592/stack</b><br />[<ffffffff811097c0>] <b>futex_wait_queue_me</b>+0xc0/0x120<br />[<ffffffff8110a4f6>] futex_wait+0x116/0x270<br />[<ffffffff8110ca60>] do_futex+0x120/0x5a0<br />[<ffffffff8110cf61>] SyS_futex+0x81/0x180<br />[<ffffffff81869c5b>] entry_SYSCALL_64_fastpath+0x22/0xd0<br />[<ffffffffffffffff>] 0xffffffffffffffff<br /></span></span><br />For other threads we'll see more interesting stacks when they are waiting for disk etc. Stay tuned! Note that if you'll check <b>wchan</b> file for the process:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>sudo strings /proc/30580/wchan</b><br />poll_schedule_timeout</span></span><br /><br /> we'll see the top line of the output above (hex values aside) as a null-terminated string. This is a "<i>wait channel</i>", the symbolic name corresponding to the location in the kernel where the process is sleeping.<br /><br /></li><li><b>stat</b> - status information about the process, used by <b>ps</b>. This file is a set of fields separated by spaces, so it is easy to parse by some script or load into the database. <b><a href="https://bugs.mysql.com/search.php?cmd=display&status=All&severity=all&reporter=4602820">Domas Mituzas</a></b> once reported a bug, <a href="https://bugs.mysql.com/bug.php?id=72027" target="_blank">Bug #72027</a> - "<b>LOAD_FILE() does not work on dynamic files</b>" (fixed only in MySQL 8.0.15+), based on failed attempts to apply <b>LOAD_DATA()</b> to <b>stat</b> file in <b>/proc</b>.<br /><br />Output starts with <b>PID</b> and <b>comm</b> fields, followed by the process <b>state</b> and other details. For example:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>cat /proc/30580/stat</b><br /><b>30580</b> (mysqld) <b>S</b> 1 30579 30579 0 -1 4194368 1007759 0 360 0 68598 13573 0 0 20 0 <b>37</b> 0 <b>218060678 764219392</b> 76109 18446744073709551615 1 1 0 0 0 0 540679 12288 1768 0 0 0 17 1 0 0 513 0 0 0 0 0 0 0 0 0 0<br /></span></span><br />The value of <b>state</b> is a single character:<br /><ul><li><b>R</b> Running</li><li><b>S</b> Sleeping in an interruptible wait</li><li><b>D </b> Waiting in uninterruptible disk sleep</li><li><b>Z</b> Zombie</li><li><b>T</b> Stopped (on a signal)</li><li><b>t</b> Tracing stop</li><li>... some other values were also used in the past<br /></li></ul>Then we see identifiers of the parent process and process group, followed by the session id and controlling terminal (<b>-1</b>), and so on. Field 20 is the number of threads in the process, fields 23 and 24 virtual memory size and resident set size, in bytes. Just read the manual if you plan to interpretad these lines somewhere. Note that <b>sudo</b>/<b>root</b> is not needed to see the details about the process of the other user.<br /><br /></li><li><b>statm</b> - this file provides information about memory usage, measured in (4K) pages:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>cat /proc/30580/statm</b><br /><b>186577</b> 76109 2515 5590 0 169800 0<br /></span></span><br />Again we see just space-separated fields, easy to parse. The fileds are total size, resident, shared, text (code) size, next one (former lib) is always 0, then data + stack size and dirty pages. Some of these values may be inaccurate because of internal kernel optimizations applied.<br /><br /></li><li><b>status</b> - this file provides much of the information<i> </i>from <b>stat</b> and <b>statm</b> in a format that's easier for humans to read. For example:<br /><span style="font-family: courier;"><span style="font-size: x-small;">openxs@ao756:~$ <b>cat /proc/30580/status</b><br />Name: mysqld<br />State: S (sleeping)<br />Tgid: 30580<br />Ngid: 0<br />Pid: 30580<br />PPid: 1<br />TracerPid: 0<br />Uid: 116 116 116 116<br />Gid: 125 125 125 125<br />FDSize: 128<br />Groups: 125<br />NStgid: 30580<br />NSpid: 30580<br />NSpgid: 30579<br />NSsid: 30579<br />VmPeak: 746308 kB<br /><b>VmSize: 746308 kB</b><br />VmLck: 0 kB<br />VmPin: 0 kB<br />VmHWM: 328248 kB<br /><b>VmRSS: 304436 kB</b><br />VmData: 679056 kB<br />VmStk: 144 kB<br />VmExe: 22360 kB<br />VmLib: 7784 kB<br />VmPTE: 1036 kB<br />VmPMD: 16 kB<br />VmSwap: 13256 kB<br />HugetlbPages: 0 kB<br /><b>Threads: 37</b><br />SigQ: 0/14893<br />SigPnd: 0000000000000000<br />ShdPnd: 0000000000000000<br />SigBlk: 0000000000084007<br />SigIgn: 0000000000003000<br />SigCgt: 00000001800006e8<br />CapInh: 0000000000000000<br />CapPrm: 0000000000000000<br />CapEff: 0000000000000000<br />CapBnd: 0000003fffffffff<br />CapAmb: 0000000000000000<br />Seccomp: 0<br />Speculation_Store_Bypass: thread vulnerable<br />Cpus_allowed: ff<br />Cpus_allowed_list: 0-7<br />Mems_allowed: 00000000,00000001<br />Mems_allowed_list: 0<br />voluntary_ctxt_switches: 687<br />nonvoluntary_ctxt_switches: 82<br /></span></span><br />The values are supposed to be clear for a reader. If not - just check <a href="https://man7.org/linux/man-pages/man5/procfs.5.html" target="_blank"><b>man 5 proc</b></a>.<br /><br /></li><li><b>syscall</b> - this file exposes the system call number and argument registers for the system call currently being executed by the process, followed by the values of the stack pointer and program counter registers. The values of all six argument registers are exposed, although most system calls use fewer registers. If the process is blocked, but not in a system call, thensystem call number is <b>-1</b>, followed by just the values of the stack pointer and program counter. If process is not blocked, then the file contains just the string "<b>running</b>". For example, this is my ad hoc poor man threads monitoring script of a kind applied to the <b>mysqld</b> process executing some <b>sysbench</b> load:<br /><span style="font-family: courier;"><span style="font-size: x-small;"><br />openxs@ao756:~$ <b>for dir in `ls /proc/30580/task`; do echo -n $dir': '; 2>/dev/null sudo cat /proc/$dir/syscall; done | more</b><br />2389: 202 0x1db141c 0x80 0x834e8 0x0 0x1db1400 0x41a70 0x7fe1cace9fb0 0x7fe1ca807360<br />2390: 202 0x1db141c 0x80 0x834e5 0x0 0x1db1400 0x41a70 0x7fe1c00e4fb0 0x7fe1ca807360<br />2391: 202 0x1db141c 0x80 0x83505 0x0 0x1db1400 0x41a7d 0x7fe1c00a3fb0 0x7fe1ca807360<br />2392: 202 0x1db141c 0x80 0x834fe 0x0 0x1db1400 0x41a7d 0x7fe1b6a39fb0 0x7fe1ca807360<br />2393: 202 0x1db141c 0x80 0x83504 0x0 0x1db1400 0x41a7d 0x7fe1b6a7afb0 0x7fe1ca807360<br />2394: 202 0x1db141c 0x80 0x83501 0x0 0x1db1400 0x41a7d 0x7fe1c0125fb0 0x7fe1ca807360<br /><b>2395: running</b><br />2488: 202 0x1db141c 0x80 0x8351e 0x0 0x1db1400 0x41a8a 0x7fe1cadedfb0 0x7fe1ca807360<br />2493: 202 0x1db141c 0x80 0x83518 0x0 0x1db1400 0x41a8a 0x7fe1cad2afb0 0x7fe1ca807360<br />2800: 75 0x12 0x1 0x0 0x0 0x0 0x1db0870 0x7fe1c01e8f50 0x7fe1c876a8dd<br /><b>30580: 7 0x7fe1c13b9e98 0x2 0xffffffff 0x7fe1c84539d0 0x0 0x7fe1c8453700 0x7ffe2<br />f57a590 0x7fe1c876880d<br /></b>30581: 128 0x7fe1c0bfed70 0x7fe1c0bfedf0 0x0 0x8 0x0 0x7ffe2f57a5d0 0x7fe1c0bfec<br />c0 0x7fe1c86a3b36<br />30582: 208 0x7fe1cae58000 0x1 0x100 0x7fe1c7c76000 0x7fe1b3bfe750 0x7fe1b3bff9c0<br />...</span></span><br /><br />Here you can see a <b>running</b> thread and various system calls for other threads. If only I could map numbers to something useful... Smart people can as you'll find out in the next posts.<br /></li></ul><p>Let me stop checking <b>/proc</b> at this stage, otherwise I'll end up with a yet another MySQL-illustrated manual page for <b>/proc</b> with a size of the large book chapter.<br /></p><p>In the next post I am going to describe MySQL threads monitoring with <b>/proc</b>, <b>ps</b> and <b>top</b> sampling, based on the information presented here.<br /></p>Valerii Kravchukhttp://www.blogger.com/profile/13158916419325454260noreply@blogger.com0