Sunday, September 20, 2020

Dynamic Tracing of pthread_mutex_lock in MariaDB - First Steps

 I've got several comments and questions during my session on dynamic tracing at MariaDB Server Fest 2020. One of them was from a MariaDB developer at Slack and sounded as follows:

"Can perf profile contention on pthread_mutex_lock?"

I surely stated that it can (it can profile anything), but tried to clarify what exactly is needed whenever we hit the probe. The replies where the following:

"ideally, collect stack traces of mariadbd, and sort them in descending order by time spent between pthread_mutex_lock and next pthread_mutex_unlock."

and:

"less ideally, just collect stacktraces and sort in descending order by frequency"

I am not sure about the former (I am yet to try to trace more than one probe at a time with perf and find out how to aggregate the result properly). It is surely doable (in a more efficient way) with bcc tools if one writes proper program, and probably even with bpftrace. I am going to show how to do this in the next posts one day... But the question was about perf.

Fortunately I can easily show how to do the latter. For the primitive test I've started MariaDB 10.5 built from GitHub source on my Ubuntu 16.04 netbook with the following options:

openxs@ao756:~/dbs/maria10.5$ bin/mysqld_safe --no-defaults --port=3311 --socket=/tmp/mariadb105.sock --performance_schema=1 --performance-schema-instrument='memory/%=ON' --thread-handling='pool-of-threads' &

 and used the following sysbench test as a load that I expected to cause some mutex waits:

openxs@ao756:~/dbs/maria10.5$ sysbench --table-size=1000000 --threads=34 --time=300 --report-interval=5 --mysql-socket=/tmp/mariadb105.sock --mysql-user=openxs --mysql-db=sbtest /usr/share/sysbench/oltp_read_only.lua run

First step is to find out what exact pthread library is used:

openxs@ao756:~/dbs/maria10.5$ ldd bin/mariadbd
        linux-vdso.so.1 =>  (0x00007ffc03437000)
        libcrypt.so.1 => /lib/x86_64-linux-gnu/libcrypt.so.1 (0x00007f585d0c6000)
        liblz4.so.1 => /usr/lib/x86_64-linux-gnu/liblz4.so.1 (0x00007f585ceae000)
        liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007f585cc8c000)
        libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x00007f585ca7c000)
        libsnappy.so.1 => /usr/lib/x86_64-linux-gnu/libsnappy.so.1 (0x00007f585c874000)
        libaio.so.1 => /lib/x86_64-linux-gnu/libaio.so.1 (0x00007f585c672000)
        libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f585c467000)
        libssl.so.1.0.0 => /lib/x86_64-linux-gnu/libssl.so.1.0.0 (0x00007f585c1ff000)
        libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007f585bdba000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f585bb9d000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f585b999000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f585b617000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f585b30e000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f585b0f8000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f585ad2e000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f585f4e6000)

Adding the probe after that is trivial:

openxs@ao756:~/dbs/maria10.5$ sudo perf probe -x  /lib/x86_64-linux-gnu/libpthread.so.0 pthread_mutex_lock
Added new event:
  probe_libpthread:pthread_mutex_lock (on pthread_mutex_lock in /lib/x86_64-linux-gnu/libpthread-2.23.so)

You can now use it in all perf tools, such as:

        perf record -e probe_libpthread:pthread_mutex_lock -aR sleep 1

I had not checked yet what are the arguments and not tried to record mutex address or anything besides the fact of the call to this function. Then I started sysbench and while it worked:

openxs@ao756:~/dbs/maria10.5$ sysbench --table-size=1000000 --threads=34 --time=300 --report-interval=5 --mysql-socket=/tmp/mariadb105.sock --mysql-user=openxs --mysql-db=sbtest /usr/share/sysbench/oltp_read_only.lua run
sysbench 1.1.0-faaff4f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 34
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 34 tps: 564.25 qps: 9079.14 (r/w/o: 7944.45/0.00/1134.69) lat (ms,95%): 99.33 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 34 tps: 570.68 qps: 9136.92 (r/w/o: 7995.56/0.00/1141.37) lat (ms,95%): 104.84 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 34 tps: 585.57 qps: 9345.87 (r/w/o: 8175.33/0.00/1170.53) lat (ms,95%): 80.03 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 34 tps: 588.03 qps: 9409.23 (r/w/o: 8232.98/0.00/1176.25) lat (ms,95%): 97.55 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 34 tps: 587.39 qps: 9399.31 (r/w/o: 8224.92/0.00/1174.39) lat (ms,95%): 97.55 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 34 tps: 584.80 qps: 9360.32 (r/w/o: 8190.33/0.00/1169.99) lat (ms,95%): 77.19 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 34 tps: 573.02 qps: 9173.87 (r/w/o: 8028.24/0.00/1145.63) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 34 tps: 572.60 qps: 9160.00 (r/w/o: 8014.00/0.00/1146.00) lat (ms,95%): 84.47 err/s: 0.00 reconn/s: 0.00
[ 45s ] thds: 34 tps: 529.54 qps: 8468.10 (r/w/o: 7409.02/0.00/1059.09) lat (ms,95%): 101.13 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 34 tps: 393.04 qps: 6276.27 (r/w/o: 5491.38/0.00/784.88) lat (ms,95%): 121.08 err/s: 0.00 reconn/s: 0.00
[ 55s ] thds: 34 tps: 382.60 qps: 6134.80 (r/w/o: 5369.20/0.00/765.60) lat (ms,95%): 110.66 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 34 tps: 386.36 qps: 6210.71 (r/w/o: 5436.79/0.00/773.91) lat (ms,95%): 134.90 err/s: 0.00 reconn/s: 0.00
[ 65s ] thds: 34 tps: 397.01 qps: 6320.60 (r/w/o: 5528.37/0.00/792.23) lat (ms,95%): 153.02 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 34 tps: 419.01 qps: 6699.60 (r/w/o: 5861.98/0.00/837.63) lat (ms,95%): 248.83 err/s: 0.00 reconn/s: 0.00
...

I've tried to run perf record for the mariadbd process as follows, with -g option:

openxs@ao756:~/dbs/maria10.5$ sudo perf record -e probe_libpthread:pthread_mutex_lock -g -p`pidof mariadbd` sleep 30
^C[ perf record: Woken up 735 times to write data ]
[ perf record: Captured and wrote 183.516 MB perf.data (1150509 samples) ]

 I had noty set the frequency and with the default one we see notable drop of QPS from sysbench and huge amount of data collected in less than 30 seconds (highlighted). So we can not speak about really small impact with such an approach. In production case I'd have to play with sampling frequency and time to run for sure.

Raw results can be checked with perf script:

openxs@ao756:~/dbs/maria10.5$ sudo perf script | more
mariadbd  9621 [000] 30901.858515: probe_libpthread:pthread_mutex_lock: (7f28dd88fd40)
                    9d40 pthread_mutex_lock (/lib/x86_64-linux-gnu/libpthread-2.23.so)
                  6a9233 close_thread_tables (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  70b87c mysql_execute_command (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  72a0ae Prepared_statement::execute (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  72a546 Prepared_statement::execute_loop (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  72af58 mysql_stmt_execute_common (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  72b0b5 mysqld_stmt_execute (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  7090d8 dispatch_command (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  70777c do_command (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  887c40 tp_callback (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  a76650 worker_main (/home/openxs/dbs/maria10.5/bin/mariadbd)
                  b31911 pfs_spawn_thread (/home/openxs/dbs/maria10.5/bin/mariadbd)
                    76ba start_thread (/lib/x86_64-linux-gnu/libpthread-2.23.so)
--More--

Now we can process the results collected. The easiest way is to use perf report:

openxs@ao756:~/dbs/maria10.5$ sudo perf report -g >/tmp/perf_mutex.txt

In the resulting file we can see the following:

openxs@ao756:~/dbs/maria10.5$ cat /tmp/perf_mutex.txt | head -100
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1M of event 'probe_libpthread:pthread_mutex_lock'
# Event count (approx.): 1150509
#
# Children      Self  Command   Shared Object        Symbol                     
# ........  ........  ........  ...................  ......................................................
#
   100.00%   100.00%  mariadbd  libpthread-2.23.so   [.] pthread_mutex_lock     
            |
            |--99.98%-- start_thread
            |          |
            |          |--99.97%-- pfs_spawn_thread
            |          |          |
            |          |          |--99.95%-- worker_main
            |          |          |          |
            |          |          |          |--86.17%-- tp_callback
            |          |          |          |          do_command
            |          |          |          |          dispatch_command
            |          |          |          |          |
            |          |          |          |          |--46.56%-- mysqld_stmt_execute
            |          |          |          |          |          mysql_stmt_execute_common
            |          |          |          |          |          Prepared_statement::execute_loop
            |          |          |          |          |          Prepared_statement::execute
            |          |          |          |          |          |
            |          |          |          |          |          |--27.70%-- mysql_execute_command
            |          |          |          |          |          |          |
            |          |          |          |          |          |          |--17.60%-- close_thread_tables
            |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |--8.80%-- close_thread_table
            |          |          |          |          |          |          |          |          pthread_mutex_lock
            |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |           --8.80%-- pthread_mutex_lock
            |          |          |          |          |          |          |
            |          |          |          |          |          |          |--9.47%-- execute_sqlcom_select
            |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |--9.46%-- open_and_lock_tables
            |          |          |          |          |          |          |          |          open_tables
            |          |          |          |          |          |          |          |          open_table
            |          |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |          |--8.80%-- pthread_mutex_lock
            |          |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |          |--0.63%-- open_table_get_mdl_lock
            |          |          |          |          |          |          |          |          |          MDL_context::acquire_lock
            |          |          |          |          |          |          |          |          |          MDL_context::try_acquire_lock_impl
            |          |          |          |          |          |          |          |          |          MDL_map::find_or_insert
            |          |          |          |          |          |          |          |          |          pthread_mutex_lock
            |          |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |          |           --0.03%-- tdc_acquire_share
            |          |          |          |          |          |          |          |                     pthread_mutex_lock
            |          |          |          |          |          |          |          |
            |          |          |          |          |          |          |           --0.01%-- handle_select
            |          |          |          |          |          |          |                     mysql_select
            |          |          |          |          |          |          |                     JOIN::optimize
            |          |          |          |          |          |          |                     JOIN::optimize_inner
            |          |          |          |          |          |          |                     |
            |          |          |          |          |          |          |                     |--0.01%-- join_read_const_table
            |          |          |          |          |          |          |                     |          join_read_const
            |          |          |          |          |          |          |                     |          handler::ha_index_read_idx_map
            |          |          |          |          |          |          |                     |          handler::index_read_idx_map
            |          |          |          |          |          |          |                     |          ha_innobase::index_read
            |          |          |          |          |          |          |                     |          row_search_mvcc
            |          |          |          |          |          |          |                     |          |
            |          |          |          |          |          |          |                     |          |--0.01%-- btr_cur_search_to_nth_level_func
            |          |          |          |          |          |          |                     |          |          buf_page_get_gen
            |          |          |          |          |          |          |                     |          |          buf_page_get_low
            |          |          |          |          |          |          |                     |          |          buf_read_page
            |          |          |          |          |          |          |                     |          |          buf_read_page_low
            |          |          |          |          |          |          |                     |          |          pthread_mutex_lock
            |          |          |          |          |          |          |                     |          |
            |          |          |          |          |          |          |                     |           --0.00%-- btr_pcur_move_to_next_page
            |          |          |          |          |          |          |                     |                     buf_page_get_gen
            |          |          |          |          |          |          |                     |                     buf_page_get_low
            |          |          |          |          |          |          |                     |                     buf_read_page
            |          |          |          |          |          |          |                     |                     buf_read_page_low
            |          |          |          |          |          |          |                     |                     pthread_mutex_lock
            |          |          |          |          |          |          |                     |
            |          |          |          |          |          |          |                      --0.01%-- SQL_SELECT::test_quick_select
            |          |          |          |          |          |          |                                get_key_scans_params
            |          |          |          |          |          |          |                                DsMrr_impl::dsmrr_info_const
            |          |          |          |          |          |          |                                handler::multi_range_read_info_const
            |          |          |          |          |          |          |                                ha_innobase::records_in_range
            |          |          |          |          |          |          |                                btr_estimate_n_rows_in_range_low
            |          |          |          |          |          |          |                                |
            |          |          |          |          |          |          |                                |--0.01%-- btr_cur_search_to_nth_level_func
            |          |          |          |          |          |          |                                |          buf_page_get_gen
            |          |          |          |          |          |          |                                |          buf_page_get_low
            |          |          |          |          |          |          |                                |          buf_read_page
            |          |          |          |          |          |          |                                |          buf_read_page_low
            |          |          |          |          |          |          |                                |          pthread_mutex_lock
            |          |          |          |          |          |          |                                |
            |          |          |          |          |          |          |                                 --0.00%-- buf_page_get_gen
            |          |          |          |          |          |          |                                           buf_page_get_low
            |          |          |          |          |          |          |                                           buf_read_page
            |          |          |          |          |          |          |                                           buf_read_page_low
            |          |          |          |          |          |          |                                           pthread_mutex_lock
            |          |          |          |          |          |          |
...

but the result is so large that you can not clearly see the full picture even with the smallest font set. One would need another way to visualize the data, for example, as a flame graph:

openxs@ao756:~/dbs/maria10.5$ sudo perf script | ~/git/FlameGraph/stackcollapse-perf.pl > /tmp/perf-folded.txt
openxs@ao756:~/dbs/maria10.5$ ~/git/FlameGraph/flamegraph.pl /tmp/perf-folded.txt >/tmp/mutex.svg

The resulting flame graph looks as follows:

Flame Graph for the pthread_mutex_lock calls in MariaDB server.

It makes it easier to see were most of waits originate from.

We can also check the /tmp/perf-folded.txt file created in the process:

openxs@ao756:~/dbs/maria10.5$ head -10 /tmp/perf-folded.txt                    
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;pthread_mutex_lock 82
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;tpool::task::execute;pthread_mutex_lock 41
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;tpool::task::execute;tpool::thread_pool_generic::timer_generic::execute;pthread_mutex_lock 82
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;tpool::task::execute;tpool::thread_pool_generic::timer_generic::execute;srv_error_monitor_task;pthread_mutex_lock 19
mariadbd;[libstdc++.so.6.0.21];tpool::thread_pool_generic::worker_main;tpool::task::execute;tpool::thread_pool_generic::timer_generic::execute;srv_master_callback;pthread_mutex_lock 18
mariadbd;start_thread;buf_flush_page_cleaner;buf_flush_lists;buf_flush_do_batch;pthread_mutex_lock 46
mariadbd;start_thread;buf_flush_page_cleaner;pthread_mutex_lock 64
mariadbd;start_thread;pfs_spawn_thread;ma_checkpoint_background;pthread_mutex_lock 1
mariadbd;start_thread;pfs_spawn_thread;ma_checkpoint_background;translog_get_horizon;pthread_mutex_lock 1
mariadbd;start_thread;pfs_spawn_thread;pthread_mutex_lock 112
openxs@ao756:~/dbs/maria10.5$ cat /tmp/perf-folded.txt | sort -r -n -k 1,1 | more
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;THD::reset_killed;pthread_mutex_lock 231467
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;pthread_mutex_lock 115734
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::exe
cute_loop;Prepared_statement::execute;THD::set_statement;pthread_mutex_lock 2169
99
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::exe
cute_loop;Prepared_statement::execute;mysql_execute_command;MDL_context::release
_locks_stored_before;MDL_context::release_lock;MDL_lock::remove_ticket;pthread_m
utex_lock 7232
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::exe
cute_loop;Prepared_statement::execute;mysql_execute_command;execute_sqlcom_selec
t;open_and_lock_tables;open_tables;open_table;tdc_acquire_share;pthread_mutex_lo
ck 301
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat
ch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::exe
cute_loop;Prepared_statement::execute;mysql_execute_command;execute_sqlcom_selec
t;open_and_lock_tables;open_tables;open_table;pthread_mutex_lock 101268

mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispat

to find out that it contains unique stack traces as the first field and number of such stack traces as the second. So, and answer the real question from a developer I have sort the result properly:

openxs@ao756:~/dbs/maria10.5$ cat /tmp/perf-folded.txt | sort -r -n -k 2 | head -5
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispatch_command;THD::reset_killed;pthread_mutex_lock 231467
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispatch_command;mysqld_stmt_execute;mysql_stmt_execute_common;Prepared_statement::execute_loop;Prepared_statement::execute;THD::set_statement;pthread_mutex_lock 216999
mariadbd;start_thread;pfs_spawn_thread;worker_main;pthread_mutex_lock 158530
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispatch_command;pthread_mutex_lock 115734
mariadbd;start_thread;pfs_spawn_thread;worker_main;tp_callback;do_command;dispatch_command;delete_explain_query;pthread_mutex_lock 101269
openxs@ao756:~/dbs/maria10.5$

To summarize, perf (augmented with some post processing tools) does allow to collect stacktraces leading to some probe and sort in descending order by frequency.

Finally, this is my presentation that inspired the question:


You can watch the session recorded if you missed it..

1 comment:

  1. this is cool. and the flame graph rendering allows you to maybe easily navigate the captured stacks. for instance maybe the locks held the longest are necessary but some others can be acquired/released to reduce time held. flame graph gives a simpler way to navigate this possibility!

    ReplyDelete