Sunday, January 26, 2020

Using bpftrace on Fedora 29 - More Advanced Examples

It happened so that I did not have much time to devote to further eBPF and bpftrace-based tracing and profiling since October 2019, when I posted a simple example of how it can be used on Fedora 29 to "emulate" slow query log (to some extent, by recording queries and time it took to execute them) with dynamic probes. Recently I started preparing for my talk about ftrace, eBPF and bpftrace planned for FOSDEM MySQL, MariaDB and Friends Devroom, but rejected there and later accepted for "MariaDB Day Brussels 0202 2020" (that I suggest everyone to attend, at least if you do not have better plans for the second day of FOSDEM 2020).

So, I decided to check for any ways to use bpftrace as a quick profiler, maybe "embed" it into pt-pmp one day and, surely, get what MariaDB developers ask me most often while working on performance problems or hangs - stack traces. This blog post describes my "achievements" so far on this way.

I had the same version of bpftrace coming from Fedora RPM (that had not updated in the meantime):
[openxs@fc29 server]$ rpm -q -a | grep bpftrace
bpftrace-0.9-0.fc29.x86_64
So, after checking this fine reference guide quickly, I came up with a simple one-liner:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }'
Error creating map: '@'
Error creating stack id map
This might have happened because kernel.perf_event_max_stackis smaller than 127. Try to tweak this value with sysctl kernel.perf_event_max_stack=<new value>
Error creating perf event map (-1)
Attaching 1 probe...
Segmentation fault
Ups! I tried to do what error message suggested:
[openxs@fc29 ~]$ sudo sysctl -w kernel.perf_event_max_stack=1000
kernel.perf_event_max_stack = 1000
But it had not helped. This is how I hit my first problem with bpftrace. Fedora had not cared to update it when it stopped matching "something" in a new kernel, for whatever reason. This is actually a real problem for eBPF programs - what they assume may not be "there" in a kernel, so they have to be checked carefully. There is a separate "BPF Type Format (BTF)" for encoding the debug info related to BPF program (or map) that may help to find out if the program is "compatible" without trying it. In my case incompatibility was obvious and I needed some fast way to proceed with bpftrace, in a hope to come up with some useful examples for my talk.

So, i decided to build new version of bpftrace from recent GitHub source (against bcc tools I had installed and supposedly up to date). This ended up with errors no matter what I tried:
[openxs@fc29 build]$ make
[  3%] Built target parser
[  4%] Built target arch
[  5%] Building CXX object src/ast/CMakeFiles/ast.dir/irbuilderbpf.cpp.o
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp: In member function 'void bpftrace::ast::IRBuilderBPF::CreateSignal(llvm::Value*)':
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp:669:16: error: 'BPF_FUNC_send_signal' was not declared in this scope
       getInt64(BPF_FUNC_send_signal),
                ^~~~~~~~~~~~~~~~~~~~
...
Quick search led me to the idea that it's actually a know and recently reported problem, see Issue #1014. As a solution I decided to rebuild bcc tools from GitHub source as well (the alternative to to change bpftrace source) and removed Fedora packages. To make a long story short, this is what I did eventually (from fc -l output):
1051     rm -rf bcc/
1052     rm -rf bpftrace/
1053     sudo dnf remove bcc
1054     git clone https://github.com/iovisor/bcc.git
1055     cd bcc/
1056     git submodule update --init --recursive
1057     sudo dnf install -y bison cmake ethtool flex git iperf libstdc++-static   python-netaddr python-pip gcc gcc-c++ make zlib-devel   elfutils-libelf-devel
1058     sudo dnf install -y luajit luajit-devel
1059     sudo dnf install -y   http://repo.iovisor.org/yum/extra/mageia/cauldron/x86_64/netperf-2.7.0-1.mga6.x86_64.rpm
1060     sudo pip install pyroute2
1061     sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
1062     mkdir build
1063     cd build/
1064     cmake .. -DCMAKE_INSTALL_PREFIX=/usr
1065     make
1066     make test
1067     sudo make install
1068     cd ../..
1069     git clone https://github.com/iovisor/bpftrace.git
1070     cd bpftrace/
1071     git submodule update --init --recursive
1072     mkdir build
1073     cd build/
1074     cmake -DCMAKE_BUILD_TYPE=Release ../
1075     make
1076     ./tests/bpftrace_test
1077     fc -l
Basically I did it all as documented in related projects for my version of Fedora. Step highlighted (1056 above) is essential though, as it was missed here in any obvious way as a mandatory (while it is in recent versions). Step 1076 was successful:
...
[----------] Global test environment tear-down
[==========] 350 tests from 9 test cases ran. (2590 ms total)
[  PASSED  ] 350 tests.

[openxs@fc29 build]$ make testRunning tests...
Test project /mnt/home/openxs/git/bpftrace/build
    Start 1: bpftrace_test
1/3 Test #1: bpftrace_test ....................   Passed    2.42 sec
    Start 2: runtime_test
2/3 Test #2: runtime_test .....................***Failed   21.10 sec
    Start 3: tools-parsing-test
3/3 Test #3: tools-parsing-test ...............***Failed    5.88 sec

33% tests passed, 2 tests failed out of 3

Total Test time (real) =  29.41 sec

The following tests FAILED:
          2 - runtime_test (Failed)
          3 - tools-parsing-test (Failed)
Errors while running CTest
make: *** [Makefile:128: test] Error 8
while make test shown some failures (same as make test for bcc tools), but form previous experience this had not looked like a big deal. I continued basic testing that the resulting bpftrace now works (I tried to add kernel probe, guess what software runs on this workstation):
[openxs@fc29 build]$ sudo ./src/bpftrace -e 'kprobe:do_nanosleep { printf("sleep by %s\n", comm); }'
Attaching 1 probe...
sleep by percona-qan-age
sleep by percona-qan-age
sleep by skypeforlinux
sleep by mysqld
...
^C
So, with some confidence built up, I've installed my new and very recent bpftrace tool I built myself:
[openxs@fc29 build]$ sudo make install
[  3%] Built target parser
...
-- Installing: /usr/local/bin/bpftrace
...
-- Installing: /usr/local/man/man8/xfsdist.8.gz

[openxs@fc29 build]$ which bpftrace
/usr/local/bin/bpftrace

[openxs@fc29 build]$ cd ..
[openxs@fc29 bpftrace]$ git log -1
commit afafbf561522dd33fa316be3e33375bc662399ac (HEAD -> master, origin/master, origin/HEAD)
Author: bas smit <bas@baslab.org>
Date:   Fri Jan 17 12:33:43 2020 +0100

    Add `--info` flag

    This prints the supported kernel features
Quick test involved starting MariaDB 10.3 with thread pool enabled (more on "why thread pool?" in some next blog post):
[openxs@fc29 maria10.3]$ ./bin/mysqld_safe --no-defaults --thread_handling=pool-of-threads --thread_pool_idle_timeout=10 &
[1] 32643
and running sysbench:
[openxs@fc29 maria10.3]$ sysbench /usr/local/share/sysbench/oltp_update_index.lua --mysql-host=127.0.0.1 --mysql-user=root --mysql-port=3306 --threads=10 --tables=4 --table-size=1000000 --time=600 --report-interval=5 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 10
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 10 tps: 54.37 qps: 54.37 (r/w/o: 0.00/54.37/0.00) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 10 tps: 40.60 qps: 40.60 (r/w/o: 0.00/40.60/0.00) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 10 tps: 40.00 qps: 40.00 (r/w/o: 0.00/40.00/0.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 10 tps: 38.00 qps: 38.00 (r/w/o: 0.00/38.00/0.00) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 10 tps: 34.00 qps: 34.00 (r/w/o: 0.00/34.00/0.00) lat (ms,95%): 746.32 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 10 tps: 36.40 qps: 36.40 (r/w/o: 0.00/36.40/0.00) lat (ms,95%): 549.52 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 10 tps: 37.80 qps: 37.80 (r/w/o: 0.00/37.80/0.00) lat (ms,95%): 601.29 err/s: 0.00 reconn/s: 0.00
...
and adding the probe like this in another shell:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }' > /tmp/bpf.stack
^C
When I stopped collecting with Ctrl-C, bpftrace printed the content of associative arrays collected (with unique user space stack 9for all processes!) as index and count of time this stack was as a value). No segmentation fault, unlike with the version from package! In the resulting file we can see (among other things):
[openxs@fc29 bpftrace]$ cat /tmp/bpf.stack | more
Attaching 1 probe...

...
@[
    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    page_cur_search_with_match_bytes(buf_block_t const*, dict_index_t const*, dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, unsigned long*, unsigned long*, page_cur_t*)+1773
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+8316
    btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.33]+146
    row_search_on_row_ref(btr_pcur_t*, unsigned long, dict_table_t const*, dtuple_t const*, mtr_t*)+85
    row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+503
    row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+1293
    row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+2271
    row_purge_step(que_thr_t*)+1057
    que_run_threads(que_thr_t*)+2280
    srv_worker_thread+956
    start_thread+234
]: 1

...

@[
    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    btr_search_build_page_hash_index(dict_index_t*, buf_block_t*, rw_lock_t*, unsigned long, unsigned long, unsigned long)+2781
    btr_search_info_update_slow(btr_search_t*, btr_cur_t*)+622
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+10130
    row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+4823
    ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+338
    handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+280
    handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)+98
    handler::multi_range_read_next(void**)+183
    Mrr_simple_index_reader::get_next(void**)+74
    DsMrr_impl::dsmrr_next(void**)+66
    QUICK_RANGE_SELECT::get_next()+39
    rr_quick(READ_RECORD*)+25
    mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*)+3560
    mysql_execute_command(THD*)+8446
    Prepared_statement::execute(String*, bool)+979
    Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+154
    mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool)+349
    mysqld_stmt_execute(THD*, char*, unsigned int)+37
    dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)+5123
    do_command(THD*)+368
    tp_callback(TP_connection*)+314
    worker_main(void*)+160
    start_thread+234
]: 1

...

@[
    trx_purge(unsigned long, bool)+1054
    srv_purge_coordinator_thread+1872
    start_thread+234
]: 2
@[
    ut_crc32_sw(unsigned char const*, unsigned long)+1611
    buf_calc_page_crc32(unsigned char const*)+61
    buf_flush_init_for_writing(buf_block_t const*, unsigned char*, void*, unsigned long)+872
    buf_flush_write_block_low(buf_page_t*, buf_flush_t, bool)+149
    buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)+607
    buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*)+1627
    buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long)+392
    pc_flush_slot()+2617
    buf_flush_page_cleaner_coordinator.cold.151+1235
    start_thread+234
]: 2

...
@[]: 6474
So, this is the default format to output stacks (one line per call, no address, frame number or anything, symbols properly resolved for my MariaDB built from GitHub source etc). But I had something like this mind eventually (from pt-pmp):
[openxs@fc29 bpftrace]$ pt-pmp
Tue Jan 21 13:52:12 EET 2020

...
     10 syscall(libc.so.6),__io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:1932),LinuxAIOHandler::poll(os0file.cc:2093),os_aio_linux_handler(stl_vector.h:805),os_aio_handler(stl_vector.h:805),fil_aio_wait(fil0fil.cc:4435),io_handler_thread(srv0start.cc:326),start_thread(libpthread.so.0),clone(libc.so.6)
      6 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),log_write_up_to(log0log.cc:929),trx_flush_log_if_needed_low(trx0trx.cc:1218),trx_flush_log_if_needed(trx0trx.cc:1218),trx_commit_complete_for_mysql(trx0trx.cc:1218),innobase_commit(ha_innodb.cc:4637),commit_one_phase_2(handler.cc:1642),ha_commit_one_phase(handler.cc:1622),ha_commit_trans(handler.cc:1484),trans_commit_stmt(transaction.cc:505),mysql_execute_command(sql_parse.cc:6106),Prepared_statement::execute(sql_prepare.cc:4807),Prepared_statement::execute_loop(sql_prepare.cc:4235),Prepared_statement::execute_loop(sql_prepare.cc:4235),mysql_stmt_execute_common(sql_prepare.cc:3235),mysqld_stmt_execute(sql_prepare.cc:3132),dispatch_command(sql_parse.cc:1798),do_command(sql_parse.cc:1401),threadpool_process_request(threadpool_common.cc:359),tp_callback(threadpool_common.cc:359),worker_main(threadpool_generic.cc:1606),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),srv_resume_thread(srv0srv.cc:2533),srv_worker_thread(srv0srv.cc:2533),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1215),get_event(mysql_thread.h:1215),worker_main(threadpool_generic.cc:1602),start_thread(libpthread.so.0),clone(libc.so.6)
      ...
That is, number of times collapsed stack was noted and then the stack itself, without function arguments if possible. This kind of stack is generated from gdb outputs that looks as follows:
Thread 1 (Thread 0x7f29eee66900 (LWP 32728)):
#0  0x00007f29ef25f431 in poll () from /lib64/libc.so.6
#1  0x000055cc4ece656f in poll (__timeout=-1, __nfds=2, __fds=0x7ffe174108e0) at /usr/include/bits/poll2.h:41
#2  handle_connections_sockets () at /mnt/home/openxs/git/server/sql/mysqld.cc:6753
#3  0x000055cc4ece7a7c in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /mnt/home/openxs/git/server/sql/mysqld.cc:6222
#4  0x00007f29ef190413 in __libc_start_main () from /lib64/libc.so.6
#5  0x000055cc4ecd9ece in _start () at /mnt/home/openxs/git/server/sql/sql_list.h:158
So, each tread stack starts from 'Thread..." line, each frame is numbered and function call is actually the fourth word in the output (in awk terms).
 
Based on that I tried add probe to produce stack in perf format for mysqld process, with some decoration to have a clear marker of when new capture starts:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n", ustack(perf));}' > /tmp/ustack.txt
If you wonder about the details of command format, functions used etc, check the Reference Guide. The result looks as follows in hexdump -C output:
00000000  41 74 74 61 63 68 69 6e  67 20 31 20 70 72 6f 62  |Attaching 1 prob|
00000010  65 2e 2e 2e 0a 23 20 0a  09 37 66 32 39 65 66 62  |e....# ..7f29efb|
00000020  38 34 33 30 30 20 6e 61  6e 6f 73 6c 65 65 70 2b  |84300 nanosleep+|
00000030  36 34 20 28 2f 75 73 72  2f 6c 69 62 36 34 2f 6c  |64 (/usr/lib64/l|
00000040  69 62 70 74 68 72 65 61  64 2d 32 2e 32 38 2e 73  |ibpthread-2.28.s|
00000050  6f 29 0a 0a 23 20 0a 09  35 35 63 63 34 66 32 30  |o)..# ..55cc4f20|
...
This is how you can see where are the new lines, how many etc, for proper parsing. Next step was to check pt-pmp source code and http://poormansprofiler.org/, review and adapt some awk magic used there to collapse stacks, get rid of function arguments etc, and I've ended up with this draft version:
[openxs@fc29 maria10.3]$ cat /tmp/ustack.txt | awk '
BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

    199 __sched_yield,srv_purge_coordinator_thread,start_thread
     16 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     13 trx_purge,srv_purge_coordinator_thread,start_thread
      4 srv_purge_coordinator_thread,start_thread
      4 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      2 ut_crc32_sw,buf_calc_page_crc32,buf_page_is_corrupted,buf_page_io_complete,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
      2 syscall,
      2 __pwrite,
      2 mtr_t::commit,row_ins_sec_index_entry_low,row_ins_sec_index_entry,row_upd_sec_index_entry,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
...
The above looks reasonable (even if not yet a direct match to what pt-pmp produces, I am too lazy to do much awk these days).

What's more interesting is that with bpftrace we can also get kernel stack for thread while profiling. I took a quick look at what we may get as a result of combining both:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n%s\n", kstack(perf), ustack(perf));}' > /tmp/kstack.txt
[sudo] password for openxs:
^C
I just pring one after the other without any separator in perf format, so I can apply the same trick as above to the result:
openxs@fc29 maria10.3]$ cat /tmp/kstack.txt | awk '                   BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

     93 __sched_text_start,schedule,__x64_sys_sched_yield,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     49 __sched_yield,srv_purge_coordinator_thread,start_thread
     42 do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     20 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     11 trx_purge,srv_purge_coordinator_thread,start_thread
      2 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      1 __x86_indirect_thunk_rax,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
      1 ut_crc32_sw,log_block_calc_checksum_crc32,log_write_up_to,trx_commit_complete_for_mysql,innobase_commit,commit_one_phase_2,ha_commit_trans,trans_commit_stmt,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
...
      1 _raw_spin_unlock_irqrestore,ata_scsi_queuecmd,scsi_queue_rq,blk_mq_dispatch_rq_list,blk_mq_do_dispatch_sched,blk_mq_sched_dispatch_requests,__blk_mq_run_hw_queue,__blk_mq_delay_run_hw_queue,blk_mq_run_hw_queue,blk_mq_sched_insert_requests,blk_mq_flush_plug_list,blk_flush_plug_list,blk_finish_plug,ext4_writepages,do_writepages,__filemap_fdatawrite_range,file_write_and_wait_range,ext4_sync_file,do_fsync,__x64_sys_fsync,do_syscall_64,entry_SYSCALL_64_after_hwframe,fsync,fil_flush_low,fil_flush_file_spaces,buf_dblwr_update,buf_page_io_complete,fil_aio_wait,io_handler_thread,start_thread
      1 page_simple_validate_new,buf_dblwr_flush_buffered_writes,pc_flush_slot,buf_flush_page_cleaner_coordinator.cold.151,start_thread
      1 page_cur_insert_rec_write_log,page_cur_insert_rec_low,page_copy_rec_list_end_no_locks,btr_page_reorganize_low,btr_can_merge_with_page,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
      1 os_thread_yield,srv_purge_coordinator_thread,start_thread
      1 os_thread_get_curr_id,fseg_free_page_func,btr_page_free,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
...
Looks cool to me, we see where we spend time in kernel while on-CPU, not only in user space code.

It's sometimes scary and not that easy to use bpftrace, but you'll get there eventually, same as we got to North Greenwich via that Emirates Air Line back in 2019. The only limit is the sky!
* * *
To summarize:
  1. bpftrace and bcc tools you get from vendor packages may not work as expected, so be ready to get nightly builds etc or, as I prefer, build from source. If this is a problem better use ftrace :) 
  2. It seems getting stack traces with bpftrace for pt-pmp is an easy next stap to make. I may be integrated as one of tracers supported here.
  3. Now we can also add kernel stack to the picure.
  4. "JUST DO BPF"!
I have many more details to share about dynamic tracing for MySQL and MariaDB DBAs. Stay tuned!

Saturday, January 25, 2020

Dynamic Tracing of MariaDB Server With ftrace - Basic Example

Today I'd like to continue my series of blog posts about dynamic tracing of MySQL server (or Percona Server for MySQL, or MariaDB, or basically any other software, whatever you prefer) on Linux. This post ideally had to start it, as the approach discussed here is truly dynamic and suitable production servers under high load (unlike gdb tricks, most of the time), works more or less the same way starting from Linux 2.6.27, is "always there" at your fingertips (if you are root, but this is a requirement in a general case for all kinds of dynamic tracing), and does not even strictly require -debuginfo packages with symbolic information for basic usage. I mean ftrace.

To prove that this interface that appeared in 2008 is still usable, let me add the same kind of probe for tracing dispatch_command() function and printing SQL statement that somebody tries to execute (so we can avoid enabling general query log etc) with pure ftrace interface, without much explanations or background (that is presented below). In this case I use recent MariaDB Server 10.3.22 built from source on my Ubuntu 16.04 netbook (because I do not care about being shared in the feed of Planet MySQL anymore in any case).

You just have to do the following:
  1. Check that dynamic tracing with ftrace is possible:
    openxs@ao756:~$ mount | grep tracefs
    tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)
    If you do not see tracefs mounted, mount it with the following command (as root):
    mount -t tracefs nodev /sys/kernel/tracing
  2. Find the address of (mangled) dispatch_command() function in the mysqld binary:
    openxs@ao756:~$ objdump -T /home/openxs/dbs/maria10.3/bin/mysqld | grep dispatch_command
    0000000000587b90
    g    DF .text  000000000000236e  Base        _Z16dispatch_command19enum_server_commandP3THDPcjbb
    The address you need (in many cases) is the first word in the output, in hex. So, 0x0000000000587b90
  3. If you want to print values of parameters passed, you need to know how they are passed by the compiler used on the architecture used. Let me share the quote that applies to x86_64:
    "First six arguments are in rdi, rsi, rdx, rcx, r8d, r9d; remaining arguments are on the stack."
    In my case (see this post or details below, or check the source code yourself if in doubts) we need to print third argument, so it must be in the rdx (64-bit) register. For ftrace we have to use just %dx, trust me for now, I tried ;)
  4. Check probe syntax (see the picture below for quick reference) and come up with something like this to add a probe named "dc". Note that on Ubuntu 16.04 this had not worked via sudo directly, I had to become root via "sudo su -" (weird, something to check one day):
    root@ao756:~# echo 'p:dc /home/openxs/dbs/maria10.3/bin/mysqld:0x0000000000587b90 query=+0(%dx):string' > /sys/kernel/debug/tracing/uprobe_events
    All these happens while MariaDB server is running and procerssing queries, in a truly dynamic manner. There is no impact at all so far. Creators of MariaDB never added anything ot the code (or to the build process) to make this work. It's enough to have a binary that runs of Linux kernel that supports ftrace.
  5. Enable tracing for the probe added (this is when magic happens) and tracing in general:
    root@ao756:~# echo 1 > /sys/kernel/debug/tracing/events/uprobes/dc/enable
    root@ao756:~# echo 1 > /sys/kernel/debug/tracing/tracing_on
  6. Now assuming you run the following:
    root@ao756:~# mysql -uroot --socket=/tmp/mariadb.sock
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 92
    Server version: 5.5.5-10.3.22-MariaDB Source distribution

    Copyright (c) 2009-2019 Percona LLC and/or its affiliates
    Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

    Oracle is a registered trademark of Oracle Corporation and/or its
    affiliates. Other names may be trademarks of their respective
    owners.

    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

    mysql> select 1+1;
    +-----+
    | 1+1 |
    +-----+
    |   2 |
    +-----+
    1 row in set (0.00 sec)

    mysql> select version();
    +-----------------+
    | version()       |
    +-----------------+
    | 10.3.22-MariaDB |
    +-----------------+
    1 row in set (0.00 sec)
    you may get the trace as follows, for example:
    root@ao756:~# cat /sys/kernel/debug/tracing/trace
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 9851/9851   #P:2
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
              python-21369 [001] d... 209044.826580: : sys_wr: 0000000000000008
    ...
    (I had show the top of the output to give hint about format), or:
    root@ao756:~# cat /sys/kernel/debug/tracing/trace_pipe
              mysqld-1082  [000] d... 273258.971401: dc: (0x560d8a20fb90) query="select @@version_comment limit 1"
              mysqld-1082  [000] d... 273263.187839: dc: (0x560d8a20fb90) query="select 1+1"
              mysqld-1082  [001] d... 273269.128542: dc: (0x560d8a20fb90) query="select version()"
    ...
  7. If you want to stop this basic tracing, run these:
    root@ao756:~# echo 0 > /sys/kernel/debug/tracing/events/uprobes/dc/enable
    root@ao756:~# echo 0 > /sys/kernel/debug/tracing/tracing_on
So, with 7 simple steps we can enable dynamic tracing and see SQL statements executed and time when this happened.
Check https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/uprobetracer.rst
The magic behind ftrace and steps that I really had to perform when I tried to apply this approach for the very first time will be presented in the next posts. Stay tuned! 

Monday, January 20, 2020

Dynamic Tracing of MariaDB Server With bcc trace - Basic Example

This is a yet another blog post in my series about dynamic tracing of MySQL server (and friends) on Linux. Logically it had to appear after this one about perf and another one about bpftrace. For older Linux systems or when you are in a hurry with customer and have no time to upgrade, build from source etc perf just works and is really flexible (but it comes with some cost of writing many samples to disk and then processing them). For happy users of Linux with kernels 4.9+ (the newer the better), like recent Ubuntu, RHEL 8, Debian 9+ or Fedora there entire world of new efficient tracing with bpftrace is open and extending with every new kernel release.

For those in between, like me with this Ubuntu 16.04:
openxs@ao756:~/git/bcc/build$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
openxs@ao756:~/git/bcc/build$ uname -a
Linux ao756 4.4.0-171-generic #200-Ubuntu SMP Tue Dec 3 11:04:55 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
the fancy world of eBPF and more efficient dynamic tracing is still mostly open, as we can try to use BCC tools. BCC is a toolkit for creating efficient kernel tracing and manipulation programs that includes several potentially useful tools and examples for MySQL DBAs. It makes use of extended BPF (Berkeley Packet Filters), formally known as eBPF.

I have a draft of this blog post hanging around since October 2019, but every time I tried to complete it I was not happy with the content. I wanted to get back, test more, try to present more tools, find out how to be able to access structure members in probes as easy as I can do it with gdb or perf, but then I hit some problem and put the draft aside...

When I started again some time later I often hit some new problem, so today I just decided to finally write down what I already know for sure, and provide at least a very basic example of dynamic tracing along the lines of those used in earlier posts (capturing queries executed by different threads using dynamic probes).

The first problem in case of Ubuntu 16.04 is to get the binaries of BCC tools. One of the ways is to build from GitHub source. The INSTALL.md document is clear enough when describing build dependencies and steps:
git clone https://github.com/iovisor/bcc.git
mkdir bcc/build; cd bcc/build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr
make
sudo make install
but still there is something to note. In recent versions you surely have to update the libbpf submodule, or you'll end up with compilation errors at early stage. My steps today were the following:
openxs@ao756:~/dbs/maria10.3$ cd ~/git/bcc/
openxs@ao756:~/git/bcc$ git pull
Already up-to-date.
openxs@ao756:~/git/bcc$ git log -1
commit dce8e9daf59f44dec4e3500d39a82a8ce59e43ba
Author: Yonghong Song <yhs@fb.com>
Date:   Fri Jan 17 22:06:52 2020 -0800

    sync with latest libbpf repo

    sync libbpf submodule upto the following commit:
        commit 033ad7ee78e8f266fdd27ee2675090ccf4402f3f
        Author: Andrii Nakryiko <andriin@fb.com>
        Date:   Fri Jan 17 16:22:23 2020 -0800

            sync: latest libbpf changes from kernel

    Signed-off-by: Yonghong Song <yhs@fb.com>

openxs@ao756:~/git/bcc$ git submodule init
openxs@ao756:~/git/bcc$ git submodule update
Submodule path 'src/cc/libbpf': checked out '033ad7ee78e8f266fdd27ee2675090ccf4402f3f'
Now I can proceed to build subdirectory and complete the build:
openxs@ao756:~/git/bcc/build$ cmake .. -DCMAKE_INSTALL_PREFIX=/usr...
openxs@ao756:~/git/bcc/build$ make
...
[ 99%] Building CXX object tests/cc/CMakeFiles/test_libbcc.dir/test_usdt_probes.cc.o
[100%] Building CXX object tests/cc/CMakeFiles/test_libbcc.dir/utils.cc.o
[100%] Linking CXX executable test_libbcc
[100%] Built target test_libbcc
It's always interesting to check if tests pass:
openxs@ao756:~/git/bcc/build$ make test
Running tests...
Test project /home/openxs/git/bcc/build
      Start  1: style-check
 1/40 Test  #1: style-check ......................   Passed    0.01 sec
      Start  2: c_test_static
 2/40 Test  #2: c_test_static ....................   Passed    0.30 sec
...
40/40 Test #40: lua_test_standalone ..............***Failed    0.06 sec

75% tests passed, 10 tests failed out of 40

Total Test time (real) = 450.78 sec

The following tests FAILED:
          3 - test_libbcc (Failed)
          4 - py_test_stat1_b (Failed)
          5 - py_test_bpf_log (Failed)
          6 - py_test_stat1_c (Failed)
          7 - py_test_xlate1_c (Failed)
          8 - py_test_call1 (Failed)
         16 - py_test_brb (Failed)
         17 - py_test_brb2 (Failed)
         18 - py_test_clang (Failed)
         40 - lua_test_standalone (Failed)
Errors while running CTest
Makefile:105: recipe for target 'test' failed
make: *** [test] Error 8
I've always had some tests failed, and one day I probably have to report the issue for the project, but for the purpose of this post (based on previous experience with older code) I'll get at least trace tool working as expected. So, I decided to process with installation:
openxs@ao756:~/git/bcc/build$ sudo make install
...
-- Up-to-date: /usr/share/bcc/tools/old/stackcount
-- Up-to-date: /usr/share/bcc/tools/old/oomkill
The tools are installed by default to /usr/share/bcc/tools.

For adding dynamic probes I'll use trace tool  that probes functions you specify and displays trace messages if a particular condition is met. You can control the message format to display function
arguments and return values.

Brendan Gregg explains the usage of this and other tools here with a lot of details. I'll just add a nice chart from that page here:
There is a separate tutorial with examples. You may want to check section for trace tool there.

For the purpose of this blog post I think it's enough to quickly check help output:
openxs@ao756:~/git/bcc/build$ sudo /usr/share/bcc/tools/trace
usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE]
             [-S] [-M MAX_EVENTS] [-t] [-u] [-T] [-C] [-c CGROUP_PATH]
             [-n NAME] [-f MSG_FILTER] [-B] [-s SYM_FILE_LIST] [-K] [-U] [-a]
             [-I header]
             probe [probe ...]
trace: error: too few arguments
and note the following basic syntax is used to define probes (see man /usr/share/bcc/man/man8/trace.8.gz after buuilding the tools from source as described above):
PROBE SYNTAX
       The general probe syntax is as follows:

       [{p,r}]:[library]:function[(signature)]      [(predicate)]     ["format
       string"[, arguments]]

       {t:category:event,u:library:probe}  [(predicate)]  ["format   string"[,
       arguments]]

       {[{p,r}],t,u}
              Probe  type  -  "p" for function entry, "r" for function return,
              "t" for kernel tracepoint, "u" for USDT probe. The default probe
              type is "p".
...
Fir simplicity here we do not consider conditional probes, so predicate is skipped. At the moment we are not interested in kernel or or user defined static tracepoints (the are not defined in default recent builds of MySQL or MariaDB server anway and require -DENABLE_DTRACE=ON to be explictly added to cmake command line used). For user defined dynamic probes in the mysqld process we need p (for probe at function entry) and maybe r (for function return).

We need to refer to library, and in our case this is a full path name to the mysqld binary (or just mysqld if it's in PATH). We also need to refer to some function by names. Quick test will show you that by default trace does NOT accept plain function names in MySQL or MariaDB code (as perf does), and require mangled ones to be used (same as bpftrace). We can find the names with nm command:
openxs@ao756:~/git/bcc/build$ nm -na /home/openxs/dbs/maria10.3/bin/mysqld | grep dispatch_command
00000000004a1eef t _Z16dispatch_command19enum_server_commandP3THDPcjbb.cold.344
00000000005c5180 T _Z16dispatch_command19enum_server_commandP3THDPcjbb
00000000005c5180 t _Z16dispatch_command19enum_server_commandP3THDPcjbb.localalias.256
In the example above I was specifically looking for dispatch_command() function of MariaDB server version 10.3.x that I assume (see previous post) has a string with SQL statement as the third argument, packet. So, I can refer to this function in probe as "_Z16dispatch_command19enum_server_commandP3THDPcjbb".

The "format string" that define how to output arguments of probe  is a printf-style format string.  You  can  use the following format specifiers: %s, %d%u, ...  with the same semantics as printf's. In our case for zero-terminating string we'll use "%s".

Arguments of the function traced are named arg1, arg2, ... argN (unless we provide a signature for function), and are numbered starting from 1. So, in out case we can add a probe to print third argument of the dispatch_command() function upon entry as follows:
openxs@ao756:~/dbs/maria10.3$ sudo /usr/share/bcc/tools/trace -T 'p:/home/openxs/dbs/maria10.3/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb "%s" arg3'
[sudo] password for openxs:
TIME     PID     TID     COMM            FUNC             -
16:16:53 26585   29133   mysqld          _Z16dispatch_command19enum_server_commandP3THDPcjbb select @@version_comment limit 1
16:17:02 26585   29133   mysqld          _Z16dispatch_command19enum_server_commandP3THDPcjbb select 1
16:17:05 26585   29133   mysqld          _Z16dispatch_command19enum_server_commandP3THDPcjbb select 2
16:17:07 26585   29133   mysqld          _Z16dispatch_command19enum_server_commandP3THDPcjbb
^C
I've got the output above for this sample session:
openxs@ao756:~/dbs/maria10.3$ bin/mysql -uroot --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 16
Server version: 10.3.22-MariaDB Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0,000 sec)

MariaDB [(none)]> select 2;
+---+
| 2 |
+---+
| 2 |
+---+
1 row in set (0,001 sec)

MariaDB [(none)]> exit
Bye
Note that I've added -T option to the command line to output timestamp.
  
It's a bit more complex with recent versions of Percona Server or MySQL (or any function parameters that are complex structures). It's also more complex if we want to process prepared statements and process packet content depending on the first argument (the above is correct only for COM_QUERY) and so on. But these are basic steps to get a log of SQL queries with timestamps by adding a dynamic probe with BCC trace tool. Enjoy!

Saturday, January 18, 2020

Fun with Bugs #92 - On MySQL Bug Reports I am Subscribed to, Part XXVI

I'd like to continue reviewing MySQL bug reports from Community users that I considered interesting and subscribed to. Unlike in the previous post in this series, I am not going to check test cases on any competitor product, but will use only recently released MySQL 5.7.29 and 8.0.19 for checks, if any. This time I'll concentrate on bugs reported in November 2019.

As usual, I mostly care about optimizer, InnoDB and replication related bugs. Here is the list:
  • Bug #97476 - "Range optimizer skips rows". This bug reported by Ilya Raudsepp looks like a clear regression in MySQL 8.0.x comparing to MySQL 5.7.x at least. I get the following correct results with 5.7.29:
    mysql> SELECT t.id
        -> FROM Test t
        -> JOIN (
        ->     SELECT item_id, MAX(created_at) AS created_at
        ->     FROM Test t
        ->     WHERE (platform_id = 2) AND (item_id IN (3,2,111)) AND (type = 'Default')
        ->     GROUP BY item_id
        -> ) t2 ON t.item_id = t2.item_id
        ->   t.item_id = t2.item_id
        ->   AND t.created_at = t2.created_at
        ->   AND t.type = 'Default'
        -> WHERE t.platform_id = 2;
    +----+
    | id |
    +----+
    |  6 |
    |  3 |
    |  5 |
    +----+
    3 rows in set (0,03 sec)

    mysql> select version();
    +-----------+
    | version() |
    +-----------+
    | 5.7.29    |
    +-----------+
    1 row in set (0,02 sec)
  • Bug #97531 - "5.7 replication breakage with syntax error with GRANT management". This tricky bug reported by Simon Mudd applies also to MySQL 8.0.x. It is closed as fixed, but the fix had not made it to recent 5.7.29 and 8.0.19 releases, so you'll have to wait for few more months.
  • Bug #97552 - "Regression: LEFT JOIN with Impossible ON condition performs slowly". Yet another optimizer regression in MySQL 8 (comparing to 5.7.x) that is fixed only in MySQL 8.0.20+. The bug was reported by Fredric Johansson.
  • Bug #97648 - "Bug in order by clause in union clause". Yet another regression (at least from user's point of view) in recent MySQL 5.7.x and 8.0.x comparing to 5.6.x. This time without a "regression" tag. The bug was reported by Andrei Mart.
  • Bug #97662 - "MySQL v8.0.18 FIPS mode is no longer supported". According to Ryan L, MySQL 8.0.18+ is no longer supporting ssl_fips_mode=STRICT, as OpenSSL 1.1.1 is not FIPS-compatible and MySQL Server must be compiled using OpenSSL 1.1.1 or higher. That's interesting. Check also this link.
  • Bug #97682 - "Handler fails to trigger on Error 1049 or SQLSTATE 42000 or plain sqlexception". This regression (comparing to MySQL 5.7) was reported by Jericho Rivera. It is fixed in MySQL 8.0.20. The patch was provided by Kamil Holubicki.
  • Bug #97692 - "Querying information_schema.TABLES issue". I do not see any documented attempt to check on MySQL 8.0, so I had to add a comment to the bug report. From what I see, in MySQL 8.0.19 we still get different (empty) result from the second query, but at least now we have a warning:
    mysql> SELECT ts.TABLE_SCHEMA
        -> FROM information_schema.TABLES ts
        -> WHERE ts.TABLE_TYPE ='VIEW'
        -> AND ts.TABLE_SCHEMA NOT IN ('sys')
        -> AND ts.TABLE_COMMENT LIKE '%invalid%';
    +--------------+
    | TABLE_SCHEMA |
    +--------------+
    | test         |
    +--------------+
    1 row in set, 1 warning (0,00 sec)

    mysql> show warnings\G
    *************************** 1. row ***************************
      Level: Warning
       Code: 1356
    Message: View 'test.v' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them
    1 row in set (0,00 sec)

    mysql> select version();
    +-----------+
    | version() |
    +-----------+
    | 8.0.19    |
    +-----------+
    1 row in set (0,00 sec)
    The bug was reported by Vinicius Malvestio Grippa.
  • Bug #97693 - "ALTER USER user IDENTIFIED BY 'password' broken by invalid authentication_string". The bug was reported by Nikolai Ikhalainen. MySQL 8.0.19 is still affected.
  • Bug #97694 - "MySQL 8.0.18 fails on STOP SLAVE/START SLAVE stress test". For some reason I do not see any documented attempt to verify this on MySQL 5.7 also. The bug was reported by PrzemysÅ‚aw SkibiÅ„ski, who also suggested a fix.
  • Bug #97734 - "Document the correct method to stop slaving with MTS without a warning or error". I can only agree with this request from Buchan Milne. Please. do :)
  • Bug #97735 - "ALTER USER IF EXISTS ... WITH_MAX_USER_CONNECTIONS 9999 not applied correctly". yet another bug report by Simon Mudd in this list. For some reason, again, I do not see any documented attempt to verify the bug on MySQL 8.0.x, while there is no clear reason to think it is not affected.
  • Bug #97742 - "bad item ref from correlated subquery to outer distinct table". This bug was reported by Song Zhibai, who also had contributed a patch. Based on further comments from  Øystein Grøvlen and these results:
    mysql> EXPLAIN SELECT f3 FROM t1 HAVING (SELECT 1 FROM t2 HAVING f2 LIMIT 1);
    +----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    | id | select_type        | table | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
    +----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    |  1 | PRIMARY            | t1    | NULL       | ALL   | NULL          | NULL    | NULL    | NULL |    3 |   100.00 | NULL        |
    |  2 | DEPENDENT SUBQUERY | t2    | NULL       | index | NULL          | PRIMARY | 4       | NULL |    1 |   100.00 | Using index |
    +----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    2 rows in set, 2 warnings (0,00 sec)

    mysql> show warnings\G
    *************************** 1. row ***************************
      Level: Note
       Code: 1276
    Message: Field or reference 'f2' of SELECT #2 was resolved in SELECT #1
    *************************** 2. row ***************************
      Level: Note
       Code: 1003
    Message: /* select#1 */ select `test`.`t1`.`f3` AS `f3` from `test`.`t1` having (/* select#2 */ select 1 from `test`.`t2` having `test`.`t1`.`f2` limit 1)
    2 rows in set (0,00 sec)

    mysql> select version();
    +-----------+
    | version() |
    +-----------+
    | 5.7.29    |
    +-----------+
    1 row in set (0,00 sec)
    I'd say that MySQL 5.7.x is also affected, but for some reason nobody documented any attempt to verify it there. So, I've added a comment.
  • Bug #97777 - "separate global variables (from hot variables) using linker script (ELF)". Beautiful bug report from Daniel Black.With a lot of details, perf and readelf outputs and patch contributed. See also his Bug #97822 - "buf_page_get_gen buf_pool->stat.n_page_gets++ is a cpu waste", with perf analysis up to a single assembler instruction level and fix suggested.
  • Bug #97825 - "dd_mdl_acquire in dd_table_open with dict_sys->mutex hold may cause deadlock". Here I am really puzzled by no visible attempt to check the arguments of bug reporter, Dave Do, who tried to perform lock order analysis by code review. All we see as a result is this:
    "Lock order could be different, but it is irrelevant, since these are locks on totally different levels and can't, in themselves, cause any deadlock."
    What a great argument! Not a bug, surely... We trust you.
"What bugs are you talking about? I have no bugs, neither does MySQL 8!"
To summarize:
  1. MySQL 8 introduces some optimizer (and some other) regressions. They seem to be fixed fast enough, but I wonder why only Community users were able to find them not Oracle's QA...
  2. MySQL 8.0.19 is surely great, but I see many serious bugs fixed o0nly in 8.0.20+.
  3. Percona, Booking and Facebook engineers still continue contributing high quality bug reports, comments/verification details and patches. Oracle is lucky to have such nice partners in making MySQL better.
  4. I still see problems with following proper verification procedures and documenting the results. Too often the bug reported for 8.0.x is NOT checked on 5.7.x as well, regression tag is not set, and so on. Sometimes reports are closed as "Not a bug" without any attempt to follow the analysis provided or prove the point. This is sad and wrong.

Sunday, January 12, 2020

Fun with Bugs #91 - On MySQL Bug Reports I am Subscribed to, Part XXV

Not sure if it's still interesting to anybody else, but MySQL users keep finding and reporting new problems that may be caused by genuine bugs in the code. I keep checking these reports and subscribing to those I consider interesting. Let me start blogging in the New Year of 2020 with a review of some replication, InnoDB and (many!) optimizer bugs reported in September and October, 2019.

As usual, I start from the oldest and care to mention bug reporters by names and add links to their other bug reports, if any. So, here is the new list:
  • Bug #96827 - "mysqlbinlog needs options to abort if invalid events are found on in-use binlogs". I had never checked myself, but I see no reasons not to trust Yoshinori Matsunobu in this case, based on code fragments shared. All current MySQL versions, from 5.6.x to 8.0.x, are affected. From what I see here, MariaDB is also affected.
  • Bug #96853 - "Inconsistent super_read_only status when changing the variable is blocked". Nice bug report by Przemyslaw Malkowski from Percona. For some reason I do not see clear statement if MySQL 8 is affected.
  • Bug #96874 - "The write_notifier_mutex in log_sys is useless". This bug was reported by Chen Zongzhi on MySQL 8.0.17 (see also hist another similar Bug #97358 - "The log.flush_notifier_mutex in log_sys is useless"), but "Version" filed is empty even though the bug is "Verified". This is NOT acceptable.
  • Bug #96946 - "Outer reference in join condition isn't allowed". This bug (that affects all MySQL versions) was reported by Laurents Meyer. See also older Bug #35242 (still "Verified" and affects MariaDB 10.3.x as well).
  • Bug #96950 - "CONCAT() can generate corrupted output". I wish we'd see the exact test case, but at least based on code review this bug (reported by Jay Edgar) was verified for MySQL 5.6 and 5.7. I see the same code in MariaDB, unfortunately.
  • Bug #97001 - "Dangerous optimization reconsidering_access_paths_for_index_ordering". The problems is with queries like this:
    SELECT ... WHERE [secondary key conditions] ORDER BY `id` ASC LIMIT n
    and bug reporter, Jeremy Cole, listed a lot of potentially related older bug reports. he had also suggested a patch. I'd be happy to see the fix in MySQL soon.
  • Bug #97113 - "BIT column serialized incorrectly in CASE expression". This bug report was created by Bradley Grainger. It is stated there that MySQL 5.7 (not only 8.0) is affected, but "Version:" field of a verified bug does NOT list 5.7.x. Absolutely wrong way of bugs processing. MariaDB also seems to be inconsistent, even though the result for one of the queries is different:
    MariaDB [test]> SELECT CASE WHEN name IS NOT NULL THEN value ELSE NULL END FROM query_bit;
    Field   1:  `CASE WHEN name IS NOT NULL THEN value ELSE NULL END`
    Catalog:    `def`
    Database:   ``
    Table:      ``
    Org_table:  ``
    Type:       NEWDECIMAL
    Collation:  binary (63)
    Length:     2
    Max_length: 1
    Decimals:   0
    Flags:      BINARY NUM


    +-----------------------------------------------------+
    | CASE WHEN name IS NOT NULL THEN value ELSE NULL END |
    +-----------------------------------------------------+
    | 1                                                   |
    +-----------------------------------------------------+
    1 row in set (0.021 sec)
  • Bug #97150 - "rwlock: refine lock->recursive with C11 atomics". Patch for MySQL 8.0.x was contributed by Cai Yibo. See also his another contribution, Bug #97228 - "rwlock: refine lock->lock_word with C11 atomics".
  • Bug #97299 - "Improve the explain informations for Hash Joins". Simple EXPLAIN (unlike the one with format=tree) does not give a hint that new MySQL 8.0.18+ features, hash join, was used. Simple and useful feature request from Tibor Korocz.
  • Bug #97345 - "IO Thread not detecting failed master with relay_log_space_limit." Nice bug report from Jean-François Gagné, but no documented attempt to check if MySQL 5.6.x and 8.0.x are also affected.
  • Bug #97347 - "In some cases queries with ST_CONTAINS do not return any results". Simple and easy to check bug report from Christian Koinig. Note that based on quick test MariaDB is NOT affected:
    MariaDB [test]> select version(), count(*) FROM test
        -> WHERE ST_CONTAINS(
        ->  geo_footprint,
        ->  ST_GeomFromGeoJSON('{"type":"Polygon","coordinates":[[[15.11333480819996
    6,48.1337532388],[15.113329984100005,48.1337371609],[15.113411697200036,48.13371
    66354],[15.113673777399981,48.1336819199],[15.114544787600039,48.1335464618],[15
    .115336574000025,48.1334415189],[15.116374992200008,48.1332937084],[15.117266346
    799966,48.1330924824],[15.11769786879995,48.1329803459],[15.118129375299986,48.1
    32868199],[15.118515258099933,48.1327388086],[15.118597296700045,48.1327141533],
    [15.118635348899943,48.132702717],[15.11867907729993,48.1327796282],[15.11876276
    890007,48.13290987],[15.118805112699988,48.1330357889],[15.118850101500016,48.13
    33486685],[15.118823191700017,48.1334777297],[15.118820984299987,48.1334784295],
    [15.118821076099948,48.1334779691],[15.113334808199966,48.1337532388],[15.113334
    808199966,48.1337532388]]]}'));
    +--------------------+----------+
    | version()          | count(*) |
    +--------------------+----------+
    | 10.3.7-MariaDB-log |        1 |
    +--------------------+----------+
    1 row in set (0.003 sec)
  • Bug #97372 - "Constructor Query_event must check enough space". Contribution to 5.7 and 8.0 by Pengbo Shi. Waiting for the OCI signed by the contributor...
  • Bug #97418 - "MySQL chooses different execution plan in 5.7". Interesting bug report from Vinodh Krish. I am again not sure if versions affected match the results of tests presented here.
  • Bug #97421 - "Replace into affected row count not match with events in binlog". Not sure if MySQL 8 was checked, but MariaDB 10.3.7 also uses single Update_rows event in the binary log. Thanks to Ke Lu for noticing and reporting this!
Also, on a separate note, this claim of MySQL 8.0 performance regression from Mark CallaghanBug #86215, is still being analyzed it seems. No further comments for 2.5 years already!

Autumn of 2019 was fruitful. A lot of interesting MySQL bug reports also, not just grapes on my balcony...
To summarize:
  1. For some reason I often do not see explicit documented attempts by Oracle MySQL engineers from the bugs verification team to check bug on different MySQL versions. Sometimes obviously affected version (like MySQL 8.0.x) is not listed in the field. So "Version" field becomes useless This is absolutely wrong. Maybe I should submit yet another talk to some conference on how to process bugs properly?
  2. Some regression bugs are still not marked with "regression" tag when verified.
  3. MySQL optimizer still requires a lot of work to become decent.
  4. I see a lot of new interesting new bug reports both from well known old community members and users I had never noticed before by name. This is great and proves that MySQL is still alive and use all kinds of contributions from Community.
Next time I'll review interesting bugs reported in November and December, 2019. Stay tuned!