Saturday, February 22, 2020

Fun with Bugs #94 - On MySQL Bug Reports I am Subscribed to, Part XXVIII

I may get a chance to speak about proper bugs processing for open source projects later this year, so I have to keep reviewing recent MySQL bugs to be ready for that. In my previous post in this series I listed some interesting MySQL bug reports created in December, 2019. Time to move on to January, 2020! Belated Happy New Year of cool MySQL Bugs!

As usual I mostly care about InnoDB, replication and optimizer bugs and explicitly mention bug reporter by name and give link to his other active reports (if any). I also pick up examples of proper (or improper) reporter and Oracle engineers attitudes. Here is the list:
  • Bug #98103 - "unexpected behavior while logging an aborted query in the slow query log".  Query that was killed while waiting for the table metadata lock is not only get logged, but also lock wait time is saved as query execution time. I'd like to highlight how bug reporter, Pranay Motupalli, used gdb to study what really happens in the code in this case. Perfect bug report!
  • Bug #98113 - "Crash possible when load & unload a connection handler". The (quite obvious) bug was verified based on code review, but only after some efforts were spent by Oracle engineer on denial to accept the problem and its importance. This bug was reported by Fangxin Flou.
  • Bug #98132 - "Analyze table leads to empty statistics during online rebuild DDL ". Nice addition to my collections! This bug with a nice and clear test case was reported by Albert Hu, who also suggested a fix.
  • Bug #98139 - "Committing a XA transaction causes a wrong sequence of events in binlog". This bug reported by Dehao Wang was verified as a "documentation" one, but I doubt documenting current behavior properly is an acceptable fix. Bug reporter suggested to commit in the binary log first, for example. Current implementation that allows users to commit/rollback a XA transaction by using another connection if the former connection is closed or killed, is risky. A lot of arguing happened in comments in the process, and my comment asking for a clear quote from the manual:
    Would you be so kind to share some text from this page you mentioned:

    https://dev.mysql.com/doc/refman/8.0/en/xa.html

    or any other fine MySQL 8 manual page stating that XA COMMIT is NOT supported when executed from session/connection/thread other than those prepared the XA transaction? I am doing something wrong probably, but I can not find such text anywhere.
    was hidden. Let's see what happens to this bug report next.
  • Bug #98211 - "Auto increment value didn't reset correctly.". Not sure what this bug reported by Zhao Jianwei has to do with "Data Types", IMHO it's more about DDL or data dictionary. Again, some sarcastic comments from Community users were needed to put work on this bug back on track...
  • Bug #98220 - "with log_slow_extra=on Errno: info not getting updated correctly for error". This bug was reported by lalit Choudhary from Percona.
  • Bug #98227 - "innodb_stats_method='nulls_ignored' and persistent stats get wrong cardinalities". I think category is wrong for this bug. It's a but in InnoDB's persistent statistics implementation, one of many. The bug was reported by Agustín G from Percona.
  • Bug #98231 - "show index from a partition table gets a wrong cardinality value". Yet another by report by Albert Hu. that ended up as a "documentation" bug for now, even though older MySQL versions provided better cardinality estimations than MySQL 8.0 in this case (so this is a regression of a kind). I hope the bug will be re-classified and properly processed later.
  • Bug #98238 - "I_S.KEY_COLUMN_USAGE is very slow". I am surprised to see such a bug in MySQL 8. According to the bug reporter, Manuel Mausz, this is also a kind of regression comparing to older MySQL version, where these queries used to run faster. Surely, no "regression" tag in this case was added.
  • Bug #98284 - "Low sysbench score in the case of a large number of connections". This notable performance regression of MySQL 8 vs 5.7 was reported by zanye zjy. perf profiling pointed out towards ppoll() where a lot of time is spent. There is a fix suggested by Fangxin Flou (to use poll() instead), but the bug is still "Open".
  • Bug #98287 - "Explanation of hash joins is inconsistent across EXPLAIN formats". This bug was reported by Saverio M and ended up marked as a duplicate of Bug #97299 fixed in upcoming 8.0.20. Use EXPLAIN FORMAT=TREE in the meantime to see proper information about hash joins usage in the plan.
  • Bug #98288 - "xa commit crash lead mysql replication error". This bug report from Phoenix Zhang (who also suggested a patch) was declared a duplicate of Bug #76233 - "XA prepare is logged ahead of engine prepare" (that I've already discussed among other XA transactions bugs here).
  • Bug #98324 - "Deadlocks more frequent since version 5.7.26". Nice regression bug report by Przemyslaw Malkowski from Percona, with additional test provided later by Stephen Wei . Interestingly enough, test results shared by Umesh Shastry show that MySQL 8.0.19 is affected in the same way as 5.7.26+, but 8.0.19 is NOT listed as one of versions affected. This is a mistake to fix, along with missing regression tag.
  • Bug #98427 - "InnoDB FullText AUX Tables are broken in 8.0". Yet another regression in MySQL 8 was found by Satya Bodapati. Change in default collation for utf8mb4 character set caused this it seems. InnoDB FULLTEXT search was far from perfect anyway...
The are clouds in the sky of MySQL bugs processing.
To summarize:
  1.  Still too much time and efforts are sometimes spent on arguing with bug reporter instead of accepting and processing bugs properly. This is unfortunate.
  2. Sometimes bugs are wrongly classified when verified (documentation vs code bug, wrong category, wrong severity, not all affected versions are listed, ignoring regression etc). This is also unfortunate.
  3. Percona engineers still help to make MySQL better.
  4. There are some fixes in upcoming MySQL 8.0.20 that I am waiting for :)
  5. XA transactions in MySQL are badly broken (they are not atomic in storage engine + binary log) and hardly safe to use in reality.

Saturday, February 8, 2020

Fun with Bugs #93 - On MySQL Bug Reports I am Subscribed to, Part XXVII

No matter what I write and present about dynamic tracing, blog posts about MySQL bugs are more popular based on statistics. So, to make more readers happy, I'd like to continue my review of interesting bugs reported in November with this post on bugs reported during December, 2019.

As usual, I'll try to concentrate on bug reports related to InnoDB, replication and optimizer, but some other categories also got my attention:
  • Bug #97911 - "[FATAL] Semaphore wait has lasted > 600 seconds. We intentionally crash the serv...". This bug got marked as a duplicate of other, older long semaphore wait bug (in "Can't repeat" status!) without much analysis. I think all Oracle engineers who added comments to that bug missed one interesting point:
    ... [ERROR] [MY-012872] [InnoDB] Semaphore wait has lasted > 39052544 seconds.
    even though bug reporter highlighted it in a comment. Reported wait time is a problem and is surely a bug, no matter how to reproduce the long wait itself and what is its root cause!
  • Bug #97913 - "Undo logs growing during partitioning ALTER queries". This bug (affecting only MySQL 5.7.x) was reported by Przemyslaw Malkowski from Percona, who also presented useful examples of monitoring queries to the information_schema.innodb_metrics and performance_schema. Check also comments that may explain why 8.0 is not affected in a similar way.
  • Bug #97935 - "Memory leak in client connection using information_schema". It took some efforts (starting from but not limited to Valgrind Massif profiling of heap memory usage) and time for Daniel Nichter to prove the point and get this bug "Verified". It is also not clear if MySQL 8 is also affected.
  • Bug #97950 - "buf_read_page_handle_error can trigger assert failure". Bug reporter, Shu Lin, tried his best to make the point. It's clear enough how to repeat this, and one could use one of documented test synchonisation methods if gdb is too much for bug verification. I do not think this bug was handled properly or got the level of attention it truly deserved.
  • Bug #97966 - "XA COMMIT in another session will not write binlog event". This bug was reported by Lujun Wang and immediately verified, but again with no documented check if MySQL 8 is affected. This happens too often, unfortunately.
  • Bug #97971 - "Roles not handling column level privileges correctly; Can SELECT, but not UPDATE". Clear and simple bug report with a workaround from Travis Bement. It was immediately verified.
  • Bug #98014 - "Lossy implicit conversion in conditional breaks ONLY_FULL_GROUP_BY". Yet another case of (IMHO) improper bug processing. The argument presented (from the manual):
    "MySQL 5.7.5 and later also permits a nonaggregate column not named in a GROUP BY clause when ONLY_FULL_GROUP_BY SQL mode is enabled, provided that this column is limited to a single value"
    does not apply, as "single value" for = 0 is NOT selected, we have multiple Host values matching it due to conversion. This is how proper version (guess what it is) works:
    mysql> SELECT User, Host, COUNT(*) FROM mysql.user WHERE Host = 0 GROUP BY 1;
    ERROR 1055 (42000): 'mysql.user.Host' isn't in GROUP BY
    mysql> select @@sql_mode;
    +--------------------+
    | @@sql_mode         |
    +--------------------+
    | ONLY_FULL_GROUP_BY |
    +--------------------+
    1 row in set (0.001 sec)

    mysql> set session sql_mode='';
    Query OK, 0 rows affected (0.029 sec)

    mysql> SELECT User, Host, COUNT(*) FROM mysql.user WHERE Host = 0 GROUP BY 1;+---------------+-----------+----------+
    | User          | Host      | COUNT(*) |
    +---------------+-----------+----------+
    | data_engineer |           |        1 |
    | en            | localhost |        1 |
    | ro1           |           |        1 |
    | ro2           |           |        1 |
    | role-1        |           |        1 |
    | root          | ::1       |        3 |
    ...
    | user1         | %         |        1 |
    +---------------+-----------+----------+
    13 rows in set, 17 warnings (0.003 sec)
    I think this bug reported by Joshua Varner must be verified.
  • Bug #98046 - "Inconsistent behavior while logging a killed query in the slow query log". Bug reporter, Pranay Motupalli, provided a clear test case and a detailed analysis, including the gdb debugging session that proves the point. Nice bug report.
  • Bug #98055 - "MySQL Optimizer Bug not picking right index". Both the bug reporter (Sudheer Gadipathi) and engineer who verified the bug stated that MySQL 8.0.x is similarly affected (UNIQUE key is preferred for the partitioned table, even though there is a better non-unique index). But 8.0.x is NOT listed in the "Version:" filed. Weird.
  • Bug #98068 - "SELECT FOR UPDATE not-exist table in PERFORMANCE SCHEMA reports confusing error". This is a funny (but still a regression) bug report by William ZHANG. Proper versions work like this:
    mysql> select database();+--------------------+
    | database()         |
    +--------------------+
    | performance_schema |
    +--------------------+
    1 row in set (0.001 sec)

    mysql> select * from not_exist_table;
    ERROR 1146 (42S02): Table 'performance_schema.not_exist_table' doesn't exist
    mysql> select * from not_exist_table for update;
    ERROR 1146 (42S02): Table 'performance_schema.not_exist_table' doesn't exist
  • Bug #98072 - "innochecksum summary shows blob pages as other type of page for 8.0 tables". The bug was reported by SERGEY KUZMICHEV. This time "regression" tag is missing, even though it's clearly stated that MySQL 5.7 worked differently. This is from the proper version:
    ...
    File::..\data\test\blob_test.ibd
    ================PAGE TYPE SUMMARY==============
    #PAGE_COUNT     PAGE_TYPE
    ===============================================
           1        Index page
           0        Undo log page
           1        Inode page
           0        Insert buffer free list page
         508        Freshly allocated page
           1        Insert buffer bitmap
           0        System page
           0        Transaction system page
           1        File Space Header
           0        Extent descriptor page
          64        BLOB page
           0        Compressed BLOB page
           0        Page compressed page
           0        Page compressed encrypted page
           0        Other type of page

    ...
  • Bug #98083 - "Restarting the computer when deleting the database will cause directory residues". One would expect that MySQL 8 with a data dictionary should have some means to figure out the remaining database directory for a dropped database upon startup (as it stores information about databases elsewhere) and do proper cleanup. I think this bug reported by Jinming Liao must be verified and fixed. There is no "... manual creation or deletion of tables or databases..." involved in this case.
  • Bug #98091 - "InnoDB does not initialize raw disk partitions". As simple as that and both 5.7.29 and 8.0.219 are surely affected. It was not always the case, I've used raw devices myself with older MySQL versions, so this bug reported by Saverio M is a regression. Still, "regression" tag is missing.
That's all bugs reported in December, 2019 that I cared to subscribe to  and mention here. Next time I'll check bugs reported in January, 2020. There are at least 16 in my list already, so stay tuned.

Follow the links in this post to get more details about profiling and creating off-CPU FlameGraphs for MySQL. This post is devoted to bugs, though :)


To summarize:
  1. I am happy to see bug reports from people whom I never noticed before. MySQL Community is alive.
  2. Some flexibility in following common sense based bugs verification procedures is still visible. Bugs reported for 5.7 are not checked on 8.0 (or the results of this check are not documented in public), nobody cares to read what bug reporter says carefully or go extra mile, "regression" tag not added, and so on. 
  3. Probably at this stage my writings are mostly ignored by Oracle's decision makers. But I keep watching them all anyway.

Sunday, January 26, 2020

Using bpftrace on Fedora 29 - More Advanced Examples

It happened so that I did not have much time to devote to further eBPF and bpftrace-based tracing and profiling since October 2019, when I posted a simple example of how it can be used on Fedora 29 to "emulate" slow query log (to some extent, by recording queries and time it took to execute them) with dynamic probes. Recently I started preparing for my talk about ftrace, eBPF and bpftrace planned for FOSDEM MySQL, MariaDB and Friends Devroom, but rejected there and later accepted for "MariaDB Day Brussels 0202 2020" (that I suggest everyone to attend, at least if you do not have better plans for the second day of FOSDEM 2020).

So, I decided to check for any ways to use bpftrace as a quick profiler, maybe "embed" it into pt-pmp one day and, surely, get what MariaDB developers ask me most often while working on performance problems or hangs - stack traces. This blog post describes my "achievements" so far on this way.

I had the same version of bpftrace coming from Fedora RPM (that had not updated in the meantime):
[openxs@fc29 server]$ rpm -q -a | grep bpftrace
bpftrace-0.9-0.fc29.x86_64
So, after checking this fine reference guide quickly, I came up with a simple one-liner:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }'
Error creating map: '@'
Error creating stack id map
This might have happened because kernel.perf_event_max_stackis smaller than 127. Try to tweak this value with sysctl kernel.perf_event_max_stack=<new value>
Error creating perf event map (-1)
Attaching 1 probe...
Segmentation fault
Ups! I tried to do what error message suggested:
[openxs@fc29 ~]$ sudo sysctl -w kernel.perf_event_max_stack=1000
kernel.perf_event_max_stack = 1000
But it had not helped. This is how I hit my first problem with bpftrace. Fedora had not cared to update it when it stopped matching "something" in a new kernel, for whatever reason. This is actually a real problem for eBPF programs - what they assume may not be "there" in a kernel, so they have to be checked carefully. There is a separate "BPF Type Format (BTF)" for encoding the debug info related to BPF program (or map) that may help to find out if the program is "compatible" without trying it. In my case incompatibility was obvious and I needed some fast way to proceed with bpftrace, in a hope to come up with some useful examples for my talk.

So, i decided to build new version of bpftrace from recent GitHub source (against bcc tools I had installed and supposedly up to date). This ended up with errors no matter what I tried:
[openxs@fc29 build]$ make
[  3%] Built target parser
[  4%] Built target arch
[  5%] Building CXX object src/ast/CMakeFiles/ast.dir/irbuilderbpf.cpp.o
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp: In member function 'void bpftrace::ast::IRBuilderBPF::CreateSignal(llvm::Value*)':
/mnt/home/openxs/git/bpftrace/src/ast/irbuilderbpf.cpp:669:16: error: 'BPF_FUNC_send_signal' was not declared in this scope
       getInt64(BPF_FUNC_send_signal),
                ^~~~~~~~~~~~~~~~~~~~
...
Quick search led me to the idea that it's actually a know and recently reported problem, see Issue #1014. As a solution I decided to rebuild bcc tools from GitHub source as well (the alternative to to change bpftrace source) and removed Fedora packages. To make a long story short, this is what I did eventually (from fc -l output):
1051     rm -rf bcc/
1052     rm -rf bpftrace/
1053     sudo dnf remove bcc
1054     git clone https://github.com/iovisor/bcc.git
1055     cd bcc/
1056     git submodule update --init --recursive
1057     sudo dnf install -y bison cmake ethtool flex git iperf libstdc++-static   python-netaddr python-pip gcc gcc-c++ make zlib-devel   elfutils-libelf-devel
1058     sudo dnf install -y luajit luajit-devel
1059     sudo dnf install -y   http://repo.iovisor.org/yum/extra/mageia/cauldron/x86_64/netperf-2.7.0-1.mga6.x86_64.rpm
1060     sudo pip install pyroute2
1061     sudo dnf install -y clang clang-devel llvm llvm-devel llvm-static ncurses-devel
1062     mkdir build
1063     cd build/
1064     cmake .. -DCMAKE_INSTALL_PREFIX=/usr
1065     make
1066     make test
1067     sudo make install
1068     cd ../..
1069     git clone https://github.com/iovisor/bpftrace.git
1070     cd bpftrace/
1071     git submodule update --init --recursive
1072     mkdir build
1073     cd build/
1074     cmake -DCMAKE_BUILD_TYPE=Release ../
1075     make
1076     ./tests/bpftrace_test
1077     fc -l
Basically I did it all as documented in related projects for my version of Fedora. Step highlighted (1056 above) is essential though, as it was missed here in any obvious way as a mandatory (while it is in recent versions). Step 1076 was successful:
...
[----------] Global test environment tear-down
[==========] 350 tests from 9 test cases ran. (2590 ms total)
[  PASSED  ] 350 tests.

[openxs@fc29 build]$ make testRunning tests...
Test project /mnt/home/openxs/git/bpftrace/build
    Start 1: bpftrace_test
1/3 Test #1: bpftrace_test ....................   Passed    2.42 sec
    Start 2: runtime_test
2/3 Test #2: runtime_test .....................***Failed   21.10 sec
    Start 3: tools-parsing-test
3/3 Test #3: tools-parsing-test ...............***Failed    5.88 sec

33% tests passed, 2 tests failed out of 3

Total Test time (real) =  29.41 sec

The following tests FAILED:
          2 - runtime_test (Failed)
          3 - tools-parsing-test (Failed)
Errors while running CTest
make: *** [Makefile:128: test] Error 8
while make test shown some failures (same as make test for bcc tools), but form previous experience this had not looked like a big deal. I continued basic testing that the resulting bpftrace now works (I tried to add kernel probe, guess what software runs on this workstation):
[openxs@fc29 build]$ sudo ./src/bpftrace -e 'kprobe:do_nanosleep { printf("sleep by %s\n", comm); }'
Attaching 1 probe...
sleep by percona-qan-age
sleep by percona-qan-age
sleep by skypeforlinux
sleep by mysqld
...
^C
So, with some confidence built up, I've installed my new and very recent bpftrace tool I built myself:
[openxs@fc29 build]$ sudo make install
[  3%] Built target parser
...
-- Installing: /usr/local/bin/bpftrace
...
-- Installing: /usr/local/man/man8/xfsdist.8.gz

[openxs@fc29 build]$ which bpftrace
/usr/local/bin/bpftrace

[openxs@fc29 build]$ cd ..
[openxs@fc29 bpftrace]$ git log -1
commit afafbf561522dd33fa316be3e33375bc662399ac (HEAD -> master, origin/master, origin/HEAD)
Author: bas smit <bas@baslab.org>
Date:   Fri Jan 17 12:33:43 2020 +0100

    Add `--info` flag

    This prints the supported kernel features
Quick test involved starting MariaDB 10.3 with thread pool enabled (more on "why thread pool?" in some next blog post):
[openxs@fc29 maria10.3]$ ./bin/mysqld_safe --no-defaults --thread_handling=pool-of-threads --thread_pool_idle_timeout=10 &
[1] 32643
and running sysbench:
[openxs@fc29 maria10.3]$ sysbench /usr/local/share/sysbench/oltp_update_index.lua --mysql-host=127.0.0.1 --mysql-user=root --mysql-port=3306 --threads=10 --tables=4 --table-size=1000000 --time=600 --report-interval=5 run
sysbench 1.1.0-174f3aa (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 10
Report intermediate results every 5 second(s)
Initializing random number generator from current time


Initializing worker threads...

Threads started!

[ 5s ] thds: 10 tps: 54.37 qps: 54.37 (r/w/o: 0.00/54.37/0.00) lat (ms,95%): 467.30 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 10 tps: 40.60 qps: 40.60 (r/w/o: 0.00/40.60/0.00) lat (ms,95%): 530.08 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 10 tps: 40.00 qps: 40.00 (r/w/o: 0.00/40.00/0.00) lat (ms,95%): 520.62 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 10 tps: 38.00 qps: 38.00 (r/w/o: 0.00/38.00/0.00) lat (ms,95%): 623.33 err/s: 0.00 reconn/s: 0.00
[ 25s ] thds: 10 tps: 34.00 qps: 34.00 (r/w/o: 0.00/34.00/0.00) lat (ms,95%): 746.32 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 10 tps: 36.40 qps: 36.40 (r/w/o: 0.00/36.40/0.00) lat (ms,95%): 549.52 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 10 tps: 37.80 qps: 37.80 (r/w/o: 0.00/37.80/0.00) lat (ms,95%): 601.29 err/s: 0.00 reconn/s: 0.00
...
and adding the probe like this in another shell:
[openxs@fc29 ~]$ sudo bpftrace -e 'profile:hz:99 { @[ustack] = count(); }' > /tmp/bpf.stack
^C
When I stopped collecting with Ctrl-C, bpftrace printed the content of associative arrays collected (with unique user space stack 9for all processes!) as index and count of time this stack was as a value). No segmentation fault, unlike with the version from package! In the resulting file we can see (among other things):
[openxs@fc29 bpftrace]$ cat /tmp/bpf.stack | more
Attaching 1 probe...

...
@[
    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    page_cur_search_with_match_bytes(buf_block_t const*, dict_index_t const*, dtuple_t const*, page_cur_mode_t, unsigned long*, unsigned long*, unsigned long*, unsigned long*, page_cur_t*)+1773
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+8316
    btr_pcur_open_low(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_pcur_t*, char const*, unsigned int, unsigned long, mtr_t*) [clone .constprop.33]+146
    row_search_on_row_ref(btr_pcur_t*, unsigned long, dict_table_t const*, dtuple_t const*, mtr_t*)+85
    row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*, btr_pcur_t*, mtr_t*, bool)+503
    row_purge_remove_sec_if_poss_leaf(purge_node_t*, dict_index_t*, dtuple_t const*)+1293
    row_purge_record_func(purge_node_t*, unsigned char*, que_thr_t const*, bool)+2271
    row_purge_step(que_thr_t*)+1057
    que_run_threads(que_thr_t*)+2280
    srv_worker_thread+956
    start_thread+234
]: 1

...

@[
    rec_get_offsets_func(unsigned char const*, dict_index_t const*, unsigned short*, bool, unsigned long, mem_block_info_t**)+31
    btr_search_build_page_hash_index(dict_index_t*, buf_block_t*, rw_lock_t*, unsigned long, unsigned long, unsigned long)+2781
    btr_search_info_update_slow(btr_search_t*, btr_cur_t*)+622
    btr_cur_search_to_nth_level_func(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, rw_lock_t*, char const*, unsigned int, mtr_t*, unsigned long)+10130
    row_search_mvcc(unsigned char*, page_cur_mode_t, row_prebuilt_t*, unsigned long, unsigned long)+4823
    ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)+338
    handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function)+280
    handler::read_range_first(st_key_range const*, st_key_range const*, bool, bool)+98
    handler::multi_range_read_next(void**)+183
    Mrr_simple_index_reader::get_next(void**)+74
    DsMrr_impl::dsmrr_next(void**)+66
    QUICK_RANGE_SELECT::get_next()+39
    rr_quick(READ_RECORD*)+25
    mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, bool, unsigned long long*, unsigned long long*)+3560
    mysql_execute_command(THD*)+8446
    Prepared_statement::execute(String*, bool)+979
    Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+154
    mysql_stmt_execute_common(THD*, unsigned long, unsigned char*, unsigned char*, unsigned long, bool, bool)+349
    mysqld_stmt_execute(THD*, char*, unsigned int)+37
    dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool)+5123
    do_command(THD*)+368
    tp_callback(TP_connection*)+314
    worker_main(void*)+160
    start_thread+234
]: 1

...

@[
    trx_purge(unsigned long, bool)+1054
    srv_purge_coordinator_thread+1872
    start_thread+234
]: 2
@[
    ut_crc32_sw(unsigned char const*, unsigned long)+1611
    buf_calc_page_crc32(unsigned char const*)+61
    buf_flush_init_for_writing(buf_block_t const*, unsigned char*, void*, unsigned long)+872
    buf_flush_write_block_low(buf_page_t*, buf_flush_t, bool)+149
    buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)+607
    buf_flush_page_and_try_neighbors(buf_page_t*, buf_flush_t, unsigned long, unsigned long*)+1627
    buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long)+392
    pc_flush_slot()+2617
    buf_flush_page_cleaner_coordinator.cold.151+1235
    start_thread+234
]: 2

...
@[]: 6474
So, this is the default format to output stacks (one line per call, no address, frame number or anything, symbols properly resolved for my MariaDB built from GitHub source etc). But I had something like this mind eventually (from pt-pmp):
[openxs@fc29 bpftrace]$ pt-pmp
Tue Jan 21 13:52:12 EET 2020

...
     10 syscall(libc.so.6),__io_getevents_0_4(libaio.so.1),LinuxAIOHandler::collect(os0file.cc:1932),LinuxAIOHandler::poll(os0file.cc:2093),os_aio_linux_handler(stl_vector.h:805),os_aio_handler(stl_vector.h:805),fil_aio_wait(fil0fil.cc:4435),io_handler_thread(srv0start.cc:326),start_thread(libpthread.so.0),clone(libc.so.6)
      6 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),log_write_up_to(log0log.cc:929),trx_flush_log_if_needed_low(trx0trx.cc:1218),trx_flush_log_if_needed(trx0trx.cc:1218),trx_commit_complete_for_mysql(trx0trx.cc:1218),innobase_commit(ha_innodb.cc:4637),commit_one_phase_2(handler.cc:1642),ha_commit_one_phase(handler.cc:1622),ha_commit_trans(handler.cc:1484),trans_commit_stmt(transaction.cc:505),mysql_execute_command(sql_parse.cc:6106),Prepared_statement::execute(sql_prepare.cc:4807),Prepared_statement::execute_loop(sql_prepare.cc:4235),Prepared_statement::execute_loop(sql_prepare.cc:4235),mysql_stmt_execute_common(sql_prepare.cc:3235),mysqld_stmt_execute(sql_prepare.cc:3132),dispatch_command(sql_parse.cc:1798),do_command(sql_parse.cc:1401),threadpool_process_request(threadpool_common.cc:359),tp_callback(threadpool_common.cc:359),worker_main(threadpool_generic.cc:1606),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_wait,os_event::wait(sync0types.h:476),os_event::wait_low(sync0types.h:476),srv_resume_thread(srv0srv.cc:2533),srv_worker_thread(srv0srv.cc:2533),start_thread(libpthread.so.0),clone(libc.so.6)
      3 pthread_cond_timedwait,inline_mysql_cond_timedwait(mysql_thread.h:1215),get_event(mysql_thread.h:1215),worker_main(threadpool_generic.cc:1602),start_thread(libpthread.so.0),clone(libc.so.6)
      ...
That is, number of times collapsed stack was noted and then the stack itself, without function arguments if possible. This kind of stack is generated from gdb outputs that looks as follows:
Thread 1 (Thread 0x7f29eee66900 (LWP 32728)):
#0  0x00007f29ef25f431 in poll () from /lib64/libc.so.6
#1  0x000055cc4ece656f in poll (__timeout=-1, __nfds=2, __fds=0x7ffe174108e0) at /usr/include/bits/poll2.h:41
#2  handle_connections_sockets () at /mnt/home/openxs/git/server/sql/mysqld.cc:6753
#3  0x000055cc4ece7a7c in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /mnt/home/openxs/git/server/sql/mysqld.cc:6222
#4  0x00007f29ef190413 in __libc_start_main () from /lib64/libc.so.6
#5  0x000055cc4ecd9ece in _start () at /mnt/home/openxs/git/server/sql/sql_list.h:158
So, each tread stack starts from 'Thread..." line, each frame is numbered and function call is actually the fourth word in the output (in awk terms).
 
Based on that I tried add probe to produce stack in perf format for mysqld process, with some decoration to have a clear marker of when new capture starts:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n", ustack(perf));}' > /tmp/ustack.txt
If you wonder about the details of command format, functions used etc, check the Reference Guide. The result looks as follows in hexdump -C output:
00000000  41 74 74 61 63 68 69 6e  67 20 31 20 70 72 6f 62  |Attaching 1 prob|
00000010  65 2e 2e 2e 0a 23 20 0a  09 37 66 32 39 65 66 62  |e....# ..7f29efb|
00000020  38 34 33 30 30 20 6e 61  6e 6f 73 6c 65 65 70 2b  |84300 nanosleep+|
00000030  36 34 20 28 2f 75 73 72  2f 6c 69 62 36 34 2f 6c  |64 (/usr/lib64/l|
00000040  69 62 70 74 68 72 65 61  64 2d 32 2e 32 38 2e 73  |ibpthread-2.28.s|
00000050  6f 29 0a 0a 23 20 0a 09  35 35 63 63 34 66 32 30  |o)..# ..55cc4f20|
...
This is how you can see where are the new lines, how many etc, for proper parsing. Next step was to check pt-pmp source code and http://poormansprofiler.org/, review and adapt some awk magic used there to collapse stacks, get rid of function arguments etc, and I've ended up with this draft version:
[openxs@fc29 maria10.3]$ cat /tmp/ustack.txt | awk '
BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

    199 __sched_yield,srv_purge_coordinator_thread,start_thread
     16 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     13 trx_purge,srv_purge_coordinator_thread,start_thread
      4 srv_purge_coordinator_thread,start_thread
      4 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      2 ut_crc32_sw,buf_calc_page_crc32,buf_page_is_corrupted,buf_page_io_complete,buf_read_page,buf_page_get_gen,btr_cur_search_to_nth_level_func,row_search_mvcc,ha_innobase::index_read,handler::ha_index_read_map,handler::read_range_first,handler::multi_range_read_next,Mrr_simple_index_reader::get_next,DsMrr_impl::dsmrr_next,QUICK_RANGE_SELECT::get_next,rr_quick,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
      2 syscall,
      2 __pwrite,
      2 mtr_t::commit,row_ins_sec_index_entry_low,row_ins_sec_index_entry,row_upd_sec_index_entry,row_upd_step,row_update_for_mysql,ha_innobase::update_row,handler::ha_update_row,mysql_update,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
...
The above looks reasonable (even if not yet a direct match to what pt-pmp produces, I am too lazy to do much awk these days).

What's more interesting is that with bpftrace we can also get kernel stack for thread while profiling. I took a quick look at what we may get as a result of combining both:
[openxs@fc29 bpftrace]$ sudo bpftrace -e 'profile:hz:99 /comm == "mysqld"/ {printf("# %s\n%s\n", kstack(perf), ustack(perf));}' > /tmp/kstack.txt
[sudo] password for openxs:
^C
I just pring one after the other without any separator in perf format, so I can apply the same trick as above to the result:
openxs@fc29 maria10.3]$ cat /tmp/kstack.txt | awk '                   BEGIN { s = ""; }
/^\#/ { print s; s = ""; }
/^\t/ { if (index($2, "(") > 0) {targ = substr($2, 1, index($2, "(") - 1)} else {targ = substr($2, 1, index($2, "+") - 1)} ; if (s != "") { s = s "," targ } else { s = targ } }
END { print s }' | sort | uniq -c | sort -r -n -k 1,1 | more

     93 __sched_text_start,schedule,__x64_sys_sched_yield,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     49 __sched_yield,srv_purge_coordinator_thread,start_thread
     42 do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
     20 srv_get_task_queue_length,trx_purge,srv_purge_coordinator_thread,start_thread
     11 trx_purge,srv_purge_coordinator_thread,start_thread
      2 srv_get_task_queue_length,srv_purge_coordinator_thread,start_thread
      1 __x86_indirect_thunk_rax,do_syscall_64,entry_SYSCALL_64_after_hwframe,__sched_yield,srv_purge_coordinator_thread,start_thread
      1 ut_crc32_sw,log_block_calc_checksum_crc32,log_write_up_to,trx_commit_complete_for_mysql,innobase_commit,commit_one_phase_2,ha_commit_trans,trans_commit_stmt,mysql_execute_command,Prepared_statement::execute,Prepared_statement::execute_loop,mysql_stmt_execute_common,mysqld_stmt_execute,dispatch_command,do_command,tp_callback,worker_main,start_thread
...
      1 _raw_spin_unlock_irqrestore,ata_scsi_queuecmd,scsi_queue_rq,blk_mq_dispatch_rq_list,blk_mq_do_dispatch_sched,blk_mq_sched_dispatch_requests,__blk_mq_run_hw_queue,__blk_mq_delay_run_hw_queue,blk_mq_run_hw_queue,blk_mq_sched_insert_requests,blk_mq_flush_plug_list,blk_flush_plug_list,blk_finish_plug,ext4_writepages,do_writepages,__filemap_fdatawrite_range,file_write_and_wait_range,ext4_sync_file,do_fsync,__x64_sys_fsync,do_syscall_64,entry_SYSCALL_64_after_hwframe,fsync,fil_flush_low,fil_flush_file_spaces,buf_dblwr_update,buf_page_io_complete,fil_aio_wait,io_handler_thread,start_thread
      1 page_simple_validate_new,buf_dblwr_flush_buffered_writes,pc_flush_slot,buf_flush_page_cleaner_coordinator.cold.151,start_thread
      1 page_cur_insert_rec_write_log,page_cur_insert_rec_low,page_copy_rec_list_end_no_locks,btr_page_reorganize_low,btr_can_merge_with_page,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
      1 os_thread_yield,srv_purge_coordinator_thread,start_thread
      1 os_thread_get_curr_id,fseg_free_page_func,btr_page_free,btr_compress,btr_cur_compress_if_useful,btr_cur_pessimistic_delete,row_purge_remove_sec_if_poss_tree,row_purge_record_func,row_purge_step,que_run_threads,srv_worker_thread,start_thread
...
Looks cool to me, we see where we spend time in kernel while on-CPU, not only in user space code.

It's sometimes scary and not that easy to use bpftrace, but you'll get there eventually, same as we got to North Greenwich via that Emirates Air Line back in 2019. The only limit is the sky!
* * *
To summarize:
  1. bpftrace and bcc tools you get from vendor packages may not work as expected, so be ready to get nightly builds etc or, as I prefer, build from source. If this is a problem better use ftrace :) 
  2. It seems getting stack traces with bpftrace for pt-pmp is an easy next stap to make. I may be integrated as one of tracers supported here.
  3. Now we can also add kernel stack to the picure.
  4. "JUST DO BPF"!
I have many more details to share about dynamic tracing for MySQL and MariaDB DBAs. Stay tuned!

Saturday, January 25, 2020

Dynamic Tracing of MariaDB Server With ftrace - Basic Example

Today I'd like to continue my series of blog posts about dynamic tracing of MySQL server (or Percona Server for MySQL, or MariaDB, or basically any other software, whatever you prefer) on Linux. This post ideally had to start it, as the approach discussed here is truly dynamic and suitable production servers under high load (unlike gdb tricks, most of the time), works more or less the same way starting from Linux 2.6.27, is "always there" at your fingertips (if you are root, but this is a requirement in a general case for all kinds of dynamic tracing), and does not even strictly require -debuginfo packages with symbolic information for basic usage. I mean ftrace.

To prove that this interface that appeared in 2008 is still usable, let me add the same kind of probe for tracing dispatch_command() function and printing SQL statement that somebody tries to execute (so we can avoid enabling general query log etc) with pure ftrace interface, without much explanations or background (that is presented below). In this case I use recent MariaDB Server 10.3.22 built from source on my Ubuntu 16.04 netbook (because I do not care about being shared in the feed of Planet MySQL anymore in any case).

You just have to do the following:
  1. Check that dynamic tracing with ftrace is possible:
    openxs@ao756:~$ mount | grep tracefs
    tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)
    If you do not see tracefs mounted, mount it with the following command (as root):
    mount -t tracefs nodev /sys/kernel/tracing
  2. Find the address of (mangled) dispatch_command() function in the mysqld binary:
    openxs@ao756:~$ objdump -T /home/openxs/dbs/maria10.3/bin/mysqld | grep dispatch_command
    0000000000587b90
    g    DF .text  000000000000236e  Base        _Z16dispatch_command19enum_server_commandP3THDPcjbb
    The address you need (in many cases) is the first word in the output, in hex. So, 0x0000000000587b90
  3. If you want to print values of parameters passed, you need to know how they are passed by the compiler used on the architecture used. Let me share the quote that applies to x86_64:
    "First six arguments are in rdi, rsi, rdx, rcx, r8d, r9d; remaining arguments are on the stack."
    In my case (see this post or details below, or check the source code yourself if in doubts) we need to print third argument, so it must be in the rdx (64-bit) register. For ftrace we have to use just %dx, trust me for now, I tried ;)
  4. Check probe syntax (see the picture below for quick reference) and come up with something like this to add a probe named "dc". Note that on Ubuntu 16.04 this had not worked via sudo directly, I had to become root via "sudo su -" (weird, something to check one day):
    root@ao756:~# echo 'p:dc /home/openxs/dbs/maria10.3/bin/mysqld:0x0000000000587b90 query=+0(%dx):string' > /sys/kernel/debug/tracing/uprobe_events
    All these happens while MariaDB server is running and procerssing queries, in a truly dynamic manner. There is no impact at all so far. Creators of MariaDB never added anything ot the code (or to the build process) to make this work. It's enough to have a binary that runs of Linux kernel that supports ftrace.
  5. Enable tracing for the probe added (this is when magic happens) and tracing in general:
    root@ao756:~# echo 1 > /sys/kernel/debug/tracing/events/uprobes/dc/enable
    root@ao756:~# echo 1 > /sys/kernel/debug/tracing/tracing_on
  6. Now assuming you run the following:
    root@ao756:~# mysql -uroot --socket=/tmp/mariadb.sock
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 92
    Server version: 5.5.5-10.3.22-MariaDB Source distribution

    Copyright (c) 2009-2019 Percona LLC and/or its affiliates
    Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

    Oracle is a registered trademark of Oracle Corporation and/or its
    affiliates. Other names may be trademarks of their respective
    owners.

    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

    mysql> select 1+1;
    +-----+
    | 1+1 |
    +-----+
    |   2 |
    +-----+
    1 row in set (0.00 sec)

    mysql> select version();
    +-----------------+
    | version()       |
    +-----------------+
    | 10.3.22-MariaDB |
    +-----------------+
    1 row in set (0.00 sec)
    you may get the trace as follows, for example:
    root@ao756:~# cat /sys/kernel/debug/tracing/trace
    # tracer: nop
    #
    # entries-in-buffer/entries-written: 9851/9851   #P:2
    #
    #                              _-----=> irqs-off
    #                             / _----=> need-resched
    #                            | / _---=> hardirq/softirq
    #                            || / _--=> preempt-depth
    #                            ||| /     delay
    #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
    #              | |       |   ||||       |         |
              python-21369 [001] d... 209044.826580: : sys_wr: 0000000000000008
    ...
    (I had show the top of the output to give hint about format), or:
    root@ao756:~# cat /sys/kernel/debug/tracing/trace_pipe
              mysqld-1082  [000] d... 273258.971401: dc: (0x560d8a20fb90) query="select @@version_comment limit 1"
              mysqld-1082  [000] d... 273263.187839: dc: (0x560d8a20fb90) query="select 1+1"
              mysqld-1082  [001] d... 273269.128542: dc: (0x560d8a20fb90) query="select version()"
    ...
  7. If you want to stop this basic tracing, run these:
    root@ao756:~# echo 0 > /sys/kernel/debug/tracing/events/uprobes/dc/enable
    root@ao756:~# echo 0 > /sys/kernel/debug/tracing/tracing_on
So, with 7 simple steps we can enable dynamic tracing and see SQL statements executed and time when this happened.
Check https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/trace/uprobetracer.rst
The magic behind ftrace and steps that I really had to perform when I tried to apply this approach for the very first time will be presented in the next posts. Stay tuned! 

Monday, January 20, 2020

Dynamic Tracing of MariaDB Server With bcc trace - Basic Example

This is a yet another blog post in my series about dynamic tracing of MySQL server (and friends) on Linux. Logically it had to appear after this one about perf and another one about bpftrace. For older Linux systems or when you are in a hurry with customer and have no time to upgrade, build from source etc perf just works and is really flexible (but it comes with some cost of writing many samples to disk and then processing them). For happy users of Linux with kernels 4.9+ (the newer the better), like recent Ubuntu, RHEL 8, Debian 9+ or Fedora there entire world of new efficient tracing with bpftrace is open and extending with every new kernel release.

For those in between, like me with this Ubuntu 16.04:
openxs@ao756:~/git/bcc/build$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.6 LTS"
openxs@ao756:~/git/bcc/build$ uname -a
Linux ao756 4.4.0-171-generic #200-Ubuntu SMP Tue Dec 3 11:04:55 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
the fancy world of eBPF and more efficient dynamic tracing is still mostly open, as we can try to use BCC tools. BCC is a toolkit for creating efficient kernel tracing and manipulation programs that includes several potentially useful tools and examples for MySQL DBAs. It makes use of extended BPF (Berkeley Packet Filters), formally known as eBPF.

I have a draft of this blog post hanging around since October 2019, but every time I tried to complete it I was not happy with the content. I wanted to get back, test more, try to present more tools, find out how to be able to access structure members in probes as easy as I can do it with gdb or perf, but then I hit some problem and put the draft aside...

When I started again some time later I often hit some new problem, so today I just decided to finally write down what I already know for sure, and provide at least a very basic example of dynamic tracing along the lines of those used in earlier posts (capturing queries executed by different threads using dynamic probes).

The first problem in case of Ubuntu 16.04 is to get the binaries of BCC tools. One of the ways is to build from GitHub source. The INSTALL.md document is clear enough when describing build dependencies and steps:
git clone https://github.com/iovisor/bcc.git
mkdir bcc/build; cd bcc/build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr
make
sudo make install
but still there is something to note. In recent versions you surely have to update the libbpf submodule, or you'll end up with compilation errors at early stage. My steps today were the following:
openxs@ao756:~/dbs/maria10.3$ cd ~/git/bcc/
openxs@ao756:~/git/bcc$ git pull
Already up-to-date.
openxs@ao756:~/git/bcc$ git log -1
commit dce8e9daf59f44dec4e3500d39a82a8ce59e43ba
Author: Yonghong Song <yhs@fb.com>
Date:   Fri Jan 17 22:06:52 2020 -0800

    sync with latest libbpf repo

    sync libbpf submodule upto the following commit:
        commit 033ad7ee78e8f266fdd27ee2675090ccf4402f3f
        Author: Andrii Nakryiko <andriin@fb.com>
        Date:   Fri Jan 17 16:22:23 2020 -0800

            sync: latest libbpf changes from kernel

    Signed-off-by: Yonghong Song <yhs@fb.com>

openxs@ao756:~/git/bcc$ git submodule init
openxs@ao756:~/git/bcc$ git submodule update
Submodule path 'src/cc/libbpf': checked out '033ad7ee78e8f266fdd27ee2675090ccf4402f3f'
Now I can proceed to build subdirectory and complete the build:
openxs@ao756:~/git/bcc/build$ cmake .. -DCMAKE_INSTALL_PREFIX=/usr...
openxs@ao756:~/git/bcc/build$ make
...
[ 99%] Building CXX object tests/cc/CMakeFiles/test_libbcc.dir/test_usdt_probes.cc.o
[100%] Building CXX object tests/cc/CMakeFiles/test_libbcc.dir/utils.cc.o
[100%] Linking CXX executable test_libbcc
[100%] Built target test_libbcc
It's always interesting to check if tests pass:
openxs@ao756:~/git/bcc/build$ make test
Running tests...
Test project /home/openxs/git/bcc/build
      Start  1: style-check
 1/40 Test  #1: style-check ......................   Passed    0.01 sec
      Start  2: c_test_static
 2/40 Test  #2: c_test_static ....................   Passed    0.30 sec
...
40/40 Test #40: lua_test_standalone ..............***Failed    0.06 sec

75% tests passed, 10 tests failed out of 40

Total Test time (real) = 450.78 sec

The following tests FAILED:
          3 - test_libbcc (Failed)
          4 - py_test_stat1_b (Failed)
          5 - py_test_bpf_log (Failed)
          6 - py_test_stat1_c (Failed)
          7 - py_test_xlate1_c (Failed)
          8 - py_test_call1 (Failed)
         16 - py_test_brb (Failed)
         17 - py_test_brb2 (Failed)
         18 - py_test_clang (Failed)
         40 - lua_test_standalone (Failed)
Errors while running CTest
Makefile:105: recipe for target 'test' failed
make: *** [test] Error 8
I've always had some tests failed, and one day I probably have to report the issue for the project, but for the purpose of this post (based on previous experience with older code) I'll get at least trace tool working as expected. So, I decided to process with installation:
openxs@ao756:~/git/bcc/build$ sudo make install
...
-- Up-to-date: /usr/share/bcc/tools/old/stackcount
-- Up-to-date: /usr/share/bcc/tools/old/oomkill
The tools are installed by default to /usr/share/bcc/tools.

For adding dynamic probes I'll use trace tool  that probes functions you specify and displays trace messages if a particular condition is met. You can control the message format to display function
arguments and return values.

Brendan Gregg explains the usage of this and other tools here with a lot of details. I'll just add a nice chart from that page here:
There is a separate tutorial with examples. You may want to check section for trace tool there.

For the purpose of this blog post I think it's enough to quickly check help output:
openxs@ao756:~/git/bcc/build$ sudo /usr/share/bcc/tools/trace
usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE]
             [-S] [-M MAX_EVENTS] [-t] [-u] [-T] [-C] [-c CGROUP_PATH]
             [-n NAME] [-f MSG_FILTER] [-B] [-s SYM_FILE_LIST] [-K] [-U] [-a]
             [-I header]
             probe [probe ...]
trace: error: too few arguments
and note the following basic syntax is used to define probes (see man /usr/share/bcc/man/man8/trace.8.gz after buuilding the tools from source as described above):
PROBE SYNTAX
       The general probe syntax is as follows:

       [{p,r}]:[library]:function[(signature)]      [(predicate)]     ["format
       string"[, arguments]]

       {t:category:event,u:library:probe}  [(predicate)]  ["format   string"[,
       arguments]]

       {[{p,r}],t,u}
              Probe  type  -  "p" for function entry, "r" for function return,
              "t" for kernel tracepoint, "u" for USDT probe. The default probe
              type is "p".
...
Fir simplicity here we do not consider conditional probes, so predicate is skipped. At the moment we are not interested in kernel or or user defined static tracepoints (the are not defined in default recent builds of MySQL or MariaDB server anway and require -DENABLE_DTRACE=ON to be explictly added to cmake command line used). For user defined dynamic probes in the mysqld process we need p (for probe at function entry) and maybe r (for function return).

We need to refer to library, and in our case this is a full path name to the mysqld binary (or just mysqld if it's in PATH). We also need to refer to some function by names. Quick test will show you that by default trace does NOT accept plain function names in MySQL or MariaDB code (as perf does), and require mangled ones to be used (same as bpftrace). We can find the names with nm command:
openxs@ao756:~/git/bcc/build$ nm -na /home/openxs/dbs/maria10.3/bin/mysqld | grep dispatch_command
00000000004a1eef t _Z16dispatch_command19enum_server_commandP3THDPcjbb.cold.344
00000000005c5180 T _Z16dispatch_command19enum_server_commandP3THDPcjbb
00000000005c5180 t _Z16dispatch_command19enum_server_commandP3THDPcjbb.localalias.256
In the example above I was specifically looking for dispatch_command() function of MariaDB server version 10.3.x that I assume (see previous post) has a string with SQL statement as the third argument, packet. So, I can refer to this function in probe as "_Z16dispatch_command19enum_server_commandP3THDPcjbb".

The "format string" that define how to output arguments of probe  is a printf-style format string.  You  can  use the following format specifiers: %s, %d%u, ...  with the same semantics as printf's. In our case for zero-terminating string we'll use "%s".

Arguments of the function traced are named arg1, arg2, ... argN (unless we provide a signature for function), and are numbered starting from 1. So, in out case we can add a probe to print third argument of the dispatch_command() function upon entry as follows:
openxs@ao756:~/dbs/maria10.3$ sudo /usr/share/bcc/tools/trace -T 'p:/home/openxs/dbs/maria10.3/bin/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb "%s" arg3'
[sudo] password for openxs:
TIME     PID     TID     COMM            FUNC             -
16:16:53 26585   29133   mysqld          _Z16dispatch_command19enum_server_commandP3THDPcjbb select @@version_comment limit 1
16:17:02 26585   29133   mysqld          _Z16dispatch_command19enum_server_commandP3THDPcjbb select 1
16:17:05 26585   29133   mysqld          _Z16dispatch_command19enum_server_commandP3THDPcjbb select 2
16:17:07 26585   29133   mysqld          _Z16dispatch_command19enum_server_commandP3THDPcjbb
^C
I've got the output above for this sample session:
openxs@ao756:~/dbs/maria10.3$ bin/mysql -uroot --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 16
Server version: 10.3.22-MariaDB Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> select 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0,000 sec)

MariaDB [(none)]> select 2;
+---+
| 2 |
+---+
| 2 |
+---+
1 row in set (0,001 sec)

MariaDB [(none)]> exit
Bye
Note that I've added -T option to the command line to output timestamp.
  
It's a bit more complex with recent versions of Percona Server or MySQL (or any function parameters that are complex structures). It's also more complex if we want to process prepared statements and process packet content depending on the first argument (the above is correct only for COM_QUERY) and so on. But these are basic steps to get a log of SQL queries with timestamps by adding a dynamic probe with BCC trace tool. Enjoy!

Saturday, January 18, 2020

Fun with Bugs #92 - On MySQL Bug Reports I am Subscribed to, Part XXVI

I'd like to continue reviewing MySQL bug reports from Community users that I considered interesting and subscribed to. Unlike in the previous post in this series, I am not going to check test cases on any competitor product, but will use only recently released MySQL 5.7.29 and 8.0.19 for checks, if any. This time I'll concentrate on bugs reported in November 2019.

As usual, I mostly care about optimizer, InnoDB and replication related bugs. Here is the list:
  • Bug #97476 - "Range optimizer skips rows". This bug reported by Ilya Raudsepp looks like a clear regression in MySQL 8.0.x comparing to MySQL 5.7.x at least. I get the following correct results with 5.7.29:
    mysql> SELECT t.id
        -> FROM Test t
        -> JOIN (
        ->     SELECT item_id, MAX(created_at) AS created_at
        ->     FROM Test t
        ->     WHERE (platform_id = 2) AND (item_id IN (3,2,111)) AND (type = 'Default')
        ->     GROUP BY item_id
        -> ) t2 ON t.item_id = t2.item_id
        ->   t.item_id = t2.item_id
        ->   AND t.created_at = t2.created_at
        ->   AND t.type = 'Default'
        -> WHERE t.platform_id = 2;
    +----+
    | id |
    +----+
    |  6 |
    |  3 |
    |  5 |
    +----+
    3 rows in set (0,03 sec)

    mysql> select version();
    +-----------+
    | version() |
    +-----------+
    | 5.7.29    |
    +-----------+
    1 row in set (0,02 sec)
  • Bug #97531 - "5.7 replication breakage with syntax error with GRANT management". This tricky bug reported by Simon Mudd applies also to MySQL 8.0.x. It is closed as fixed, but the fix had not made it to recent 5.7.29 and 8.0.19 releases, so you'll have to wait for few more months.
  • Bug #97552 - "Regression: LEFT JOIN with Impossible ON condition performs slowly". Yet another optimizer regression in MySQL 8 (comparing to 5.7.x) that is fixed only in MySQL 8.0.20+. The bug was reported by Fredric Johansson.
  • Bug #97648 - "Bug in order by clause in union clause". Yet another regression (at least from user's point of view) in recent MySQL 5.7.x and 8.0.x comparing to 5.6.x. This time without a "regression" tag. The bug was reported by Andrei Mart.
  • Bug #97662 - "MySQL v8.0.18 FIPS mode is no longer supported". According to Ryan L, MySQL 8.0.18+ is no longer supporting ssl_fips_mode=STRICT, as OpenSSL 1.1.1 is not FIPS-compatible and MySQL Server must be compiled using OpenSSL 1.1.1 or higher. That's interesting. Check also this link.
  • Bug #97682 - "Handler fails to trigger on Error 1049 or SQLSTATE 42000 or plain sqlexception". This regression (comparing to MySQL 5.7) was reported by Jericho Rivera. It is fixed in MySQL 8.0.20. The patch was provided by Kamil Holubicki.
  • Bug #97692 - "Querying information_schema.TABLES issue". I do not see any documented attempt to check on MySQL 8.0, so I had to add a comment to the bug report. From what I see, in MySQL 8.0.19 we still get different (empty) result from the second query, but at least now we have a warning:
    mysql> SELECT ts.TABLE_SCHEMA
        -> FROM information_schema.TABLES ts
        -> WHERE ts.TABLE_TYPE ='VIEW'
        -> AND ts.TABLE_SCHEMA NOT IN ('sys')
        -> AND ts.TABLE_COMMENT LIKE '%invalid%';
    +--------------+
    | TABLE_SCHEMA |
    +--------------+
    | test         |
    +--------------+
    1 row in set, 1 warning (0,00 sec)

    mysql> show warnings\G
    *************************** 1. row ***************************
      Level: Warning
       Code: 1356
    Message: View 'test.v' references invalid table(s) or column(s) or function(s) or definer/invoker of view lack rights to use them
    1 row in set (0,00 sec)

    mysql> select version();
    +-----------+
    | version() |
    +-----------+
    | 8.0.19    |
    +-----------+
    1 row in set (0,00 sec)
    The bug was reported by Vinicius Malvestio Grippa.
  • Bug #97693 - "ALTER USER user IDENTIFIED BY 'password' broken by invalid authentication_string". The bug was reported by Nikolai Ikhalainen. MySQL 8.0.19 is still affected.
  • Bug #97694 - "MySQL 8.0.18 fails on STOP SLAVE/START SLAVE stress test". For some reason I do not see any documented attempt to verify this on MySQL 5.7 also. The bug was reported by Przemysław Skibiński, who also suggested a fix.
  • Bug #97734 - "Document the correct method to stop slaving with MTS without a warning or error". I can only agree with this request from Buchan Milne. Please. do :)
  • Bug #97735 - "ALTER USER IF EXISTS ... WITH_MAX_USER_CONNECTIONS 9999 not applied correctly". yet another bug report by Simon Mudd in this list. For some reason, again, I do not see any documented attempt to verify the bug on MySQL 8.0.x, while there is no clear reason to think it is not affected.
  • Bug #97742 - "bad item ref from correlated subquery to outer distinct table". This bug was reported by Song Zhibai, who also had contributed a patch. Based on further comments from  Øystein Grøvlen and these results:
    mysql> EXPLAIN SELECT f3 FROM t1 HAVING (SELECT 1 FROM t2 HAVING f2 LIMIT 1);
    +----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    | id | select_type        | table | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra       |
    +----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    |  1 | PRIMARY            | t1    | NULL       | ALL   | NULL          | NULL    | NULL    | NULL |    3 |   100.00 | NULL        |
    |  2 | DEPENDENT SUBQUERY | t2    | NULL       | index | NULL          | PRIMARY | 4       | NULL |    1 |   100.00 | Using index |
    +----+--------------------+-------+------------+-------+---------------+---------+---------+------+------+----------+-------------+
    2 rows in set, 2 warnings (0,00 sec)

    mysql> show warnings\G
    *************************** 1. row ***************************
      Level: Note
       Code: 1276
    Message: Field or reference 'f2' of SELECT #2 was resolved in SELECT #1
    *************************** 2. row ***************************
      Level: Note
       Code: 1003
    Message: /* select#1 */ select `test`.`t1`.`f3` AS `f3` from `test`.`t1` having (/* select#2 */ select 1 from `test`.`t2` having `test`.`t1`.`f2` limit 1)
    2 rows in set (0,00 sec)

    mysql> select version();
    +-----------+
    | version() |
    +-----------+
    | 5.7.29    |
    +-----------+
    1 row in set (0,00 sec)
    I'd say that MySQL 5.7.x is also affected, but for some reason nobody documented any attempt to verify it there. So, I've added a comment.
  • Bug #97777 - "separate global variables (from hot variables) using linker script (ELF)". Beautiful bug report from Daniel Black.With a lot of details, perf and readelf outputs and patch contributed. See also his Bug #97822 - "buf_page_get_gen buf_pool->stat.n_page_gets++ is a cpu waste", with perf analysis up to a single assembler instruction level and fix suggested.
  • Bug #97825 - "dd_mdl_acquire in dd_table_open with dict_sys->mutex hold may cause deadlock". Here I am really puzzled by no visible attempt to check the arguments of bug reporter, Dave Do, who tried to perform lock order analysis by code review. All we see as a result is this:
    "Lock order could be different, but it is irrelevant, since these are locks on totally different levels and can't, in themselves, cause any deadlock."
    What a great argument! Not a bug, surely... We trust you.
"What bugs are you talking about? I have no bugs, neither does MySQL 8!"
To summarize:
  1. MySQL 8 introduces some optimizer (and some other) regressions. They seem to be fixed fast enough, but I wonder why only Community users were able to find them not Oracle's QA...
  2. MySQL 8.0.19 is surely great, but I see many serious bugs fixed o0nly in 8.0.20+.
  3. Percona, Booking and Facebook engineers still continue contributing high quality bug reports, comments/verification details and patches. Oracle is lucky to have such nice partners in making MySQL better.
  4. I still see problems with following proper verification procedures and documenting the results. Too often the bug reported for 8.0.x is NOT checked on 5.7.x as well, regression tag is not set, and so on. Sometimes reports are closed as "Not a bug" without any attempt to follow the analysis provided or prove the point. This is sad and wrong.