Saturday, November 30, 2019

Fun with Bugs #90 - On MySQL Bug Reports I am Subscribed to, Part XXIV

Previous post in this series was published 3 months ago and the last Bug #96340 from it is already closed as fixed in upcoming MySQL 8.0.19. I've picked up 50+ more bugs to follow since that time, so I think I should send quick status update about interesting public MySQL bug reports that are still active.

As usual I concentrate mostly on InnoDB, replication and optimizer bugs. Here is the list, starting from the oldest:
  • Bug #96374  - "binlog rotation deadlock when innodb concurrency limit setted". This bug was reported by Jia Liu, who used gdb to show threads deadlock details. I admit that recently more bug reporters use gdb and sysbench with custom(ized) Lua scripts to prove the point, and I am happy to see this happening!
  • Bug #96378 - "Subquery with parameter is exponentially slower than hard-coded value". In my primitive test with user variables replaced by constants (on MariaDB 10.3.7) I get the same plan for the query, so I am not 100% sure that the analysis by my dear friend Sinisa Milivojevic was right and it's about optimization (and not comparing values with different collations, for example). But anyway, this problem reported by Jeff Johnson ended up as a verified feature request. Let's see what may happen to it next.
  • Bug #96379 - "First query successful, second - ERROR 1270 (HY000): Illegal mix of collations ". This really funny bug was reported by Владислав Сокол.
  • Bug #96400 - "MTS STOP SLAVE takes over a minute when master crashed during event logging". Nice bug report by Przemyslaw Malkowski from Percona, who used sysbench and dbdeployer to demonstrate the problem. Later Przemysław Skibiński (also from Percona) provided a patch to resolve the problem.
  • Bug #96412 - "Mess usages of latch meta data for InnoDB latches (mutex and rw_lock)". Fungo Wang had to make a detailed code analysis to get this bug verified. I am not sure why it ended up with severity S6 (Debug Builds) though.
  • Bug #96414 - "CREATE TABLE events in wrong order in a binary log.". This bug was reported by Iwo P. His test case to demonstarte the problem included small source code modification, but (unlike with some other bug reports) this had NOT prevented accepting it as a true, verified bug. The bug not affect MySQL 8.0.3+ thanks to WL#6049 "Meta-data locking for FOREIGN KEY tables" implemented there.
  • Bug #96472 - "Memory leak after 'innodb.alter_crash'". Yet another bug affecting only MySQL 4.7 and not MySQL 8.0. It was reported by Yura Sorokin from Percona.
  • Bug #96475 - "ALTER TABLE t IMPORT TABLESPACE blocks SELECT on I_S.tables.".  Clear and simple "How to repeat" instructions (using dbdeployer) by Jean-François Gagné. See also his related Bug #96477 - "FLUSH TABLE t FOR EXPORT or ALTER TABLE t2 IMPORT TABLESPACE broken in 8.0.17" for MySQL 8. The latter is a regression bug (without a regression tag), and I just do not get how the GA releases with such new bugs introduced may happen.
  • Bug #96504 - "Refine atomics and barriers for weak memory order platform". Detailed analysis, with links to code etc from Cai Yibo.
  • Bug #96525 - "Huge malloc when open file limit is high". Looks more like a systemd problem (in versions < 240) to me. Anyway, useful report from Andreas Hasenack.
  • Bug #96615 - "mysql server cannot handle write operations after set system time to the past". A lot of arguments were needed to get this verified, but Shangshang Yu was not going to give up. First time I see gstack used in the bug report to get a stack trace quickly. It's a part of gdb RPM on CentOS 6+. I have to try it vs gdb and pstack one day and decide what is the easiest and most efficient way to get backtraces of all threads in production...
  • Bug #96637 - "Clone fails on just upgraded server from 5.7". I had not used MySQL 8 famous clone plugin yet in practice, but I already know that it has bugs. This bug was reported by Satya Bodapati, who also suggested a patch.
  • Bug #96644 - "Set read_only on a master waiting for semi-sync ACK blocked on global read lock". Yet another problem (documented limitation) report from Przemyslaw Malkowski. Not sure why it was not verified on MySQL 8.0. Without a workaround to set master to read only it is unsafe to use long rpl_semi_sync_master_timeout values, as we may end up with that long downtime.
  • Bug #96677 - ""SELECT ... INTO var_name FOR UPDATE" not working in MySQL 8". This regression bug was reported by Vinodh Krish. Some analysis and patch were later suggested by Zsolt Parragi.
  • Bug #96690 - "sql_require_primary_key should not apply to temporary tables". This bug was also reported by Przemyslaw Malkowski from Percona. It ended up as a verified feature request, but not everyone in community is happy with this. Let me quote:
    "[30 Aug 8:08] Jean-François Gagné
    Could we know what was the original severity of this bug as reported by Przemyslaw ? This is now hidden as it has been reclassified as S4 (Feature Request).

    From my point of view, this is actually a bug, not a feature request and it should be classified as S2. A perfectly working application would break for no reason when a temporary table does not have a Primary Key, so this is actually a big hurdle for using sql_require_primary_key, hence serious bug in the implementation of this otherwise very nice and useful feature.
    "
That's all about bugs I've subscribed to in summer.
Winter is coming, so why not to remember nice warm sunny days and interesting MySQL bugs reported back then.
To summarize:
  1. We still see some strange "games" played during bugs processing and trend to decrease severity of reports. I think this is a waste of time for both Oracle engineers and community bug reporters.
  2. I am still not sure if Oracle's QA does not use ASan or just ignore problems reported for MTR test cases. Anyway, Percona engineers do this for them, and report related bugs :)
  3. dbdeployer and sysbench are really popular among MySQL bug reporters recently!
  4. Importing of InnoDB tablespaces is broken in MySQL 8.0.17+ at least.
  5. There are many interesting MySQL bugs reported during last 3 months, so I epxect more posts in this series soon.

Sunday, November 17, 2019

Dynamic Tracing of MySQL Server with Timestamps Using gdb

Some time ago I wanted a customer to trace some MariaDB function execution and make sure that when it is executed I get both timestamp of execution and some of the arguments printed into some log file. Our InnoDB guru in MariaDB, Marko Mäkelä, suggested to use gdb, set breakpoint on the function and use its command ... end syntax to print whatever we needed, and log the output.

Adding a timestamp to each breakpoint "command" execution was the next step, and for this I've suggested to use plain "shell date" command. Redirecting output to the same file gdb uses to log everything indeed did the trick. Let me document it here for readers and myself. I'll consider the example of logging SQL queries I've used in the previous blog post on dynamic tracing with perf and use recent Percona Server 5.7.28-31 on Ubuntu 16.04 for tests.

Skipping the details already explained in that previous post, let's assume I want to set a breakpoint at the dispatch_command() function and print the value of com_data->com_query.query every time it hit it. This is what I did with gdb after attaching it to the proper mysqld process:
...
[New LWP 6562]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fc563d8774d in poll () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) set height 0
(gdb) set log on
Copying output to gdb.txt.
(gdb) b dispatch_command
Breakpoint 1 at 0xbe9660: file /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc, line 1254.
(gdb) command 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>shell date >> ./gdb.txt
>p com_data->com_query.query
>continue
>end
(gdb) continue
Continuing.

Key detail here is to use command and refer to the breakpoint by number, 1, and then append the output of the date command to the gdb.txt file in the current directory that is used by default to log the gdb output.

Then in another shell I executed the following:
openxs@ao756:~$ mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.28-31-log Percona Server (GPL), Release '31', Revision 'd14ef86'

Copyright (c) 2009-2019 Percona LLC and/or its affiliates
Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.04 sec)

mysql> select 2;
+---+
| 2 |
+---+
| 2 |
+---+
1 row in set (0.04 sec)

mysql> shutdown;
Query OK, 0 rows affected (0.04 sec)

mysql> quit
Bye

I had the output in the gdb window, but what's more important is that I've got it in the gdb.txt file as well. The content looked as follows:
openxs@ao756:~$ ls -l gdb.txt
-rw-r--r-- 1 root root 7123 лис 17 19:21 gdb.txt
openxs@ao756:~$ cat gdb.txt
Breakpoint 1 at 0xbe9660: file /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc, line 1254.
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
Continuing.
[New Thread 0x7fc566411700 (LWP 6611)]
[Switching to Thread 0x7fc566411700 (LWP 6611)]

Thread 29 "mysqld" hit Breakpoint 1, dispatch_command (thd=thd@entry=
    0x7fc548631000, com_data=com_data@entry=0x7fc566410da0, command=COM_QUERY)
    at /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc:1254
1254    /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc: No such file or directory.
неділя, 17 листопада 2019 19:20:47 +0200
$1 = 0x7fc54865b021 "show databases"
Thread 29 "mysqld" hit Breakpoint 1, dispatch_command (
    thd=thd@entry=0x7fc548631000, com_data=com_data@entry=0x7fc566410da0,
    command=COM_QUERY)
    at /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc:1254
1254    in /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc
неділя, 17 листопада 2019 19:20:47 +0200
$2 = 0x7fc54865b021 "show tables"
...
Thread 29 "mysqld" hit Breakpoint 1, dispatch_command (
    thd=thd@entry=0x7fc548631000, com_data=com_data@entry=0x7fc566410da0,
    command=COM_QUERY)
    at /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc:1254
1254    in /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc
неділя, 17 листопада 2019 19:20:47 +0200
$9 = 0x7fc54865b021 "select @@version_comment limit 1"

Thread 29 "mysqld" hit Breakpoint 1, dispatch_command (
    thd=thd@entry=0x7fc548631000, com_data=com_data@entry=0x7fc566410da0,
    command=COM_QUERY)
    at /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc:1254
1254    in /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc
неділя, 17 листопада 2019 19:21:01 +0200
$10 = 0x7fc54865b021 "select 1"

Thread 29 "mysqld" hit Breakpoint 1, dispatch_command (
    thd=thd@entry=0x7fc548631000, com_data=com_data@entry=0x7fc566410da0,
    command=COM_QUERY)
    at /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc:1254
1254    in /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc
неділя, 17 листопада 2019 19:21:08 +0200
$11 = 0x7fc54865b021 "select 2"

Thread 29 "mysqld" hit Breakpoint 1, dispatch_command (
    thd=thd@entry=0x7fc548631000, com_data=com_data@entry=0x7fc566410da0,
    command=COM_QUERY)
    at /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc:1254
1254    in /mnt/workspace/percona-server-5.7-debian-binary-rocks-new/label_exp/min-xenial-x64/test/percona-server-5.7-5.7.28-31/sql/sql_parse.cc
неділя, 17 листопада 2019 19:21:12 +0200
$12 = 0x7fc54865b021 "shutdown"
[Thread 0x7fc566411700 (LWP 6611) exited]

Thread 1 "mysqld" received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0x7fc56645b780 (LWP 6525)]
0x00007fc563d8774d in poll () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
Detaching from program: /usr/sbin/mysqld, process 6525
openxs@ao756:~$
That's it. For basic tracing of any function calls in MySQL or MariaDB server (or any binary with debuginfo available) including timestamps and any arguments, variables or expressions printed you do not strictly need anything but gdb. With different format options for date command you can format timestamp any way you need, for example, you can remove day and month names etc and get even nanosecond precision if you prefer:
openxs@ao756:~$ for i in `seq 1 20`; do sleep 0.1; date +'%H:%m:%S.%N'; done
20:11:41.126205172
20:11:41.230087681
...
20:11:42.996521734
20:11:43.100229347
openxs@ao756:~$
* * * 
Sometimes event basic tools allow to get useful results. This photo was made with my dumb Nokia phone while running. Just a stop for a moment. Nice and useful result, IMHO, same as with gdb trick discussed here.
To summarize, for basic dynamic tracing with timestamps and arbitrary information printed you can use a command line debugger, like gdb. In some cases this simple approach is quite useful.

This kind of tracing comes with a cost, both in terms of performance impact (awful) and additional steps to parse the output (we do not really care about all that breakpoint details, we justneed a timestamp and some values printed). eBPF and related dynamic tracing tools (bcc trace and bpftrace) may help with both problems, as I am going to demonstrate in upcoming blog posts. Stay tuned!

Sunday, November 10, 2019

Time in Performance Schema

I've seen questions like this:
"Is there a way to know when (date and time) the last statement captured in ... was actually ran?"
more than once in some discussions (and customer issues) related to Performance Schema. I've seen answers provided by my colleagues and me after some limited testing. I've also noticed statements that it may not be possible.

Indeed, examples of wall clock date and time in the output of queries from the performance_schema are rare (and usually come from tables in the information_schemasys.format_time() function converts time to a nice, human readable format, but it still remains relative - it is not a date and time when something recorded in performance_schema happened.

In this post I'd like to document the answer I've seen and have in mind (and steps to get it) here, to save time for readers and myself faced with similar questions in the future. I'll also show the problem with this answer that I've noticed after testing for more than few minutes.

Let's start with simple setup of testing environment. In my case it is good old MariaDB 10.3.7 running on this netbook under Windows. First, let's check if Performance Schema is enabled:
 MariaDB [test]> select version(), @@performance_schema;
+--------------------+----------------------+
| version()          | @@performance_schema |
+--------------------+----------------------+
| 10.3.7-MariaDB-log |                    1 |
+--------------------+----------------------+
1 row in set (0.233 sec)
Then let's enable recording of time for everything and enable all consumers:
MariaDB [test]> update performance_schema.setup_instruments set enabled='yes', timed='yes';
Query OK, 459 rows affected (0.440 sec)
Rows matched: 707  Changed: 459  Warnings: 0

MariaDB [test]> update performance_schema.setup_consumers set enabled='yes';
Query OK, 8 rows affected (0.027 sec)
Rows matched: 12  Changed: 8  Warnings: 0
Now we can expect recently executed statements to be recorded, like this:
MariaDB [test]> select now(), event_id, timer_start, timer_end, sql_text from performance_schema.events_statements_current\G*************************** 1. row ***************************
      now(): 2019-11-03 17:42:51
   event_id: 46
timer_start: 22468653162059216
  timer_end: 22468697203533224
   sql_text: select now(), event_id, timer_start, timer_end, sql_text from performance_schema.events_statements_current
1 row in set (0.045 sec)
Good, but how we can get a real time when the statement was executed (like now() reports)? We all know from the fine manual that timer_start and timer_end values are in "picoseconds". So we can easily convert them into seconds (or whatever units we prefer):
MariaDB [test]> select now(), event_id, timer_start/1000000000000, sql_text from performance_schema.events_statements_current\G
*************************** 1. row ***************************
                    now(): 2019-11-03 17:54:02
                 event_id: 69
timer_start/1000000000000: 23138.8159
                 sql_text: select now(), event_id, timer_start/1000000000000, sql_text from performance_schema.events_statements_current
1 row in set (0.049 sec)
This value is related startup time one might assume, and we indeed can expect that timer in Performance Schema is initialized at some very early stage of startup. But how to get date and time of server startup with SQL statement?

This also seems to be easy, as we have a global status variable called Uptime measured in seconds. Depending on fork and version used we can get the value of Uptime either from the Performance Schema (in MySQL 5.7+) or from the Information Schema (in MariaDB and older MySQL versions). For example:
MariaDB [test]> select variable_value from information_schema.global_status where variable_name = 'Uptime';
+----------------+
| variable_value |
+----------------+
| 23801          |
+----------------+
1 row in set (0.006 sec)
So, server startup time is easy to get with a date_sub() function:
MariaDB [test]> select @start := date_sub(now(), interval (select variable_value from information_schema.global_status where variable_name = 'Uptime') second) as start;
+----------------------------+
| start                      |
+----------------------------+
| 2019-11-03 11:28:18.000000 |
+----------------------------+
1 row in set (0.007 sec)
In the error log of MariaDB server I see:
2019-11-03 11:28:18 0 [Note] mysqld (mysqld 10.3.7-MariaDB-log) starting as process 5636 ...
So, I am sure the result is correct. Now, if we use date_add() to add timer value converted to seconds, for example to the server startup time, we can get the desired answer, date and time when the statement recorded in performance_schema was really executed:
MariaDB [test]> select event_id, @ts := date_add(@start, interval timer_start/1000000000000 second) as ts, sql_text, now(), timediff(now(), @ts) from performance_schema.events_statements_current\G
*************************** 1. row ***************************
            event_id: 657
                  ts: 2019-11-03 18:24:00.501654
            sql_text: select event_id, @ts := date_add(@start, interval timer_start/1000000000000 second) as ts, sql_text, now(), timediff(now(), @ts) from performance_schema.events_statements_current
               now(): 2019-11-03 18:24:05
timediff(now(), @ts): 00:00:04.498346
1 row in set (0.002 sec)
I was almost ready to publish this blog post a week ago, before paying more attention to the result (that used to be perfectly correct in earlier simple tests) and executing a variation of statement presented above. The problem I noticed is that when Uptime of the server is not just few minutes (as it often happens in quick test environments), but hours or days, timestamp that we get for a recent event from the performance_schema using the approach suggested may differ from current timestamp notably (we see 4.5+ seconds difference highlighted above). Moreover, this difference seem to fluctuate:
MariaDB [test]> select event_id, @ts := date_add(@start, interval timer_start/1000000000000 second) as ts, sql_text, now(), timediff(now(), @ts) from performance_schema.events_statements_current\G
*************************** 1. row ***************************
            event_id: 682
                  ts: 2019-11-03 18:24:01.877763
            sql_text: select event_id, @ts := date_add(@start, interval timer_start/1000000000000 second) as ts, sql_text, now(), timediff(now(), @ts) from performance_schema.events_statements_current
               now(): 2019-11-03 18:24:07
timediff(now(), @ts): 00:00:05.122237
1 row in set (0.002 sec)
and tend to grow with Uptime. This make the entire idea of converting timer_start and timer_end Performance Schema "counters" in "picoseconds" questionable and unreliable for the precise real timestamp matching and comparing with other timestamp information sources in production.
Same as with this photo of sunset at Brighton taken with my Nokia dumb phone back in June, I do not see a clear picture of time measurement in Performance Schema...

After spending some more time thinking about this I decided to involve MySQL team somehow and created the feature request, Bug #97558 - "Add function (like sys.format_time) to convert TIMER_START to timestamp", that ended up quickly "Verified" (so I have small hope that I had not missed anything really obvious - correct me if I am wrong). I'd be happy to see further comments there and, surely, the function I asked about implemented. But I feel there is some internal problem with this and some new feature at server side may be needed to take the "drift" of time in Performance Schema into account.

There is also a list of currently open questions that I may try to answer in followup posts:
  1. Is the problem of time drift I noticed a MariaDB 10.3.7-specific, or recent MySQL 5..x and 8.0.x are also affected?
  2. Is this difference in time growing monotonically with time or really fluctuating?
  3. When exactly Performance Schema time "counter" starts, where is it in the code?
  4. Are there any other, better or at least more precise and reliable ways to get timestamps of some specific events that happen during MySQL server work? I truly suspect that gdb and especially dynamic tracing on Linux with tools like bpftrace may give us more reliable results...

Monday, October 7, 2019

Dynamic Tracing of MariaDB Server With bpftrace - Basic Example

Unlike the previous post, this one is not just a comment to some slides from the "Tracing and Profiling MySQL" talk at Percona Live Europe 2019. I am going to add the details that were missing there (as I was in a hurry and had forgotten to copy/paste proper outputs while testing). I am going to show how to add dynamic probe with "latest and greatest" bpftrace tool.

The goal is the same as before - try to add dynamic probe(s) to trace query execution. More specifically, to capture text of the queries executed by clients of MySQL server. As bpftrace requires new kernel and just does not work on Ubuntu 16.04, for the demonstration I use my Fedora 29 box with kernel 5.2.x and, for a change, get queries from Fedora's own MariaDB 10.3.17 installed from rpm packages there.

I have both bcc and bpftrace installed also from packages:
[openxs@fc29 ~]$ rpm -qa | grep bcc
python3-bcc-0.8.0-1.fc29.x86_64
bcc-0.8.0-1.fc29.x86_64
bcc-tools-0.8.0-1.fc29.x86_64
[openxs@fc29 ~]$ rpm -qa | grep bpf
bpftrace-0.9-0.fc29.x86_64
[openxs@fc29 ~]$
You can check fine manual for the details, but even basic -h option provides enough to start, as long as you already know some terms and probes syntax:
[openxs@fc29 ~]$ bpftrace -h
USAGE:
    bpftrace [options] filename
    bpftrace [options] -e 'program'

OPTIONS:
    -B MODE        output buffering mode ('line', 'full', or 'none')
    -d             debug info dry run
    -dd            verbose debug info dry run
    -e 'program'   execute this program
    -h             show this help message
    -l [search]    list probes
    -p PID         enable USDT probes on PID
    -c 'CMD'       run CMD and enable USDT probes on resulting process
    -v             verbose messages
    --version      bpftrace version

ENVIRONMENT:
    BPFTRACE_STRLEN           [default: 64] bytes on BPF stack per str()
    BPFTRACE_NO_CPP_DEMANGLE  [default: 0] disable C++ symbol demangling

EXAMPLES:
bpftrace -l '*sleep*'
    list probes containing "sleep"
bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'
    trace processes calling sleep
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
    count syscalls by process name
In our case we need to define uprobe for proper mysqld binary and trace the dispatch_command() function. Before we start, note that parameters of dispatch_command() in MariaDB 10.3 are not the same as in Percona Server 5.7 I've used in the previous post. Basically, this function starts as follows in sql/sql_parse.cc:
   1570 bool dispatch_command(enum enum_server_command command, THD *thd,
   1571                       char* packet, uint packet_length, bool is_com_mult        i,
   1572                       bool is_next_command)
Note the third argument, packet. If the first argument, command. is SQL_QUERY, then packet contains the query text (as a zero-terminated string) for sure (it's also true for many other commands, but let me skip the details for now). That's why we'll use third argument in our uprobe to capture the SQL text.

Now, let's start the service and check the exact full path name for the mysql binary:
[openxs@fc29 ~]$ sudo service mariadb start
[sudo] password for openxs:
Redirecting to /bin/systemctl start mariadb.service
[openxs@fc29 ~]$ ps aux | grep mysqldmysql     9109  6.2  1.2 1699252 104108 ?      Ssl  09:30   0:00 /usr/libexec/mysqld --basedir=/usr
openxs    9175  0.0  0.0 215744   892 pts/0    S+   09:30   0:00 grep --color=auto mysqld
The first naive attempt to add the probe after cursory reading the documentation and checking few examples may look like this:
 [openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:dispatch_command { printf("%s\n", str(arg2)); }'
Attaching 1 probe...
Could not resolve symbol: /usr/libexec/mysqld:dispatch_command
It seems in my case, unlike perf, bpftrace is not "aware" of C++ names or symbolic information in a separate -debuginfo package. So, I need mangled name:
[openxs@fc29 ~]$ nm -na /usr/libexec/mysqld | grep dispatch_command
nm: /usr/libexec/mysqld: no symbols
[openxs@fc29 ~]$ nm -na /home/openxs/dbs/maria10.3/bin/mysqld | grep dispatch_command
00000000004a1eef t _Z16dispatch_command19enum_server_commandP3THDPcjbb.cold.344
00000000005c5190 T _Z16dispatch_command19enum_server_commandP3THDPcjbb
00000000005c5190 t _Z16dispatch_command19enum_server_commandP3THDPcjbb.localalias.256
Surely there is no symbols in the binary from Fedora package, so I checked the binary (of the same version) that I've built myself (as usual) and assumed that neither parameters nor mangling approach could be different. So, the next attempt to add dynamic probe would look as follows:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { printf("%s\n", str(arg2)); }'
Attaching 1 probe...
show databases
show tables
t1
select @@version_comment limit 1
select user, host from mysql.user
^C
It worked and you see above the output I've got for the following session:
[openxs@fc29 ~]$ mysql -uroot testReading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 19
Server version: 10.3.17-MariaDB MariaDB Server

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> select user, host from mysql.user;
+------+-----------+
| user | host      |
+------+-----------+
| root | 127.0.0.1 |
| root | ::1       |
| root | fc29      |
| root | localhost |
+------+-----------+
4 rows in set (0.000 sec)
You can see some SQL statements generated when mysql command line connects, as well as packet value in some other packet ("t1") than COM_QUERY, probably. My probe had not even tried to check other parameters besides the supposed query text.

Now, the probe is defined on uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb - I've just used long, mangled version of the function name and full path name to the binary, and defined a dynamic probe (uprobe). There is no filter and the action for the probe is defined as { printf("%s\n", str(arg2)); } - that is, I print third argument (they are numbered starting from zero, arg0, arg1, arg2, ...) as a zero-terminated string. Without str() built in function I'd get just a pointer that could be printed as (unsigned) long integer, u64.

Basically, that's all. We have a quick and dirty way to capture all queries. No timing or anything, but it all depends on probe action that can use numerous built in variables and functions.

More "advanced" use of bpftrace, a lame attempt to capture time to execute query, may look like this:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql = str(arg2); @start[@sql] = nsecs; }
uretprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[@sql] != 0/ { printf("%s : %u64 ms\n", @sql, (nsecs - @start[@sql])/1000000); } '

Attaching 2 probes...
select sleep(3) : 300064 ms
select sleep(1) : 100064 ms
^C

@sql: select sleep(1)


@start[select sleep(3)]: 10666558704666
@start[select sleep(1)]: 10685614895675


[openxs@fc29 ~]$
In this case I try to store time since probe start into the associative array with query text as an "index" and start time (in nanoseconds) as a value.Then I calculate the difference from current nsecs value upon function return, in a separate uretprobe. I've used global variables, @sql for the query text, and @start[] for the array. It even seems to work well for a single threaded load based on the above. But as soon as I try to use multiple concurrent threads:
[openxs@fc29 ~]$ for i in `seq 1 4`; do mysql -uroot test -e"select sleep($i)" & done
it becomes clear that global variables are really global and my outputs are all wrong.

A bit better version may look like this:
[openxs@fc29 ~]$ sudo bpftrace -e 'uprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb { @sql[tid] = str(arg2); @start[tid] = nsecs; }                                                                               uretprobe:/usr/libexec/mysqld:_Z16dispatch_command19enum_server_commandP3THDPcjbb /@start[tid] != 0/ { printf("%s : %u64 %u64 ms\n", @sql[tid], tid, (nsecs - @start[tid])/1000000); } '
Attaching 2 probes...
select sleep(1) : 1120764 100064 ms
 : 1120764 064 ms
select sleep(2) : 949064 200064 ms
 : 949064 064 ms
select sleep(3) : 1120864 300064 ms
 : 1120864 064 ms
select sleep(4) : 1120664 400064 ms
 : 1120664 064 ms
select sleep(1) : 1120764 100064 ms
 : 1120764 064 ms
select sleep(2) : 949064 200064 ms
 : 949064 064 ms
select sleep(3) : 1120664 300064 ms
 : 1120664 064 ms
select sleep(4) : 1120864 400064 ms
 : 1120864 064 ms
^C

@sql[9490]:
@sql[11206]:
@sql[11207]:
@sql[11208]:


@start[11207]: 13609305005933
@start[9490]: 13610305621499
@start[11206]: 13611305753596
@start[11208]: 13612305235313


[openxs@fc29 ~]$
The output is for both sequential and concurrent execution of queries. I've used two associative arrays, @sql[] for queries and @start[] for start times, both indexed by tid - built in variable for thread id, that should not change, at least until we use pool-of-threads... You can also see that the tool by default outputs the content of all global associate arrays at the end, unless we free them explicitly.

* * *
This image of bpftrace internals is taken from the Brendan Gregg's post
bpftrace commands may be much more complicated and it may make sense to to store them in a separate file. The tool is near ideal of "quick and dirty" tests, and one day I'll write a way more complete posts with much better examples. But having a way to capture, filter and summarize queries in kernel space and send only the relevant results to a user space, all that in a safe manner, is really cool!

Sunday, October 6, 2019

Dynamic Tracing of MySQL Server With perf probe - Basic Example

I am going to write a series of blog posts based on my talks and experiences at Percona Live Europe 2019. The first one would be a kind of extended comment for a couple of slides from the "Tracing and Profiling MySQL" talk.

We can surely wait until Performance Schema instruments every other line of code or at least every important stage and wait in every storage engine we care about, but there is no real need for that. If you run any version of MySQL under Linux with more or less recent kernel (anything newer than 4.1 is good enough, in general), you can easily use dynamic tracing for any application (at least if there is symbolic information for the binaries), any time. As Brendan Gregg put it here:
"One benefit of dynamic tracing is that it can be enabled on a live system without restarting anything. You can take an already-running kernel or application and then begin dynamic instrumentation, which (safely) patches instructions in memory to add instrumentation. That means there is zero overhead or tax for this feature until you begin using it. One moment your binary is running unmodified and at full speed, and the next, it's running some extra instrumentation instructions that you dynamically added. Those instructions should eventually be removed once you've finished using your session of dynamic tracing."
One of the ways to use dynamic tracing (that is supported for a long time) is a perf profiler and its probe command. In the simplest case that I am going to illustrate here, probe is defined for a function defined in the code and refers to it by name. You can refer to the name of local variable, function parameter, local data structure member in the probe etc, and record the values of them with other probe data.

For a simple example let me consider recent Percona Server 5.7.x running on recent Ubuntu 16.04 with kernel 4.4.x. Let's assume I want to trace all calls to the dispatch_command() function and record every query every connection processes that way.

Skipping the details for now, let's assume I've found out (with gdb in my case, but it can be code review as well) that when this function is called I can see the query user wants to execute in the com_data structure passed via a pointer to the function:
(gdb) p com_data->com_query.query
$4 = 0x7fb0dba8d021 "select 2"
Based on this information and having -dbg package also installed for Percona Server I can add a probe dynamically any time using the following simple command (--add option is assumed by default):
openxs@ao756:~$ sudo perf probe -x /usr/sbin/mysqld 'dispatch_command com_data->com_query.query:string'
Added new event:
  probe_mysqld:dispatch_command (on dispatch_command in /usr/sbin/mysqld with query=com_data->com_query.query:string)

You can now use it in all perf tools, such as:

        perf record -e probe_mysqld:dispatch_command -aR sleep 1
In this probe I refer to the specific binary with -x option and full path name, and the function in that binary by name, and I say that I'd like to record the value of com_data->com_query.query as a zero-terminated string. Now I can use any variation of perf record command (with -F option to define sampling frequency, -g option to capture stack traces etc, see more here) and my probe will be one of the events captured.

For this simple example of tracing I'll use -e option to capture only the events related to the probe I defined. Probe name for this simple case by default consists of the binary name, colon (':') separator and function name. I'll use -R option to collect raw sample records and . I've also added -a option to collect samples on all CPUs. You can see the hint for possible command in the output above. 

So, I can record related events with default frequency as follows:
openxs@ao756:~$ sudo perf record -e 'probe_mysqld:dispatch_command*' -aR
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.676 MB perf.data (3 samples) ]
I let it work for some time in the foreground and then pressed Ctrl-C to stop collecting. Now I can check raw sample records with perf script command:
openxs@ao756:~$ sudo perf script >/tmp/queries.txt
openxs@ao756:~$ cat /tmp/queries.txt
          mysqld 31340 [001]  3888.849079: probe_mysqld:dispatch_command: (be9250) query="select 100"
          mysqld 31340 [001]  3891.648739: probe_mysqld:dispatch_command: (be9250) query="select user, host from mysql.user"
          mysqld 31340 [001]  3895.890141: probe_mysqld:dispatch_command: (be9250) query="select 2"
This is the detailed trace, with additional information (exact text of the query executed) added as requested. Output also included PID of the binary, CPU the sample was taken from and a timestamp.

When I am done with tracing, I can delete the probe with --del option referring it by name:
openxs@ao756:~$ sudo perf probe --del dispatch_commandRemoved event: probe_mysqld:dispatch_command
The (small, more on that later) overhead for tracing was added dynamically, only for the exact information I needed and only for the period of tracing. After the dynamic probe is removed we have exactly the same binary as originally started running with zero extra overhead. Now do this with Performance Schema :)

* * *
Slides are available at https://www.slideshare.net/valeriikravchuk1/


More details on the way other tools mentioned during the talk can be used by MySQL DBAs are coming soon in this blog. Stay tuned!

Saturday, August 31, 2019

Fun with Bugs #89 - On MySQL Bug Reports I am Subscribed to, Part XXIII

I have to celebrate the anniversary of my last day in Oracle (that was 7 years ago!) somehow, and I think writing yet another blog post about Oracle MySQL bugs is a good way to do this. I am actually surprised (and happy) that public bugs database is still alive, maintained and considered important in Oracle, and I know who in Oracle was working hard all these years for this to happen!

In my previous post in this series I've stopped on Bug #95954 and had not completed review of interesting MySQL bug reports that I've subscribed to in June 2019. So, below I start with the next bug in my list, complete review for June and cover some bugs reported in July. There were many.
  • Bug #95957 - "IN operator issue when comparing signed column and the column cast to unsigned". This bug was reported by Manuel Rigger. As far as I can see, MariaDB 10.3 is not affected:
    MariaDB [test]> CREATE TABLE t0(c0 INT);
    Query OK, 0 rows affected (0.518 sec)

    MariaDB [test]> INSERT INTO t0(c0) VALUES(-1);
    Query OK, 1 row affected (0.196 sec)

    MariaDB [test]> SELECT t0.c0 IN (1, CAST(t0.c0 AS UNSIGNED)) from t0;
    +---------------------------------------+
    | t0.c0 IN (1, CAST(t0.c0 AS UNSIGNED)) |
    +---------------------------------------+
    |                                     0 |
    +---------------------------------------+
    1 row in set, 1 warning (0.207 sec)

    MariaDB [test]> show warnings\G
    *************************** 1. row ***************************
      Level: Note
       Code: 1105
    Message: Cast to unsigned converted negative integer to it's positive complement

    1 row in set (0.013 sec)
  • Bug #96001 - "No warning when creating foreign key in MyISAM tables". I am really surprised that this "documented feature" was accepted as a S3 bug when reported by Przemyslaw Malkowski from Percona. But this happened. I also prefer NOT to have unsupported syntax just ignored silently and would like to see a warning (or error in strict mode).
  • Bug #96002 - "'variable log_bin_trust_function_creators' -variable is "hidden"." Consistency is important, as well as correct documentation. So I was happy to see this bug report from Peter Laursen. 75(!) of his bug reports are still active, by the way, some were reported more than 10 years ago. I wonder if anyone is going to check (if not fix) them any time soon.
  • Bug #96079 - "large_tests.innodb_innochecksum_3gb test failing with debug build." Yet another MTR test failure reported by Lalit Choudhary from Percona.
  • Bug #96100 - "Generated column cause a heap-use-after-free error". Probably ASan builds are not tested as carefully in Oracle as they are by some community members, like Zkong Kong who reported this bug. Otherwise they would mark this bug report as a duplicate of some known internal bug.
  • Bug #96108 - "To run mtr "innodb.log_flush_order" MySql Server would be always core down". I've listed the bug as an example that even though code modification is needed to reproduce the crash (reported by Juncai Meng) literally, it was accepted and "Verified". In other reports this sometimes not happen, and the point is that it is surely not a rule carved in stone in Oracle to NOT accept bugs if test case involves code modification. Remember that and fight for your reports if needed.
  • Bug #96128 - "Doc: documentation is inaccurate when InnoDB starts with innodb_read_only". Correct manual matters a lot, so nice to have it corrected in this case by Calvin Sun.
  • Bug #96134 - "Please provide control functions for the IO Thread." I'd be also happy to see a way to control replication I/O thread progress and read logs only up to some position or GTID, per channel. Thanks Jean-François Gagné for this nice feature request.
  • Bug #96142 - "Inconsistent error on slave for Update event on table with non-exists partition". Yet another by report from by Lalit Choudhary. Good to see multiple versions check.
  • Bug #96148 - "using Invisible Index when slave apply EVENT". This bug was reported by Songlei Wang. Consistency matters, so if the index is invisible it should remain invisible for replication SQL thread as well. See also his another Bug #96150 - "'show slave status' show the Inaccurate Last_IO_Error message".
  • Bug #96167 - "Many header files now missing from devel package". As noted by Manuel Ung, now it is impossible to build plugins and UDFs unless users download the source tree, and then copy the headers to the appropriate places. Packaging in hard.
  • Bug #96178 - "mysqldump leaks memory when selected tables are dumped with --order-by-primary". Abhinav Sharma proposed a simple MTR test case to run on ASan build, and suggested a fix. Very nice bug report. Unfortunately I do not see any statements about the results of checking MySQL 8.0.x.
  • Bug #96192 - "Possible race condition with binlog-transaction-dependency-tracking". Bug reporter, Herman Lee, complained about one place in the code where race condition may happen even after the fix for one MySQL bug. Does it really matter for bug verification if he found more places? I fail to see a reason to keep he bug in "Need Feedback" status, when code review is enough to confirm there is a problem in that one part of the code clearly identified.
  • Bug #96196 - "performance_schema_accounts_size and p_s_hosts_size limited by 16384", Nice bug that can be confirmed by code review or just opening many connections was reported by Nikolai Ikhalainen from Percona. Autoscaling is broken/limited in this case.
  • Bug #96340 - "Slow startup for mysql 8.0 with many tables due to the tablespace files scan". This is actually a regression comparing to 5.7, but it's visible on slow disks. It's nice to see useful discussion, explanations and patch suggested (by Sunny Bains) in this bug report created by Lalit Choudhary.
I started this summer in a beautiful Barcelona. This blog posts ends it for me. One of the good changes leaving Oracle seven years ago introduced in my life is a real freedom to work from anywhere and travel as often as I really want, both for work and for fun.
To summarize:
  1. I am happy to see MySQL public bugs database still up and widely used, even though I do not work on it directly for 7 years already. It's a key service and media for cooperation for the entire MySQL Community! Just check how it happens in Bug #96340.
  2. Consistency matters.
  3. I still see cases when the time is wasted at bugs verification stage.
  4. MySQL fine manual still have details explained incorrectly, and this is unfortunate.
  5. Sometime I wonder why Percona engineers and other MySQL Community users manage to find even MTR test failures faster than anyone in Oracle cares to report and fix them. They all know magic (like actually running all tests on debug builds and checking the results, maybe). Or, maybe, they care?
* * *
Shameless self-promotion at the end. First time since 2015 I am going to attend Percona Live conference and speak there. Ticket prices increase on September 1, so using code CMESPEAK-VALERII you’ll get the best deal right now.

Saturday, August 10, 2019

Fun with Bugs #88 - On MySQL Bug Reports I am Subscribed to, Part XXII

It's Saturday night. I have a week of vacation ahead that I am going to spend at home, working on things I usually do not have time for. I already did something useful (created a couple of test cases for MariaDB bugs here and there, among other things), so why not to get some fun and continue my previous review of recent interesting MySQL bug report...

So, here is the list of 15 more MySQL community bug reports that I've subscribed to back in May and June 2019:
  • Bug #95491 - "The unused wake up in simulator AIO of function reserve_slot". I am impressed by the fact that these days many people just study InnoDB code, find problematic parts, improve them, test and prove the improvement, and submit patches to Oracle. Chen Zongzhi, who reported this bug, is one of such great people.
  • Bug #95496 - "Replication of memory tables". As pointed out by Iwo P, MySQL manual describes what happens to MEMORY tables replicated from master to slave in details, including the case of master restart. But one detail is missing - when slave with MEMORY tables restarts, even with super_read_only set, it also generates even to delete data from the table into its own binary log, with GTID, and this leads to problems in case of failover.
  • Bug #95507 - "innodb_stats_method is not honored when innodb_stats_persistent=ON". This bug was reported by my colleague Sergei Petrunia. See also related MDEV-19574  (still "Open").
  • Bug #95547 - "MySQL restarts automatically after ABORT_SERVER due to disk full". I am not sure this is a replication bug, more a problem of too primitive mysqld_safe or systemd startup scripts that do not check free space on disk or reason why mysqld process committed suicide. Anyway, this report by Przemyslaw Malkowski from Percona was accepted as a replication bug.
  • Bug #95582 - "DDL using bulk load is very slow under long flush_list". As found and proved by Fungo Wang, due to optimized away redo log writing for bulk load, to make sure such DDL is crash safe, dirtied pages must be flushed to disk synchronously. Currently this is done inefficiently, as shown in this bug report. See also a year old (and still "Open") Bug #92099 - "provide an option to record redo log for bulk loading" from the same reporter, where he asks for a way to get redo logging back (but still benefit from other "online" ALTER features). MariaDB has that option after the fix of MDEV-16809.
  • Bug #95612 - "replication breakage due to Can not lock user management caches for processing". Simon Mudd noted that in some cases simple DROP USER IF EXISTS ... breaks parallel replication in MySQL 8.0.16. Good to know that START SLAVE allows to continue :)
  • Bug #95698 - "JSON columns are returned with binary charset". This bug was reported by Daniel Book. Note that according to the original issue, DBD::MariaDB (I had no idea this exists!) works properly with MariaDB 10.3 or 10.4. I had not checked.
  • Bug #95734 - "Failing SET statement in a stored procedure changes variables". This really weird and unexpected bug was reported by Przemysław Skibiński.
  • Bug #95863 - "MTS ordered commit may cause deadlock ". Great finding by Lalit Choudhary from Percona!
  • Bug #95895 - "Shutdown takes long time when table uses AUTOINC and FOREIGN KEY constraints". It was reported by Satya Bodapati from Percona, who later contributed as patch. See also his related Bug #95898 - "InnoDB releases memory of table only on shutdown if table has FK constraint". I truly hope to see these fixed in next minor releases of MySQL. Note that originally bug (PS-5639) was reported by Sveta Smirnova.
  • Bug #95928 - "InnoDB crashes on compressed and encrypted table when changing FS block size". Make sure to check and write original filesystem block size somewhere... This bug was reported by Sergei Glushchenko from Percona.
  • Bug #95929 - "Duplicate entry for key 'PRIMARY' when querying information_schema.TABLES". This nice feature of new data dictionary of MySQL 8.0.16 under concurrent load was found by Manuel Rigger.
  • Bug #95934 - "innodb_flush_log_at_trx_commit = 0 get performance regression on 8 core machine". If I had access to 8 cores box, I'd try to check this sysbench-based complete test case immediately. I should do this on my quadcore box at least, next week. But here we see some 12 days just wasted and then bug report from Chen Zongzh ended up "Analyzing", till today, with no further feedback.
  • Bug #95954 - "CAST of negative function return value to UNSIGNED malfunctions with BIGINT". Yet another weird bug found by Manuel Rigger. There are patches (separate for 5.6, 5.7 and 8.0) contributed by Oleksandr Peresypkin.
Montjuïc is a place to enjoy nice view of Barcelona and forget about MySQL bugs. So, I could miss some interesting ones at the end of May, 2019...
To summarize:
  1. Percona engineers still contribute a lot of useful and interesting MySQL bug reports, often - with patches.
  2. I am impressed by quality of MySQL bug reports in 2019. Often we see detailed analysis, proper complete test cases and patches.
  3. There is a recent fork of DBD::mysql for MariaDB (and MySQL), DBD::MariaDB!
  4. I badly need some new hardware for running benchmarks and tests from recent bug reports...
  5. I should spend more time in May in Barcelona :)