Sunday, June 28, 2020

Fun with Bugs #100 - On MySQL Bug Reports I am Subscribed to, Part XXXIV

I delayed this post #100 in the "Fun with Bugs" series for few weeks - the previous one was published 4 weeks ago. The idea was to make it the last one, and for this I needed something to celebrate. Two days ago proper event happened, we have MySQL Bug #100000 reported! Here it is:
  • Bug #100000 - "Provide an index hint that only affects the choice of index for NL join". This nice feature request was added by former optimizer developer in MySQL, Øystein Grøvlen. Hundreds of other feature requests are waiting for the attention both from the MySQL Verification Team and from developers, so good to see a feature request getting the number that nobody ever forget!
Actually Øystein Grøvlen created several interesting bug report during that day:
  • Bug #99994 - "Index range scan is chosen where table scan takes 40% less time". Clear and simple bug report that relies on the world sample database.
  • Bug #99995 - "Histogram is not used for filtering estimate when index is disabled".
  • Bug #99996 - "Prefer histogram over index statistics when eq_range_index_dive_limit is exceeded". This was verified as a feature request.
  • Bug #99997 - "Range estimates are usually off by a factor of 2 for large ranges". It was declared a duplicate of older bug report I am also subscribed to, Bug #73386 - "For ranges, innodb doubles estimates, or caps estimates to half the table". See also this MariaDB bug report, MDEV-19424 - "InnoDB's records_in_range estimates are capped at about 50%", and links from it for a lot of related discussions. Let's wait and see what vendor resolves this faster...
    The other report, Bug #99998 - "For large ranges, the range estimate will never exceed 50%", is probably also a duplicate of the same old bug.
  • Bug #99999 - "EXPLAIN FORMAT=TREE does not show cost/rows for semijoin materialization". Yet another nice and clear bug report.
So, hardly anyone else had a chance to get that #100000 filed. As far as I can see, all these were reported during a very short period of 2 minutes, from "26 Jun 7:57" till "26 Jun 7:58"! Not sure how to do this without some automation or at least all the details ready for quick copy/pasting!

Now back to some older bugs I've subscribed to over last 4 weeks:
  • Bug #99791 - "MySQL 8 orphaned table due to unchecked non-existent row format check." As reported by Marc Reilly, tables created in versions < MySQL 8 which use row_format COMPRESSED or REDUNDANT, where row_format is not set explicitly in the Table DDL allow users to create un-prefixed indexes on fields which exceed the maximum column size of 767 bytes. Upgrading to MySQL 8 do nothing with these tables, but as soon as new index is added and reboot happens, such a table becomes inaccessible. What a surprise!
  • Bug #99794 - "MySQL 57 client is inefficient at bulkloads/binlog replay". In this bug report  Marc Reilly basically asks to back port the fix from MySQL 8.0.13.
  • Bug #99800 - "ps_truncate_all_tables() does not work in super_read_only mode". This regression bug was reported by Lalit Choudhary.
  • Bug #99805 - "mysql async client is incomplete". There is no way to determine file descriptor state (should it block on read or write), so it is impossible to use it in asynchronous contexts without busy looping. This bug report by Domas Mituzas was used in one discussion as an argument that MySQL bugs database still gets proper attention from MySQL engineers. It's truly so.
  • Bug #99892 - "initialize with innodb_page_size=4096 gets "Specified key was too long" errors". This is a regression vs 5.7 (without a tag). As Mark Callaghan found out, one can not initialize MySQL 8 instance without errors with such a small innodb_page_size.
  • Bug #99924 - "The record per key value from InnoDB is not suitable when n_diff is zero". As reported by Ze Yang, due to lack of locking when server reads the innodb_rec_per_key, the n_diff value may be 0 (not set) while the table->stat_n_rows is > 0. As a result (see great comment by Øystein Grøvlen), if a table object is opened during the recalculation of statistics, the rec_per_key for a column/index may be quite misleading. It will be interpreted as all rows have the same value, and the index will probably not be chosen for any non-covering scans. There is a patch suggested (to set rec_per_key to 1 or 10 in such case), as well as other suggestion to set the value REC_PER_KEY_UNKNOWN. Useful reading!
  • Bug #99933  - "In-memory hash join will only use two-thirds of join buffer". Yet another bug report related to hash joins from Øystein Grøvlen, with a fix suggested. See also his Bug #99934 - "Hash join adds columns to the hash table that is not needed." There is a lot of work ahead to improve the implementation of this new feature in MySQL 8.
  • Bug #99935 - "innodb_doublewrite_files is not correct when innodb_buffer_pool_size > 1G". Just 2 files are created instead of 16 according to the manual. This bug was reported by Satya Bodapati.
  • Bug #99943 - "Hash join does not work for Semijoin and Antijoin". This bug report from Tibor Korocz was "Verified", but later comments suggest that it's more like wrong expectations/interpretation of cases when the feature has to be used (it is supposed to be used instead of BNL, but not instead of semijoin materialization and subquery materialization). Let's wait and see how it ends up...
  • Bug #99966 - "Switching to use NUMA-SMART Counter for statistical counters". Great bug report from Krunal Bauskar, with a patch suggested. I hope to get a NUMA system one day myself to understand the challenges and performance problems there better.  
So, that's it, my very last post in the "Fun with Bugs" series that started more than 7 years ago. The series where I listed most of the interesting bug reports I keep an eye on, since Bug #2. It was a long way with a lot of fun and a lot of (rarely appreciated) work in the process, but now my watch has ended. I am not going to try to micro manage MySQL bugs processing any more and finally let the MySQL Verification Team do their job without my regular attention. Good luck!



Percona had recently started to blog about bugs, so I am sure they will keep an eye and share lists of important bugs on a regular basis. They should really care more than I do these days.

Sunday, June 21, 2020

The Magic Behind the innodb_dedicated_server=ON

Two days ago I noted this tweet by Mark Callaghan:
"This month I learned that innodb_dedicated_server is great. Don't need to set many options beyond it. I have been doing it wrong."
and decided to check what this great setting (that I also never cared to use with MySQL 8) is really doing. I could, surely, just re-read the fine manual (and I did that), but what can be better than just trying it?

Yesterday I had some free time in the morning for testing random things and was sitting near my good old Fedora (yes, I know, time to use 32 if not 33...) test box, with 8G of RAM and 4 cores:
[openxs@fc29 8.0]$ pt-summary
# Percona Toolkit System Summary Report ######################
        Date | 2020-06-20 05:27:35 UTC (local TZ: EEST +0300)
    Hostname | fc29
      Uptime | 39 min,  1 user,  load average: 0.29, 0.37, 0.39
    Platform | Linux
     Release | Fedora release 29 (Twenty Nine)
      Kernel | 5.3.11-100.fc29.x86_64
Architecture | CPU = 64-bit, OS = 64-bit
   Threading | NPTL 2.28
    Compiler | GNU CC version 8.3.1 20190223 (Red Hat 8.3.1-2).
     SELinux | Enforcing
 Virtualized | No virtualization detected
# Processor ##################################################
  Processors | physical = 1, cores = 4, virtual = 4, hyperthreading = no
      Speeds | 1x1999.777, 1x1999.816, 1x2001.216, 1x2030.647
      Models | 4xIntel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz
      Caches | 4x2048 KB
# Memory #####################################################
       Total | 7.7G
        Free | 247.4M
        Used | physical = 2.1G, swap allocated = 0.0, swap used = 0.0, virtual = 2.1G
      Shared | 198.6M
     Buffers | 5.4G
      Caches | 5.1G
       Dirty | 352 kB
     UsedRSS | 4.3G
  Swappiness | 60
 DirtyPolicy | 20, 10
 DirtyStatus | 0, 0
...
and it had MySQL 8.0.18 there (also time to rebuild from recent GitHub source, I know...). So, I tried to compare settings with --no-defaults:
[openxs@fc29 8.0]$ ./bin/mysqld_safe --no-defaults --basedir=/home/openxs/dbs/8.0 --datadir=/home/openxs/dbs/8.0/data &
[1] 3764
[openxs@fc29 8.0]$ 2020-06-20T05:21:24.289203Z mysqld_safe Logging to '/home/openxs/dbs/8.0/data/fc29.err'.
2020-06-20T05:21:24.378630Z mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/8.0/data
[openxs@fc29 8.0]$ ./bin/mysql -uroot -e'select version()';
+-----------+
| version() |
+-----------+
| 8.0.18    |
+-----------+
[openxs@fc29 8.0]$ ./bin/mysql -uroot -e'show global variables' >/tmp/default.txt
[openxs@fc29 8.0]$ ./bin/mysql -uroot -e'shutdown';
[openxs@fc29 8.0]$ 2020-06-20T05:22:35.220091Z mysqld_safe mysqld from pid file /home/openxs/dbs/8.0/data/fc29.pid ended
[1]+  Done                    ./bin/mysqld_safe --no-defaults --basedir=/home/openxs/dbs/8.0 --datadir=/home/openxs/dbs/8.0/data
[openxs@fc29 8.0]$ cat /tmp/default.txt | grep innodb_dedicated
innodb_dedicated_server OFF
Then I've added the magic option and repeated the same steps:
[openxs@fc29 8.0]$ ./bin/mysqld_safe --no-defaults --basedir=/home/openxs/dbs/8.0 --datadir=/home/openxs/dbs/8.0/data --innodb_dedicated_server=ON &
[1] 3993
[openxs@fc29 8.0]$ 2020-06-20T05:23:10.824899Z mysqld_safe Logging to '/home/openxs/dbs/8.0/data/fc29.err'.
2020-06-20T05:23:10.870749Z mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/8.0/data
[openxs@fc29 8.0]$ tail -f data/fc29.err 
2020-06-20T05:23:10.870749Z mysqld_safe Starting mysqld daemon with databases from /home/openxs/dbs/8.0/data
2020-06-20T05:23:10.888077Z 0 [Warning] [MY-010139] [Server] Changed limits: max_open_files: 1024 (requested 8161)
2020-06-20T05:23:10.888097Z 0 [Warning] [MY-010142] [Server] Changed limits: table_open_cache: 431 (requested 4000)
2020-06-20T05:23:11.168384Z 0 [System] [MY-010116] [Server] /home/openxs/dbs/8.0/bin/mysqld (mysqld 8.0.18) starting as process 4117
 100 200 300 400 500
 100 200 300 400 500
 100 200 300 400 500
 100 200 300 400 500
 100 200 300 400 500
 100 200 300 400 500
2020-06-20T05:23:59.904782Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2020-06-20T05:24:00.145885Z 0 [System] [MY-010931] [Server] /home/openxs/dbs/8.0/bin/mysqld: ready for connections. Version: '8.0.18'  socket: '/tmp/mysql.sock'  port: 3306  Source distribution.
2020-06-20T05:24:00.218482Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Socket: '/tmp/mysqlx.sock' bind-address: '::' port: 33060
^C
[openxs@fc29 8.0]$ ./bin/mysql -uroot -e'show global variables' >/tmp/dedicated.txt
So, what was the real difference? Here it is, thanks diff:
[openxs@fc29 8.0]$ diff -u /tmp/default.txt /tmp/dedicated.txt 
--- /tmp/default.txt    2020-06-20 08:22:25.674368791 +0300
+++ /tmp/dedicated.txt  2020-06-20 08:24:40.938561950 +0300
@@ -130,11 +130,11 @@
 innodb_buffer_pool_dump_pct    25
 innodb_buffer_pool_filename    ib_buffer_pool
 innodb_buffer_pool_in_core_file    ON
-innodb_buffer_pool_instances   1
+innodb_buffer_pool_instances   8
 innodb_buffer_pool_load_abort  OFF
 innodb_buffer_pool_load_at_startup ON
 innodb_buffer_pool_load_now    OFF
-innodb_buffer_pool_size    134217728
+innodb_buffer_pool_size    6442450944
 innodb_change_buffer_max_size  25
 innodb_change_buffering    all
 innodb_checksum_algorithm  crc32
@@ -147,7 +147,7 @@
 innodb_data_file_path  ibdata1:12M:autoextend
 innodb_data_home_dir   
 innodb_deadlock_detect ON
-innodb_dedicated_server    OFF
+innodb_dedicated_server    ON
 innodb_default_row_format  dynamic
 innodb_directories 
 innodb_disable_sort_file_cache OFF
@@ -157,7 +157,7 @@
 innodb_fill_factor 100
 innodb_flush_log_at_timeout    1
 innodb_flush_log_at_trx_commit 1
-innodb_flush_method    fsync
+innodb_flush_method    O_DIRECT_NO_FSYNC
 innodb_flush_neighbors 0
 innodb_flush_sync  ON
 innodb_flushing_avg_loops  30
@@ -183,8 +183,8 @@
 innodb_log_buffer_size 16777216
 innodb_log_checksums   ON
 innodb_log_compressed_pages    ON
-innodb_log_file_size   50331648
-innodb_log_files_in_group  2
+innodb_log_file_size   536870912
+innodb_log_files_in_group  6
 innodb_log_group_home_dir  ./
 innodb_log_spin_cpu_abs_lwm    80
 innodb_log_spin_cpu_pct_hwm    50
@@ -205,7 +205,7 @@
 innodb_online_alter_log_max_size   134217728
 innodb_open_files  431
 innodb_optimize_fulltext_only  OFF
-innodb_page_cleaners   1
+innodb_page_cleaners   4
 innodb_page_size   16384
 innodb_parallel_read_threads   4
 innodb_print_all_deadlocks OFF
[openxs@fc29 8.0]$ 
Actually, fine manual page that Mark Callaghan pointed to, explain most of these changes:
"When innodb_dedicated_server is enabled, InnoDB automatically configures the following variables:
Only consider enabling innodb_dedicated_server if the MySQL instance resides on a dedicated server where it can use all available system resources."
It described the settings in details. For systems with >4G of RAM (like mine) 75% of RAM is allocated for innodb_buffer_poll_size. Then if innodb_buffer_pool_size <= 8G, innodb_log_file_size is set to 512M. Number of redo log files in group is explained properly:
"The number of log files is configured according to the automatically configured buffer pool size (in gigabytes)."
That's why we see 6G for innodb_buffer_pool_size (75% of almost 8G) and 6 for innodb_log_files_in_group. I am not sure if that many (or more than 3) log files make any practical sense or help with anything, but probably it's a way to make sure total capacity of redo log is "large enough", while each log is not larger than 2048M.

As for setting innodb_flush_method, the value of O_DIRECT_NO_FSYNC is questionable, because of a known bug, Bug #94912 - "O_DIRECT_NO_FSYNC possible write hole". The manual is clear about this case:

"Data loss is possible if redo log files and data files reside on different storage devices, and a crash occurs before data file writes are flushed from a device cache that is not battery-backed. If you use or intend to use different storage devices for redo log files and data files, and your data files reside on a device with a cache that is not battery-backed, use O_DIRECT instead."
But I am not yet sure if the implementation is smart enough NOT to make this setting when redo logs are stored on different filesystem. Something to check later.

You may ask why setting innodb_page_cleaners to 4 is  NOT documented? Well, it is the default setting and the reason why it was adjusted to 1 in my initial configuration is explained in the manual here:
"If the number of page cleaner threads exceeds the number of buffer pool instances, innodb_page_cleaners is automatically set to the same value as innodb_buffer_pool_instances. "
I had one buffer pool instance by default, as default innodb_buffer_pool_size was < 1G. As a result of this magic tuning it's 6G, so 8 instances are used (and we can use 4 innodb_page_cleaners). Again, a questionable (but default) setting.

The default value of innodb_page_cleaners, 4, is not the best fit for all. As recent Percona blog stated:
"With 4 cleaner threads, InnoDB is able to flush at a very high rate. Actually, unless you are using Percona Server for MySQL with the parallel doublewrite buffers feature, very likely the doublewrite buffer will bottleneck before the cleaner threads."
After reading the manual pages related to the feature, I found it well documented. The only formal problem was reported as Bug #99945 - "Missing details in the description of innodb_dedicated_server".

Sometimes we have to clarify the details of some MySQL features, to find out there is no real magic behind them.
To summarize:
  • I would not say that innodb_dedicated_server setting replaces setting of really many options. Just 4 by the manual and 6 in my corner case that compared to --no-defaults settings.
  • Some settings are obvious and good (use 75% of RAM for InnoDB buffer pool if RAM is large enough, use reasonably large redo log space, others are somewhat questionable. The company I work for moves to the opposite direction with them for version 10.5.
  • The feature is well documented.
  • Personally I do not think that doing one setting instead of 4-6, for key InnoDB variables, is such a big deal for any use cases besides some artificial benchmarks.

Sunday, May 31, 2020

Dynamic Tracing of Memory Allocations in MySQL: First Steps with perf

I often have to deal with customers claiming there is a memory leak in MySQL affecting their production servers. To prove the leak is real and show where it happens usually running with tcmalloc and relying to it's heap profiler, or even more intrusive, running under Valgrind Massif is needed. This is normal when the test case demonstrating the leak is isolated and we prove a leak for some bug report. But hardly anyone in production in a normal situation would agree to do this, as performance impact is very serious.

So, recently I decided to check if dynamic tracing (along the lines of this page by Brendan Gregg, with Flame Graphs to show suspicious code path) is a more reasonable alternative. As I have to work with different Linux versions, often as old as CentOS 6, I can not rely on bcc tools for this. I mentioned more than once that in internal discussion that adding dynamic probe for malloc() (and other memory allocating functions, if you prefer) with perf may already be a good start and give some data to study. So, time to try this approach myself (so that I am ready to suggest it to customers with all the details at hand) and continue my series of posts on dynamic tracing. It can not be all the time about bugs, readers needs solutions as well :)

For this post I decided to work with latest and greatest MySQL 8.0.20 that I have built from source on my good old almost irrelevant Ubuntu 16.04 netbook:
openxs@ao756:~$ uname -a
Linux ao756 4.4.0-178-generic #208-Ubuntu SMP Sun Apr 5 23:45:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
openxs@ao756:~$ cd dbs/8.0
openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock -e "show variables like '%version%'"
+--------------------------+-----------------------+
| Variable_name            | Value                 |
+--------------------------+-----------------------+
| immediate_server_version | 999999                |
| innodb_version           | 8.0.20                |
| original_server_version  | 999999                |
| protocol_version         | 10                    |
| slave_type_conversions   |                       |
| tls_version              | TLSv1,TLSv1.1,TLSv1.2 |
| version                  | 8.0.20                |
| version_comment          | Source distribution   |
| version_compile_machine  | x86_64                |
| version_compile_os       | Linux                 |
| version_compile_zlib     | 1.2.11                |
+--------------------------+-----------------------+
openxs@ao756:~/dbs/8.0$
The reason is simple. Looks like I have more readers while writing about MySQL, not some fork of it :) I hope this tend will change soon.

First of all, I have to check what memory allocation library is used (the details of adding probe and even some features available to probe may be different for jemalloc, for example):
openxs@ao756:~/dbs/8.0$ ldd bin/mysqld
        linux-vdso.so.1 =>  (0x00007ffe8afb6000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fc9bdfe2000)
        libprotobuf-lite.so.3.6.1 => /home/openxs/dbs/8.0/bin/../lib/private/libprotobuf-lite.so.3.6.1 (0x00007fc9bdd8d000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fc9bdb85000)
        libssl.so.1.0.0 => /lib/x86_64-linux-gnu/libssl.so.1.0.0 (0x00007fc9bd91c000)
        libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007fc9bd4d7000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fc9bd2d3000)
        libaio.so.1 => /lib/x86_64-linux-gnu/libaio.so.1 (0x00007fc9bd0d1000)
        libnuma.so.1 => /usr/lib/x86_64-linux-gnu/libnuma.so.1 (0x00007fc9bcec6000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fc9bcb44000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fc9bc83b000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fc9bc625000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fc9bc25b000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fc9be1ff000)
openxs@ao756:~/dbs/8.0$
I do not see tcmalloc or jemalloc in the output above, but libc is linked dynamically. So, malloc() implementation would likely come from it (/lib/x86_64-linux-gnu/libc.so.6 binary, to be specific). Let me add the probe (see this older post for more details, some details about probes syntax presented here are also useful):
openxs@ao756:~$ sudo perf probe -x /lib/x86_64-linux-gnu/libc.so.6 'malloc size=%di:s64'
[sudo] password for openxs:
Added new events:
  probe_libc:malloc    (on malloc in /lib/x86_64-linux-gnu/libc-2.23.so with size=%di:s64)
  probe_libc:malloc_1  (on malloc in /lib/x86_64-linux-gnu/libc-2.23.so with size=%di:s64)

You can now use it in all perf tools, such as:

        perf record -e probe_libc:malloc_1 -aR sleep 1

One of the first things I want to be able to find out is how many bytes are allocated by malloc(). In case of libc malloc() on this Ubuntu I could just use the fist and only argument name (you'd be surprised what it really is, try to find out!) or just $params (see man perf-probe for details).  But in a general case (like with jemalloc on the same system) I should better refer to it based on the way it is passed - via %DI (not short form, NOT %EAI) register on x86 systems. This is why you see size=%di:s64 in the probe definition. Note also that to get real value as integer I had to use s64 type (for signed 64-bit integer, even though I;d expect the argument to be unsigned). By default and as unsigned we get hex value, and one day I'd like to be able to count also bytes allocated, not just calls.

I wanted to add return probe for the same function, in a hope to store (one day) the pointer and then maybe check (with other probe) if it is ever explicitly freed. But in my case it is not possible to get the return value:
openxs@ao756:~$ sudo perf probe -x  /lib/x86_64-linux-gnu/libc.so.6 malloc_ret='malloc%return $retval'
Failed to find "__libc_malloc%return",
 because __libc_malloc is an inlined function and has no return point.
Added new event:
  probe_libc:malloc_ret (on malloc%return in /lib/x86_64-linux-gnu/libc-2.23.so with $retval)

You can now use it in all perf tools, such as:

        perf record -e probe_libc:malloc_ret -aR sleep 1
I am yet to check the reason, but according to the above the function is inlined and thus does not have a return point. Sounds weird. Note that for jemalloc I was able to add similar probe successfully and get the pointer value traced.

So, what do I get when I try to trace this probe in perf? Let me show. If I run some stupid test surely allocating memory in one shell:
openxs@ao756:~/dbs/8.0$ bin/mysqlslap -uroot --socket=/tmp/mysql8.sock --concurrency=12 --create-schema=test --no-drop --number-of_queries=300000 --iterations=2 --query='select @a := repeat("a", 1000);'
While in the other I'll try to record some samples:
openxs@ao756:~$ sudo perf record -e 'probe_libc:malloc' -aRg sleep 10
[sudo] password for openxs:
Lowering default frequency rate to 3250.
Please consider tweaking /proc/sys/kernel/perf_event_max_sample_rate.
[ perf record: Woken up 710 times to write data ]
[ perf record: Captured and wrote 179.217 MB perf.data (1017163 samples) ]
Then I'll get the output like this from perf script:
openxs@ao756:~$ sudo perf script | grep mysqld | more
...
mysqld   632 [001] 218853.089892: probe_libc:malloc: (7fc7312b7130) size=120
                  a18de2 std::__detail::_Hashtable_alloc<Malloc_allocator<std::_
_detail::_Hash_node<TABLE_LIST*, true> > >::_M_allocate_buckets (/home/openxs/db
s/8.0/bin/mysqld)
                  a19061 lock_table_names (/home/openxs/dbs/8.0/bin/mysqld)
                  a1adc8 open_tables (/home/openxs/dbs/8.0/bin/mysqld)
                  a1b576 open_tables_for_query (/home/openxs/dbs/8.0/bin/mysqld)
                  ae7313 Sql_cmd_dml::prepare (/home/openxs/dbs/8.0/bin/mysqld)
                  aef86f Sql_cmd_dml::execute (/home/openxs/dbs/8.0/bin/mysqld)
                  a92b0a mysql_execute_command (/home/openxs/dbs/8.0/bin/mysqld)
                  a97824 mysql_parse (/home/openxs/dbs/8.0/bin/mysqld)
                  a99605 dispatch_command (/home/openxs/dbs/8.0/bin/mysqld)
                  a9abd0 do_command (/home/openxs/dbs/8.0/bin/mysqld)
                  bba6b0 handle_connection (/home/openxs/dbs/8.0/bin/mysqld)
                 20d0355 pfs_spawn_thread (/home/openxs/dbs/8.0/bin/mysqld)
...
mysqld  2795 [000] 218853.091016: probe_libc:malloc: (7fc7312b7130) size=1040                  bb7ed3 String::real_alloc (/home/openxs/dbs/8.0/bin/mysqld)
                  dda12f Item_func_repeat::val_str (/home/openxs/dbs/8.0/bin/mys
qld)
                  da5272 Item_func_set_user_var::check (/home/openxs/dbs/8.0/bin
/mysqld)
                  da5874 Item_func_set_user_var::val_str (/home/openxs/dbs/8.0/b
in/mysqld)
                  d1d1cf Item::send (/home/openxs/dbs/8.0/bin/mysqld)
                  a23d29 THD::send_result_set_row (/home/openxs/dbs/8.0/bin/mysq
ld)
                  eb86c1 Query_result_send::send_data (/home/openxs/dbs/8.0/bin/
mysqld)
                  b55524 SELECT_LEX_UNIT::ExecuteIteratorQuery (/home/openxs/dbs
/8.0/bin/mysqld)
                  b5566e SELECT_LEX_UNIT::execute (/home/openxs/dbs/8.0/bin/mysq
ld)
                  ae8f52 Sql_cmd_dml::execute_inner (/home/openxs/dbs/8.0/bin/my
sqld)
                  aefc37 Sql_cmd_dml::execute (/home/openxs/dbs/8.0/bin/mysqld)
                  a92b0a mysql_execute_command (/home/openxs/dbs/8.0/bin/mysqld)
                  a97824 mysql_parse (/home/openxs/dbs/8.0/bin/mysqld)
                  a99605 dispatch_command (/home/openxs/dbs/8.0/bin/mysqld)
...
Note the following:
  • I tried to trace all calls to malloc() from glibc, for all processes, with stack traces. As a result I've got 179 MB of raw data in perf.data over just 10 seconds, with default frequency of sampling! This is too much for any pracrtical purpose, so make sure to set smaller frequency (like -F99) or performance impact may be too seriois. With newer kernels and eBPF this impact may be notably reduced.
  • I've got both allocation size reported (and may get a good use for it one day), and stack traces, showing where the allocation is coming from! This is exactly what developers need to check if it is really a possible leak.
Now what I can I do with the above? I can use perf report (sudo perf report > /tmp/report.txt) and get the call graphs in text:
openxs@ao756:~$ cat /tmp/report.txt | more                                     
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 556  of event 'probe_libc:malloc'
# Event count (approx.): 4935574
#
# Children      Self  Command          Shared Object        Symbol

# ........  ........  ...............  ...................  ....................
................................................................................
................................................................................
..........
#
...
   87.56%    87.56%  mysqlslap        libc-2.23.so         [.] malloc
                  |
                  ---start_thread
                     run_task
                     |
                     |--86.77%-- mysql_real_query
                     |          cli_read_query_result
                     |          read_com_query_metadata
                     |          cli_read_metadata_ex
                     |          MEM_ROOT::AllocSlow
                     |          MEM_ROOT::AllocBlock
                     |          malloc
                     |
                      --0.79%-- mysql_store_result
                                cli_read_rows
                                MEM_ROOT::AllocSlow
                                MEM_ROOT::AllocBlock
                                malloc
...
    10.18%     0.00%  mysqld           mysqld               [.] do_command
                     |
                     ---do_command
                        dispatch_command
                        |
                        |--9.00%-- mysql_parse
                        |          |
                        |          |--7.88%-- mysql_execute_command
                        |          |          Sql_cmd_dml::execute
                        |          |          |
                        |          |          |--6.64%-- Sql_cmd_dml::prepare
                        |          |          |          open_tables_for_query
                        |          |          |          open_tables
                        |          |          |          lock_table_names
                        |          |          |          |
                        |          |          |          |--6.04%-- std::__detai
l::_Hashtable_alloc<Malloc_allocator<std::__detail::_Hash_node<TABLE_LIST*, true
> > >::_M_allocate_buckets
                        |          |          |          |          malloc
                        |          |          |          |
                        |          |          |           --0.59%-- get_and_lock
_tablespace_names
                        |          |          |                     std::__detai
l::_Hashtable_alloc<Malloc_allocator<std::__detail::_Hash_node<std::__cxx11::bas
ic_string<char, std::char_traits<char>, std::allocator<char> >, true> > >::_M_al
locate_buckets
                        |          |          |                     malloc
                        |          |          |
                        |          |           --1.24%-- Sql_cmd_dml::execute_inner
                        |          |                     SELECT_LEX_UNIT::execute
                        |          |                     SELECT_LEX_UNIT::ExecuteIteratorQuery
                        |          |                     Query_result_send::send_data
                        |          |                     THD::send_result_set_row
                        |          |                     Item::send
                        |          |                     |
                        |          |                     |--0.66%-- Item_func_set_user_var::val_str
                        |          |                     |          Item_func_set_user_var::check
                        |          |                     |          Item_func_repeat::val_str
                        |          |                     |          String::real_alloc
                        |          |                     |          malloc
                        |          |                     |
                        |          |                      --0.58%-- user_var_entry::val_str
                        |          |                                String::copy
                        |          |                                String::real_alloc
                        |          |                                malloc
                        |          |
                        |           --1.12%-- parse_sql
                        |                     |
                        |                     |--0.66%-- THD::sql_parser
                        |                     |          MYSQLparse
                        |                     |          THD::raise_condition
                        |                     |          Diagnostics_area::push_warning
                        |                     |          MEM_ROOT::AllocSlow
                        |                     |          MEM_ROOT::AllocBlock
                        |                     |          malloc
                        |                     |
                        |                      --0.46%-- Diagnostics_area::copy_sql_conditions_from_da
                        |                                Diagnostics_area::push_warning
                        |                                Diagnostics_area::push_warning
                        |                                MEM_ROOT::AllocSlow
                        |                                MEM_ROOT::AllocBlock
                        |                                malloc
                        |
                         --1.18%-- THD::swap_rewritten_query
                                   String::mem_realloc
                                   malloc
...
I can also create a flame graph of number of malloc() calls per function/sequence of calls. I had to modify steps from the original source a big, to come up with the following:
openxs@ao756:~$ sudo perf script > out.stack
openxs@ao756:~$ git/FlameGraph/stackcollapse-perf.pl < out.stack | git/FlameGraph/flamegraph.pl --color=mem --title='malloc( Fale Graph' --countname="calls" -- > malloc.svg
For this test I've included allocations for all the processes and had to use stackcollapse-perf.pl script, as stack traces are coming in perf format.

The result (actually a part of it, just for the mysqld process, copy/pasted as .png static file) may look as follows:

Sample malloc() calls flame graph build from one of tests.
Now you can see where most malloc() calls happen, and this is the first step in checking where leaks may come from if we really suspect them. Quite a lot of fun for a simple probe added with perf, on any Linux system younger than 10 years!

Next time I am going to apply this approach to some real life use case that looks like a leak and see if anything useful, not just fun, can be derived from it. For now I know it's doable and I am sure about the steps (documented here and more in my background notes) to trace malloc() with perf dynamic probe.

Saturday, May 30, 2020

Fun with Bugs #99 - On MySQL Bug Reports I am Subscribed to, Part XXXIII

In my previous post in this series I've commented on some interesting MySQL bug reports that were added during the second half of April. Time to move on to bugs reported in May, 2020, as we are quickly approaching MySQL Bug #100000 soon and I want to write a separate post for this great achievement :)

Here is the list:
  • Bug #99432 - "Improving memory barrier during rseg allocation". Nice contribution by my former colleague in Percona, Krunal Bauskar, who now works on making MySQL better for ARM processors. According to his findings, the use of a relaxed memory model improves performance on ARM by up to 2%. See also yet another bug report with a contribution that matters for ARM, Bug #99556 - "Avoid sequentially consistent atomics for atomic counters" (contributed by Sergey Glushchenko from Percona).
  • Bug #99444 - "New HASH JOIN order problem". One should not expect and rely on any specific order unless explicit ORDER BY is used, so formally this report by Gabor Aron is "Not a Bug". I put it into this list as several other community members helped him a lot in understanding why results with HASH_JOIN optimization in newer versions are still valid and what are the ways to get the results with the desired ordering efficiently. Guilhem Bichot, for instance, suggested two different ways, using window function and lateral table. Useful reading in any case!
  • Bug #99458 - "i_s_fts_index_cache_fill_one_index() is not protect by the lock". Looks like even crashes are possible as a result, based on comments. Nice finding by Haiqing Sun.
  • Bug #99459 - "SQL run with GROUP_MIN_MAX may infinite loop and never return". After some discussion around the validity and severity of bug reports where test case involved adding DEBUG_SYNC() to show the problem in a predictable way, this great bug report by Ze Yang was verified. All MySQL GA versions are affected, including 8.0.20! As a side note, I'd prefer NOT to read such discussions any more. They are wasting time of all parties involved.
  • Bug #99499 - "Incorrect result when constant equailty expression is used in LEFT JOIN condition". This bug that affects MySQL 5.7.x only (it was fixed in MySQL 8.0.17+ and in 5.6 code was different) was reported by Marcos Albe from Percona. 
  • Bug #99504 - "Generated column incorrect on INSERT when based on column w/ expression DEFAULT". Several problems are highlighted in the complex enough test case submitted by Brad Lanier.
  • Bug #99582 - "Reduce logging of new doublewrite buffer initialization which is confusing". 180 lines or so are added when --log-error-verbosity is set to 3. As a workaround one can add:
    log-error-suppression-list="MY-011950"
    to the [mysqld] section of the .cnf file. This problem was reported by Simon Mudd. Make sure to read all comments.
  • Bug #99591 - "Option --tc-heuristic-recover documentation wrong, missing details". In reality it does not work with more than one XA-capable engine installed. I wish fine manual documents the reality, not the good intentions of the past. This documentation request was added by Sami Ahlroos.
  • Bug #99593 - "Performance issues in 8.0.20". It seems to be yet another TempTable engine problem that caused regression comparing to MySQL 5.7. At least this:

    • SET GLOBAL internal_tmp_mem_storage_engine=MEMORY;

    is a workaround. The bug (a duplicate of internal Bug #30562964) was reported by billy noah and is fixed in upcoming MySQL 8.0.21.
  • Bug #99601 - "Broken Performance using EXIST function, increasing execution time each loop". This regression bug (without tag, but who cares...) in MySQL 8.0 was reported by Ronny Görner and minimal test case demonstrating that the problem is actually with function call was contributed by Shane Bester.
  • Bug #99643 - "innobase_commit_by_xid/innobase_rollback_by_xid is not thread safe". This bug was reported by Zhai Weixiang, who had also suggested the fix in the code.
  • Bug #99717 - "Performance regression of parallel count". Great bug report with code analysis and ready to use MTR test case from Ze Yang. Sunny Bains already confirmed that the problematic code is going to be removed.

To summarize:
  1. I am happy to see Oracle engineers explaining community bug reporters the reasons and possible solutions for the problems they hit that are not actually caused by any bug in MySQL. I tried to do this as well, whenever possible, while working on MySQL bugs...
  2. We can still find speculations that if the bug is repeatable only by adding DEBUG_SYNC() or similar debug lines, then it can not be verified or gets lower severity... IMHO this is nonsense, as there are many high severity verified real bug reports where this method is used to demonstrate the problem clearly. Just stop it!