Thursday, December 31, 2015

New Year Wishes for Providers of MySQL Support Services

Three years ago I shared my wishes for customers of Oracle's MySQL Support Services. There I basically asked them to report any problem that they suspect to be caused by the bug in MySQL software at http://bugs.mysql.com.This year I want to share wishes mostly for myself (and other providers of MySQL Support services).

I have a job of MySQL Support Engineer for almost 10.5 years. I did it in MySQL AB, Sun, Oracle and Percona. I had enough opportunities to see all kinds of approaches, types, kinds and qualities of services. But I still have some dreams in this area that I'd like to see fulfilled for both myself as a provider of service and for customers of such a service:
  1. I wish to see MySQL Support mostly done in an asynchronous way, via emails and (when absolutely needed and possible) remote login sessions.

    In most cases it's enough for customer to know that she will get a detailed, best possible answer to any her initial question (or problem statement) or any followups question or request in a predictable, well defined time. There is no need for engineer and customer to always work in sync, by talking on phone, chatting or doing a shared screen sessions.

    Support should work the same way UNIX operating system does: by sharing all available resources (engineers) among all tasks (support requests) at hand, allocating resources for the task for some small amount of time and then forcing the resource to switch to other task, either when time unit allocated is ended or immediately when we have to wait for something to complete. Surely this mode is beneficial for support providers (because of ability to work for more customers concurrently than they have engineers online), but customers also get clear benefits. They can move on and work on something else until they get email back (or time to get a reply passes), and they may get a reply based on concurrent (but asynchronous) work of several engineers ("fan-out").

  2. At the same time, I wish each support provider to have a well defined SLA (time of getting a guaranteed useful technical reply, either a solution, suggestion or further question) not only for the initial reply (as we can see here and, honestly, almost everywhere), but also for the followups, for each and every customer email.

    Ideally both sides should be able to negotiate the date(time) of the next reply (even if it's different from formal official SLA), and then make sure to meet this deadline in 100% of cases. Some steps towards this goal are visible here, but so far no well know Support provider is perfect with followups in time, based on my knowledge.

  3. I wish Support engineers to never be involved in phone conferences with customers without a clearly defined agenda related to MySQL and limited time to be spent on phone (see item 1 above for the reasons).

    Sometimes somebody from "services" side should be "there", in case of questions during some long discussion. I think this is a job for customer's TAM (technical assistance manager), Sales Engineer (if the topic is related to purchasing some service or software) or anyone who is paid per hour (like Consultant).

  4. I wish Support engineers, no matter what Support provider they work for, to always report upstream MySQL bugs at http://bugs.mysql.com/ and fork-specific bugs at their public bug trackers, as openly available (public) to all MySQL users.

    Some bugs may be repeatable only with customer-specific and confidential data, and some bugs may have security implications. Ideally, Support engineers should always work on a repeatable test case or otherwise well grounded bug report NOT containing customer data. As for security problems, there is always a way to explain in public important details of the possible security attack vector and list versions affected, without giving enough details for "script kiddies" to just blindly copy-paste the test case to get unauthorized access or crash well-managed public MySQL server.

  5. I wish Support engineers to present their work and share their experience in public.

    We all should try to share knowledge we have and get while working with customers, not only internally to our colleagues in services or via internal knowledge bases, but also in our own blogs, articles, on public MySQL forums and on MySQL-related conferences.

    MySQL Support providers should encourage support engineers to make the results of their work public whenever possible. Not only bugs, but problem solving approaches, code written (if any), experience gained should be shared with MySQL community. This will give us all customers who known more about MySQL and will help us not to re-invent the wheel.
 
To summarize, I wish our customers in the New Year of 2016 to get a simple, but well-defined, responsible, and reliable 24x7 Support service provided by the engineers who are well known to the Community based on their public work on MySQL (via blog posts, bug reports and conference presentations). I wish all MySQL Support Service providers to deliver what they promise (or more) in 100% of cases. I wish myself to work for MySQL Support Provider that cares about my wishes and tries to help me to see my dreams expressed here coming true.

Happy New Year, MySQL Community!

Saturday, December 19, 2015

Tricking the Optimizer, or How Checking Bug Reports Help to Solve Real Problems

I've got several useful habits over the years of work in MySQL Support. One of them is to start working on every problem with search for known MySQL bugs related to the problem at hand. I'd like to share one recent case where this habit helped me to get a solution for customer almost instantly.

It was one of rare cases when customer opened a support request with a very clear question and even a test case. The problem was described very precisely, more or less as follows (with table and column names, and data changed for this blog post, surely).

Let's assume we have two tables created like these:

mysql> create table t1(id int auto_increment primary key, c1 varchar(2), c2 varchar(100));Query OK, 0 rows affected (0.27 sec)

mysql> create table t2(id int auto_increment primary key, t1_id int, ctime datetime, cvalue decimal(10,2), key(t1_id, ctime));
Query OK, 0 rows affected (0.15 sec)


So, we have a couple of tables with assumed (even if not formally declared) relation, of a "master-details" kind. We have few rows in the "master" table:

mysql>  insert into t1(c1,c2) values ('UA','Val'),('US','Tom'),('UK','Jerry');
Query OK, 3 rows affected (0.06 sec)
Records: 3  Duplicates: 0  Warnings: 0


and usually several (or very many) rows for each row from "master" in the "details" table:

mysql>  insert into t2(t1_id, ctime, cvalue) values(1,NOW(),10),(1,NOW(),20),(1,NOW(),30),(2,NOW(),10),(2,NOW(),50),(3,NOW(),100);
Query OK, 6 rows affected (0.07 sec)
Records: 6  Duplicates: 0  Warnings: 0


Just to double check, this is what we have to begin with in our tables:

mysql> select * from t1;
+----+------+-------+
| id | c1   | c2    |
+----+------+-------+
|  1 | UA   | Val   |
|  2 | US   | Tom   |
|  3 | UK   | Jerry |
+----+------+-------+
3 rows in set (0.01 sec)

mysql> select * from t2;
+----+-------+---------------------+--------+
| id | t1_id | ctime               | cvalue |
+----+-------+---------------------+--------+
|  1 |     1 | 2015-12-19 17:30:59 |  10.00 |
|  2 |     1 | 2015-12-19 17:30:59 |  20.00 |
|  3 |     1 | 2015-12-19 17:30:59 |  30.00 |
|  4 |     2 | 2015-12-19 17:30:59 |  10.00 |
|  5 |     2 | 2015-12-19 17:30:59 |  50.00 |
|  6 |     3 | 2015-12-19 17:30:59 | 100.00 |
+----+-------+---------------------+--------+
6 rows in set (0.00 sec)


Now, let's try to get many more rows to the "details" table (do not mind the warnings, I did that on the server with binary log enabled and binlog_format = STATEMENT):

mysql>  insert into t2(t1_id, ctime, cvalue) select t1_id, NOW(), cvalue from t2;
Query OK, 6 rows affected, 1 warning (0.04 sec)
Records: 6  Duplicates: 0  Warnings: 1


...

mysql>  insert into t2(t1_id, ctime, cvalue) select t1_id, NOW(), cvalue from t2;
Query OK, 196608 rows affected, 1 warning (25.63 sec)
Records: 196608  Duplicates: 0  Warnings: 1


Now, we have index on both (t1_id, ctime) columns in our t2 table, so queries like the following are executed in a fast and efficient way:

mysql> explain select * from t1 join t2 on t1.id = t2.t1_id where t1.id = 3 and ctime between '2015-12-19 17:30:59' and '2015-12-19 17:31:10'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: const
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: const
         rows: 1
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: range
possible_keys: t1_id
          key: t1_id
      key_len: 11
          ref: NULL
         rows: 1
        Extra: Using index condition
2 rows in set (0.01 sec)


mysql> select sql_no_cache * from t1 join t2 on t1.id = t2.t1_id where t1.id = 3 and ctime between '2015-12-19 17:30:59' and '2015-12-19 17:31:10';
+----+------+-------+----+-------+---------------------+--------+
| id | c1   | c2    | id | t1_id | ctime               | cvalue |
+----+------+-------+----+-------+---------------------+--------+
|  3 | UK   | Jerry |  6 |     3 | 2015-12-19 17:30:59 | 100.00 |
+----+------+-------+----+-------+---------------------+--------+
1 row in set (0.00 sec)


Note range access and both columns used for the t1_id index of the t2 table. But what if we specify rows from the "master" (t1) table indirectly? Like this:

mysql> explain select * from t1 join t2 on t1.id = t2.t1_id where t1.c1 = 'UK' and ctime between '2015-12-19 17:30:59' and '2015-12-19 17:31:10'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: ref
possible_keys: t1_id
          key: t1_id
      key_len: 5
          ref: test.t1.id
         rows: 1
        Extra: Using index condition
2 rows in set (0.00 sec)

mysql> select sql_no_cache * from t1 join t2 on t1.id = t2.t1_id where t1.c1 = 'UK' and ctime between '2015-12-19 17:30:59' and '2015-12-19 17:31:10';
+----+------+-------+----+-------+---------------------+--------+
| id | c1   | c2    | id | t1_id | ctime               | cvalue |
+----+------+-------+----+-------+---------------------+--------+
|  3 | UK   | Jerry |  6 |     3 | 2015-12-19 17:30:59 | 100.00 |
+----+------+-------+----+-------+---------------------+--------+
1 row in set (0.22 sec)


We see that all rows from the "master" table are checked (this is expected, and we have only 3 there anyway), but then we see "ref" access via t1_id index of the "details" (t2) table, and, what's more important, we get the result notably slower than before.

Now the question: Why we are only using the t1_id element of the compound key? This becomes an issue when the leading element of the key is not highly selective.

Another question: Is there any way we can prompt the optimizer to make better use of the index so that when it has evaluated the t1_id values it scans the t2 table using the full depth of the index?

So, this was the customer request I've got. Based on my good habit, I started to search for known optimizer bugs, using the following search string in Google:

site:bugs.mysql.com join range compound index

On the very first page of results I've got a link to the following bug (that was reported by my former colleague Justin Swanhart, probably for some other customer issue, and maybe we even discussed it in the past with him, but still, I just googled for the string above, without any specific details in mind):
  • Bug #70002 - "very low performance join - eq_ref and ref access broken for compound indexes"
The bug was declared a "Duplicate" of an older Bug #52030, "WHERE clause with != primary key check slows a query from 0.06 sec to 4 min", and their similarity was far from clear to me. That older bug was reported for 5.1.x and still is "Verified". But there was no need to understand the relation, as the last comment (by Jørgen Løland) in the bug I found stated:
"Thanks for the bug report. This looks like a duplicate of BUG#52030 but with a much less obscure reproducible. I'll link the two bugs together and request that the implementor verifies your test case as well when the issue has been fixed.

A short summary of what happens: MySQL can only do '[eq_]ref' access if the comparison operator is =. That happens to be the case for the first keypart. If MySQL is to make use of the BETWEEN predicate when looking up rows in the index, 'dynamic range' access has to be used. This is the kind that says "Range checked for each record (index 0x...)" in EXPLAIN. 'Dynamic range' is more costly to use than 'ref' because parts of the optimizer has to be invoked for every row in the referred-to table (cust_car in this case). Because of this higher cost, 'dynamic range' will not be chosen by MySQL if 'ref' access is possible. Admittedly, this heuristic sometimes fail like in this bug report."
This gives the answer to the first customer question: we are using only the first column of the compound index there is a heuristic inside optimizer that prefers ref access, and this can be considered a bug/known limitation of current MySQL's optimizer.

Moreover, I've immediately tried to trick optimizer to use dynamic range access path:

mysql> explain select * from t1 join t2 on (t1.id >= t2.t1_id and t1.id <= t2.t1_id) where t1.c1 = 'UK' and ctime between '2015-12-19 17:30:59' and '2015-12-19 17:31:10'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 4
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
         type: ALL
possible_keys: t1_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 392850
        Extra: Range checked for each record (index map: 0x2)
2 rows in set (0.00 sec)

mysql> select * from t1 join t2 on (t1.id >= t2.t1_id and t1.id <= t2.t1_id) where t1.c1 = 'UK' and ctime between '2015-12-19 17:30:59' and '2015-12-19 17:31:10';
+----+------+-------+----+-------+---------------------+--------+
| id | c1   | c2    | id | t1_id | ctime               | cvalue |
+----+------+-------+----+-------+---------------------+--------+
|  3 | UK   | Jerry |  6 |     3 | 2015-12-19 17:30:59 | 100.00 |
+----+------+-------+----+-------+---------------------+--------+
1 row in set (0.01 sec)

and, as you can see, got both the desired plan (on MySQL 5.6.27) and fast execution.

So, I've immediately suggested to customer to use the trick of replacing the join condition with the one that prevents "ref" access, and this workaround helped to get fast execution time for the real case as well! This was the answer to the second question: replace '=' in WHERE clause with logically equal condition that bypasses the heuristics and let the entire index to be used for each row selected for the outer table.

Problem was solved for customer in 20 minutes or so after he had opened the support request! This is the power of good habit and the result of using the Community and Oracle engineer's work on maintaining public bugs database! We do not know if the bug will ever be fixed, but we are able to use the results of its discussion to solve problems.

Side note: the same trick applied to older version, without dynamic range optimization, does not give any real benefit. The resulting query on 5.5.x, for example, is notably slower. So, another good habit to mention is: check the suggested solution on the exact version customer used before recommending it!

Sunday, December 13, 2015

Fun with Bugs #40 - Bugs fixed in MySQL 5.7.10

MySQL 5.7.10, the first release after MySQL 5.7 GA announcement, appeared almost a week ago. It's interesting to check what bugs reported by MySQL Community were fixed in this version, hence this post (that had to wait during the weekdays busy with usual support work). As usual, I'll concentrate mostly on InnoDB, replication and optimizer-related bug fixes and highlight names of those who reported and verified each bug.

Let's start with InnoDB. The following bugs reported by MySQL Community users were fixed:
  • Bug #78623 - "Small tablespaces with BLOBs take up to 80 times more space in 5.7 than in 5.6". This serious regression was reported by my colleague Laurynas Biveinis and verified by Umesh well before 5.7.9 release, but probably too late to get fixed there. This is not the only example of bugs that could be fixed in the first 5.7 GA release theoretically. Either MySQL release process is too long in Oracle now, or they were just in a hurry to announce 5.7GA "in time"...
  • Bug #78494 - "InnoDB: Failing assertion: !(&buf_pool->mutex)->is_owned() buf0buf.cc line 3388", was reported by  Roel Van de Paar and verified by Umesh. As well as some other bugs I am going to mention today, it shows that when proper QA efforts are applied community users may still find cases that are missed by Oracle MySQL QA.
  • Bug #77321 - "Import tablespace may fail with invalid corruption error". This bug was originally found in MySQL 5.6 by Santosh Praneeth Banda, who had also suggested a way to fix, and verified by Shaohua Wang (probably).  
  • Bug #78410 is still private, so we can not get the details. I do not understand why this bug is still hidden, as according to the release notes:
    "After a crash on Windows, copying the data directory to a non-Windows platform to perform the restore caused a crash recovery failure on startup. The code did not convert file path separators from \ to / in the redo log."
  • Bug #78336 - "handle_fatal_signal (sig=11) in dict_table_t::get_ref_count". It was reported for debug build by  Roel Van de Paar and verified by Shane Bester, who had provided an even simpler test case and clarified the impact. I just do not get it while, with the fix ready back on September 22 (according to the last comment), it was NOT included in 5.7.9...
  • Bug #78197 - "COMPRESSION column in innodb_sys_tablespaces is not correct", was reported by Zhai Weixiang and verified by Umesh.
  • Bug #77572 - "The bogus duplicate key error in online ddl with incorrect key name". It also affects 5.6 and was reported (with both MTR test case and patch suggested) by Zhang Yingqiang. Surely, it was easy for Sinisa Milivojevic to verify it.
  • Bug #77128 - "Regression in dealing with filesystem limits", was reported for 5.6 and 5.7 by my colleagues Przemyslaw Malkowski and verified by Umesh.
  • Bug #73225 - "If tablespace exists, can't CREATE table, but can ALTER ENGINE=InnoDB". It was reported for 5.6 by Federico Razzoli and verified by my colleague Sveta Smirnova while she was still working in Oracle... 
  • Bug #78728 - "InnoDB: Failing assertion: 0 in file ha_innopart.cc line 3526 ". This debug-binaries only assertion failure was reported by Roel Van de Paar and verified by Umesh.
There are many more bugs fixed in InnoDB, but they were internal-only, probably found by Oracle QA.

I see only two replication-related bugs from Community users fixed in 5.7.10:
  • Bug #76795 - "2pc succeeds even though binlog flush/sync fails". It was originally reported for MySQL 5.6 by  Santosh Praneeth Banda, who had suggested the patch, and verified by Umesh. See also Bug #68953 reported long time ago as a regression bug in 5.6 by Laurynas Biveinis.
As for optimizer, the list is also really short:
  • Bug #78260  - "handle_fatal_signal (sig=11) in Partition_helper::ph_read_range_first on SELECT". It was reported by Roel Van de Paar and verified by Umesh. The bug is related to partition pruning, so we can blame optimizer as well.
  • Bug #78665 - "GCOLS: GCOL VALUE CHANGES WHEN SESSION CHANGES SQL_MODE". Guilhem Bichot had reported this for MySQL Community to note, and probably he had fixed the bug himself later.The bug was verified by Miguel Solorzano.
  • Bug #78641 - "cast of too big HEX literal to BIGINT UNSIGNED: bad result and no warning". It was reported by Guilhem Bichot and verified by Umesh. MySQL 5.6 was also affected, and the fix is in 5.6.28.
  • Bug #77480 - "Wrong results for truncated column and aggregation". It was reported by Daniël van Eeden and verified by Umesh. Shane Bester had added a very simple test case later that also demonstrates the problem in MySQL 5.5 and 5.6 (where it is NOT fixed).
To summarize, we see a lot of fixes in InnoDB and 5.7.10 finally fixed some problems known for a long time (now that it's the latest GA release we'll probably get most fixes in 5.7 and only some of them will be back ported to 5.6), but there is still a long way to go. If you are interested in bugs I considered important in 5.7.9, check this post. Only 6 bugs listed there are closed.