Sunday, October 27, 2013

Recalculating InnoDB Persistent Statistics - a Story of the Bug Report

One of the first posts in this blog was about reporting MySQL bugs "properly", in a way that maximizes chances for it to be processed really soon. I had written the following there:
"Ideally, you should provide a complete test case and/or instructions that any reader can use to reproduce your problem"
Indeed, if one can just copy/paste something to mysql command line client or run some file attached to see the problem, chances are high for the bug to be processed really soon. We all like to get low hanging fruits from time to time, and Oracle engineers who work on bugs are not exceptions. But does this mean that bug without clear test case has no value and is going to be ignored?

It should NOT be the case. Let's review Bug #70617 reported by my colleague some time ago based on the problem we helped our customer to solve. There bug reporter just copied somewhat edited email with problem description we've got, and it was referring to a very complex SQL statement (joining 17 tables or so) that worked slow when executed from some PHP application on MySQL 5.6, but always worked fast when similar SQL was executed in mysql command line client and mostly worked fast in good old MySQL 5.5 in both cases.

The bug report contained this kind of problem description and listed workarounds we found: replacing all INNER JOINs with STRAIGHT_JOIN (trick that is useful more often than I'd like it to be) let us get not ideal, but consistent performance, while recreating the tables with persistent statistics disabled just let us get the same good performance as in 5.5 (take the fact that persistent statistics for InnoDB tables is enabled by default into account every time you see query from older versions performing poorly in MySQL 5.6, by the way).

I let you review that my old post (with the content that was once offered as a talk to MySQL User Conference back in 2006 and recently was submitted for Percona Live London conference, but again was not accepted) and decide for yourself is the bug report good enough to get proper attention. It seems some of my former colleagues in Oracle decided that it is not any good to report something like that, without a simple test case, and they were also offended enough by some statements made there and the fact that bug reporter explicitly does NOT want to spend time on creating small test case not related to customer data... So, bug report was ignored, while I've got few chances to try to defend bug reporter in private chats.

Assuming that without a test case this bug report is going to be ignored longer than I'd like to, I've spent some 30 minutes testing and sending "good" reports:
  • Bug #70629 - "innodb_stats_auto_recalc problem for InnoDB tables with persistent statistics"
  • Bug #70630 - "Why one can access persistent statistics data while they are changing?"
this time with a simple test case to copy/paste that demonstrated both the reason of the problem (persistent statistics is NOT recalculated immediately) and the way this badly influences join order selection by the optimizer even with just 2 tables. I had not even tried to do anything based on customer data - 2 simple tables and standard enough actions allowed to see the problem immediately.

Was that because I am smart or experienced in bugs processing? Not at all, my actions where really simple. Nice and smart test cases for the original  Bug #70617 were created and added by famous Shane Bester next day, probably as soon as he noted the bug (it was hard NOT to note it after my Facebook posts), along with some insights based on code review. As a result, original bug report was verified even before my detailed one with a simple test case.

To complete this story I've just added this documentation request, Bug #70741. Let's hope the manual will soon clearly describe how the background thread that re-estimates statistics really works and explain why one should run ANALYZE TABLE after any big change in data if he plans to run queries against the table immediately, no matter what the setting of innodb_stats_auto_recalc is.

Looks like I've spent more time arguing about the original bug report (both externally and internally) than any good engineer (Shane or me) needed to understand the problem based on description and create a test case showing something that is a problem and may be related to the original report. Even with time to report 3 more bugs taken into account, arguing and "Facebook escalations" took more...

What is the summary?

First of all, engineers who work on bugs should spend few minutes carefully reading and thinking about the problem described before ignoring it or blaming reporter for wrong attitudes just because he had no time to create a simple test case (or even did not wanted to do this, for whatever reasons). Bug report with clear problem statement and solutions found also presented is already good enough to think about it. Why would one assume that others are wasting time on bug reporting without really good reasons?

Bug reporters, on the other side, are surely able to get their bugs processed faster if, instead of explaining why they can not afford spending any more time on bug report and making various statements not directly related to the problem at hand, they spend some time creating a test case to copy/paste. Chances are high that original bug reporter understands the problem way better already than anybody else in the world, so why not to try to make the world better by contributing some more lines of text?


  1. I look at it this way: I am not paid to create Oracle test cases. Any test cases I create are CONTRIBUTIONS. The bug is ALREADY a contribution itself. If a bug report is logical and describes behavior and workarounds, then the people getting PAID TO WRITE TEST CASES can write a test case. IMHO.

  2. I have to agree with you, after spending some time on the "other side"...

    Every bug report is a favor and contribution, even if it is not so easy to process. In this case it was really easy to understand what can be the reason, and this was enough to start creating a test case highlighting the problem in a very clear and easily repeatable way.