Sunday, November 18, 2018

Fun with Bugs #72 - On MySQL Bug Reports I am Subscribed to, Part IX

I've subscribed to more than 60 new bug reports since my previous post in this series. It means that I'd need 4-5 posts to cover all new subscriptions and reasons behind them. I still plan to write about most of the bug reports I was interested in recently, but for this post I decided to pick up only MySQL 8.0 regression bugs and pay special attention to those that could be handled better or faster by Oracle engineers, as well as those handled perfectly.

The initial reason for this special attention originally was Bug #93085 - "Stall when concurrently execute create/alter user with flush privilege", that caused a lot of interesting Twitter discussions. It took some time, comments (in the bug report and in social media) and pressure from MySQL Community (including yours truly) to get it accepted as a real (regression!) bug to work on, and got "Verified". Unfortunately too often recently I see more time spent on arguing that something is not a bug or can not be reproduced, or is an example of improper usage of some MySQL feature etc instead of simply checking how things worked before MySQL 8.0 and how this changed, to worse.

Another example of "interesting" approach to bugs in MySQL 8.0 is Bug #93102 - "I_S queries don't return column names in the same case as requested.". It's indeed a duplicate of old and well known Bug #84456 - "column names in metadata appearing as uppercase when selecting from I_S" reported at early 8.0 development stage by Shane Bester from Oracle and community user (see Bug #85947). Still, it was decided NOT to fix it and tell users to rely on workaround, while this breaks application compatibility and is a regression.

Take a look at Bug #92998 - "After finishing the backup with mysqldump, mysql crashes and restarts" also. It ended up in "Unsupported" status, with some statements that "Dumping and restoring data between different 8.0 releases is not supported". This can be classified as a regression by itself. What I miss is a link to the manual saying it's not supported (was not able to find it in 5 minutes) any explanation of crash and restart - supported or not, running mysqldump should NOT cause server restarts in a general case. I think this bug report could end up in many statuses, but of them all "Unsupported" is hardly correct.

This my photo is far from ideal and can be criticized from different points of view, but there is no point to argue with the fact that it shows clouds in the sky. I wish the fact that MySQL 8.0 GA releases still have regression bugs is accepted with less arguing and more attention.

Now let me continue with a list of recently reported regression bugs in MySQL 8.0 that were handled mostly properly:

  • Bug #93215 - "Case statement use in conditional with sub_part index regression in 8.0". MySQL of versions < 8.0 (and MariaDB 10.3.7) work as expected also. The bug was verified fast, but it still misses explicit "regression" tag.
  • Bug #93214 - "LIMIT is applied before HAVING when you have a subquery". The bug was "Verified" quickly, but I still miss the exact 8.0.x version(s) affected and the results of checking with older versions. I strongly suspect it's a regression, as MariaDB 10.3.7 provides expected result:
MariaDB [test]> CREATE TABLE test (id INT PRIMARY KEY, value INT);
Query OK, 0 rows affected (0.510 sec)
MariaDB [test]> INSERT INTO test VALUES (1, 99), (2,98), (3, 97);
Query OK, 3 rows affected (0.057 sec)
Records: 3  Duplicates: 0  Warnings: 0
MariaDB [test]> SELECT, (SELECT t2.value FROM test t2 WHERE = AS sub_value FROM test t1 HAVING sub_value = 99 ORDER BY value LIMIT 1;
| id | sub_value |
|  1 |        99 |
1 row in set (0.116 sec)
            • Bug #93170 - "undo truncation in 8.0.13 is not crash safe". The bug was quickly verified (after all, it's a failure of existing innodb_undo.truncate_recover MTR test case), but had not got "regression" tag. I am still not sure how it was missed during regular testing and ended up in the MySQL 8.0.13 release.
            • Bug #93147 - "Upgrade to 8.0.13 from 8.0.11 fails". In pre-8.0 releases there was no strict need to update to every intermediate minor version, so it's also a regression of a kind for any production DBA.
            • Bug #92979 - "MySQL 8.0 performance degradation on INSERT with foreign_key_checks=0". This is a verified performance regression comparing to MySQL 5.7, but "regression" tag is still missing. 

            To summarize, there are some regressions noted by community users recently in MySQL 8.0 GA releases. Some of them were demonstrated with simple test cases, so it's strange they were not noted by Oracle's QA. What's worse, it seems some of Oracle engineers are not ready to accept the fact that the best ever MySQL 8.0 GA release they worked on may get some things done incorrectly and worse than before, so they seem to waste time on useless discussions that everything is OK, work as expected and nothing can be done differently.  I also see some processed and verified bug reports without detailed check for regressions presented to users or even with "regression" tag NOT added when needed.

            I hope this is not going to become a new trend. I wish all community bug reports and features of MySQL get as much attention and detailed study from Oracle employees as (far from perfect) JSON support in MariaDB...

            Sunday, November 4, 2018

            On New Severity Levels for MySQL Bugs

            Four weeks ago while working on a blog post about half baked XA transactions feature of MySQL server I've noted that there are new severity levels added by Oracle for MySQL bug reports. Previously we had 5 levels:

            • S1 (Critical) - mostly for all kinds of crashes, DoS attack vectors, data corruptions etc
            • S2 (Serious) - mostly for wrong results bugs, broken replication etc
            • S3 (Non-critical) - all kinds of minor but annoying bugs, from unexpected results in some corner cases to misleading or wrong error messages, inefficient or unclear code etc
            • S4 (Feature requests) - anything that should work or be implemented based on common sense, but is not documented in the manual and was not required by the original specification or implementation of some feature.
            • S5 (Performance) - everything works as expected and documented, but the resulting performance is bad or less than expected. Something does not scale well, doesn't return results fast enough in some cases, or could be made faster or some specific platform using some different code or library. This severity level was also probably added at Oracle times, at least it was not there in 2005 when I started to work on MySQL bugs.

            Informal descriptions above are mine and may be incorrect or different from definitions Oracle engineers currently use. I tried to search for Oracle definitions that apply to MySQL, but was not able to find anything immediately useful (any help with public URL is appreciated). 

            In general, severity is defined as the degree of impact a bug has on the operation or use of some software, so less severity assumes less impact on common MySQL operations. One may also expect that bugs with higher severity are fixed first (have higher internal priority). It may not be that simple (and was not during my days in MySQL, when many more inputs were taken into account while setting priority for the bug fix), but it's a valid assumption for any community member.

            By default when searching for bugs you got all bugs of severity levels S1, S2, S3 and S5. You had to specifically care to get feature requests in search results while using bugs database search interface.

            If you try to search bugs today, you'll see two more severity levels added, S6 (Debug Builds) and S7 (Test Cases):

            Now we have 7 Severity levels for MySQL bug reports
            S6 severity level seems to be used for assertion failures and other bugs that affect only debug builds and can not be reproduced literally with non-debug binaries. S7 severity level is probably used for bug reports about failing MTR test cases, assuming that failure does NOT show a regression in MySQL software, but rather some non-determinism, platform dependencies, timing assumptions or other defects of the test case itself.

            By default bug reports with these severity levels are NOT included in search (they are not considered "Production Bugs"). So, one has to care to see them. This, as well as normal common sense based assumption that lower severity eventually means to lower priority for the fix, caused some concerns. It would be great for somebody from Oracle to explain the intended use and reasons for introduction of these severity levels with some more text than a single tweet, to clarify possible FUD people may have. If applied formally, these new severity values may lead to low priority for quite important problems. Most debug assertions are in the code for really good reason, as many weird things (up to crashes and data corruption) may happen in non-debug binaries somewhere later in cases when debug-only assertion fails.

            I was surprised to find out that at the moment we have 67 active S6 bug reports, and 32 active S7 bug reports. The latter list obviously includes reports that should not be S7, like Bug #92274 - "Question of mysql 8.0 sysbench oltp_point_select test" that is obviously about a performance regression noted in MySQL 8 (vs MySQL 5.7) by the bug reporter.

            Any comments from Oracle colleagues on the reasons to introduce new severity levels, their formal definitions and impact on community bug reports processing are greatly appreciated.

            Saturday, October 27, 2018

            Fun with Bugs #71 - On Some Public Bugs Fixed in MySQL 5.7.24

            Oracle released many new MySQL versions back on Monday, but I had no time during this very busy week to check anything related (besides the fact that MySQL 8.0.13 can be complied from source on my Fedora 27 box). I am sure you've read a lot about MySQL 8.0.13 elsewhere already, even patches contributed by Community are already presented in a separate post by Jesper Krogh.

            I am still mostly interested in MySQL 5.7. So, here is my typical quick review of some selected bugs reported in public by MySQL Community users and fixed in MySQL 5.7.24.

            My wife noticed this nice spider in the garden and reported it to me via this photo. Spider is formally not a bug, while in this post I discuss pure bugs...
            Let me start with fixes in Performance Schema (that is supposed to be mostly bugs free):
            • Bug #90264 - "Some file operations in mf_iocache2.c are not instrumented". This bug reported by Yura Sorokin from Percona, who also contributed patches, is fixed in all recent Oracle releases, from 5.5.62 to 8.0.13.
            • Bug #77519 - "Reported location of Innodb Merge Temp File is wrong". This bug was reported by Daniël van Eeden back in 2015. Let's hope files are properly reported in @@tmpdir now.
            • Bug #80777 - "order by on LAST_SEEN_TRANSACTION results in empty set". Yet another bug report from Daniël van Eeden got fixed.
            Let's continue with InnoDB bugs:
            • Bug #91032 - "InnoDB 5.7 Primary key scan lack data". Really weird bug was reported by Raolh Rao back in May.
            • Bug #95045 - Release notes are referring to public bug that does not exist! So, we have a bug in them. Related text:
              "It was possible to perform FLUSH TABLES FOR EXPORT on a partitioned table created with innodb_file_per_table=1 after discarding its tablespace. Attempting to do so now raises ER_TABLESPACE_DISCARDED. (Bug #95045, Bug #27903881)"
              and refer to Bug #80669 - "Failing assert: fil_space_get(table->space) != __null in line 724", reported by Ramesh Sivaraman from Percona. In the comment from Roel there we see that actual bug was Bug #90545 that is, surprise, still private!
              Recently I found out (here) that some community members think that keeping crashing bugs private after the fixed version is released is still better than publish test cases for them before all affected versions are fixed... I am not so sure.
            What about replication (group replication aside, I have enough Galera problems to deal with in my life to even think about it)? There are some interesting bug fixes:
            • Bug #90551 - "[MySQL 8.0 GA Debug Build] Assertion `!thd->has_gtid_consistency_violation'". Good to know that Oracle engineers still pay attention to debug assertions, as in this report (with nice simple test case involving XA transactions)  reported by Roel Van de Paar from Percona.
            • Bug #89370 - "semi-sync replication doesn't work for minutes after restart replication". This bug was reported by Yan Huang, who had contributed a patch for it.
            • Bug #89143 - "Commit order deadlock + retry logic is not considering trx error cases". Nice bug report from Jean-François Gagné.
            • Bug #83232 - "replication breaks after bug #74145 happens in master". FLUSH SLOW LOGS that failed on master (because of file permission problem, for example) was still written to the binary log. Nice finding by Jericho Rivera from Percona.
            There are interesting bugs fixed in other categories as well. For example:
            • Bug #91914 - "Mysql 5.7.23 cmake fail with 'Unknown CMake command "ADD_COMPILE_FLAGS".'" Now thanks to this report by Tomasz Kłoczko one can build MySQL 5.7 with gcc 8.
            • Bug #91080 - "sql_safe_updates behaves inconsistently between delete and select". The fix is described as follows:
              "For DELETE and UPDATE that produced an error due to sql_safe_updates being enabled, the error message was insufficiently informative. The message now indicates that data truncation occurred or the range_optimizer_max_mem_size value was exceeded.

              Additionally: (1) Using EXPLAIN for such statements does not produce an error, enabling users to see from EXPLAIN output why an index is not used; (2) For multiple-table deletes and updates, an error is produced with safe updates enabled only if the target table or tables use a table scan."
              I am NOT sure this is the fix that bug reporter, Nakoa Mccullough, was expecting. He asked to be consistent with SELECT (that works). The bug is still closed :(
            • Bug #90624 - "Restore dump created with 5.7.22 on 8.0.11". It seems Emmanuel CARVIN asked for the working way to upgrade from 5.7.x to 8.0.x. Last comment seems to state that upgrade from 5.7.24 to 8.0.13 is still not possible. I had not checked this.
            • Bug #90505 is private. Release notes say:
              "If flushing the error log failed due to a file permission error, the flush operation did not complete. (Bug #27891472, Bug #90505) References: This issue is a regression of: Bug #26447825"
              OK, we have a private regression bug, fixed. Nice.
            • Bug #90266 - "No warning when truncating a string with data loss". It was when making BLOB/TEXT columns smaller. Nice finding by Carlos Tutte.
            • Bug #89537 - "Regression in FEDERATED storage engine after GCC 7 fixes". Yet another by report with patch contributed by Yura Sorokin.
            • Bug #88670 - "Subquery incorrectly shows duplicate values on subqueries.". Simple from results bug in optimizer affecting all versions starting from 5.6. Fixed now thanks to Mark El-Wakil.
            That's all bugs I wanted to mention today. To summarize my feelings after reading the release notes:
            1. I'd surely consider upgrade to 5.7.24 in any environment where replication is used. Some InnoDB fixes also matter.
            2. We still see not only private bugs (with questionable security impact) mentioned in the release notes, but this time also a typo in bug number that makes it harder to find out what was really fixed and why.
            3. I think it would be fair for Oracle to mention Percona as a major contributor to MySQL 5.7, in a same way as Facebook is mentioned in many places with regards to 8.0.13.
            4. It's good to know that some debug assertions related bugs are still fixed. More on this later...

            Sunday, October 21, 2018

            On Some Recent MySQL Optimizer Bugs

            Yet another topic I missed in my blog post on problematic MySQL features back in July is MySQL optimizer. Unlike with XA transactions, it was done in purpose, as known bugs, limitations and deficiencies of MySQL optimizer is a topic for a series of blog posts if not a separate blog. At the moment the list of known active bug reports in optimize category consists of 380(!) items (mostly "Verified"), aside from feature requests and bugs not considered "production" ones by current "rules" of MySQL public bugs database. I try to check optimizer bugs often in my posts, I reported many of them, but I am still not ready to cover this topic entirely.

            What I can do in frames of one blog post is a quick review of some "Verified" optimizer bugs reported over last year. I'll present them one by one in a list, with some comments (mostly related to my checks of the same test case with MariaDB 10.3.7 that I have at hand) and, hopefully, some conclusions about current state of MySQL optimizer.

            I'll try to shed some light on current state of MySQL optimizer, but it's huge and dark area, with many details hidden...
            So, here is the list, starting from most recently reported bugs:
            • Bug #92654 - "GROUP BY fails for queries when a temporary table is involved". This bug affects recent MySQL 8.0.12 and 5.7.23, but does not affect MariaDB 10.3, for example, from what I see:
              MariaDB [test]> insert into domain_tree values (1), (2), (3);
              Query OK, 3 rows affected (0.080 sec)
              Records: 3  Duplicates: 0  Warnings: 0

              MariaDB [test]> insert into host_connection_info values (1), (3);
              Query OK, 2 rows affected (0.054 sec)
              Records: 2  Duplicates: 0  Warnings: 0

              MariaDB [test]> SELECT
                  ->   COUNT(1),
                  ->   host_connection_status.connection_time
                  -> FROM
                  ->   (SELECT id
                  ->    FROM domain_tree) AS hosts_with_status
                  ->   LEFT OUTER JOIN
                  ->   (SELECT
                  ->      domain_id,
                  ->      'recent' AS connection_time
                  ->    FROM
                  ->      host_connection_info) AS host_connection_status
                  ->     ON = host_connection_status.domain_id
                  -> GROUP BY host_connection_status.connection_time;
              | COUNT(1) | connection_time |
              |        1 | NULL            |
              |        2 | recent          |
              2 rows in set (0.003 sec)
            • Bug #92524 - "Left join with datetime join condition produces wrong results". The bug was reported by Wei Zhao, who contributed a patch. Again, MariaDB 10.3 is not affected:
              MariaDB [test]> select B.* from h1 left join g B on h1.a=B.a where B.d=str_to_date('99991231',"%Y%m%d") and h1.a=1;
              | a | d                   |
              | 1 | 9999-12-31 00:00:00 |
              1 row in set (0.151 sec)

              MariaDB [test]> select B.* from h1 left join g B on h1.a=B.a and B.d=str_to_date
              ('99991231',"%Y%m%d") where h1.a=1;
              | a | d                   |
              | 1 | 9999-12-31 00:00:00 |
              1 row in set (0.002 sec)
            • Bug #92466 - "Case function error on randomly generated values". See also related older Bug #86624 - "Subquery's RAND() column re-evaluated at every reference". These are either regressions comparing to MySQL 5.6 (and MariaDB), or unclear and weird change in behavior that can be workarounded with some tricks (suggested by Oracle developers) to force materialization of derived table. Essentially, result depends on execution plan - what else could we dream about?
            • Bug #92421 - "Queries with views and operations over local variables don't use indexes". Yet another case when MySQL 5.6 worked differently. As Roy Lyseng explained in comments:
              "... this is due to a deliberate choice that was taken when rewriting derived tables and views in 5.7: When a user variable was assigned a value in a query block, merging of derived tables was disabled.
              In 8.0, you can override this with a merge hint: /*+ merge(v_test) */, but this is unfortunately not implemented in 5.7.
            • Bug #92209 - "AVG(YEAR(datetime_field)) makes an error result because of overflow". All recent MySQL versions and MariaDB 10.3.7 are affected.
            • Bug #92020 - "Introduce new SQL mode rejecting queries with results depending on query plan". Great feature request by Sveta Smirnova that shows current state of optimizer development properly. We need a feature for MySQL to stop accepting queries that may return different results depending on the execution plan. So, current MySQL considers different results when different execution plans are used normal! Sveta refers to her Bug #91878 - "Wrong results with optimizer_switch='derived_merge=ON';" as an example. MariaDB 10.3 is NOT affected by that bug.
            • Bug #91418 - "derived_merge causing incorrect results with distinct subquery and uuid()". From what I see in my tests, MariaDB 10.3.7 produce wrong results with derived_merge both ON and OFF, unfortunately.
            • Bug #91139 - "use index dives less often". In MySQL 5.7+ the default value of eq_range_index_dive_limit increased from to 10 to 200, and this may negatively affect performance. As Mark Callaghan noted, when there is only one possible index exists optimizer doesn't need to evaluate the query to figure out how to evaluate the query.
            • Bug #90847 - "Query returns wrong data if order by is present". This is definitely a corner case, but still. MariaDB 10.3 returns correct result in my tests.
            • Bug #90398 - "Duplicate entry for key '<group_key>' error". I can not reproduce the last public test case on MariaDB 10.3.
            • Bug #89419 - "Incorrect use of std::max". It was reported based on code analysis by Zsolt Parragi. See also Bug #90853 - "warning: taking the max of a value and unsigned zero is always equal to the other value [-Wmax-unsigned-zero]". Proper compiler detects this.
            • Bug #89410 - "delete from ...where not exists with table alias creates an ERROR 1064 (42000)". MariaDB 10.3 is also affected. Both Oracle and PostrgeSQL accepts the syntax, while in MySQL and MariaDB we can use multi-table delete syntax-based workaround as suggested by Roy Lyseng.
            • Bug #89367 - "Storing result in a variable(UDV) causes query on a view to use derived tables", was reported by Jaime Sicam. This is a kind of regression in MySQL 5.7. MariaDB 10.3 and MySQL 8.0 are not affected. Let me quote a comment by Roy Lyseng:
              "In 5.7, we added a heuristic so that queries that assign user variables are by default materialized and not merged. However, we should have let the ALGORITHM=MERGE override this decision. This is a bug."
            • Bug #89182 - "Optimizer unable to ignore index part entering less optimal query plan". Nice report from Przemyslaw Malkowski. One of many case when "ref" vs "range" decision seems to be wrong based on costs. Looks like optimizer still have parts that are heuristics/rules based and/or do not take costs into account properly.
            • Bug #89149 - "SELECT DISTINCT on multiple TEXT columns is slow". Yet another regression in MySQL 5.7+.
            That's all optimizer bugs reported in 2018 and still "Verified" that I wanted to discuss.

            From the list above I can conclude the following:
            1. There are many simple enough cases when queries provide wrong results or get not optimal execution plans in MySQL. For many of them MariaDB's optimizer does a better job.
            2. Behavior of optimizer for some popular use cases changed after MySQL 5.6, so take extra care to check queries and their results after upgrade to MySQL 5.7+.
            3. derived_merge optimization seems to cause a lot of problems for users in MySQL 5.7 and 8.0.
            4. It seems optimizer developers care enough to comment on bugs, suggest workarounds and explain decisions made.

            Saturday, October 6, 2018

            On MySQL XA Transactions

            One of the features I missed in my blog post on problematic MySQL features back in July is XA transactions. Probably I was too much in a hurry, as this feature is known to be somewhat buggy, limited and not widely used outside of Java applications. My first related feature request, Bug #40445 - "Provide C functions library implementing X/OPEN XA interface for Bea Tuxedo", was created almost 10 years ago, based on the issue from one of MySQL/Sun customers of that time. I remember some internal debates on how much time and efforts the implementation may require, but the decision was not made anyway, and one still can not directly integrate MySQL with Tuxedo transaction manager (that the idea of XA transactions originally came from). It's even more funny to see that feature request still just "Verified" when taking into account the fact that BEA Tuxedo software is Oracle's software since... 2008.

            XA Transactions support is a useful MySQL feature, but I wonder if one day it may just become abandoned as that West Pier in Brighton, or overwhelmed with many small bugs in a same way as this stairs to the beach in Hove...

            But maybe XA transactions are not widely used and nobody cares much about them?

            Let me try to do a quick review of related active bug reports and feature requests before making any conclusions:
            • Bug #91702 - "Feature Request: JOIN support for XA START command". This feature request was added less than 3 months ago and is still "Open". It means there are users interested in this feature, but Oracle engineers do not care much even to verify related requests, even less - to give them some priority. 
              See also Bug #78498 - "XA issue or limitation with 5.6.19 engine", reported 3 years ago, that is essentially about the same limitation. As bug reporter explained:
              "... it prevents us to use MySQL with Weblogic on 2 phase commit scenarii..."
            • Yet another example of a request ignored for a long time is Bug #90659 - "implicit commit and unlocking with xa start", that is about the inconsistency of current implementation. Even less (as we already know) they care about XA support outside of Java as one can conclude from the fact that Connector/Net related request, Bug #70587 - "Dot Net Distributed transaction not supported in MySql Server 5.6", had not got any attention since July, 2015...
            • Bug #91646 - "xa command still operation when super_read_only is true". This bug was reported in July by Zhenghu Wen. It seems nobody cares much about XA transactions integration when new features are added to MySQL server.
            • Bug #89860 - "XA may lost prepared transaction and cause different between master and slave". This bug reported by Michael Yang (See also Bug #88534) sounds really serious and was previously reported by Andrei Elkin (who works for MariaDB now) as Bug #76233 - "XA prepare is logged ahead of engine prepare". See also Bug #87560 - "XA PREPARE log order error in replication and binlog recovery" by Wei Zhao, who also contributed a patch. See also Bug #83983 - "innodb fail to recover the prepared xa transaction" (the bug reported by Dennis Gao is still "Open", while it's clearly related to or is a duplicate of "Verified" bugs mentioned above).
              So many related/duplicate problem reports, but no fix so far!
            • Bug #88748 - "InnoDB: Failing assertion: trx->conc_state == 1". This assertion failure was reported by Roel Van de Paar back in December, 2017. See also his Bug #84754 - "oid String::chop(): Assertion `strlen(m_ptr) == m_length' failed."
              I noted that Oracle recently invented new "low" severity levels, and this bug is S6 (Debug Builds). I do not really agree that assertions in debug builds are of so low severity - they are in the code for a reason, to prevent crashes in non-debug builds and all kinds of inconsistencies.
            • Bug #87526 - "The output of 'XA recover convert xid' is not useful". This bug reported by Sveta Smirnova caused a lot of troubles to poor users with prepared transactions hanging around for weeks after crash, as it prevented any easy way to get rid of them (and related locks) in some cases. The bug is still "Verified" in MySQL and "On hold" in Percona Server, while MariaDB fixed it in 10.3, see MDEV-14593.
            • Bug #87130 - "XA COMMIT not taken as transaction boundary". Yet another bug report with a patch from Wei Zhao.
            • Bug #75205 - "Master should write a LOST_EVENTS entry on xa commit after recovery." Daniël van Eeden reported this at early 5.7 pre-GA stage, and manual explains now that:
              "In MySQL 5.7.7 and later, there is a change in behavior and an XA transaction is written to the binary log in two parts. When XA PREPARE is issued, the first part of the transaction up to XA PREPARE is written using an initial GTID. A XA_prepare_log_event is used to identify such transactions in the binary log. When XA COMMIT or XA ROLLBACK is issued, a second part of the transaction containing only the XA COMMIT or XA ROLLBACK statement is written using a second GTID. Note that the initial part of the transaction, identified by XA_prepare_log_event, is not necessarily followed by its XA COMMIT or XA ROLLBACK, which can cause interleaved binary logging of any two XA transactions. The two parts of the XA transaction can even appear in different binary log files. This means that an XA transaction in PREPARED state is now persistent until an explicit XA COMMIT or XA ROLLBACK statement is issued, ensuring that XA transactions are compatible with replication."
              but the bug report is still "Verified".
              By the way, the need to deal with such prepared transactions recovered from the binary log caused problems like those listed above (with XA RECOVER CONVERT and order of preparing in the binary log vs engines that support XA...
            • Bug #71351 - "select hit query cache after xa commit, no result return". This bug probably affects only MySQL 5.5, so no wonder it's ignored now. Nobody tried to fix it while MySQL 5.5 was still supported, though.
            There are some more bugs originally filed in other categories, but still related to XA:
            • Bug #72036 - "XA isSameRM() shouldn't take database into account". This Connecotr/J bug was reported in 2014 by Jess Balint.
            • Bug #78050 - "Crash on when XA functions activated by a storage engine". It happens when binary log not enabled. This bug was reported by Zhenye Xie, who also contributed a patch later. Still this crashing bug remains "Verified".
            • Bug #87385 - "Partial external XA transactions are not rolled back correctly". Yet another bug report with a patch from Wei Zhao. See also his Bug #87389 - "Replication position not persisted correctly for XA transactions".
            • Bug #91633 - "Replication failure (errno 1399) on update in XA tx after deadlock". This bug reported by Lukas Sydorowski got recent comment from other community member yesterday. So, the feature is used these days, still.
            Now time for conclusions:
            1. Take extra care while using XA transactions in replication environments or with point in time recovery - you may easily end up with slaves out of sync with master and data lost.
            2. Feature requests related to XA transactions are mostly ignored, sometimes for a decade... 
            3. Patches contributed do not seem to speed up XA bug fixing.
            4. I'd say that Oracle does not care much about XA Transactions since MySQL 5.7 GA release in 2015.
            5. MySQL Community still use XA transactions with MySQL (and they will be used even more as corporate users migrate from Oracle RDBMS), find bugs and even try to fix them. But probably will have to use forks rather than MySQL itself if current attitude towards XA bugs processing and fixing remains.

            Sunday, September 23, 2018

            On MySQL Replication Bugs

            While writing about problematic Oracle MySQL features previously I concentrated mostly on InnoDB problems that I have to fight with really often and deliberately skipped replication from even the preliminary list of features to study in details for that blog post. First of all, I mostly work with MariaDB users now, and implementation of many replication features in MariaDB is notably different already (and has its own list of known problems). But this happened also because (asynchronous) replication plays a key role in most MySQL environments and deserves a detailed study in a separate post.

            It took me some time to prepare lists of recent bugs reported for MySQL replication (row-based and statement-based, GTIDs, multi-threaded and crash safe, hopefully) and then try to summarize my findings from them. Looks like I have something related to share today, finally.

            Gondolas in Venice often look very similar to each other. They all are probably replicas of few different original designs. But small differences exist. This should not happen to your data in a replication setup...
            Let me start with a summary of problems I found and then proceed with the lists of bug reports I based my conclusions on:
            1. Both minor problems with mysqlbinlog outputs, small regressions, missing useful features and real bugs (that may lead to data differences on slaves or break replication), hangs (including MTS deadlocks), crashes and performance problems with replication that can affect production setups do not seem to get high priority over last 2 years. Even those reported in public bugs database by famous MySQL Community members or current and former Oracle MySQL engineers.
            2. GTIDs and replication crash safety features were supposed to make it easier to use replication in production. In reality there are some GTID-specific ways to break replication, and it is still not as crash safe as expected in many use cases.
            3. Multi-threaded slave (a.k.a MTS or parallel replication) bugs, even serious and verified, are ignored for way more than last 2 years! The feature is not completed and may not help as much as one may expect.
            4. Manual is still wrong or missing when describing some details of current replication implementation.
            5. Nobody cares much about debug assertions found by Percona QA. We may have some bombs in the replication code that may explode in a weird way any time. 
            6. It seems group replication bugs often get proper attention fast, but nobody in Oracle has enough time to deal with asynchronous replication bugs properly. Some of them hang around without even proper verification for months...
            Now let's get into details. Here is the list of some recent row-based replication specific bugs and feature requests:
            • Bug #91822 - "incorrect datatype in RBR event when column is NULL and not explicit in query". The column missing in the INSERT statement is given the datatype of the column immediately preceding it. This bug was reported by Ernie Souhrada back in July.
            • Bug #89339 - "RBR slave index choosing logic is too simple". As Fungo Wang pointed out based on source code review, it is based on rules, not considered data distribution at all. Slave uses primary key if it exists, otherwise tries unique key that does not allow NULLs (if any), then tries other keys and finally falls back to full table scan if there are no indexes. When there is no primary or unique key on columns not allowing NULL values, no statistics is taken into account when picking up indexes vs full table scan, and this easily leads to bad plans. Surely, one has to have primary keys in all tables, especially those involved in replication, but it would be great if slave code was a bit smarter in corner cases (as MariaDB's one).
              By the way, setting slave_rows_search_algorithms to HASH_SCAN may not help, as it may be too slow in practice. See Bug #84887 by Dennis Gao for more details and an example of perf and gdb usage for the analysis.
            • Bug #89297 - "please allow number to varchar slave_type_conversions if target is large enough". Nice feature request by Shane Bester.
            • Bug #88791 - "Binary log for generated column contains new value as WHERE clause, not old value". This bug was reported by MC Brown. Surely, mysqlbinlog output for row-based events is not supposed to be 100% valid SQL for copy-paste, but it should not be misleading.
            • Bug #88595 - "Row-based master-master replication broken by add column or drop column". This is more a documentation issue. It's strange to see this limitation not documented here.
            • Bug #88223 - "Replication with no tmpdir space and InnoDB as tmp_storage_engine can break". It's really weird to see this bug reported by Sveta Smirnova hanging in "Analyzing" status for almost a year!
            • Bug #88057 - "Intermediary slave does not log master changes with binlog_rows_query_log_events". Yet another bug report from Sveta, this time "Verified". Row query log events for UPDATEs and DELETEs are NOT recorded in the binary log by the intermediate masters...
            • Bug #85292 - "unreachable statement in function process_event with event type QUERY_EVENT". Shane Bester found a part of the code that probably should be rewritten. No further public attention to this report for 1.5 years...
            • Bug #84287 - "row inserts, statement updates, persistent stats lead to table scans+ lag slaves". Yet another great finding by Shane Bester. As often happens, another problematic feature I truly hate, InnoDB's persistent statistics, is also involved in this case, leading to bad performance on slave.
            • Bug #83431 - "RBR slave's "BEGIN" binlog query event doesn't carry a session database name". It was reported almost 2 years ago by Teng Li.
            • Bug #83418 - "When binlog_row_image is MINIMAL, some updates queries logs redundant columns". This was reported by Abhinav Sharma also almost two years ago.
            • Bug #83090 - "AUTO_INCREMENT not updated when using INSERT ON DUPLICATE KEY UPDATE". This bug was reported by Monty Solomon two years ago. See also his related Bug #83030 - "AUTO_INCREMENT not updated when using REPLACE INTO".
            • Bug #80821 - "Replication breaks if multi-table DELETE is used in conjunction with Foreign Key". I do not understand why this perfect (regression?) bug report by Philip Stoev, with a clear MTR test case was immediately verified and then just hangs around for 2.5 years...
            • Bug #80770 - "Compression of row-based event". Great feature contributed by Davi Arnaut.
            • Bug #79831- "Unexpected error message on crash-safe slave with max_relay_log_size set". Let me just conclude this with a 5.6-spcific (probably) bug report by yours truly...
            I have to stop at this point, as other RBR-related bug reports are known for way more time and probably are just forgotten for good already by MySQL development. They hardly have a chance for the fix any time soon.

            Let's continue with replication bugs that are not closely related to row-based binlog format:
            • Bug #92398 - "point in time recovery using mysqlbinlog broken with temporary table -> errors". Shane Bester found yet another case when statement-based binlog format may lead to problems for point in time recovery.
            • Bug #92252 - "Mysql generates gtid gaps because of option slave-skip-errors". This bug was reported by Pin Lin. We get gaps in GTID values on slave when errors are skipped. If binlog files on master include transactions which were applied on slave and hit errors in slave-skip-errors were purged, slave can't reconnect to master and reports error 1236. Features that were supposed to make replication more robust lead to more breaks in replication.
            • Bug #92132 - "secure-file-priv breaks LOAD DATA INFILE replication in statement mode on 5.7.23". This regression bug was reported by Nicolai Plum. Recent MySQL 5.7 and 8.0 versions are affected.
            • Bug #92109 - "Please make replication crash safe with GTID and less durable setting (bis)." Jean-François Gagné keeps looking for options to make replication really crash safe in all practically important cases.
            • Bug #92073 - "int change_master(): Assertion `mi->rli->inited' failed." Roel Van de Paar continues his QA efforts in a hope to make MySQL better one day and reports all debug assertions in public. In this case we seem to have a kind of regression in 8.0 vs 5.7. Check also these his bug reports: Bug #91980 - "Assertion `head->variables.gtid_next.type == ANONYMOUS_GTID' failed." and Bug #90946 - "Assertion `get_state() >= 0' failed on RESET MASTER while tx_read_only". Debug assertions are in the code for a reason usually, so in a latter bug report Roel recently asked for some update...
            • Bug #91766 - "Log of replication position when running CHANGE MASTER TO is wrong.". Jean-François Gagné found yet another annoying detail - one would expect the error log to have data about the SQL Thread position, but it has data about the IO Thread position.
            • Bug #91744 - "START SLAVE UNTIL going further than it should." Now this bug reported by  Jean-François Gagné is really serious. Still it hangs around "Open" since July 20...
            • Bug #91633 - "Replication failure (errno 1399) on update in XA tx after deadlock". XA is yet another MySQL feature that is doomed to stay half-baked it seems... More on this you should expect in a separate blog post one day. The bug that causes replication break was reported by Lukas Sydorowski.
            • The last but not the least bug in this list (that is actually long if we try to check all active bugs reported over last 2 years) that I want to include is Bug #90448 - "INSERT IGNORE not ignored on SQL thread", by Simon Mudd. Restarting slave helps and it seems MySQL 8.0.x is not affected. See also his Bug #89876 - "mysqladmin flush-hosts is not safe in GTID mode". 
            I have to stop at some point. In conclusion let me review a list of (less) recent bugs in a multi-threaded slave (MTS) implementation specifically:
            • Bug #91477 - "Slave writes less when connected intermediary master with blackhole engine". Just 2 threads are used in this case as found by Sveta Smirnova.
            • Bug #90342 - "MySQL parallel replication automatically restarts". Nobody cares about this crashing bug for 5+ months. It's still "Open".
            • Bug #89247 - "Deadlock with MTS when slave_preserve_commit_order = ON.Jean-François Gagné reported this 9 months ago, the bug was verifvied and based on last comments and pt-pmp outputs the hang really happen for many users. Still nobody cares in any obvious way.
            • Bug #89229 - "FLUSH PRIVILEGES may cause MTS deadlock". This bug was reported by Libing Song 9 months ago.
            • Bug #87796 - "Commit_order_manager can't terminate MTS worker properly when deadlock happens". This bug was reported by Seunguck Lee a year ago!
            • Bug #85747 - "Documentation on MTS statistics". Yet another case of missing manual reported by Juan Arruti:
              "I didn't find information in MySQL reference manual about MTS statistics, I had to look into source code in order to get more information about them."
            • Bug #85142 - "reducing MTS checkpointing causes high IO load". The bug reported by Trey Raymond is "Verified" for 1.5 years. Check also his Bug #81386 - "MTS hangs on file related *_EVENT forced checkpoint" and Bug #81385 - "relay log info not updated after using sql_after_mts_gaps" that are "Open" since May 2016!
            • Bug #73066 - "Replication stall with multi-threaded replication". Ovais Tariq reported this bug more than 4 years ago, when both still worked in Percona. Nobody cares to fix it.
            • Bug #71495 - "Logging of MTS breakage is insufficient." One of Oracle customers who use MTS in production, Simon Mudd had reported this practical problem 4.5+ years ago. No fix, no comments from Oracle engineers from more than 3 years.
            I have to stop at this stage, as checking more/older replication related bugs makes me even more sad than I planned or want to be. We see fixes in every release, but the backlog seems huge. Dear Oracle MySQL engineers, please, review public MySQL replication bug reports collected over last 4-5 years and spend some more time and resources on processing and fixing them! Otherwise this Acqua alta of bugs will make this great legacy feature obsolete one sad day.

            I had not planned to study semi-sync replication bugs in this post (maybe, some day). If you are interested in more details about this feature, crash safety of replication, parallel replication and replication in general, I suggest you to check J-F Gagné's MySQL Blog.

            Saturday, August 18, 2018

            On Fine MySQL Manual

            Today I am going to provide some details on the last item in my list of problems with Oracle's way of MySQL server development, maintenance of MySQL Manual. I stated that:
            "MySQL Manual still have many details missing and is not fixed fast enough.
            Moreover, it is not open source...
            Let me explain the above:
            1. MySQL Reference Manual is not open source. It used to be built from DocBook XML sources. Probably that's still the case. But you can not find the source code in open repositories (please, correct me if I am wrong, I tried to search...) That's because it is NOT open source. It says this clearly in Preface:
              "This documentation is NOT distributed under a GPL license. Use of this documentation is subject to the following terms:
              Any other use, such as any dissemination of printed copies or use of this documentation, in whole or in part, in another publication, requires the prior written consent from an authorized representative of Oracle. Oracle and/or its affiliates reserve any and all rights to this documentation not expressly granted above.
              It was NOT Oracle who closed the source (as far as I remember, the manual was not GPL even in 2004, when I started to care about MySQL in general). But Oracle had a chance to change the license and set up some better contribution process for MySQL Community users, with many good writers among them. They decided not do this, so creators of forks and derived software have to duplicate efforts and rewrite everything themselves, and the only real way to help make the manual better is to report bugs.
            2. Quality of new documentation is not improved much. We, MySQL Community users, have to report bugs in the manual, as it still has some details missing or documented wrongly. Let me illustrate this with recent 10 or so documentation requests users made (I skipped reports for features I do not care about for now, like group replication):
              • Bug #91955 - "8.0 broke plugin API for C". According to the comment from Oracle developer in this bug reported by Laurynas Biveinis, writing plugins in C is not supported any more... But this change is NOT properly documented.
              • Bug #91781 - "Add version matrix in "Chapter 2 Keywords and Reserved Words". Good idea in the manual is not consistently implemented.
              • Bug #91743 - "performance_schema.data_locks not working correctly". Nice feature was added to Performance Schema (now we can check and study InnoDB locks with SQL!), but it is still not completely and properly documented.
              • Bug #91654 - "Handling an Unexpected Halt of a Replication Slave" Documentation is uncertain". This is also an example of improper bug handling - when I worked in Oracle newer bugs were usually declared duplicates of older ones. Here the opposite decision was made, even though both reports were from the same user, Uday Varagani, who explicitly asked to change the decision. Obviously, documentation requests do not get any extra care comparing to other kinds of bugs, quite to the contrary...
              • Bug #91648 - "Numeric literals must be a number between 0 and 9999". Surely ports with numbers larger than 9999 can be used.
              • Bug #91642 - "Document how protocol version 4.0 and 4.1 differ on how strings are terminated". As noted by Rene' Cannao', comments in the code are still sometimes more useful than documentation.
              • Bug #91549 - "MDL lock queue order seems to depend on table names". Make sure to read last comment in this nice request by Shane Bester. Dmitry Lenev provides some details on priorities of MDL locks in it. There are still cases when bugs and documentation requests document some details better than fine MySQL Manual!
              • Bug #90997 - "Handling an Unexpected Halt of a Replication Slave" manual page is wrong". Sveta Smirnova highlighted a totally misleading statement in the manual.
              • Bug #90935 - "Modify relay_log_recovery Documentation". Simple documentation request stays "Open" for 3 months. Definitely processing documentation requests is not a highest priority for Oracle engineers.
              • Bug #90680 - "dragnet logging: document how to use / find error_symbol codes". Even if request comes from a customer or otherwise well known bug reporter, like Simon Mudd, and it's related to some totally new cool feature of MySQL 8, it can wait for the fix for months...
              You can make your own conclusions from the above. But I do not see any good trends in the way new documentation is created or documentation requests are processed recently. Same problems as 4 years ago (see more on that in a side note below).
            3. Older documentation requests get even less attention than recent ones sometimes, even though they may highlight problems with software itself, not the MySQL Manual. Let me illustrate this with a few bugs I reported:
              • Bug #83640 - "Locks set by DELETE statement on already deleted record". I explicitly asked to
                "...document in the manual what locks this kind of DELETE sets when it encountered a record already marked as deleted, and why"
                This still had NOT happened. Moreover, eventually both MySQL and MariaDB engineers decided that current implementation of locking for this case is wrong and can be improved, so this report ended up as InnoDB bug. Check related  MDEV-11215 for more details.
              • Bug #73413 - "Manual does not explain MTS implementation in details". This is one of my favorite documentation requests. I've got a suggestion to explain what I want to see documented, in details. I tried my best, but if you want to get more details, read this blog.
              • Bug #72368 - "Empty/zero statistics for imported tablespace until explicit ANALYZE TABLE". This is a bug in the way persistent statistics (the feature I truly hate) in InnoDB is re-estimated "automatically". But until the bug is fixed, I asked to document current implementation (and problems with it). So, where current implementation is properly documented? If only in the bug reports...
              • Bug #71294 - "Manual page for P_S.FILE_INSTANCES table does not explain '~' in FILE_NAME".  They pretend they do not get the point:
                "Per Marc Alff, the Performance Schema is simply reporting what it gets from runtime. If runtime is reporting nonexistent filenames, that's a server issue.

                Recategorizing this as a server bug.
              • Bug #71293 - "Manual page for P_S.FILE_INSTANCES table does not explain EVENT_NAME values". Now they are waiting for the new DOCUMENTATION column in the setup_instruments table to be filled in with something by server developers... The code is the documentation? OK, bus as we know from the experience (see slides 44-46 here) chances to get a bug in Performance Schema fixed fast are even less than to see it properly and completely documented...
            There are more problems with MySQL documentation (not only reference manual), but at the moment I consider 3 highlighted and illustrated above the most serious.

            Regent's Canal is nice. If I only knew how to operate the locks there... MySQL Manual also misses some information about locks.
            As a side note, it's not the first time I write about MySQL Manual. You can find some useful details in the following posts:
            • "What's wrong with MySQL Manual". In 2014, after spending few weeks reporting up to 5 documentation bugs per day, I thought that, essentially, there is nothing much wrong with it - it's fine, well indexed by Google and has meaningful human-readable URLs. Few problems listed were lack of careful readers (I tried my best to fix that), limited attention to old documentation requests, some pages with not so much useful content and lack of "How to" documents. The later I also tried to fix to some small extent in this blog, see howto tagged posts. The real fix came mostly from Percona's blog, though.
            • I have a special "missing manual" tag for blog posts that mention at least one bug in the manual.
            • I tried to use "missing_manual" tag consistently for my own documentation requests. Last year I shared a detailed enough review of the bugs with that tag that were still active.
            As a yet another side note, I tried once to create a kind of community driven "missing manual" project, but failed. Writing manual from scratch is a (hard) full time job, while my full time job is providing support to users of MySQL, MariaDB and Percona software...

            That said, I also wanted to join MySQL's documentation team in the past, but this was not possible at least because I am not a native speaker. If anything changed in this regard, I am still open to a job offer of this kind. My conditions for an acceptable offer from Oracle are known to all interested parties and they include (but are not limited to) at least 4 times the salary I had before I quit (it was almost 6 years ago) and working for Oracle's office in Paris (because in Oracle office you ere employed by formally matters a lot for your happiness and success as employee).

            In case of no offer in upcoming week, I'll just continue to write my annoying but hopefully somewhat useful blog posts (until MySQL Reference Manual becomes true open source project ;)