Blog of (former?) MySQL Entomologist: On MySQL Replication Bugs

While writing about problematic Oracle MySQL features previously I concentrated mostly on InnoDB problems that I have to fight with really often and deliberately skipped replication from even the preliminary list of features to study in details for that blog post. First of all, I mostly work with MariaDB users now, and implementation of many replication features in MariaDB is notably different already (and has its own list of known problems). But this happened also because (asynchronous) replication plays a key role in most MySQL environments and deserves a detailed study in a separate post.

It took me some time to prepare lists of recent bugs reported for MySQL replication (row-based and statement-based, GTIDs, multi-threaded and crash safe, hopefully) and then try to summarize my findings from them. Looks like I have something related to share today, finally.

Gondolas in Venice often look very similar to each other. They all are probably replicas of few different original designs. But small differences exist. This should not happen to your data in a replication setup...

Let me start with a summary of problems I found and then proceed with the lists of bug reports I based my conclusions on:

Both minor problems with mysqlbinlog outputs, small regressions, missing useful features and real bugs (that may lead to data differences on slaves or break replication), hangs (including MTS deadlocks), crashes and performance problems with replication that can affect production setups do not seem to get high priority over last 2 years. Even those reported in public bugs database by famous MySQL Community members or current and former Oracle MySQL engineers.
GTIDs and replication crash safety features were supposed to make it easier to use replication in production. In reality there are some GTID-specific ways to break replication, and it is still not as crash safe as expected in many use cases.
Multi-threaded slave (a.k.a MTS or parallel replication) bugs, even serious and verified, are ignored for way more than last 2 years! The feature is not completed and may not help as much as one may expect.
Manual is still wrong or missing when describing some details of current replication implementation.
Nobody cares much about debug assertions found by Percona QA. We may have some bombs in the replication code that may explode in a weird way any time.
It seems group replication bugs often get proper attention fast, but nobody in Oracle has enough time to deal with asynchronous replication bugs properly. Some of them hang around without even proper verification for months...

Now let's get into details. Here is the list of some recent row-based replication specific bugs and feature requests:

Bug #91822 - "incorrect datatype in RBR event when column is NULL and not explicit in query". The column missing in the INSERT statement is given the datatype of the column immediately preceding it. This bug was reported by Ernie Souhrada back in July.
Bug #89339 - "RBR slave index choosing logic is too simple". As Fungo Wang pointed out based on source code review, it is based on rules, not considered data distribution at all. Slave uses primary key if it exists, otherwise tries unique key that does not allow NULLs (if any), then tries other keys and finally falls back to full table scan if there are no indexes. When there is no primary or unique key on columns not allowing NULL values, no statistics is taken into account when picking up indexes vs full table scan, and this easily leads to bad plans. Surely, one has to have primary keys in all tables, especially those involved in replication, but it would be great if slave code was a bit smarter in corner cases (as MariaDB's one).
By the way, setting slave_rows_search_algorithms to HASH_SCAN may not help, as it may be too slow in practice. See Bug #84887 by Dennis Gao for more details and an example of perf and gdb usage for the analysis.
Bug #89297 - "please allow number to varchar slave_type_conversions if target is large enough". Nice feature request by Shane Bester.
Bug #88791 - "Binary log for generated column contains new value as WHERE clause, not old value". This bug was reported by MC Brown. Surely, mysqlbinlog output for row-based events is not supposed to be 100% valid SQL for copy-paste, but it should not be misleading.
Bug #88595 - "Row-based master-master replication broken by add column or drop column". This is more a documentation issue. It's strange to see this limitation not documented here.
Bug #88223 - "Replication with no tmpdir space and InnoDB as tmp_storage_engine can break". It's really weird to see this bug reported by Sveta Smirnova hanging in "Analyzing" status for almost a year!
Bug #88057 - "Intermediary slave does not log master changes with binlog_rows_query_log_events". Yet another bug report from Sveta, this time "Verified". Row query log events for UPDATEs and DELETEs are NOT recorded in the binary log by the intermediate masters...
Bug #85292 - "unreachable statement in function process_event with event type QUERY_EVENT". Shane Bester found a part of the code that probably should be rewritten. No further public attention to this report for 1.5 years...
Bug #84287 - "row inserts, statement updates, persistent stats lead to table scans+ lag slaves". Yet another great finding by Shane Bester. As often happens, another problematic feature I truly hate, InnoDB's persistent statistics, is also involved in this case, leading to bad performance on slave.
Bug #83431 - "RBR slave's "BEGIN" binlog query event doesn't carry a session database name". It was reported almost 2 years ago by Teng Li.
Bug #83418 - "When binlog_row_image is MINIMAL, some updates queries logs redundant columns". This was reported by Abhinav Sharma also almost two years ago.
Bug #83090 - "AUTO_INCREMENT not updated when using INSERT ON DUPLICATE KEY UPDATE". This bug was reported by Monty Solomon two years ago. See also his related Bug #83030 - "AUTO_INCREMENT not updated when using REPLACE INTO".
Bug #80821 - "Replication breaks if multi-table DELETE is used in conjunction with Foreign Key". I do not understand why this perfect (regression?) bug report by Philip Stoev, with a clear MTR test case was immediately verified and then just hangs around for 2.5 years...
Bug #80770 - "Compression of row-based event". Great feature contributed by Davi Arnaut.
Bug #79831- "Unexpected error message on crash-safe slave with max_relay_log_size set". Let me just conclude this with a 5.6-spcific (probably) bug report by yours truly...

I have to stop at this point, as other RBR-related bug reports are known for way more time and probably are just forgotten for good already by MySQL development. They hardly have a chance for the fix any time soon.

Let's continue with replication bugs that are not closely related to row-based binlog format:

Bug #92398 - "point in time recovery using mysqlbinlog broken with temporary table -> errors". Shane Bester found yet another case when statement-based binlog format may lead to problems for point in time recovery.
Bug #92252 - "Mysql generates gtid gaps because of option slave-skip-errors". This bug was reported by Pin Lin. We get gaps in GTID values on slave when errors are skipped. If binlog files on master include transactions which were applied on slave and hit errors in slave-skip-errors were purged, slave can't reconnect to master and reports error 1236. Features that were supposed to make replication more robust lead to more breaks in replication.
Bug #92132 - "secure-file-priv breaks LOAD DATA INFILE replication in statement mode on 5.7.23". This regression bug was reported by Nicolai Plum. Recent MySQL 5.7 and 8.0 versions are affected.
Bug #92109 - "Please make replication crash safe with GTID and less durable setting (bis)." Jean-François Gagné keeps looking for options to make replication really crash safe in all practically important cases.
Bug #92073 - "int change_master(): Assertion `mi->rli->inited' failed." Roel Van de Paar continues his QA efforts in a hope to make MySQL better one day and reports all debug assertions in public. In this case we seem to have a kind of regression in 8.0 vs 5.7. Check also these his bug reports: Bug #91980 - "Assertion `head->variables.gtid_next.type == ANONYMOUS_GTID' failed." and Bug #90946 - "Assertion `get_state() >= 0' failed on RESET MASTER while tx_read_only". Debug assertions are in the code for a reason usually, so in a latter bug report Roel recently asked for some update...
Bug #91766 - "Log of replication position when running CHANGE MASTER TO is wrong.". Jean-François Gagné found yet another annoying detail - one would expect the error log to have data about the SQL Thread position, but it has data about the IO Thread position.
Bug #91744 - "START SLAVE UNTIL going further than it should." Now this bug reported by Jean-François Gagné is really serious. Still it hangs around "Open" since July 20...
Bug #91633 - "Replication failure (errno 1399) on update in XA tx after deadlock". XA is yet another MySQL feature that is doomed to stay half-baked it seems... More on this you should expect in a separate blog post one day. The bug that causes replication break was reported by Lukas Sydorowski.
The last but not the least bug in this list (that is actually long if we try to check all active bugs reported over last 2 years) that I want to include is Bug #90448 - "INSERT IGNORE not ignored on SQL thread", by Simon Mudd. Restarting slave helps and it seems MySQL 8.0.x is not affected. See also his Bug #89876 - "mysqladmin flush-hosts is not safe in GTID mode".

I have to stop at some point. In conclusion let me review a list of (less) recent bugs in a multi-threaded slave (MTS) implementation specifically:

Bug #91477 - "Slave writes less when connected intermediary master with blackhole engine". Just 2 threads are used in this case as found by Sveta Smirnova.
Bug #90342 - "MySQL parallel replication automatically restarts". Nobody cares about this crashing bug for 5+ months. It's still "Open".
Bug #89247 - "Deadlock with MTS when slave_preserve_commit_order = ON." Jean-François Gagné reported this 9 months ago, the bug was verifvied and based on last comments and pt-pmp outputs the hang really happen for many users. Still nobody cares in any obvious way.
Bug #89229 - "FLUSH PRIVILEGES may cause MTS deadlock". This bug was reported by Libing Song 9 months ago.
Bug #87796 - "Commit_order_manager can't terminate MTS worker properly when deadlock happens". This bug was reported by Seunguck Lee a year ago!
Bug #85747 - "Documentation on MTS statistics". Yet another case of missing manual reported by Juan Arruti:
"I didn't find information in MySQL reference manual about MTS statistics, I had to look into source code in order to get more information about them."
Bug #85142 - "reducing MTS checkpointing causes high IO load". The bug reported by Trey Raymond is "Verified" for 1.5 years. Check also his Bug #81386 - "MTS hangs on file related *_EVENT forced checkpoint" and Bug #81385 - "relay log info not updated after using sql_after_mts_gaps" that are "Open" since May 2016!
Bug #73066 - "Replication stall with multi-threaded replication". Ovais Tariq reported this bug more than 4 years ago, when both still worked in Percona. Nobody cares to fix it.
Bug #71495 - "Logging of MTS breakage is insufficient." One of Oracle customers who use MTS in production, Simon Mudd had reported this practical problem 4.5+ years ago. No fix, no comments from Oracle engineers from more than 3 years.

I have to stop at this stage, as checking more/older replication related bugs makes me even more sad than I planned or want to be. We see fixes in every release, but the backlog seems huge. Dear Oracle MySQL engineers, please, review public MySQL replication bug reports collected over last 4-5 years and spend some more time and resources on processing and fixing them! Otherwise this Acqua alta of bugs will make this great legacy feature obsolete one sad day.

I had not planned to study semi-sync replication bugs in this post (maybe, some day). If you are interested in more details about this feature, crash safety of replication, parallel replication and replication in general, I suggest you to check J-F Gagné's MySQL Blog.

5 comments:

Jean-François GagnéSeptember 24, 2018 at 12:15 AM
I also feel that some of the replication features should get more attention. I will bring a few in different comments to be able to have different discussions on each. The first one is a big documentation error in Bug#92093 about replication crash-safety with GTIDs that is badly documented, and this documentation bug is unfixed for a month.
Jean-François GagnéSeptember 24, 2018 at 12:19 AM
Another series of bugs is the following about improving relay-log recovery:
* Bug#74321: Execute relay-log-recovery only when needed
* Bug#74323: Avoid overloading the master NIC on relay-log-recovery of a lagging slave
Replication crash safety was introduced in MySQL 5.6, and I can understand that its first implementation was "simplistic". However, I would have liked improvements in 5.7 and 8.0. Also, such improvements to relay-log recovery could solve Bug#81840 (Automatic Replication Recovery Does Not Handle Lost Relay Log Events).
Jean-François GagnéSeptember 24, 2018 at 12:21 AM
Another series of feature requests about usability, all of which should not be very hard to implement:
* Bug#71565: Please report full GTID state when stopping / start a slave
* Bug#79517: if gtid-mode=on , set gtid-enforce-consistency=on by default if not specified
* Bug#91741: Improve Start binlog_dump log on master
Jean-François GagnéSeptember 24, 2018 at 12:22 AM
Finally, shameless self-promotion: I will be speaking about some of those bugs during my talk at Percona Live Frankfurt at the beginning of November: Demystifying MySQL Replication Crash Safety
https://www.percona.com/live/e18/sessions/demystifying-mysql-replication-crash-safety

Sunday, September 23, 2018

On MySQL Replication Bugs

5 comments: