Saturday, October 29, 2016

MySQL Support Engineer's Chronicles, Issue #2

It's time to continue my new series that I've started 2 weeks ago. I'd like to start with a reminder that it's time to send your talks for "MySQL and Friends Devroom" at FOSDEM 2017 - the only MySQL-related event next year that I plan to attend in any case. It seems we have one more week to submit, but I've already filled in all the details for the "main" talk, "Understanding MyRocks locks and deadlocks". I'd like to apply my usual source code reading and gdb breakpoints approach in case if by the end of January, 2017 official documentation still misses important details. Official MySQL manual is still the process of improving with regards to explaining InnoDB locks, and I am happy to admit that this week yet another my related documentation request, Bug #71735, "Manual does not explain locks set by SELECT ... FOR UPDATE properly", was closed.

I am really excited by community and developers activity around MyRocks these days, so it was easy to decide to spend more time on MyRocks while preparing to FOSDEM. Just take a look at recent issues reported by Justin Swanhart, for example: Issue #365 and Issue #369. How often do you see so active, almost real time discussion in the bug reports about something else MySQL-related these days?

I had to spend a lot of time recently trying to understand and prove the root cause of one performance problem that happens once in while on a node of Galera cluster, and for this I've used Performance_Schema, maybe for the first time (since reporting Bug #68079) on a real life case. It turned out that proper sizing of the tables there, so that under high load and concurrency we still get relevant data few seconds after problem had happened in its tables is not a trivial task. Moreover, it seems Galera code is not instrumented in enough details, so it is hard to measure the impact it may have. While trying to find a proper trade off between the amount of data collected, memory usage and performance impact, I decided to use profilers, from PMP to ideas of replacing gdb there with quickstack, to perf (that I've used to make some really good observations). I'd like to summarize some recent (and upcoming) profiling experience in a talk that I am still trying to submit for FOSDEM, titled "Applying profilers to MySQL". In the meantime, check comments to this my Facebook post, for many useful details.

I'd like to submit also a talk on how single bug report may influence MySQL development and development of skills of the bug reporter. It's just the idea for now, and I plan to concentrate on a story around Bug #68079 that will be 4 years old by the time of FOSDEM 2017. I have new interesting results to share that I've got while trying to check how the fix helped and how this same use case performs and scales in MyRocks now. Not sure if this is a good idea and if anything but a blog post may come out of it. Some of my previous attempts to build a talk around this bug were not really well accepted...

This week I had to explain how to get rid of huge ibdata* file(s) while moving each InnoDB table to a separate .ibd file. I've found a great summary of options in this old blog post by Shlomi Noach. Very useful in case you can not just dump everything and reload into a new instance, for whatever reason.

Another problem I had to work on was related to bad query plan on a slave where the table was partitioned. We tried to fight with bad cardinality estimations (see Bug #67351 etc), but even after getting them closer to reality optimizer still uses wrong index sometimes. My next hope is Engine-independent table statistics of MariaDB.

I've noted that customers recently are actively trying to use PAM and LDAP with MySQL. I consider this blog post by my colleague Geoff Montee very useful for them (and myself).

Finally, I've found this great blog post by Jervin Real very useful while working with colleagues on continuous InnoDB crashes caused by corrupted partition .ibd file. The idea was to restore partition data from backup on a separate instance and also recover data for all non-corrupted partitions of a huge table there, while problematic instance is starting up in a forced recovery mode to drop the problematic table and check for any further problems.

As I already mentioned several bugs in this post, I'd like to share a link to the list of bugs recently reported by my colleague from MariaDB Foundation, Sergey Vojtovich. Dear Oracle MySQL engineers, please, pay proper attention to them, especially those with "bs" in the "Summary"!

1 comment: