Saturday, December 31, 2016

Fun with Bugs #46 - On Some Bugs I've Reported During the Year of 2016

It's time to summarize the year of 2016. As a kind of a weird summary, in this post I'd like to share a list of MySQL bug reports I've created in 2016 that are still remaining "Verified" today:

  • Bug #79831 - "Unexpected error message on crash-safe slave with max_relay_log_size set". According to Umesh this is not repeatable with 5.7. The fact that I've reported the bug on January 4 probably means I was working at that time. I should not repeat this mistake again next year.
  • Bug #80067 - "Index on BIT column is NOT used when column name only is used in WHERE clause". People say the same problem happens with INT and, what may be even less expected, BOOLEAN columns.
  • Bug #80424 - "EXPLAIN output depends on binlog_format setting". Who could expect that?
  • Bug #80619 - "Allow slave to filter replication events based on GTID". In this feature request I've suggested to implement filtering by GTID pattern, so that we can skip all events originating from specific master on some slave in a complex replication chain.
  • Bug #82127 - "Deadlock with 3 concurrent DELETEs by UNIQUE key". It's clear that manual is not even close to explaining how the locks are really set "by design" in this weird case. See comments in MDEV-10962 for some explanations. Nobody from Oracle event tried to really explain how things are designed to work.
  • Bug #82212 - "mysqlbinlog can produce events larger than max_allowed_packet for mysql". This happens for encoded row-based events. There should be some way to take this overhead into account while creating binary log, IMHO.
  • Bug #83024 - "Internals manual does not explain COM_SLEEP in details". One day you'll see Sleep for some 17 seconds logged into the slow query log, and may start to wonder why...
  • Bug #83248 - "Partition pruning is not working with LEFT JOIN". You may find some interesting related ideas in MDEV-10946.
  • Bug #83640 - "Locks set by DELETE statement on already deleted record". This case shows that design of locking in InnoDB does produce really weird outcomes sometimes. This is not about "missing manual", this is about extra lock set that is absolutely NOT needed (a gap X lock on a record in the secondary unique index is set when the same transaction transaction already has the next key lock on it). As a side note, I keep finding, explaining and reporting weird or undocumented details in InnoDB locking for years, still my talk about InnoDB locks was not accepted by Oracle once again for OOW in 2016. What do I know about the subject and who even cares about those locks... 
  • Bug #83708 - "uint expression is used for the value that is passed as my_off_t for DDL log". I was really shocked by this finding. I assumed that all uint vs unsigned long long improper casts are already found. It seems I was mistaking.
  • Bug #83912 - "Time spent sleeping before entering InnoDB is not measured/reported separately". The use case that led me to reporting this bug is way more interesting than the fact that some wait is not instrumented in performance_schema. You may see more related bug reports from me next year.
  • Bug #83950 - "LOAD DATA INFILE fails with an escape character followed by a multi-byte one". This single bug (and related bugs and stories) were original topic for issue #46 of my "Fun With Bugs" series. I was not able to write everything I want properly over last 3 weeks, but trust me: it's a great story, of "Let's Make America Great Again" style. With the goal for LOAD DATA to behave exactly as INSERT when wrong utf8 data are inserted, Oracle changed the way LOAD DATA works back and forth, with the last change (back) happened in 5.7.17:
     "Incompatible Change: A change made in MySQL 5.7.8 for handling of multibyte character sets by LOAD DATA was reverted due to the replication incompatibility (Bug #24487120, Bug #82641)"
    I just can not keep up with all the related fun people have in replication environments thanks to these ongoing changes... It's incredible.
  • Bug #84004 - "Manual misses details on MDL locks set and released for online ALTER TABLE". Nothing new: locks in MySQL are not properly/completely documented, metadata locks included. yes, they are documented better now, after 11+ years of my continuous efforts (of a kind), but we are "not there yet". I am still waiting for a job offer to join MySQL Documentation Team, by the way :)
  • Bug #84173 - "mysqld_safe --no-defaults & silently does NOT work any more". Recent MySQL 5.7.17 release had not only given us new Group Replication plugin and introduced incompatible changes. In a hope to fix security issues it comes with pure regression - for the first time in last 11 years mysqld_safe --no-defaults stopped working for me! By the way, mysqld_safe is still NOT safe in a sense that 5.7.17 tried to enforce, and one day (really soon) you will find out why.
  • Bug #84185 - "Not all "Statements writing to a table with an auto-increment..." are unsafe". If you do something like DELETE FROM `table` WHERE some_col IN (SELECT some_id FROM `other_table`) where `table` has auto_increment column, why should anyone care about it? We do not generate the value, we delete rows...
    This bug report was actually created by Hartmut Holzgraefe and test case comes from Elena Stepanova (see MDEV-10170). I want to take this opportunity to thank them and other colleagues from MariaDB for their hard work and cooperation during the year of 2016. Thanks to Umesh (who processed most of my bug reports),  Sinisa Milivojevic and Miguel Solorzano for their verifications of my bug reports this year.

In conclusion I should say that, no matter how pointless you may consider this activity, I still suggest you to report each and every problem that you have with MySQL and can not understand after reading the manual, as a public MySQL bug. Now, re-read my 4 years old post on this topic and have a Happy and Fruitful New Year 2017!

Saturday, December 3, 2016

MySQL Support Engineer's Chronicles, Issue #4

This week I had to deal with some unusual problems. But let me start with Percona's xtrabackup, software that I consider a key component of many current production MySQL setups and use regularly. Recently new minor versions of XtraBackup were released, check the details on 2.4.5, for example. It made a step towards support of MariaDB 10.2, but it's still a long way to go, see this pull request #200.

My main problem with xtrabackup, though, is not with lack of support of MariaDB 10,2-specific features. Why should they care, after all... The problem is that old well known bugs and problems are not resolved, those that may affect all MySQL versions, forks and environments. Check lp:1272329 , "innobackupex dies if a directory is not readable. i.e.: lost+found", for example. Why not to read and take into account ignore_db_dir option (as server does) and let those poor souls who used mount point as a datadir to make backups? Check even older problem, passwords that are not hidden in the command lines, see lp:907280, "innobackupex script shows the password in the ps output, when its passed as a command line argument". My colleague Hartmut even suggested the fix recently, see pull request #289.

Because of these old, known problems (some of them being low hanging fruits) that are not fixed users still suffer while using xtrabackup way more often than they would like to. One day, as a result, they'll have to switch to some other online backup tools or approaches. One may dream about extended myrocks_hotbackup to cover InnoDB one day (when MyRocks and InnoDB will work together in one instance), or just use Percona TokuBackup (after adding script to go SST for Galera with it, maybe), or try something totally different. Anyway, I feel that if more bugs (including low hanging fruits) in xtrabackup are not getting fixed and pull requests are not actively accepted, the tool may become much less relevant and used soon.

I had to deal with MaxScale-related issues this week, so I'd like to remind those who use Docker for testing about https://github.com/asosso/maxscale-docker/blob/master/Dockerfile. Personally I prefer to build from source. In any case, I'd like us all to remember that in older versions one may have to set strip_db_esc option explicitly for service to deal with database names containing underscore (_). Recent 2.0.x versions have it enabled by default (see MXS-801).

I also had to explain how online ALTER TABLE works, specifically, when it sets exclusive metadata locks in the process. I still do not see this topic properly explained in the manual, so I had to report Bug #84004, "Manual misses details on MDL locks set and released for online ALTER TABLE".


By no means I am a developer for 11 years already, even less one should expect writing Java code from me. Anyway, I had to explain how to replace Oracle's ref_cursors (a.k.a cursor variables) in MySQL, both in stored procedures and in Java code that calls them. If you are wondering what is this about, check this fine manual. Note that this feature is missing in MySQL, even though it was suggested to implement it here. In general, MySQL allows just to run SELECTs in stored procedures and then in Java you can process each of the result sets returned any way you want. Things may get more complicated when more than one result set is produced, and they are even more complicated in Oracle with nested cursor expressions. So, I plan to devote a separate blog post to this topic one day. Stay tuned.

I answer questions coming not only from customers. Old friends, community users out of nowhere and, even more, colleagues are welcomed to discuss whatever MySQL- or MariaDB-related  problem they may have. If I know how to help, I'll do this, otherwise I'll quickly explain that I am of no good use. This is how I ended up testing MariaDB's CONNECT storage engine quickly as a replacement for the Federated engine, that is, to link table to a remote MySQL table. Basic instructions on how to set it up and use MySQL type looked simple, but when I tried to test on Fedora 23 and hit a problem of missing libodbc.so.1:

MariaDB [(none)]> INSTALL SONAME 'ha_connect';
ERROR 1126 (HY000): Can't open shared library '/home/openxs/dbs/mariadb10.1/lib/plugin/ha_connect.so'
  (errno: 2, libodbc.so.1: cannot open shared object file: No such file or directory)
the solution was not really straightforward. First of all I had to install unixODBC.x86_64 2.3.4-1.fc23 RPM, but it also does not provide libodbc.so.1:

[openxs@fc23 node2]$ find / -name libodbc.* 2>/dev/null
/usr/lib64/libodbc.so.2
/usr/lib64/libodbc.so
/usr/lib64/libodbc.so.2.0.0
So, I had to apply a quick and dirty hack:




[openxs@fc23 node2]$ sudo ln -s /usr/lib64/libodbc.so.2.0.0  /usr/lib64/libodbc.so.1
As a result CONNECT engine worked as expected, as long as proper account and IP-address where used:
MariaDB [test]> INSTALL SONAME 'ha_connect';
Query OK, 0 rows affected (0.27 sec)


MariaDB [test]> create table r_t2(id int primary key, c1 int) engine=connect table_type=mysql connection='mysql://msandbox:msandbox@127.0.0.1:23532/test/t';
Query OK, 0 rows affected (0.04 sec)

MariaDB [test]> select * from r_t2;                     
+----+------+
| id | c1   |
+----+------+
|  1 |    2 |
|  2 |    3 |
+----+------+
2 rows in set (0.00 sec)
From configuring MaxScale to work with database having underscore in the name to re-writing Java code that used to work with Oracle RDBMS for MySQL, with many missing details in the manuals and software bugs identified or reported in between, and all that with ongoing studies of performance problems and lost quorums, rolling upgrades and failed SSTs in Galera clusters - this is what support engineers here in MariaDB have to deal with during a typical working week.



Sunday, November 27, 2016

Upstart Basics for Automatic Restarts of MaxScale After Crash

Historically I do not care much about MariaDB's MaxScale, at least since I know how to build it from source when needed. But, as a support engineer who work at MariaDB, sometimes I have to deal with problems related to MaxScale, and this week it happened so that I had to explain what to do to implement automatic restarts of MaxScale "daemon" in case of crashes on RHEL 6.x.

In the process I had found out that two of my most often used Linux distributions, CentOS 6.x and Ubuntu 14.04, actually use Upstart, so good old System V's init tricks and scripts work there only partially and only because somebody cared to integrate them into this "new" approach to starting of tasks and services during boot, stopping them during shutdown and supervising them while the system is running. I had to do this years ago, but customer's question finally forced me to check and study some details on how this system actually works.

So, unfortunately, there is no script like mysqld_safe to start and restart MaxScale after installing official RPM from MariaDB (in this case it was maxscale-2.0.1-2.x86_64). My first idea was to write one, but then I asked myself why it is not there yet, decided to check what's already provided and take a closer look at /etc/init.d/maxscale I have on my CentOS 6.8 VM (the closest to what customer used). It's a typical classic shell script to start service and it starts MaxScale like this:

start() {
    echo -n $"Starting MaxScale: "
...
    ulimit -HSn 65535
    daemon --pidfile $MAXSCALE_PIDFILE /usr/bin/maxscale --user=maxscale $MAXSCALE_OPTIONS >& /dev/null

    RETVAL=$?
    [ $RETVAL -eq 0 ] && touch /var/lock/subsys/$servicename
...
    # Return right code
    if [ $RETVAL -ne 0 ]; then
       failure
       RETVAL=$_RETVAL_NOT_RUNNING
    fi

    echo

    return $RETVAL
}
Basically, it runs /usr/bin/maxscale --user=maxscale, the rest are details (like location of PID file the rest of the script then relies on). There is no advanced script to monitor status of this process with this PID (like mysqld_safe) or anything to care about automatic restarts. It is supposed that I just execute chkconfig maxscale on and then service starts when system enters proper runlevel.

Simple test shows that when maxscale process is killed it's gone and is not restarted. At good old times I'd just add something like this:

mx:2345:respawn:/usr/bin/maxscale --user=maxscale

to /etc/inittab (as nice articles like this suggests), but quick check and then further reading proves that it's not going to work on CentOS 6.8, as it uses Upstart.

So, either I had to write something similar to mysqld_safe for MaxScale, or (having in mind size of code and number of bugs we had in the past in that script) I had better to find out what is the supposed way to respawn processes in Upstart. Basic ideas are, again, simple. One has to create /etc/init/service_name.conf file and put something like this there (real code quote from one of Ubuntu files for MySQL):
start on runlevel [2345]
stop on starting rc RUNLEVEL=[016]

respawn
respawn limit 2 5
...
pre-start script
...
end script

exec /usr/sbin/mysqld

post-start script
...
end script

The file is easy to understand even without reading the manual. One has to set when service starts and stops, add respawn clause if we want to restart it in case of unexpected crashes or kills of the process, optionally limit the number of restarts and intervals between restarts etc, and, optionally, do something before and after start.

I quickly created /etc/init/maxscale.conf based on the above and it did a great job in starting it automatically upon system startup. I've just used exec /usr/bin/maxscale --user=maxscale basically and decided to deal with options and other details later if needed. But what was strange for the very beginning is that in /var/log/messages I've seen what looked like repeated attempts to start maxscale process, that failed:

Nov 26 17:44:45 centos maxscale[20229]: MaxScale started with 1 server threads.
Nov 26 17:44:45 centos init: maxscale main process ended, respawning
Nov 26 17:44:45 centos maxscale[20229]: Started MaxScale log flusher.
Nov 26 17:44:45 centos maxscale[20235]: Working directory: /var/log/maxscale
Nov 26 17:44:45 centos maxscale[20235]: MariaDB MaxScale 2.0.1 started
Nov 26 17:44:45 centos maxscale[20235]: MaxScale is running in process 20235
...
Nov 26 17:44:45 centos maxscale[20235]: Loaded module qc_sqlite: V1.0.0 from /usr/lib64/maxscale/libqc_sqlite.so
Nov 26 17:44:45 centos maxscale[20235]: MaxScale is already running. Process id: 20229. Use another location for the PID file to run multiple instances of MaxScale on the same machine.
Nov 26 17:44:45 centos init: maxscale main process (20234) terminated with status 4
Nov 26 17:44:45 centos init: maxscale main process ended, respawning
Moreover, when proper maxscale process was killed, it was NOT respawned as expected.

It was just a proper time to read the manual more carefully, this part about expect stanza (that I noted in some of official Upstart scripts):

"To allow Upstart to determine the final process ID for a job, it needs to know how many times that process will call fork(2). Upstart itself cannot know the answer to this question since once a daemon is running, it could then fork a number of "worker" processes which could themselves fork any number of times. Upstart cannot be expected to know which PID is the "master" in this case, considering it does not know if worker processes will be created at all, let alone how many times, or how many times the process will fork initially. As such, it is necessary to tell Upstart which PID is the "master" or parent PID. This is achieved using the expect stanza.
The syntax is simple, but you do need to know how many times your service forks."
Let's check quickly how many times fork() is called in maxscale (I'd know it better if I ever cared to study the source code in details, but I had not checked most part of it yet). Test based on that cookbook gives unexpected results:

[root@centos ~]# strace -o /tmp/strace.log -fFv maxscale &
[1] 2369
[root@centos ~]# sleep 10
[root@centos ~]# ps aux | grep strace
root      2369  2.6  0.0   4476   888 pts/0    S    20:28   0:00 strace -o /tmp/strace.log -fFv maxscale
root      2382  0.0  0.0 103312   868 pts/0    S+   20:28   0:00 grep strace
[root@centos ~]# pkill -9 strace
[1]+  Killed                  strace -o /tmp/strace.log -fFv maxscale
[root@centos ~]# ps aux | grep maxscale
root      2375  1.3  0.2 276168  3896 ?        Ssl  20:28   0:00 maxscale
root      2385  0.0  0.0 103312   868 pts/0    S+   20:28   0:00 grep maxscale
[root@centos ~]# egrep "\<(fork|clone)\>\(" /tmp/strace.log | wc | awk '{print $1}'
5
How comes we have 5 fork calls? Here they are:
[root@centos ~]# egrep "\<(fork|clone)\>\(" /tmp/strace.log
2374  clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fb024c08ab0) = 2375
2375  clone(child_stack=0x7fb01f819f10, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fb01f81a9d0, tls=0x7fb01f81a700, child_tidptr=0x7fb01f81a9d0) = 2376
2375  clone(child_stack=0x7fb01e118f10, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fb01e1199d0, tls=0x7fb01e119700, child_tidptr=0x7fb01e1199d0) = 2377
2375  clone(child_stack=0x7fb01d10af10, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fb01d10b9d0, tls=0x7fb01d10b700, child_tidptr=0x7fb01d10b9d0) = 2378
2375  clone(child_stack=0x7fb017ffef10, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fb017fff9d0, tls=0x7fb017fff700, child_tidptr=0x7fb017fff9d0) = 2379
It seems process with PID 2374 actually forked just once, to produce PID 2375, and then several threads were started by that process. We can see them:
[root@centos ~]# ps -T -p `pidof maxscale`
  PID  SPID TTY          TIME CMD
 2375  2375 ?        00:00:00 maxscale
 2375  2376 ?        00:00:00 maxscale
 2375  2377 ?        00:00:00 maxscale
 2375  2378 ?        00:00:00 maxscale
 2375  2379 ?        00:00:00 maxscale
[root@centos ~]#
So, it was really one fork() (and I could notify that even from studying /var/log/messages before) and I have to add expect fork stanza to my Upstart configuration file:

[root@centos ~]# cat /etc/init/maxscale.conf
# MaxScale service

description "MaxScale"

start on stopped rc RUNLEVEL=[2345]
stop on starting rc runlevel [!2345]

respawn
respawn limit 2 5

expect fork

exec /usr/bin/maxscale --user=maxscale
This way it works as expected, as one may easily check:

[root@centos ~]# initctl status maxscale
maxscale start/running, process 6600
[root@centos ~]# maxadmin
MaxScale> show servers
Server 0x19e47a0 (server1)
        Server:                              127.0.0.1
        Status:                              Master, Running
        Protocol:                            MySQLBackend
        Port:                                3306
        Server Version:                      5.7.15-9-log
        Node Id:                             1
        Master Id:                           -1
        Slave Ids:
        Repl Depth:                          0
        Number of connections:               0
        Current no. of conns:                0
        Current no. of operations:           0
MaxScale> quit
[root@centos ~]# kill -9 6600
[root@centos ~]# ps aux | grep maxscale
maxscale  6627  2.0  0.2 276168  3884 ?        Ssl  20:41   0:00 /usr/bin/maxscale --user=maxscale
root      6633  0.0  0.0 103312   872 pts/0    S+   20:41   0:00 grep maxscale
[root@centos ~]# initctl status maxscale
maxscale start/running, process 6627

In the /var/log/messages we clearly see that the process is respawned by init:

...
Nov 26 20:38:15 centos maxscale[6600]: Started MaxScale log flusher.
Nov 26 20:40:33 centos maxscale[6600]: Loaded module MaxAdminAuth: V2.0.0 from /usr/lib64/maxscale/libMaxAdminAuth.so
Nov 26 20:41:52 centos init: maxscale main process (6600) killed by KILL signal
Nov 26 20:41:52 centos init: maxscale main process ended, respawning
Nov 26 20:41:52 centos maxscale[6627]: Working directory: /var/log/maxscale
Nov 26 20:42:07 centos maxscale[6627]: Loaded module MaxAdminAuth: V2.0.0 from /usr/lib64/maxscale/libMaxAdminAuth.so
...
After few more checks I asked to implement this officially in packages for the Linux distributions that use Upstart, see MXS-1027.

To summarize, I wish I care more to find out how Upstart works long time ago. Now it's time to study systemd probably :) Anyway, after some reading and testing one can use it efficiently to provide automated service starts and restarts for MySQL server and services used with it.

Saturday, November 12, 2016

MySQL Support Engineer's Chronicles, Issue #3

The original idea of this series was to publish one post per week, but it seems every other week I have some special topic that well deserves a dedicated post. Last week I had no time to complete my writing because of long (and, I hope, useful) Howto post on replacing corrupted partition using non-corrupted one from other server in replication setup. But I had links and notes collected in a draft that I am going to complete now.

First of all, during the previous week I had time to submit two more talks for the  "MySQL and Friends Devroom" at FOSDEM 2017, "Applying profilers to MySQL" and "Background Story of a MySQL Bug". Call for papers is still open, as far as I understand, so I may come up with few more ideas on what to talk about.

Strange to admit this, but sometimes I can be excited with something MySQL-related. Two weeks ago I've added a note to myself about great, detailed changelogs that MariaDB publishes, with proper links to GitHub commits. Check this one for 10.0.28, for example. I wish Oracle provides the same level of details for MySQL releases as a part of their official release notes.

Still, sometimes important changes happen in upstream MySQL, get merged, and details about inherited incompatible change (and its real impact) are still missing in any release notes of any kind. Let's consider recent example. Historically MySQL treated incorrect utf8 bytes sequences differently for INSERT than for LOAD DATA. INSERT failed, LOAD DATA just had the value truncated on the first incorrect character and continued. Eventually (in MySQL 5.6.32) this was fixed by upstream MySQL (it was also fixed in a wider context in MariaDB 10.2 in frames of MDEV-9823). MySQL 5.6.32 release notes says about the incompatible change:
  • "Incompatible Change: For multibyte character sets, LOAD DATA could fail to allocate space correctly and ignore input rows as a result. (Bug #76237, Bug #20683959, Bug #23080148)
    References: This issue is a regression of: Bug #14653594."
But it surely says nothing about the impact for replication or another Bug #78758 that is fixed in 5.6.32 (by making LOAD DATA fail with error). It costed me some time to figure out all the details. Read MDEV-11217 for the historical details, nice test case and different views on the decisions made. Note also that now error message about the bad utf8 character from LOAD DATA looks (IMHO) weird enough, as it contains actually only a valid part of the string. See my MDEV-11216 about this.

I am still having fun with InnoDB locks. This week I checked what locks are set when the same DELETE (for the table with the primary key and unique secondary index) is executed twice in the same transaction. Check Bug #83640 and tell me how this sequence of locks set in one transaction may make any sense:
---TRANSACTION 636201, ACTIVE 202 sec
5 lock struct(s), heap size 1184, 4 row lock(s), undo log entries 1
MySQL thread id 1, OS thread handle 0x7f9e513a7700, query id 92 localhost root init
show engine innodb status
TABLE LOCK table `test`.`tu` trx id 636201 lock mode IX
RECORD LOCKS space id 11 page no 4 n bits 72 index `u` of table `test`.`tu` trx id 636201 lock_mode X locks rec but not gap
RECORD LOCKS space id 11 page no 3 n bits 72 index `PRIMARY` of table `test`.`tu` trx id 636201 lock_mode X locks rec but not gap
RECORD LOCKS space id 11 page no 4 n bits 72 index `u` of table `test`.`tu` trx id 636201 lock_mode X
RECORD LOCKS space id 11 page no 4 n bits 72 index `u` of table `test`.`tu` trx id 636201 lock_mode X locks gap before rec
My colleague Jan Lindström was also surprised, so we have MDEV-11215 as well, and a chance to see this studies and maybe changed by MariaDB engineers. Related problems were discussed in the past, see Bug #19762 and Bug #55717.

Some days I keep wondering what happens to XtraBackup these days in Percona. As far as I remember I was not able to reproduce lp:1461833 while working there, but the fact that the bug is still open and got no comments since I re-opened it gives me no additional confidence.

I report bugs and missing details in MySQL way too often even for my own liking. But MySQL manual really misses many details to explain results that users see in production. This week I'd like to remind about one of my bug reports about missing details in MySQL documentation, Bug #77390, and my request there:
"Please, explain all metadata and InnoDB locks set by online ALTER, with examples and details enough to explain non-trivial cases..."
Honestly, until this month I never noted that DDL log exists in MySQL. This is a ddl_log.log binary file that can be "dumped" into a somewhat readable form using a script by Mattias Jonsson from Bug #47343:
[openxs@fc23 5.7]$ perl ~/ddl_log_dump.pl data/ddl_log.log
Header: Num entries: 4 Name length: 512 Block length 4096
Entry 1 type i action s next 0
  name ./test/trange2
not 'e' entry (i)
Entry 2 type l action d next 0
  name ./test/#sql-trange2
not 'e' entry (l)
Entry 3 type l action d next 2
  name ./test/trange2#P#pmax
not 'e' entry (l)

This file may grow until MySQL server restart completes, but what's worse, when it grows over 4GB in size it becomes unusable and effectively blocks any further concurrent DDL until we get rid of it. I had a lot of fun reading the code and reporting related Bug #83708. Unlucky users who do a lot of partitioning-related DDL may find the situation less funny when they hit this bug.


I plan to describe what I had to work on this week soon, while I still remember all the relevant details and feelings. So, stay tuned!

Saturday, November 5, 2016

How to Recover Corrupted InnoDB Partition Tablespace in Replication Setup

This week I've got a question that sounded basically like this:
"Is it possible to just copy the entire partition from the replicated server?"
Let me share some background story. As it happens sometimes, user had a huge table with many partitions, let's say hundreds of gigabytes in size each, and one of them got unfortunately corrupted. It happened in a replication setup on master, but lucky they were, they had used innodb_file_per_table=1 and they had a slave that was more or less in sync with master. This allowed to reconfigure replication and continue to work, but the task remained to eventually put master back in use and get correct data in the corrupted partition. Let's assume that dumping and reloading data from one of instances in replication setup is not a desired option, as it will take too much time comparing to just copying the partition tablespace file. Hence the question above...
Side note: Let's assume for simplicity that corrupted partition does not get changes at the moment, but even if it does we can replay them theoretically (by careful processing of binary logs, for example).
My quick and simple answer was that surely it is possible with InnoDB from MySQL 5.6, and there is a great blog post that provide a lot of relevant details and steps, this one by my former colleague from Percona Jervin Real. I remember I had to deal with orphaned or corrupted partitions more than once in the past, while working for Percona. It is NOT possible to import them directly in MySQL 5.6, and I've even reported this as Bug #70196, but in 5.7.4+ after the fix for Bug #52422 it is possible. Anyway, nice partition exchange feature, as Jervin explained, helps to overcome this limitation easily in MySQL 5.6 as well.

User decided to try this, but surely dealing with corrupted InnoDB tables is not as easy as with non-corrupted ones. There many things that could go not as expected and require additional steps. For example, hardly any partition exchange is possible for a table with corrupted partition - I've suggested to do this for a clean table without any corruptions.

I still feel that things may go wrong, so I decided to make a test and show step by step how it works and what may go wrong in the process. To do this I've used one of replication sandboxes at hand, with MySQL 5.6.28 and, after making sure replication works on both slaves, created simple partitioned table on master with the following statements:
create table tp(id int, val int, ts datetime, primary key(id, ts)) partition by range (year(ts)) (partition p0 values less than (2006), partition p1 values less than (2015), partition px values less than maxvalue);
insert into tp values(1, 10, '2005-10-01');
insert into tp values(2, 10, '2014-10-02');
insert into tp values(3, 30, now());
This is what I've got on master:
master [localhost] {msandbox} (test) > select * from tp;
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  1 |   10 | 2005-10-01 00:00:00 |
|  2 |   10 | 2014-10-02 00:00:00 |
|  3 |   30 | 2016-11-05 17:34:30 |
+----+------+---------------------+
3 rows in set (0,00 sec)

master [localhost] {msandbox} (test) > select * from tp partition(px);
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  3 |   30 | 2016-11-05 17:34:30 |
+----+------+---------------------+
1 row in set (0,00 sec)

master [localhost] {msandbox} (test) > select * from tp partition(p0);
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  1 |   10 | 2005-10-01 00:00:00 |
+----+------+---------------------+
1 row in set (0,00 sec)

master [localhost] {msandbox} (test) > select * from tp partition(p1);
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  2 |   10 | 2014-10-02 00:00:00 |
+----+------+---------------------+
1 row in set (0,00 sec)

Then I checked that data are replicated on slave that I am going to corrupt soon: 
slave1 [localhost] {msandbox} (test) > select * from tp partition(p1);
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  2 |   10 | 2014-10-02 00:00:00 |
+----+------+---------------------+
1 row in set (0,00 sec)
It's time to stop MySQL server and corrupt the data in partition p1:
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ./stop
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ls -l data/test
total 152012
...

-rw-rw---- 1 openxs openxs     8610 лис  5 17:34 tp.frm
-rw-rw---- 1 openxs openxs       32 лис  5 17:34 tp.par
-rw-rw---- 1 openxs openxs    98304 лис  5 17:34 tp#P#p0.ibd
-rw-rw---- 1 openxs openxs    98304 лис  5 17:34 tp#P#p1.ibd
-rw-rw---- 1 openxs openxs    98304 лис  5 17:34 tp#P#px.ibd

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ hexdump -C data/test/tp#P#p1.ibd | more
00000000  22 e2 b8 a3 00 00 00 00  00 00 00 00 00 00 00 00  |"...............|
00000010  00 00 00 00 0f f4 ae 86  00 08 00 00 00 00 00 00  |................|
00000020  00 00 00 00 00 0c 00 00  00 0c 00 00 00 00 00 00  |................|
00000030  00 06 00 00 00 40 00 00  00 00 00 00 00 04 00 00  |.....@..........|
00000040  00 00 ff ff ff ff 00 00  ff ff ff ff 00 00 00 00  |................|
00000050  00 01 00 00 00 00 00 9e  00 00 00 00 00 9e 00 00  |................|
...

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ dd if=/dev/zero of=data/test/tp#P#p1.ibd bs=1 count=98304
98304+0 records in
98304+0 records out
98304 bytes (98 kB) copied, 0,253177 s, 388 kB/s
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ hexdump -C data/test/tp#P#p1.ibd | more
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00018000
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ls -l data/test/tp*       
-rw-rw---- 1 openxs openxs  8610 лис  5 17:34 data/test/tp.frm
-rw-rw---- 1 openxs openxs    32 лис  5 17:34 data/test/tp.par
-rw-rw---- 1 openxs openxs 98304 лис  5 17:34 data/test/tp#P#p0.ibd
-rw-rw---- 1 openxs openxs 98304 лис  5 17:41 data/test/tp#P#p1.ibd
-rw-rw---- 1 openxs openxs 98304 лис  5 17:34 data/test/tp#P#px.ibd
I just filled entire partition file with zeroes and verified it's content  before and after with hexdump. I think it's bad enough kid of corruption - there is no way to recover data, they are gone entirely, and system areas of .ibd file will surely not match what InnoDB data dictionary may thing about the tablespace.

Now I can try to start slave back and access data in the table:
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ./start
. sandbox server started
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ tail data/msandbox.err
2016-11-05 18:15:11 8092 [Note] Server hostname (bind-address): '127.0.0.1'; port: 22294
2016-11-05 18:15:11 8092 [Note]   - '127.0.0.1' resolves to '127.0.0.1';
2016-11-05 18:15:11 8092 [Note] Server socket created on IP: '127.0.0.1'.
2016-11-05 18:15:11 8092 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysql_sandbox22294-relay-bin' to avoid this problem.
2016-11-05 18:15:11 8092 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2016-11-05 18:15:11 8092 [Note] Slave I/O thread: connected to master 'rsandbox@127.0.0.1:22293',replication started in log 'mysql-bin.000005' at position 1015
2016-11-05 18:15:11 8092 [Note] Event Scheduler: Loaded 0 events
2016-11-05 18:15:11 8092 [Note] /home/openxs/5.6.28/bin/mysqld: ready for connections.
Version: '5.6.28-log'  socket: '/tmp/mysql_sandbox22294.sock'  port: 22294  MySQL Community Server (GPL)
2016-11-05 18:15:12 8092 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000005' at position 1015, relay log './mysql_sandbox22294-relay-bin.000236' position: 1178
The table was not accessed during recovery stage (as slave was stopped cleanly), so we have no hints in the error log about the corruption. But as soon as we try to access the table:

slave1 [localhost] {msandbox} (test) > select * from tp;
ERROR 2013 (HY000): Lost connection to MySQL server during query
In the error log I see:

InnoDB: Error: tablespace id is 12 in the data dictionary
InnoDB: but in file ./test/tp#P#p1.ibd it is 0!
2016-11-05 18:17:03 7f0e9c257700  InnoDB: Assertion failure in thread 139700725970688 in file fil0fil.cc line 796
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
16:17:03 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=1
max_threads=151
thread_count=3
connection_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68108 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x276e860
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f0e9c256e50 thread_stack 0x40000
/home/openxs/5.6.28/bin/mysqld(my_print_stacktrace+0x35)[0x90f695]
/home/openxs/5.6.28/bin/mysqld(handle_fatal_signal+0x3d8)[0x674fc8]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10330)[0x7f0eb0d99330]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f0eaf993c37]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f0eaf997028]
/home/openxs/5.6.28/bin/mysqld[0xae994e]
/home/openxs/5.6.28/bin/mysqld[0xae9ade]
/home/openxs/5.6.28/bin/mysqld[0xaf0419]
/home/openxs/5.6.28/bin/mysqld[0xabb7fb]
/home/openxs/5.6.28/bin/mysqld[0xabbe6b]
/home/openxs/5.6.28/bin/mysqld[0xaa996a]
/home/openxs/5.6.28/bin/mysqld[0xa9720f]
/home/openxs/5.6.28/bin/mysqld[0xa40b2e]
/home/openxs/5.6.28/bin/mysqld[0x99fca5]
/home/openxs/5.6.28/bin/mysqld[0x997d89]
/home/openxs/5.6.28/bin/mysqld[0x99f9d9]
/home/openxs/5.6.28/bin/mysqld(_ZN7handler11ha_rnd_nextEPh+0x9c)[0x58c52c]/home/openxs/5.6.28/bin/mysqld(_ZN12ha_partition8rnd_nextEPh+0x41)[0xb48441]
/home/openxs/5.6.28/bin/mysqld(_ZN7handler11ha_rnd_nextEPh+0x64)[0x58c4f4]
/home/openxs/5.6.28/bin/mysqld(_Z13rr_sequentialP11READ_RECORD+0x37)[0x8400a7]
/home/openxs/5.6.28/bin/mysqld(_Z10sub_selectP4JOINP13st_join_tableb+0x181)[0x6d3891]
/home/openxs/5.6.28/bin/mysqld(_ZN4JOIN4execEv+0x391)[0x6d16b1]
/home/openxs/5.6.28/bin/mysqld[0x718349]
/home/openxs/5.6.28/bin/mysqld(_Z12mysql_selectP3THDP10TABLE_LISTjR4ListI4ItemEPS4_P10SQL_I_ListI8st_orderESB_S7_yP13select_resultP18st_select_lex_unitP13st_select_lex+0xbc)[0x718e0c]
/home/openxs/5.6.28/bin/mysqld(_Z13handle_selectP3THDP13select_resultm+0x175)[0x719015]
/home/openxs/5.6.28/bin/mysqld[0x6f3769]
/home/openxs/5.6.28/bin/mysqld(_Z21mysql_execute_commandP3THD+0x3575)[0x6f7f25]
/home/openxs/5.6.28/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x338)[0x6fba58]
/home/openxs/5.6.28/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0xc60)[0x6fce30]
/home/openxs/5.6.28/bin/mysqld(_Z10do_commandP3THD+0xd7)[0x6fec27]
/home/openxs/5.6.28/bin/mysqld(_Z24do_handle_one_connectionP3THD+0x116)[0x6c5da6]
/home/openxs/5.6.28/bin/mysqld(handle_one_connection+0x45)[0x6c5e85]
/home/openxs/5.6.28/bin/mysqld(pfs_spawn_thread+0x126)[0x989be6]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8184)[0x7f0eb0d91184]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f0eafa5737d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f0e74005050): select * from tp
Connection ID (thread ID): 3
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
161105 18:17:04 mysqld_safe Number of processes running now: 0
161105 18:17:04 mysqld_safe mysqld restarted
So, assertion failure happened because tablespace id of deliberately corrupted partition does not match InnoDB expectations in the data dictionary. Server was restart, but this time it was not successful:

...
2016-11-05 18:17:04 8146 [Note] InnoDB: Database was not shutdown normally!
2016-11-05 18:17:04 8146 [Note] InnoDB: Starting crash recovery.
2016-11-05 18:17:04 8146 [Note] InnoDB: Reading tablespace information from the .ibd files...
2016-11-05 18:17:04 8146 [ERROR] InnoDB: space header page consists of zero bytes in tablespace ./test/tp#P#p1.ibd (table test/tp#P#p1)
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size:1024 Pages to analyze:64
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size: 1024, Possible space_id count:0
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size:2048 Pages to analyze:48
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size: 2048, Possible space_id count:0
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size:4096 Pages to analyze:24
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size: 4096, Possible space_id count:0
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size:8192 Pages to analyze:12
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size: 8192, Possible space_id count:0
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size:16384 Pages to analyze:6
2016-11-05 18:17:04 8146 [Note] InnoDB: Page size: 16384, Possible space_id count:0
2016-11-05 18:17:04 7f0f0f745780  InnoDB: Operating system error number 2 in a file operation.
InnoDB: The error means the system cannot find the path specified.
InnoDB: If you are installing InnoDB, remember that you must create
InnoDB: directories yourself, InnoDB does not create them.
InnoDB: Error: could not open single-table tablespace file ./test/tp#P#p1.ibd
InnoDB: We do not continue the crash recovery, because the table may become
InnoDB: corrupt if we cannot apply the log records in the InnoDB log to it.
InnoDB: To fix the problem and start mysqld:
InnoDB: 1) If there is a permission problem in the file and mysqld cannot
InnoDB: open the file, you should modify the permissions.
InnoDB: 2) If the table is not needed, or you can restore it from a backup,
InnoDB: then you can remove the .ibd file, and InnoDB will do a normal
InnoDB: crash recovery and ignore that table.
InnoDB: 3) If the file system or the disk is broken, and you cannot remove
InnoDB: the .ibd file, you can set innodb_force_recovery > 0 in my.cnf
InnoDB: and force InnoDB to continue crash recovery here.
161105 18:17:04 mysqld_safe mysqld from pid file /home/openxs/sandboxes/rsandbox_mysql-5_6_28/node1/data/mysql_sandbox22294.pid ended
Now, what shell we do? There is no reason to try to start again and again. We should check that partition .ibd file is there (it is), does have proper permissions (it does) and looks like a a proper tablespace. We can check that if we have proper tablespace file for the same partition elsewhere, search for known strings of data in the file etc. In this case hexdump already shown corruption of data without any chance to restore anything. Time to ask a question that started this post...

With the answer in mind (yes, we can get the data from partition .ibd file from non-corrupted table), how to proceed in practice? First of all, we need MySQL server that is started and works, so that we can run SQL statements. This one can not be started normally and, if we plan to restore data on it (for whatever reason) we have to start it first. To do so we have to try to use innodb_force_recovery settings starting for 1. Depending on how and when corruption happened 1 may not be enough. In general I would not recommend to proceed to any value more than 4. Also it makes sense to copy all files related to corrupted table and, ideally, shared tabespace to a safe place. Changes done with forced recovery may corrupt data even more, so it's good to have a way to roll back and start again. I copied all .ibd files for the table to /tmp directory, but in realirty make sure you just rename/move them inside the same physical filesystem, if you care about huge size and time to copy:
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ cp data/test/tp*.ibd /tmp/
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ls -l /tmp/tp*.ibd
-rw-rw---- 1 openxs openxs 98304 лис  5 18:49 /tmp/tp#P#p0.ibd
-rw-rw---- 1 openxs openxs 98304 лис  5 18:49 /tmp/tp#P#p1.ibd
-rw-rw---- 1 openxs openxs 98304 лис  5 18:49 /tmp/tp#P#px.ibd
Then I set innodb_force_recovery to 1 in [mysqld] section of my configuration file and started the server:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ grep innodb_force my.sandbox.cnf
innodb_force_recovery=1
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ./start
. sandbox server started
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ tail data/msandbox.err    
InnoDB: innodb_force_recovery is on: we do not allow
InnoDB: database modifications by the user. Shut down
InnoDB: mysqld and edit my.cnf so that
InnoDB: innodb_force_... is removed.
2016-11-05 18:25:25 8491 [ERROR] Error writing relay log configuration.
2016-11-05 18:25:25 8491 [ERROR] Error reading relay log configuration.
2016-11-05 18:25:25 8491 [ERROR] Failed to initialize the master info structure
2016-11-05 18:25:25 8491 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.
2016-11-05 18:25:25 8491 [Note] Event Scheduler: Loaded 0 events
2016-11-05 18:25:25 8491 [Note] /home/openxs/5.6.28/bin/mysqld: ready for connections.
Version: '5.6.28-log'  socket: '/tmp/mysql_sandbox22294.sock'  port: 22294  MySQL Community Server (GPL)
It seems server started, but when we try to access the table strange thing happens:
slave1 [localhost] {msandbox} (test) > select * from tp;
ERROR 1146 (42S02): Table 'test.tp' doesn't exist
slave1 [localhost] {msandbox} (test) > show tables;
+----------------+
| Tables_in_test |
+----------------+
| t              |
| t1             |
| t2             |
| t_local        |
| tp             |
+----------------+
5 rows in set (0,00 sec)

slave1 [localhost] {msandbox} (test) > show create table tp\G
ERROR 1146 (42S02): Table 'test.tp' doesn't exist
The table is NOT there and there at the same time. This indicates some problem with InnoDB data dictionary. In the error log we see:
2016-11-05 18:26:14 8491 [ERROR] InnoDB: Failed to find tablespace for table '"test"."tp" /* Partition "p1" */' in the cache. Attempting to load the tablespace with space id 12.
2016-11-05 18:26:14 8491 [ERROR] InnoDB: In file './test/tp#P#p1.ibd', tablespace id and flags are 0 and 0, but in the InnoDB data dictionary they are 12 and 0.
Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting-datadict.html for how to resolve the issue.
2016-11-05 18:26:14 7f434af03700  InnoDB: Operating system error number 2 in a file operation.
InnoDB: The error means the system cannot find the path specified.
2016-11-05 18:26:14 8491 [ERROR] InnoDB: Could not find a valid tablespace file for 'test/tp#P#p1'. See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting-datadict.html for how to resolve the issue.
2016-11-05 18:26:14 7f434af03700 InnoDB: cannot calculate statistics for table "test"."tp" /* Partition "p1" */ because the .ibd file is missing. For help, please refer to http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html
2016-11-05 18:26:18 7f434af03700 InnoDB: cannot calculate statistics for table "test"."tp" /* Partition "p1" */ because the .ibd file is missing. For help, please refer to http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html
2016-11-05 18:26:51 7f434af03700 InnoDB: cannot calculate statistics for table "test"."tp" /* Partition "p1" */ because the .ibd file is missing. For help, please refer to http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html
Forced recovery mode is designed to be used for one of two main actions with the problematic table: either to select some remaining data from it (we tried and can not do this, and we don't case as we have data on other server in replication setup), or to drop it (we also copied .ibd files elsewhere if you remember). Nothing works with the table, but DROP does work:
slave1 [localhost] {msandbox} (test) > drop table tp;
Query OK, 0 rows affected (0,32 sec)
Now we can try to restart without forced recovery and try to recreate the table and link orphaned .ibd files back:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ./stop
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ vi my.sandbox.cnf
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ./start. sandbox server started
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ grep innodb_force my.sandbox.cnf
#innodb_force_recovery=1
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ tail data/msandbox.err   
2016-11-05 18:51:50 8861 [Note] Server hostname (bind-address): '127.0.0.1'; port: 22294
2016-11-05 18:51:50 8861 [Note]   - '127.0.0.1' resolves to '127.0.0.1';
2016-11-05 18:51:50 8861 [Note] Server socket created on IP: '127.0.0.1'.
2016-11-05 18:51:50 8861 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MySQL server acts as a slave and has his hostname changed!! Please use '--relay-log=mysql_sandbox22294-relay-bin' to avoid this problem.
2016-11-05 18:51:50 8861 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2016-11-05 18:51:50 8861 [Note] Slave I/O thread: connected to master 'rsandbox@127.0.0.1:22293',replication started in log 'mysql-bin.000005' at position 1015
2016-11-05 18:51:50 8861 [Note] Event Scheduler: Loaded 0 events
2016-11-05 18:51:50 8861 [Note] /home/openxs/5.6.28/bin/mysqld: ready for connections.
Version: '5.6.28-log'  socket: '/tmp/mysql_sandbox22294.sock'  port: 22294  MySQL Community Server (GPL)
2016-11-05 18:51:51 8861 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000005' at position 1015, relay log './mysql_sandbox22294-relay-bin.000237' position: 4
It looks promising so far. We can get DDL for the table from other server in replication setup (or any other source, like older dump) and try to create the table in a hurry:
master [localhost] {msandbox} (test) > show create table tp\G
*************************** 1. row ***************************
       Table: tp
Create Table: CREATE TABLE `tp` (
  `id` int(11) NOT NULL DEFAULT '0',
  `val` int(11) DEFAULT NULL,
  `ts` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
  PRIMARY KEY (`id`,`ts`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (year(ts))
(PARTITION p0 VALUES LESS THAN (2006) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (2015) ENGINE = InnoDB,
 PARTITION px VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
1 row in set (0,00 sec)

slave1 [localhost] {msandbox} (test) > show tables;
+----------------+
| Tables_in_test |
+----------------+
| t              |
| t1             |
| t2             |
| t_local        |
+----------------+
4 rows in set (0,00 sec)

slave1 [localhost] {msandbox} (test) > CREATE TABLE `tp` (   `id` int(11) NOT NULL DEFAULT '0',   `val` int(11) DEFAULT NULL,   `ts` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',   PRIMARY KEY (`id`,`ts`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 /*!50100 PARTITION BY RANGE (year(ts)) (PARTITION p0 VALUES LESS THAN (2006) ENGINE = InnoDB,  PARTITION p1 VALUES LESS THAN (2015) ENGINE = InnoDB,  PARTITION px VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */;
ERROR 1813 (HY000): Tablespace for table '`test`.`tp` /* Partition `p1` */' exists. Please DISCARD the tablespace before IMPORT.
So, looks like DROP TABLE worked but corrupted partition file remained? Yes, it remained:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ls -l data/test/tp*.ibd 
-rw-rw---- 1 openxs openxs 98304 лис  5 17:41 data/test/tp#P#p1.ibd
We have to remove it and then try again:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ rm data/test/tp*.ibd

slave1 [localhost] {msandbox} (test) > CREATE TABLE `tp` (   `id` int(11) NOT NULL DEFAULT '0',   `val` int(11) DEFAULT NULL,   `ts` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',   PRIMARY KEY (`id`,`ts`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 /*!50100 PARTITION BY RANGE (year(ts)) (PARTITION p0 VALUES LESS THAN (2006) ENGINE = InnoDB,  PARTITION p1 VALUES LESS THAN (2015) ENGINE = InnoDB,  PARTITION px VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */;
Query OK, 0 rows affected (0,75 sec)
Now we can mostly follow Jervin's blog post and create non-partitioned table of the same structure with the following DDL statements:

create table tp_tmp like tp;
alter table tp_tmp remove partitioning;alter table tp_tmp discard tablespace;
At this moment we can re-import partitions one by one. In reality instead of copy we may use rename or move for those partitions that were non-corrupted.
slave1 [localhost] {msandbox} (test) > \! cp /tmp/tp#P#p0.ibd data/test/tp_tmp.ibd
slave1 [localhost] {msandbox} (test) > \! ls -l data/test/tp_tmp.*             
-rw-rw---- 1 openxs openxs  8610 лис  5 19:05 data/test/tp_tmp.frm
-rw-rw---- 1 openxs openxs 98304 лис  5 19:07 data/test/tp_tmp.ibd
slave1 [localhost] {msandbox} (test) > alter table tp_tmp import tablespace;
Query OK, 0 rows affected, 1 warning (0,34 sec)

slave1 [localhost] {msandbox} (test) > show warnings\G
*************************** 1. row ***************************
  Level: Warning
   Code: 1810
Message: InnoDB: IO Read error: (2, No such file or directory) Error opening './test/tp_tmp.cfg', will attempt to import without schema verification
1 row in set (0,00 sec)

slave1 [localhost] {msandbox} (test) > select * from tp_tmp;
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  1 |   10 | 2005-10-01 00:00:00 |
+----+------+---------------------+
1 row in set (0,00 sec)
We get the warning, but it can be safely ignored - we know for sure that table structure matches .ibd file imported. Now we are ready to exchange emprty partiton with proper partiton data in the table:

slave1 [localhost] {msandbox} (test) > alter table tp exchange partition p0 with table tp_tmp;
Query OK, 0 rows affected (0,51 sec)

slave1 [localhost] {msandbox} (test) > select * from tp_tmp;
Empty set (0,00 sec)

slave1 [localhost] {msandbox} (test) > select * from tp;
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  1 |   10 | 2005-10-01 00:00:00 |
+----+------+---------------------+
1 row in set (0,01 sec)
Then we can do the same for other partition(s) stored on the same server, all but corrupted one:
slave1 [localhost] {msandbox} (test) > alter table tp_tmp discard tablespace;
Query OK, 0 rows affected (0,05 sec)

slave1 [localhost] {msandbox} (test) > \! cp /tmp/tp#P#px.ibd data/test/tp_tmp.ibd
slave1 [localhost] {msandbox} (test) > alter table tp_tmp import tablespace;
Query OK, 0 rows affected, 1 warning (0,35 sec)

slave1 [localhost] {msandbox} (test) > select * from tp_tmp;
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  3 |   30 | 2016-11-05 17:34:30 |
+----+------+---------------------+
1 row in set (0,00 sec)

slave1 [localhost] {msandbox} (test) > alter table tp exchange partition px with table tp_tmp;Query OK, 0 rows affected (0,53 sec)

slave1 [localhost] {msandbox} (test) > select * from tp_tmp;                   
Empty set (0,00 sec)

slave1 [localhost] {msandbox} (test) > select * from tp;
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  1 |   10 | 2005-10-01 00:00:00 |
|  3 |   30 | 2016-11-05 17:34:30 |
+----+------+---------------------+
2 rows in set (0,00 sec)
The file for corrupted partititon we should copy form elsewhere, in my case it's a datadir on master sandbox instance:
slave1 [localhost] {msandbox} (test) > alter table tp_tmp discard tablespace;  
Query OK, 0 rows affected (0,04 sec)

slave1 [localhost] {msandbox} (test) > \! cp ../master/data/test/tp#P#p1.ibd data/test/tp_tmp.ibd
slave1 [localhost] {msandbox} (test) > alter table tp_tmp import tablespace;   
Query OK, 0 rows affected, 1 warning (0,36 sec)

slave1 [localhost] {msandbox} (test) > select * from tp_tmp;
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  2 |   10 | 2014-10-02 00:00:00 |
+----+------+---------------------+
1 row in set (0,00 sec)

slave1 [localhost] {msandbox} (test) > alter table tp exchange partition px with table tp_tmp;
ERROR 1737 (HY000): Found a row that does not match the partition
slave1 [localhost] {msandbox} (test) > alter table tp exchange partition p1 with table tp_tmp;
Query OK, 0 rows affected (0,47 sec)

slave1 [localhost] {msandbox} (test) > select * from tp_tmp;
Empty set (0,00 sec)

slave1 [localhost] {msandbox} (test) > select * from tp partition(p1);
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  2 |   10 | 2014-10-02 00:00:00 |
+----+------+---------------------+
1 row in set (0,01 sec)
As you can see from my small mistake above, if data imported do not match range partition condition, they are not imported and you get error. We are done with restoring the data and now we can checkif replication works. Let me change all the data on master:
master [localhost] {msandbox} (test) > update tp set val = val + 1;
Query OK, 3 rows affected (0,03 sec)
Rows matched: 3  Changed: 3  Warnings: 0

master [localhost] {msandbox} (test) > select * from tp;
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  1 |   11 | 2005-10-01 00:00:00 |
|  2 |   11 | 2014-10-02 00:00:00 |
|  3 |   31 | 2016-11-05 17:34:30 |
+----+------+---------------------+
3 rows in set (0,00 sec)
and make sure they are replicated to slave:
slave1 [localhost] {msandbox} (test) > select * from tp;
+----+------+---------------------+
| id | val  | ts                  |
+----+------+---------------------+
|  1 |   11 | 2005-10-01 00:00:00 |
|  2 |   11 | 2014-10-02 00:00:00 |
|  3 |   31 | 2016-11-05 17:34:30 |
+----+------+---------------------+
3 rows in set (0,00 sec)

In the error log we'll see only messages related to importing tablespaces, no errors of any kind:
...
2016-11-05 19:01:02 9201 [ERROR] InnoDB: The file './test/tp#P#p1.ibd' already exists though the corresponding table did not exist in the InnoDB data dictionary. Have you moved InnoDB .ibd files around without using the SQL commands DISCARD TABLESPACE and IMPORT TABLESPACE, or did mysqld crash in the middle of CREATE TABLE? You can resolve the problem by removing the file './test/tp#P#p1.ibd' under the 'datadir' of MySQL.
2016-11-05 19:07:58 9201 [Note] InnoDB: Sync to disk
2016-11-05 19:07:58 9201 [Note] InnoDB: Sync to disk - done!
2016-11-05 19:07:58 9201 [Note] InnoDB: Phase I - Update all pages
2016-11-05 19:07:58 9201 [Note] InnoDB: Sync to disk
2016-11-05 19:07:58 9201 [Note] InnoDB: Sync to disk - done!
2016-11-05 19:07:58 9201 [Note] InnoDB: Phase III - Flush changes to disk
2016-11-05 19:07:58 9201 [Note] InnoDB: Phase IV - Flush complete
2016-11-05 19:10:48 9201 [Note] InnoDB: Sync to disk
2016-11-05 19:10:48 9201 [Note] InnoDB: Sync to disk - done!
2016-11-05 19:10:48 9201 [Note] InnoDB: Phase I - Update all pages
2016-11-05 19:10:48 9201 [Note] InnoDB: Sync to disk
2016-11-05 19:10:48 9201 [Note] InnoDB: Sync to disk - done!
2016-11-05 19:10:48 9201 [Note] InnoDB: Phase III - Flush changes to disk
2016-11-05 19:10:49 9201 [Note] InnoDB: Phase IV - Flush complete
2016-11-05 19:12:42 9201 [Note] InnoDB: Sync to disk
2016-11-05 19:12:42 9201 [Note] InnoDB: Sync to disk - done!
2016-11-05 19:12:42 9201 [Note] InnoDB: Phase I - Update all pages
2016-11-05 19:12:42 9201 [Note] InnoDB: Sync to disk
2016-11-05 19:12:42 9201 [Note] InnoDB: Sync to disk - done!
2016-11-05 19:12:42 9201 [Note] InnoDB: Phase III - Flush changes to disk
2016-11-05 19:12:42 9201 [Note] InnoDB: Phase IV - Flush complete
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$
I am too tired to write a nice summary right now, so I'll write it in some other post. Now I (and you, my readers) have a detailed enough reference of all boring steps involved in recovery of corrupted partition from .ibd file taken from other server in replication setup.

Saturday, October 29, 2016

MySQL Support Engineer's Chronicles, Issue #2

It's time to continue my new series that I've started 2 weeks ago. I'd like to start with a reminder that it's time to send your talks for "MySQL and Friends Devroom" at FOSDEM 2017 - the only MySQL-related event next year that I plan to attend in any case. It seems we have one more week to submit, but I've already filled in all the details for the "main" talk, "Understanding MyRocks locks and deadlocks". I'd like to apply my usual source code reading and gdb breakpoints approach in case if by the end of January, 2017 official documentation still misses important details. Official MySQL manual is still the process of improving with regards to explaining InnoDB locks, and I am happy to admit that this week yet another my related documentation request, Bug #71735, "Manual does not explain locks set by SELECT ... FOR UPDATE properly", was closed.

I am really excited by community and developers activity around MyRocks these days, so it was easy to decide to spend more time on MyRocks while preparing to FOSDEM. Just take a look at recent issues reported by Justin Swanhart, for example: Issue #365 and Issue #369. How often do you see so active, almost real time discussion in the bug reports about something else MySQL-related these days?

I had to spend a lot of time recently trying to understand and prove the root cause of one performance problem that happens once in while on a node of Galera cluster, and for this I've used Performance_Schema, maybe for the first time (since reporting Bug #68079) on a real life case. It turned out that proper sizing of the tables there, so that under high load and concurrency we still get relevant data few seconds after problem had happened in its tables is not a trivial task. Moreover, it seems Galera code is not instrumented in enough details, so it is hard to measure the impact it may have. While trying to find a proper trade off between the amount of data collected, memory usage and performance impact, I decided to use profilers, from PMP to ideas of replacing gdb there with quickstack, to perf (that I've used to make some really good observations). I'd like to summarize some recent (and upcoming) profiling experience in a talk that I am still trying to submit for FOSDEM, titled "Applying profilers to MySQL". In the meantime, check comments to this my Facebook post, for many useful details.

I'd like to submit also a talk on how single bug report may influence MySQL development and development of skills of the bug reporter. It's just the idea for now, and I plan to concentrate on a story around Bug #68079 that will be 4 years old by the time of FOSDEM 2017. I have new interesting results to share that I've got while trying to check how the fix helped and how this same use case performs and scales in MyRocks now. Not sure if this is a good idea and if anything but a blog post may come out of it. Some of my previous attempts to build a talk around this bug were not really well accepted...

This week I had to explain how to get rid of huge ibdata* file(s) while moving each InnoDB table to a separate .ibd file. I've found a great summary of options in this old blog post by Shlomi Noach. Very useful in case you can not just dump everything and reload into a new instance, for whatever reason.

Another problem I had to work on was related to bad query plan on a slave where the table was partitioned. We tried to fight with bad cardinality estimations (see Bug #67351 etc), but even after getting them closer to reality optimizer still uses wrong index sometimes. My next hope is Engine-independent table statistics of MariaDB.

I've noted that customers recently are actively trying to use PAM and LDAP with MySQL. I consider this blog post by my colleague Geoff Montee very useful for them (and myself).

Finally, I've found this great blog post by Jervin Real very useful while working with colleagues on continuous InnoDB crashes caused by corrupted partition .ibd file. The idea was to restore partition data from backup on a separate instance and also recover data for all non-corrupted partitions of a huge table there, while problematic instance is starting up in a forced recovery mode to drop the problematic table and check for any further problems.

As I already mentioned several bugs in this post, I'd like to share a link to the list of bugs recently reported by my colleague from MariaDB Foundation, Sergey Vojtovich. Dear Oracle MySQL engineers, please, pay proper attention to them, especially those with "bs" in the "Summary"!

Monday, October 24, 2016

Fixing Broken MySQL Replication: Basic Usage of pt-table-checksum, pt-slave-restart and pt-table-sync

For second day in a row I am trying to concentrate on writing a second issue for my new series of blog posts, but I can not complete. It seems half of it is going to be devoted to three tools from Percona Toolkit that I had to remember and explain this week: pt-table-checksum, pt-table-sync and pt-slave-restart. So, I decided to write a separate post on this topic.

The context was simple: in the middle of (the weekend) night customer comes with broken replication issue, without a fresh backup from master or with few gigabytes terabytes of data there. Customer tries to skip error on some of affected slave(s), but keeps getting more errors and slave does not work. Question is simple: what else to do besides making new backup and restoring it on slaves?

Customer assumes (or wants to believe) that only few rows in a table or two are missing and asks if there is a way to "resolve" only conflicted tables and start replication once again. The situation may be complicated by the fact that master keeps getting changes to the data and with every minute spent on decision making slave(s) getting more out of sync with master, while soon the data on slave may be badly needed and so they must be in sync...

It's very tempting to provide a very basic advice: "you can try to use pt-table-sync", and provide a link to the manual to poor customer to read and apply. Then forget about the issue for a couple of days (as customer may need about that time to read, try and then finally to wait for any results).

While this is easy and may even work for many customers, I think this is not fair to stop at this stage, as the manual for the tool, while great, is not easy to follow even for somebody like me who had read it and explained it many times, even less it is easy to follow while under stress of production failure. Unfortunately, even best of existing blog posts about this topic (like this one), while adding many real life use cases and details (like what exact privileges are needed for the MySQL user running the tool etc), still miss all steps for the real life use case, when serious decision is to be made fast. So, I decided to play a bit with the tools, show the exact outputs one may encounter when things go wrong and explain how to overcome some of the problems one may hit.

First, let's start with a short list of related best practices (shame on me to write about them, not my style at all, but they do exist):
  1. If you suppose to do anything else in case of broken replication besides restoring slave from the backup or relying on some automated failover solution, please, install Percona Toolkit (or at least pt-table-checksum, pt-table-sync and pt-slave-restart tools) on all hosts in your replication setup. You never know which of them will become master or remain online, so better get prepared to restore replication from any of them, even if there is no Internet connection at the moment to download extra tools.
  2. Consider running pt-table-checksum on a regular basis even if you do not see any replication errors. This way you'll make sure the tool works and whatever data differences (or problems to run the tool) may appear, you'll find them out early, during some well planned maintenance window. So, you'll have enough time to figure out what's going on before things get really badly broken at some unpredictable moment.
  3. Better read manual pages for the tools at least once and try to use the tools in testing environment before relying on them in production. You may hit bugs, find tables without proper indexes, waste time on finding proper options and Googling, or break things even more if you try really hard and, say, use pt-table-sync in a wrong way...
  4. Make sure all your table have PRIMARY or UNIQUE keys, better - single column ones. pt-table-sync relies on primary keys heavily, as it needs a way to identify the row to change or delete if needed. Single column UNIQUE or PRIMARY key will also help both pt-table-sync and pt-table-checksum to split data into chunks properly, without complaining about skipped chunks and other confusing messages.
  5. Note that pt-table-checksum and pt-table-sync can NOT help to fix DDL differences, like some table dropped on slave by mistake or having different data type of some column.
I plan to show commands and outputs, and for this I need a replication setup. I am lazy these days and prefer not to copy-paste (even reviewed and edited) real outputs from customers, so I'll just use one of replication sandboxes I have on my netbook. It will add few extra complexities like non-default ports used by MySQL instances involved and "wrong" values for report_host options. Non-default (not 3306) ports are used for MySQL server way more often than one may expect. So, I use MySQL 5.6.28 and my master is the following:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ ./my sql test -e"select @@port"
+--------+
| @@port |
+--------+
|  22293 |
+--------+


First of all, in case of broken replication it would be useful to find out how large is the difference. If every other table has 90% of rows different on some slave, it's faster to rebuild it from backup. If you do not follow best practices. For this you should use pt-table-checksum. In the example below I tried to apply it to single database, test, in my replication sandbox (with two slaves):

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293
Cannot connect to P=22294,h=SBslave1,p=...,u=msandbox
Cannot connect to P=22295,h=SBslave2,p=...,u=msandbox
Diffs cannot be detected because no slaves were found.  Please read the --recursion-method documentation for information.
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T19:30:10      0      0   262146       7       0   2.769 test.t
10-23T19:30:13      0      0    65536       1       0   2.882 test.t1
10-23T19:30:16      0      0    32768       1       0   2.258 test.t2


So, I connect to specific host (127.0.0.1 in this case) and port (22923 in my case). I had provided username (and password) of a user having proper privileges (now go find them in the manual, you WILL NOT. Then go read this blueprint and Bug #916168 or that Muhammad's post. I literally had to search for all SQL statements executed in the source code and then summarize all related minimal privileges once, I wish I had a list at hand when needed... Sounds complicated already, isn't it?)

You can also see the problem in the output above - my command was not able to connect to my slaves. In this case it was so because report_host was explicitly set to some value on my slaves that is NOT resolvable from the host I work at:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ ping SBslave1ping: unknown host SBslave1
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ cd ../node1

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ./my sql -e"select @@report_host"
+---------------+
| @@report_host |
+---------------+
| SBslave1      |
+---------------+

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ grep report my.sandbox.cnf
report-host=SBslave1
report-port=22294


This is what master "thinks" about its slaves:

master [localhost] {msandbox} (test) > show slave hosts;
+-----------+----------+-------+-----------+--------------------------------------+
| Server_id | Host     | Port  | Master_id | Slave_UUID                           |
+-----------+----------+-------+-----------+--------------------------------------+
|       101 | SBslave1 | 22294 |         1 | 3c3e7a24-b2c3-11e5-9b34-f4b7e2133dbd |
|       102 | SBslave2 | 22295 |         1 | 3e226dff-b2c3-11e5-9b35-f4b7e2133dbd |
+-----------+----------+-------+-----------+--------------------------------------+
2 rows in set (0,00 sec)


So, when your master is confused and/or pt-table-checksum does not see your slaves (that exist and working), you can try to resolve the problem by creating a separate table and providing so called "DSN"s (data source names) for them there:

master [localhost] {msandbox} (test) > show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| percona            |
| performance_schema |
| test               |
+--------------------+
5 rows in set (0,00 sec)

master [localhost] {msandbox} (test) > use percona
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
master [localhost] {msandbox} (percona) > CREATE TABLE `dsns` (
    ->   `id` int(11) NOT NULL AUTO_INCREMENT,
    ->   `parent_id` int(11) DEFAULT NULL,
    ->   `dsn` varchar(255) NOT NULL,
    ->   PRIMARY KEY (`id`)
    -> );

Query OK, 0 rows affected (0,27 sec)

master [localhost] {msandbox} (percona) > insert into dsns(dsn) values('h=127.0.0.1,u=msandbox,p=msandbox,P=22294');
Query OK, 1 row affected (0,06 sec)

master [localhost] {msandbox} (percona) > insert into dsns(dsn) values('h=127.0.0.1,u=msandbox,p=msandbox,P=22295');
Query OK, 1 row affected (0,06 sec)


Note that percona database is created by default if it does not exist first time you run pt-table-checksum. Table structure to create there and details of syntax of "DSN"s were take from the manual, that you should use as a reference for this case. Ideally you should make sure that the database and the table is created on master and all slaves well before you expect your first replication breakage ;) Hence one of the "best practices" listed above.

With details on how to access slaves written into the percona.dsns table, you can now tell the pt-table-checksum tool how to find them (using --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293 option to tell the tool where the table with details on how to access slaves is located):

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T19:38:26      0      0   262146       8       0   2.705 test.t
10-23T19:38:28      0      0    65536       1       0   1.309 test.t1
10-23T19:38:28      0      0    32768       1       0   0.657 test.t2


This is what you should have at hand, ideally, well before your replication breaks for the first time - details on how much time it takes to check each table, how many chunks are there, the fact that no one of them was skipped, and definite information that at specific moment in time the data were all the same on all slaves involved.

Now, it's time to break replication. I check some data in the test.t table and then make them different deliberately on slave2:

slave2 [localhost] {msandbox} (test) > select * from t limit 10;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2016-03-18 17:47:39 |
|  2 | 2016-03-18 17:47:40 |
|  3 | 2016-03-18 17:47:40 |
|  4 | 2016-03-18 17:47:40 |
|  5 | 2016-03-18 17:47:40 |
|  6 | 2016-03-18 17:47:40 |
|  7 | 2016-03-18 17:47:40 |
|  8 | 2016-03-18 17:47:40 |
|  9 | 2016-03-18 17:47:40 |
| 10 | 2016-03-18 17:47:40 |
+----+---------------------+
10 rows in set (0,00 sec)

slave2 [localhost] {msandbox} (test) > delete from t where id = 9;
Query OK, 1 row affected (0,05 sec)

slave2 [localhost] {msandbox} (test) > delete from t where id = 100;
Query OK, 1 row affected (0,04 sec)

slave2 [localhost] {msandbox} (test) > insert into t values (10000000, now());
Query OK, 1 row affected (0,05 sec)


Now, I continue to do changes on master:

master [localhost] {msandbox} (test) > update t set ts = now() where id between 1 and 10;
Query OK, 10 rows affected (0,07 sec)
Rows matched: 10  Changed: 10  Warnings: 0


Then, eventually, I find out that replication on slave2 is broken:

slave2 [localhost] {msandbox} (test) > show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22293
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 24026
               Relay_Log_File: mysql_sandbox22295-relay-bin.000009
                Relay_Log_Pos: 23716
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 23995
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 23553
              Relay_Log_Space: 3711091
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 23995
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 2b1577b7-b2c3-11e5-9b34-f4b7e2133dbd
             Master_Info_File: /home/openxs/sandboxes/rsandbox_mysql-5_6_28/node2/data/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 161023 19:41:37
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0,00 sec)


You can note that some slave has a problem while running pt-table-cheksum next time:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ cd ../master/
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
Replica ao756 is stopped.  Waiting.
Replica ao756 is stopped.  Waiting.

...





We see that pt-table-checksum waits for some slave that it knows and cares (that can not be properly identified by hostname in my case, as all 3 instances run on the same host) to start. This is expected, as this tool works by running statements on master (and rely on statement-based logging for its connection to master) that are expected to be replicated to slaves and produce different results (chunk checksums) there if the data are different on specific slave. So, if pt-table-checksum produces messages like those above and nothing else for some time, then at least one of slaves is NOT running and replication is broken. We can stop the tool:

^C# Caught SIGINT.
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T19:43:43      0      0     1000       1       0  37.123 test.t

But surely checksums were not calculated completely.

To proceed further with checking and fixing (if needed), we have to identify stopped slave and start it again. I my case I already know that slave is not just stopped by chance, but got error and is not able ot find the row to update. I can skip this error manually or, if I assume or know there many errors, I can rely on pt-slave-restart tool to make sure replication is working again:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-slave-restart -umsandbox -pmsandbox --host=127.0.0.1 --port=22295
2016-10-23T19:45:05 P=22295,h=127.0.0.1,p=...,u=msandbox mysql_sandbox22295-relay-bin.000009       23716 1032


Note that pt-slave-restart by default reports every restart and sleeps for some time between checks. I pointed out the slave that I know was stopped and I see that, after single restart, the tool prints nothing. I can check if the slave is in sync:

slave2 [localhost] {msandbox} (test) > show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22293
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 26314
               Relay_Log_File: mysql_sandbox22295-relay-bin.000009
                Relay_Log_Pos: 26477
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 26314
              Relay_Log_Space: 3713379
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 2b1577b7-b2c3-11e5-9b34-f4b7e2133dbd
             Master_Info_File: /home/openxs/sandboxes/rsandbox_mysql-5_6_28/node2/data/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0,00 sec)

From the data highlighted above it is clear that it is in sync. I can leave pt-slave-restart running just in case, or I can stop it now with Ctrl+C, to get access to my ssh session:

...
^CExiting on SIGINT.
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$


Now, as long as we restarted slave after SQL thread was stopped with error at least once, we are 100% sure that the data on slave are NOT in sync with its master. We can still check how much they are different with pt-table-checksum:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T19:49:55      0      3   262146       7       0   2.453 test.t
10-23T19:49:56      0      0    65536       1       0   1.307 test.t1
10-23T19:49:58      0      0    32768       1       0   1.361 test.t2


So, 3 chunks with differences were found. We can try to find out more about the differences in chunks for this table using --replicate-check-only option:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --replicate-check-only --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
Differences on ao756
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
test.t 1 -3 1 PRIMARY 1,1,2016-10-23 19:41:37 1000,1000,2016-03-18 17:47:40
test.t 6 1 0 PRIMARY  1,1,2016-10-23 19:41:37
test.t 7 1 0 PRIMARY 262146,262146,2016-03-18 17:47:40

We see that difference in row counts (CNT_DIFF) is not high, so we can expect that with just few statements executed we can get data back in sync and proceed with pt-table-sync tool. We can also try to find out how big the difference is by checking the percona.checksums table on the slave that is out of sync:

slave2 [localhost] {msandbox} (percona) > SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM percona.checksums WHERE (  master_cnt <> this_cnt  OR master_crc <> this_crc  OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl;
+------+-----+------------+--------+
| db   | tbl | total_rows | chunks |
+------+-----+------------+--------+
| test | t   |        999 |      3 |
+------+-----+------------+--------+
1 row in set (0,00 sec)


If the difference is in many thousands or millions of rows, at this stage it's not yet late to decide on restoring slave (or just this specific table) from the backup instead of wasting more time on syncing it row by row.

In my case the difference seams to be reasonably small, so I'll use pt-table-sync , point it out to the broken slave, ask to --sync_to-master this slave and use --print option to double check what it is going to change

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-sync --print --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t
DELETE FROM `test`.`t` WHERE `id`='1' AND `ts`='2016-03-18 17:47:39' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;DELETE FROM `test`.`t` WHERE `id`='2' AND `ts`='2016-03-18 17:47:40' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/; 

...

REPLACE INTO `test`.`t`(`id`, `ts`) VALUES ('10', '2016-10-23 19:41:37') /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;REPLACE INTO `test`.`t`(`id`, `ts`) VALUES ('100', '2016-03-18 17:47:40') /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;DELETE FROM `test`.`t` WHERE `id`='10000000' AND `ts`='2016-10-23 19:40:49' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$


The tool adds comment with the details on what data sources it is syncing and where the change was initiated from. You can see that it tries to produce as safe statements as possible, by providing both PRIMARY key value and additional conditions. These statements are to be executed on MASTER, and they will not change or delete anything there because of these additional conditions.

One more important detail in the command line used:

pt-table-sync --print --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t

pt-table-sync expects some option defining action (--print or --execute that I use later), some option defining what exactly to sync to (in this case I sync slave to its master) and one or two DSNs, details about the data sources to work with. The tool may try to sync just two different tables on two different hosts, not related by any replication (this is useful sometimes). For fixing broken MySQL async replication it is usually used in the way I did above, and one has to specify single DSN for a slave. pt-table-sync does NOT accept user name or password, or hostname, or port as separate command line options, so you have to specify all the details on how to access slave in the DSN. In my case the slave I want to fix runs on port 22295 of the host 127.0.0.1. The same user, msandbox, will be used to run changes on master, so make sure you specify the user with proper privileges granted and that this user (with the same password) exists on both slave and master. Note also that in this DSN I specified that the tool has to sync test.t table and nothing else. I know it is different on slave and I do not want the tool to touch anything else. When I was not that speicific, by the way, I've got the following message about the different table (that seems to be in sync from previous outputs):

Can't make changes on the master because no unique index exists at /usr/bin/pt-table-sync line 10697.  while doing test.t2 on 127.0.0.1

I had not followed best practices, obviously. If test.t2 table gets out of sync, I'll be in troubles.

So, I checked that changes planned are in line with what I expect (I can do this as I know what I actually changed on slave2 to break replication there). After that I feel more safe and can ask the tool to do the changes automatically:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-sync --execute --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$



I surely trust  the tool that the problem is now fixed (there were no suspicious messages), but it never hurts to double check with pt-table-checksum:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T20:06:44      0      0   262146       7       0   2.640 test.t
10-23T20:06:46      0      0    65536       1       0   1.207 test.t1
10-23T20:06:47      0      0    32768       1       0   1.115 test.t2

and by checking the outputs of SHOW SLAVE STATUS on all slaves.

To summarize, there are several important details, notions and steps involved in fixing broken replication using Percona Toolkit. I plan to use this blog post as a quick reference for myself and (yet another) set of instructions for those desperate customers approaching Support in the middle of the night in deep troubles...