Monday, October 24, 2016

Fixing Broken MySQL Replication: Basic Usage of pt-table-checksum, pt-slave-restart and pt-table-sync

For second day in a row I am trying to concentrate on writing a second issue for my new series of blog posts, but I can not complete. It seems half of it is going to be devoted to three tools from Percona Toolkit that I had to remember and explain this week: pt-table-checksum, pt-table-sync and pt-slave-restart. So, I decided to write a separate post on this topic.

The context was simple: in the middle of (the weekend) night customer comes with broken replication issue, without a fresh backup from master or with few gigabytes terabytes of data there. Customer tries to skip error on some of affected slave(s), but keeps getting more errors and slave does not work. Question is simple: what else to do besides making new backup and restoring it on slaves?

Customer assumes (or wants to believe) that only few rows in a table or two are missing and asks if there is a way to "resolve" only conflicted tables and start replication once again. The situation may be complicated by the fact that master keeps getting changes to the data and with every minute spent on decision making slave(s) getting more out of sync with master, while soon the data on slave may be badly needed and so they must be in sync...

It's very tempting to provide a very basic advice: "you can try to use pt-table-sync", and provide a link to the manual to poor customer to read and apply. Then forget about the issue for a couple of days (as customer may need about that time to read, try and then finally to wait for any results).

While this is easy and may even work for many customers, I think this is not fair to stop at this stage, as the manual for the tool, while great, is not easy to follow even for somebody like me who had read it and explained it many times, even less it is easy to follow while under stress of production failure. Unfortunately, even best of existing blog posts about this topic (like this one), while adding many real life use cases and details (like what exact privileges are needed for the MySQL user running the tool etc), still miss all steps for the real life use case, when serious decision is to be made fast. So, I decided to play a bit with the tools, show the exact outputs one may encounter when things go wrong and explain how to overcome some of the problems one may hit.

First, let's start with a short list of related best practices (shame on me to write about them, not my style at all, but they do exist):
  1. If you suppose to do anything else in case of broken replication besides restoring slave from the backup or relying on some automated failover solution, please, install Percona Toolkit (or at least pt-table-checksum, pt-table-sync and pt-slave-restart tools) on all hosts in your replication setup. You never know which of them will become master or remain online, so better get prepared to restore replication from any of them, even if there is no Internet connection at the moment to download extra tools.
  2. Consider running pt-table-checksum on a regular basis even if you do not see any replication errors. This way you'll make sure the tool works and whatever data differences (or problems to run the tool) may appear, you'll find them out early, during some well planned maintenance window. So, you'll have enough time to figure out what's going on before things get really badly broken at some unpredictable moment.
  3. Better read manual pages for the tools at least once and try to use the tools in testing environment before relying on them in production. You may hit bugs, find tables without proper indexes, waste time on finding proper options and Googling, or break things even more if you try really hard and, say, use pt-table-sync in a wrong way...
  4. Make sure all your table have PRIMARY or UNIQUE keys, better - single column ones. pt-table-sync relies on primary keys heavily, as it needs a way to identify the row to change or delete if needed. Single column UNIQUE or PRIMARY key will also help both pt-table-sync and pt-table-checksum to split data into chunks properly, without complaining about skipped chunks and other confusing messages.
  5. Note that pt-table-checksum and pt-table-sync can NOT help to fix DDL differences, like some table dropped on slave by mistake or having different data type of some column.
I plan to show commands and outputs, and for this I need a replication setup. I am lazy these days and prefer not to copy-paste (even reviewed and edited) real outputs from customers, so I'll just use one of replication sandboxes I have on my netbook. It will add few extra complexities like non-default ports used by MySQL instances involved and "wrong" values for report_host options. Non-default (not 3306) ports are used for MySQL server way more often than one may expect. So, I use MySQL 5.6.28 and my master is the following:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ ./my sql test -e"select @@port"
+--------+
| @@port |
+--------+
|  22293 |
+--------+


First of all, in case of broken replication it would be useful to find out how large is the difference. If every other table has 90% of rows different on some slave, it's faster to rebuild it from backup. If you do not follow best practices. For this you should use pt-table-checksum. In the example below I tried to apply it to single database, test, in my replication sandbox (with two slaves):

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293
Cannot connect to P=22294,h=SBslave1,p=...,u=msandbox
Cannot connect to P=22295,h=SBslave2,p=...,u=msandbox
Diffs cannot be detected because no slaves were found.  Please read the --recursion-method documentation for information.
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T19:30:10      0      0   262146       7       0   2.769 test.t
10-23T19:30:13      0      0    65536       1       0   2.882 test.t1
10-23T19:30:16      0      0    32768       1       0   2.258 test.t2


So, I connect to specific host (127.0.0.1 in this case) and port (22923 in my case). I had provided username (and password) of a user having proper privileges (now go find them in the manual, you WILL NOT. Then go read this blueprint and Bug #916168 or that Muhammad's post. I literally had to search for all SQL statements executed in the source code and then summarize all related minimal privileges once, I wish I had a list at hand when needed... Sounds complicated already, isn't it?)

You can also see the problem in the output above - my command was not able to connect to my slaves. In this case it was so because report_host was explicitly set to some value on my slaves that is NOT resolvable from the host I work at:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ ping SBslave1ping: unknown host SBslave1
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ cd ../node1

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ./my sql -e"select @@report_host"
+---------------+
| @@report_host |
+---------------+
| SBslave1      |
+---------------+

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ grep report my.sandbox.cnf
report-host=SBslave1
report-port=22294


This is what master "thinks" about its slaves:

master [localhost] {msandbox} (test) > show slave hosts;
+-----------+----------+-------+-----------+--------------------------------------+
| Server_id | Host     | Port  | Master_id | Slave_UUID                           |
+-----------+----------+-------+-----------+--------------------------------------+
|       101 | SBslave1 | 22294 |         1 | 3c3e7a24-b2c3-11e5-9b34-f4b7e2133dbd |
|       102 | SBslave2 | 22295 |         1 | 3e226dff-b2c3-11e5-9b35-f4b7e2133dbd |
+-----------+----------+-------+-----------+--------------------------------------+
2 rows in set (0,00 sec)


So, when your master is confused and/or pt-table-checksum does not see your slaves (that exist and working), you can try to resolve the problem by creating a separate table and providing so called "DSN"s (data source names) for them there:

master [localhost] {msandbox} (test) > show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| percona            |
| performance_schema |
| test               |
+--------------------+
5 rows in set (0,00 sec)

master [localhost] {msandbox} (test) > use percona
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
master [localhost] {msandbox} (percona) > CREATE TABLE `dsns` (
    ->   `id` int(11) NOT NULL AUTO_INCREMENT,
    ->   `parent_id` int(11) DEFAULT NULL,
    ->   `dsn` varchar(255) NOT NULL,
    ->   PRIMARY KEY (`id`)
    -> );

Query OK, 0 rows affected (0,27 sec)

master [localhost] {msandbox} (percona) > insert into dsns(dsn) values('h=127.0.0.1,u=msandbox,p=msandbox,P=22294');
Query OK, 1 row affected (0,06 sec)

master [localhost] {msandbox} (percona) > insert into dsns(dsn) values('h=127.0.0.1,u=msandbox,p=msandbox,P=22295');
Query OK, 1 row affected (0,06 sec)


Note that percona database is created by default if it does not exist first time you run pt-table-checksum. Table structure to create there and details of syntax of "DSN"s were take from the manual, that you should use as a reference for this case. Ideally you should make sure that the database and the table is created on master and all slaves well before you expect your first replication breakage ;) Hence one of the "best practices" listed above.

With details on how to access slaves written into the percona.dsns table, you can now tell the pt-table-checksum tool how to find them (using --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293 option to tell the tool where the table with details on how to access slaves is located):

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T19:38:26      0      0   262146       8       0   2.705 test.t
10-23T19:38:28      0      0    65536       1       0   1.309 test.t1
10-23T19:38:28      0      0    32768       1       0   0.657 test.t2


This is what you should have at hand, ideally, well before your replication breaks for the first time - details on how much time it takes to check each table, how many chunks are there, the fact that no one of them was skipped, and definite information that at specific moment in time the data were all the same on all slaves involved.

Now, it's time to break replication. I check some data in the test.t table and then make them different deliberately on slave2:

slave2 [localhost] {msandbox} (test) > select * from t limit 10;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2016-03-18 17:47:39 |
|  2 | 2016-03-18 17:47:40 |
|  3 | 2016-03-18 17:47:40 |
|  4 | 2016-03-18 17:47:40 |
|  5 | 2016-03-18 17:47:40 |
|  6 | 2016-03-18 17:47:40 |
|  7 | 2016-03-18 17:47:40 |
|  8 | 2016-03-18 17:47:40 |
|  9 | 2016-03-18 17:47:40 |
| 10 | 2016-03-18 17:47:40 |
+----+---------------------+
10 rows in set (0,00 sec)

slave2 [localhost] {msandbox} (test) > delete from t where id = 9;
Query OK, 1 row affected (0,05 sec)

slave2 [localhost] {msandbox} (test) > delete from t where id = 100;
Query OK, 1 row affected (0,04 sec)

slave2 [localhost] {msandbox} (test) > insert into t values (10000000, now());
Query OK, 1 row affected (0,05 sec)


Now, I continue to do changes on master:

master [localhost] {msandbox} (test) > update t set ts = now() where id between 1 and 10;
Query OK, 10 rows affected (0,07 sec)
Rows matched: 10  Changed: 10  Warnings: 0


Then, eventually, I find out that replication on slave2 is broken:

slave2 [localhost] {msandbox} (test) > show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22293
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 24026
               Relay_Log_File: mysql_sandbox22295-relay-bin.000009
                Relay_Log_Pos: 23716
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 23995
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 23553
              Relay_Log_Space: 3711091
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 23995
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 2b1577b7-b2c3-11e5-9b34-f4b7e2133dbd
             Master_Info_File: /home/openxs/sandboxes/rsandbox_mysql-5_6_28/node2/data/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 161023 19:41:37
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0,00 sec)


You can note that some slave has a problem while running pt-table-cheksum next time:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ cd ../master/
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
Replica ao756 is stopped.  Waiting.
Replica ao756 is stopped.  Waiting.

...





We see that pt-table-checksum waits for some slave that it knows and cares (that can not be properly identified by hostname in my case, as all 3 instances run on the same host) to start. This is expected, as this tool works by running statements on master (and rely on statement-based logging for its connection to master) that are expected to be replicated to slaves and produce different results (chunk checksums) there if the data are different on specific slave. So, if pt-table-checksum produces messages like those above and nothing else for some time, then at least one of slaves is NOT running and replication is broken. We can stop the tool:

^C# Caught SIGINT.
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T19:43:43      0      0     1000       1       0  37.123 test.t

But surely checksums were not calculated completely.

To proceed further with checking and fixing (if needed), we have to identify stopped slave and start it again. I my case I already know that slave is not just stopped by chance, but got error and is not able ot find the row to update. I can skip this error manually or, if I assume or know there many errors, I can rely on pt-slave-restart tool to make sure replication is working again:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-slave-restart -umsandbox -pmsandbox --host=127.0.0.1 --port=22295
2016-10-23T19:45:05 P=22295,h=127.0.0.1,p=...,u=msandbox mysql_sandbox22295-relay-bin.000009       23716 1032


Note that pt-slave-restart by default reports every restart and sleeps for some time between checks. I pointed out the slave that I know was stopped and I see that, after single restart, the tool prints nothing. I can check if the slave is in sync:

slave2 [localhost] {msandbox} (test) > show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22293
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 26314
               Relay_Log_File: mysql_sandbox22295-relay-bin.000009
                Relay_Log_Pos: 26477
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 26314
              Relay_Log_Space: 3713379
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 2b1577b7-b2c3-11e5-9b34-f4b7e2133dbd
             Master_Info_File: /home/openxs/sandboxes/rsandbox_mysql-5_6_28/node2/data/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0,00 sec)

From the data highlighted above it is clear that it is in sync. I can leave pt-slave-restart running just in case, or I can stop it now with Ctrl+C, to get access to my ssh session:

...
^CExiting on SIGINT.
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$


Now, as long as we restarted slave after SQL thread was stopped with error at least once, we are 100% sure that the data on slave are NOT in sync with its master. We can still check how much they are different with pt-table-checksum:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T19:49:55      0      3   262146       7       0   2.453 test.t
10-23T19:49:56      0      0    65536       1       0   1.307 test.t1
10-23T19:49:58      0      0    32768       1       0   1.361 test.t2


So, 3 chunks with differences were found. We can try to find out more about the differences in chunks for this table using --replicate-check-only option:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --replicate-check-only --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
Differences on ao756
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
test.t 1 -3 1 PRIMARY 1,1,2016-10-23 19:41:37 1000,1000,2016-03-18 17:47:40
test.t 6 1 0 PRIMARY  1,1,2016-10-23 19:41:37
test.t 7 1 0 PRIMARY 262146,262146,2016-03-18 17:47:40

We see that difference in row counts (CNT_DIFF) is not high, so we can expect that with just few statements executed we can get data back in sync and proceed with pt-table-sync tool. We can also try to find out how big the difference is by checking the percona.checksums table on the slave that is out of sync:

slave2 [localhost] {msandbox} (percona) > SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM percona.checksums WHERE (  master_cnt <> this_cnt  OR master_crc <> this_crc  OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl;
+------+-----+------------+--------+
| db   | tbl | total_rows | chunks |
+------+-----+------------+--------+
| test | t   |        999 |      3 |
+------+-----+------------+--------+
1 row in set (0,00 sec)


If the difference is in many thousands or millions of rows, at this stage it's not yet late to decide on restoring slave (or just this specific table) from the backup instead of wasting more time on syncing it row by row.

In my case the difference seams to be reasonably small, so I'll use pt-table-sync , point it out to the broken slave, ask to --sync_to-master this slave and use --print option to double check what it is going to change

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-sync --print --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t
DELETE FROM `test`.`t` WHERE `id`='1' AND `ts`='2016-03-18 17:47:39' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;DELETE FROM `test`.`t` WHERE `id`='2' AND `ts`='2016-03-18 17:47:40' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/; 

...

REPLACE INTO `test`.`t`(`id`, `ts`) VALUES ('10', '2016-10-23 19:41:37') /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;REPLACE INTO `test`.`t`(`id`, `ts`) VALUES ('100', '2016-03-18 17:47:40') /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;DELETE FROM `test`.`t` WHERE `id`='10000000' AND `ts`='2016-10-23 19:40:49' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$


The tool adds comment with the details on what data sources it is syncing and where the change was initiated from. You can see that it tries to produce as safe statements as possible, by providing both PRIMARY key value and additional conditions. These statements are to be executed on MASTER, and they will not change or delete anything there because of these additional conditions.

One more important detail in the command line used:

pt-table-sync --print --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t

pt-table-sync expects some option defining action (--print or --execute that I use later), some option defining what exactly to sync to (in this case I sync slave to its master) and one or two DSNs, details about the data sources to work with. The tool may try to sync just two different tables on two different hosts, not related by any replication (this is useful sometimes). For fixing broken MySQL async replication it is usually used in the way I did above, and one has to specify single DSN for a slave. pt-table-sync does NOT accept user name or password, or hostname, or port as separate command line options, so you have to specify all the details on how to access slave in the DSN. In my case the slave I want to fix runs on port 22295 of the host 127.0.0.1. The same user, msandbox, will be used to run changes on master, so make sure you specify the user with proper privileges granted and that this user (with the same password) exists on both slave and master. Note also that in this DSN I specified that the tool has to sync test.t table and nothing else. I know it is different on slave and I do not want the tool to touch anything else. When I was not that speicific, by the way, I've got the following message about the different table (that seems to be in sync from previous outputs):

Can't make changes on the master because no unique index exists at /usr/bin/pt-table-sync line 10697.  while doing test.t2 on 127.0.0.1

I had not followed best practices, obviously. If test.t2 table gets out of sync, I'll be in troubles.

So, I checked that changes planned are in line with what I expect (I can do this as I know what I actually changed on slave2 to break replication there). After that I feel more safe and can ask the tool to do the changes automatically:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-sync --execute --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$



I surely trust  the tool that the problem is now fixed (there were no suspicious messages), but it never hurts to double check with pt-table-checksum:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
10-23T20:06:44      0      0   262146       7       0   2.640 test.t
10-23T20:06:46      0      0    65536       1       0   1.207 test.t1
10-23T20:06:47      0      0    32768       1       0   1.115 test.t2

and by checking the outputs of SHOW SLAVE STATUS on all slaves.

To summarize, there are several important details, notions and steps involved in fixing broken replication using Percona Toolkit. I plan to use this blog post as a quick reference for myself and (yet another) set of instructions for those desperate customers approaching Support in the middle of the night in deep troubles...

No comments:

Post a Comment