Blog of (former?) MySQL Entomologist: October 2016

Saturday, October 29, 2016

MySQL Support Engineer's Chronicles, Issue #2

It's time to continue my new series that I've started 2 weeks ago. I'd like to start with a reminder that it's time to send your talks for "MySQL and Friends Devroom" at FOSDEM 2017 - the only MySQL-related event next year that I plan to attend in any case. It seems we have one more week to submit, but I've already filled in all the details for the "main" talk, "Understanding MyRocks locks and deadlocks". I'd like to apply my usual source code reading and gdb breakpoints approach in case if by the end of January, 2017 official documentation still misses important details. Official MySQL manual is still the process of improving with regards to explaining InnoDB locks, and I am happy to admit that this week yet another my related documentation request, Bug #71735, "Manual does not explain locks set by SELECT ... FOR UPDATE properly", was closed.

I am really excited by community and developers activity around MyRocks these days, so it was easy to decide to spend more time on MyRocks while preparing to FOSDEM. Just take a look at recent issues reported by Justin Swanhart, for example: Issue #365 and Issue #369. How often do you see so active, almost real time discussion in the bug reports about something else MySQL-related these days?

I had to spend a lot of time recently trying to understand and prove the root cause of one performance problem that happens once in while on a node of Galera cluster, and for this I've used Performance_Schema, maybe for the first time (since reporting Bug #68079) on a real life case. It turned out that proper sizing of the tables there, so that under high load and concurrency we still get relevant data few seconds after problem had happened in its tables is not a trivial task. Moreover, it seems Galera code is not instrumented in enough details, so it is hard to measure the impact it may have. While trying to find a proper trade off between the amount of data collected, memory usage and performance impact, I decided to use profilers, from PMP to ideas of replacing gdb there with quickstack, to perf (that I've used to make some really good observations). I'd like to summarize some recent (and upcoming) profiling experience in a talk that I am still trying to submit for FOSDEM, titled "Applying profilers to MySQL". In the meantime, check comments to this my Facebook post, for many useful details.

I'd like to submit also a talk on how single bug report may influence MySQL development and development of skills of the bug reporter. It's just the idea for now, and I plan to concentrate on a story around Bug #68079 that will be 4 years old by the time of FOSDEM 2017. I have new interesting results to share that I've got while trying to check how the fix helped and how this same use case performs and scales in MyRocks now. Not sure if this is a good idea and if anything but a blog post may come out of it. Some of my previous attempts to build a talk around this bug were not really well accepted...

This week I had to explain how to get rid of huge ibdata* file(s) while moving each InnoDB table to a separate .ibd file. I've found a great summary of options in this old blog post by Shlomi Noach. Very useful in case you can not just dump everything and reload into a new instance, for whatever reason.

Another problem I had to work on was related to bad query plan on a slave where the table was partitioned. We tried to fight with bad cardinality estimations (see Bug #67351 etc), but even after getting them closer to reality optimizer still uses wrong index sometimes. My next hope is Engine-independent table statistics of MariaDB.

I've noted that customers recently are actively trying to use PAM and LDAP with MySQL. I consider this blog post by my colleague Geoff Montee very useful for them (and myself).

Finally, I've found this great blog post by Jervin Real very useful while working with colleagues on continuous InnoDB crashes caused by corrupted partition .ibd file. The idea was to restore partition data from backup on a separate instance and also recover data for all non-corrupted partitions of a huge table there, while problematic instance is starting up in a forced recovery mode to drop the problematic table and check for any further problems.

As I already mentioned several bugs in this post, I'd like to share a link to the list of bugs recently reported by my colleague from MariaDB Foundation, Sergey Vojtovich. Dear Oracle MySQL engineers, please, pay proper attention to them, especially those with "bs" in the "Summary"!

Monday, October 24, 2016

Fixing Broken MySQL Replication: Basic Usage of pt-table-checksum, pt-slave-restart and pt-table-sync

For second day in a row I am trying to concentrate on writing a second issue for my new series of blog posts, but I can not complete. It seems half of it is going to be devoted to three tools from Percona Toolkit that I had to remember and explain this week: pt-table-checksum, pt-table-sync and pt-slave-restart. So, I decided to write a separate post on this topic.

The context was simple: in the middle of (the weekend) night customer comes with broken replication issue, without a fresh backup from master or with few ~~gigabytes~~ terabytes of data there. Customer tries to skip error on some of affected slave(s), but keeps getting more errors and slave does not work. Question is simple: what else to do besides making new backup and restoring it on slaves?

Customer assumes (or wants to believe) that only few rows in a table or two are missing and asks if there is a way to "resolve" only conflicted tables and start replication once again. The situation may be complicated by the fact that master keeps getting changes to the data and with every minute spent on decision making slave(s) getting more out of sync with master, while soon the data on slave may be badly needed and so they must be in sync...

It's very tempting to provide a very basic advice: "you can try to use pt-table-sync", and provide a link to the manual to poor customer to read and apply. Then forget about the issue for a couple of days (as customer may need about that time to read, try and then finally to wait for any results).

While this is easy and may even work for many customers, I think this is not fair to stop at this stage, as the manual for the tool, while great, is not easy to follow even for somebody like me who had read it and explained it many times, even less it is easy to follow while under stress of production failure. Unfortunately, even best of existing blog posts about this topic (like this one), while adding many real life use cases and details (like what exact privileges are needed for the MySQL user running the tool etc), still miss all steps for the real life use case, when serious decision is to be made fast. So, I decided to play a bit with the tools, show the exact outputs one may encounter when things go wrong and explain how to overcome some of the problems one may hit.

First, let's start with a short list of related best practices (shame on me to write about them, not my style at all, but they do exist):

If you suppose to do anything else in case of broken replication besides restoring slave from the backup or relying on some automated failover solution, please, install Percona Toolkit (or at least pt-table-checksum, pt-table-sync and pt-slave-restart tools) on all hosts in your replication setup. You never know which of them will become master or remain online, so better get prepared to restore replication from any of them, even if there is no Internet connection at the moment to download extra tools.
Consider running pt-table-checksum on a regular basis even if you do not see any replication errors. This way you'll make sure the tool works and whatever data differences (or problems to run the tool) may appear, you'll find them out early, during some well planned maintenance window. So, you'll have enough time to figure out what's going on before things get really badly broken at some unpredictable moment.
Better read manual pages for the tools at least once and try to use the tools in testing environment before relying on them in production. You may hit bugs, find tables without proper indexes, waste time on finding proper options and Googling, or break things even more if you try really hard and, say, use pt-table-sync in a wrong way...
Make sure all your table have PRIMARY or UNIQUE keys, better - single column ones. pt-table-sync relies on primary keys heavily, as it needs a way to identify the row to change or delete if needed. Single column UNIQUE or PRIMARY key will also help both pt-table-sync and pt-table-checksum to split data into chunks properly, without complaining about skipped chunks and other confusing messages.
Note that pt-table-checksum and pt-table-sync can NOT help to fix DDL differences, like some table dropped on slave by mistake or having different data type of some column.

I plan to show commands and outputs, and for this I need a replication setup. I am lazy these days and prefer not to copy-paste (even reviewed and edited) real outputs from customers, so I'll just use one of replication sandboxes I have on my netbook. It will add few extra complexities like non-default ports used by MySQL instances involved and "wrong" values for report_host options. Non-default (not 3306) ports are used for MySQL server way more often than one may expect. So, I use MySQL 5.6.28 and my master is the following:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ ./my sql test -e"select @@port"
+--------+
| @@port |
+--------+
| 22293 |
+--------+

First of all, in case of broken replication it would be useful to find out how large is the difference. If every other table has 90% of rows different on some slave, it's faster to rebuild it from backup. If you do not follow best practices. For this you should use pt-table-checksum. In the example below I tried to apply it to single database, test, in my replication sandbox (with two slaves):

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293
Cannot connect to P=22294,h=SBslave1,p=...,u=msandbox
Cannot connect to P=22295,h=SBslave2,p=...,u=msandbox
Diffs cannot be detected because no slaves were found. Please read the --recursion-method documentation for information.
            TS ERRORS DIFFS     ROWS CHUNKS SKIPPED    TIME TABLE
10-23T19:30:10      0      0   262146       7       0   2.769 test.t
10-23T19:30:13      0      0    65536       1       0   2.882 test.t1
10-23T19:30:16      0      0    32768       1       0   2.258 test.t2

So, I connect to specific host (127.0.0.1 in this case) and port (22923 in my case). I had provided username (and password) of a user having proper privileges (now go find them in the manual, you WILL NOT. Then go read this blueprint and Bug #916168 or that Muhammad's post. I literally had to search for all SQL statements executed in the source code and then summarize all related minimal privileges once, I wish I had a list at hand when needed... Sounds complicated already, isn't it?)

You can also see the problem in the output above - my command was not able to connect to my slaves. In this case it was so because report_host was explicitly set to some value on my slaves that is NOT resolvable from the host I work at:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ ping SBslave1ping: unknown host SBslave1
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ cd ../node1

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ ./my sql -e"select @@report_host"
+---------------+
| @@report_host |
+---------------+
| SBslave1      |
+---------------+
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ grep report my.sandbox.cnf
report-host=SBslave1
report-port=22294

This is what master "thinks" about its slaves:

master [localhost] {msandbox} (test) > show slave hosts;
+-----------+----------+-------+-----------+--------------------------------------+
| Server_id | Host     | Port | Master_id | Slave_UUID                           |
+-----------+----------+-------+-----------+--------------------------------------+
|       101 | SBslave1 | 22294 |         1 | 3c3e7a24-b2c3-11e5-9b34-f4b7e2133dbd |
|       102 | SBslave2 | 22295 |         1 | 3e226dff-b2c3-11e5-9b35-f4b7e2133dbd |
+-----------+----------+-------+-----------+--------------------------------------+
2 rows in set (0,00 sec)

So, when your master is confused and/or pt-table-checksum does not see your slaves (that exist and working), you can try to resolve the problem by creating a separate table and providing so called "DSN"s (data source names) for them there:

master [localhost] {msandbox} (test) > show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| mysql              |
| percona            |
| performance_schema |
| test               |
+--------------------+
5 rows in set (0,00 sec)

master [localhost] {msandbox} (test) > use percona
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
master [localhost] {msandbox} (percona) > CREATE TABLE `dsns` (
    ->   `id` int(11) NOT NULL AUTO_INCREMENT,
    ->   `parent_id` int(11) DEFAULT NULL,
    ->   `dsn` varchar(255) NOT NULL,
    ->   PRIMARY KEY (`id`)
    -> );
Query OK, 0 rows affected (0,27 sec)

master [localhost] {msandbox} (percona) > insert into dsns(dsn) values('h=127.0.0.1,u=msandbox,p=msandbox,P=22294');
Query OK, 1 row affected (0,06 sec)

master [localhost] {msandbox} (percona) > insert into dsns(dsn) values('h=127.0.0.1,u=msandbox,p=msandbox,P=22295');
Query OK, 1 row affected (0,06 sec)

Note that percona database is created by default if it does not exist first time you run pt-table-checksum. Table structure to create there and details of syntax of "DSN"s were take from the manual, that you should use as a reference for this case. Ideally you should make sure that the database and the table is created on master and all slaves well before you expect your first replication breakage ;) Hence one of the "best practices" listed above.

With details on how to access slaves written into the percona.dsns table, you can now tell the pt-table-checksum tool how to find them (using --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293 option to tell the tool where the table with details on how to access slaves is located):

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS DIFFS     ROWS CHUNKS SKIPPED    TIME TABLE
10-23T19:38:26      0      0   262146       8       0   2.705 test.t
10-23T19:38:28      0      0    65536       1       0   1.309 test.t1
10-23T19:38:28      0      0    32768       1       0   0.657 test.t2

This is what you should have at hand, ideally, well before your replication breaks for the first time - details on how much time it takes to check each table, how many chunks are there, the fact that no one of them was skipped, and definite information that at specific moment in time the data were all the same on all slaves involved.

Now, it's time to break replication. I check some data in the test.t table and then make them different deliberately on slave2:

slave2 [localhost] {msandbox} (test) > select * from t limit 10;
+----+---------------------+
| id | ts                  |
+----+---------------------+
| 1 | 2016-03-18 17:47:39 |
| 2 | 2016-03-18 17:47:40 |
| 3 | 2016-03-18 17:47:40 |
| 4 | 2016-03-18 17:47:40 |
| 5 | 2016-03-18 17:47:40 |
| 6 | 2016-03-18 17:47:40 |
| 7 | 2016-03-18 17:47:40 |
| 8 | 2016-03-18 17:47:40 |
| 9 | 2016-03-18 17:47:40 |
| 10 | 2016-03-18 17:47:40 |
+----+---------------------+
10 rows in set (0,00 sec)

slave2 [localhost] {msandbox} (test) > delete from t where id = 9;
Query OK, 1 row affected (0,05 sec)

slave2 [localhost] {msandbox} (test) > delete from t where id = 100;
Query OK, 1 row affected (0,04 sec)

slave2 [localhost] {msandbox} (test) > insert into t values (10000000, now());
Query OK, 1 row affected (0,05 sec)

Now, I continue to do changes on master:

master [localhost] {msandbox} (test) > update t set ts = now() where id between 1 and 10;
Query OK, 10 rows affected (0,07 sec)
Rows matched: 10 Changed: 10 Warnings: 0

Then, eventually, I find out that replication on slave2 is broken:

slave2 [localhost] {msandbox} (test) > show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22293
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 24026
               Relay_Log_File: mysql_sandbox22295-relay-bin.000009
                Relay_Log_Pos: 23716
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                   Last_Errno: 1032
                   Last_Error: Could not execute Update_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 23995
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 23553
              Relay_Log_Space: 3711091
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1032
               Last_SQL_Error: Could not execute Update_rows event on table test.t; Can't find record in 't', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log mysql-bin.000004, end_log_pos 23995
Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 2b1577b7-b2c3-11e5-9b34-f4b7e2133dbd
             Master_Info_File: /home/openxs/sandboxes/rsandbox_mysql-5_6_28/node2/data/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 161023 19:41:37
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0,00 sec)

You can note that some slave has a problem while running pt-table-cheksum next time:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ cd ../master/
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
Replica ao756 is stopped. Waiting.
Replica ao756 is stopped. Waiting.
...

We see that pt-table-checksum waits for some slave that it knows and cares (that can not be properly identified by hostname in my case, as all 3 instances run on the same host) to start. This is expected, as this tool works by running statements on master (and rely on statement-based logging for its connection to master) that are expected to be replicated to slaves and produce different results (chunk checksums) there if the data are different on specific slave. So, if pt-table-checksum produces messages like those above and nothing else for some time, then at least one of slaves is NOT running and replication is broken. We can stop the tool:

^C# Caught SIGINT.
            TS ERRORS DIFFS     ROWS CHUNKS SKIPPED    TIME TABLE
10-23T19:43:43      0      0     1000       1       0 37.123 test.t
But surely checksums were not calculated completely.

To proceed further with checking and fixing (if needed), we have to identify stopped slave and start it again. I my case I already know that slave is not just stopped by chance, but got error and is not able ot find the row to update. I can skip this error manually or, if I assume or know there many errors, I can rely on pt-slave-restart tool to make sure replication is working again:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-slave-restart -umsandbox -pmsandbox --host=127.0.0.1 --port=22295
2016-10-23T19:45:05 P=22295,h=127.0.0.1,p=...,u=msandbox mysql_sandbox22295-relay-bin.000009       23716 1032

Note that pt-slave-restart by default reports every restart and sleeps for some time between checks. I pointed out the slave that I know was stopped and I see that, after single restart, the tool prints nothing. I can check if the slave is in sync:

slave2 [localhost] {msandbox} (test) > show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 22293
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000004
          Read_Master_Log_Pos: 26314
               Relay_Log_File: mysql_sandbox22295-relay-bin.000009
                Relay_Log_Pos: 26477
        Relay_Master_Log_File: mysql-bin.000004
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 26314
              Relay_Log_Space: 3713379
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 2b1577b7-b2c3-11e5-9b34-f4b7e2133dbd
             Master_Info_File: /home/openxs/sandboxes/rsandbox_mysql-5_6_28/node2/data/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0,00 sec)
From the data highlighted above it is clear that it is in sync. I can leave pt-slave-restart running just in case, or I can stop it now with Ctrl+C, to get access to my ssh session:

...
^CExiting on SIGINT.
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$

Now, as long as we restarted slave after SQL thread was stopped with error at least once, we are 100% sure that the data on slave are NOT in sync with its master. We can still check how much they are different with pt-table-checksum:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/master$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS DIFFS     ROWS CHUNKS SKIPPED    TIME TABLE
10-23T19:49:55      0      3   262146       7       0   2.453 test.t
10-23T19:49:56      0      0    65536       1       0   1.307 test.t1
10-23T19:49:58      0      0    32768       1       0   1.361 test.t2

So, 3 chunks with differences were found. We can try to find out more about the differences in chunks for this table using --replicate-check-only option:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node1$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --replicate-check-only --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
Differences on ao756
TABLE CHUNK CNT_DIFF CRC_DIFF CHUNK_INDEX LOWER_BOUNDARY UPPER_BOUNDARY
test.t 1 -3 1 PRIMARY 1,1,2016-10-23 19:41:37 1000,1000,2016-03-18 17:47:40
test.t 6 1 0 PRIMARY 1,1,2016-10-23 19:41:37
test.t 7 1 0 PRIMARY 262146,262146,2016-03-18 17:47:40
We see that difference in row counts (CNT_DIFF) is not high, so we can expect that with just few statements executed we can get data back in sync and proceed with pt-table-sync tool. We can also try to find out how big the difference is by checking the percona.checksums table on the slave that is out of sync:

slave2 [localhost] {msandbox} (percona) > SELECT db, tbl, SUM(this_cnt) AS total_rows, COUNT(*) AS chunks FROM percona.checksums WHERE ( master_cnt <> this_cnt OR master_crc <> this_crc OR ISNULL(master_crc) <> ISNULL(this_crc)) GROUP BY db, tbl;
+------+-----+------------+--------+
| db   | tbl | total_rows | chunks |
+------+-----+------------+--------+
| test | t   |        999 |      3 |
+------+-----+------------+--------+
1 row in set (0,00 sec)

If the difference is in many thousands or millions of rows, at this stage it's not yet late to decide on restoring slave (or just this specific table) from the backup instead of wasting more time on syncing it row by row.

In my case the difference seams to be reasonably small, so I'll use pt-table-sync , point it out to the broken slave, ask to --sync_to-master this slave and use --print option to double check what it is going to change

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-sync --print --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t
DELETE FROM `test`.`t` WHERE `id`='1' AND `ts`='2016-03-18 17:47:39' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;DELETE FROM `test`.`t` WHERE `id`='2' AND `ts`='2016-03-18 17:47:40' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;
...

REPLACE INTO `test`.`t`(`id`, `ts`) VALUES ('10', '2016-10-23 19:41:37') /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;REPLACE INTO `test`.`t`(`id`, `ts`) VALUES ('100', '2016-03-18 17:47:40') /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;DELETE FROM `test`.`t` WHERE `id`='10000000' AND `ts`='2016-10-23 19:40:49' LIMIT 1 /*percona-toolkit src_db:test src_tbl:t src_dsn:D=test,P=22293,h=127.0.0.1,p=...,t=t,u=msandbox dst_db:test dst_tbl:t dst_dsn:D=test,P=22295,h=127.0.0.1,p=...,t=t,u=msandbox lock:1 transaction:1 changing_src:1 replicate:0 bidirectional:0 pid:31126 user:openxs host:ao756*/;
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$

The tool adds comment with the details on what data sources it is syncing and where the change was initiated from. You can see that it tries to produce as safe statements as possible, by providing both PRIMARY key value and additional conditions. These statements are to be executed on MASTER, and they will not change or delete anything there because of these additional conditions.

One more important detail in the command line used:

pt-table-sync --print --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t

pt-table-sync expects some option defining action (--print or --execute that I use later), some option defining what exactly to sync to (in this case I sync slave to its master) and one or two DSNs, details about the data sources to work with. The tool may try to sync just two different tables on two different hosts, not related by any replication (this is useful sometimes). For fixing broken MySQL async replication it is usually used in the way I did above, and one has to specify single DSN for a slave. pt-table-sync does NOT accept user name or password, or hostname, or port as separate command line options, so you have to specify all the details on how to access slave in the DSN. In my case the slave I want to fix runs on port 22295 of the host 127.0.0.1. The same user, msandbox, will be used to run changes on master, so make sure you specify the user with proper privileges granted and that this user (with the same password) exists on both slave and master. Note also that in this DSN I specified that the tool has to sync test.t table and nothing else. I know it is different on slave and I do not want the tool to touch anything else. When I was not that speicific, by the way, I've got the following message about the different table (that seems to be in sync from previous outputs):

Can't make changes on the master because no unique index exists at /usr/bin/pt-table-sync line 10697. while doing test.t2 on 127.0.0.1

I had not followed best practices, obviously. If test.t2 table gets out of sync, I'll be in troubles.

So, I checked that changes planned are in line with what I expect (I can do this as I know what I actually changed on slave2 to break replication there). After that I feel more safe and can ask the tool to do the changes automatically:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-sync --execute --sync-to-master u=msandbox,p=msandbox,h=127.0.0.1,P=22295,D=test,t=t
openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$

I surely trust the tool that the problem is now fixed (there were no suspicious messages), but it never hurts to double check with pt-table-checksum:

openxs@ao756:~/sandboxes/rsandbox_mysql-5_6_28/node2$ pt-table-checksum --databases test -umsandbox -pmsandbox --host=127.0.0.1 --port=22293 --recursion-method=dsn=h=127.0.0.1,D=percona,t=dsns,P=22293
            TS ERRORS DIFFS     ROWS CHUNKS SKIPPED    TIME TABLE
10-23T20:06:44      0      0   262146       7       0   2.640 test.t
10-23T20:06:46      0      0    65536       1       0   1.207 test.t1
10-23T20:06:47      0      0    32768       1       0   1.115 test.t2
and by checking the outputs of SHOW SLAVE STATUS on all slaves.

To summarize, there are several important details, notions and steps involved in fixing broken replication using Percona Toolkit. I plan to use this blog post as a quick reference for myself and (yet another) set of instructions for those desperate customers approaching Support in the middle of the night in deep troubles...

Sunday, October 16, 2016

MySQL Support Engineer's Chronicles, Issue #1

I have a habit to keep a log of everything I work on (or even just read about while working) every day. It's just a plain text file, with date labels, where most lines are URLs of issues and bugs I contributed to somehow, and my comments on what I did and why. But there are also code fragments, texts, copy-pasted outputs and links to interesting sites, articles and documentation.

Some of lines from this log I post on Facebook separately as I add them, these are mostly related to MySQL bugs and other things that annoy me. I know that some of my Facebook friends actually do not like this kind of posts at all... When I used LiveJornal actively, I've posted digests of my log from time to time (you can check what was there if you read Russian). But that digests contained everything, from funny videos of owls to tips for setting up Informix on Mac OS X, with few MySQL-related things in between. For this blog I decided to resurrect that digests, limit them to things related to MySQL and related forks and technologies, and try to make them regular. The idea is to post what I considered interesting and important at the end of every week. So, this is my first issue of what I am going to call "MySQL Support Engineer's Chronicles". Let's see how it is going to work long term...

This week started with more news related to MyRocks. First of all, it was announced that MyRocks will be included into MariaDB 10.2. New forum was also set up in Google Groups for the related discussions, myrocks-dev. I see only 3 topics there so far, but I still joined this group. I've also spent some time this week testing how well MyRocks scale for SELECTs like those in the famous Bug #68079. Good news is that (while InnoDB after that bug fix still seem to work faster) for that use case MyRocks now runs like 20% faster than last time I checked (end of April, 2016). Both InnoDB in 5.7.15 and MyRocks scale to 32+ threads on my old QuadCore box, that is, they run the same number of queries faster with more threads doing the job. I'll share more details on this test case later, when I get a chance to spend more time running tests with higher concurrency and checking for the bottlenecks.

Other good news were about MySQL & Friends Devroom at FOSDEM 2017. Call for papers is open! FOSDEM is my favorite conference and the only one I am going to participate in by all means every year (as I did in 2015 and 2016). I am going to submit several talks and, besides something with "gdb" in the title, I am thinking about a talk on some special bugs (like Bug #68079, Bug #28404, or Bug #82968) and their influence on MySQL users and further development (maybe even separate talk per each of these bugs), some new MariaDB 10.2 features (like recursive CTEs) or locking in MyRocks.

I've spent some time this week checking different ways to compress data in InnoDB tables (and I expect more work on this next week). Originally the idea was to show that MariaDB's compression really works. I am still thinking about writing a separate blog post showing step by step how it works and why it does not or why somebody can assume it does not (hints: check du and ls -ls commands, don't use just ls -l to make conclusions). For now just check this blog post by my colleague Jan Lindstrom and do not hesitate to try it on HDD with, say, ext4 filesystem. If you care enough to try different compression algorithms (you have to compile MariaDB from source for this), check also this great study by Percona.

Unfortunately these days I can not avoid Galera clusters and related issues, so I spent some time reading and trying to find some proper documentation on what may happen when poor user just decides to shut down all nodes in the cluster (for whatever reason) and then start the cluster again. SST may happen, and this is rarely good. I ended up checking this old post by my former colleague Przemysław Malkowski, some hints by Severalnines and Galera bugs (like Issue #283 and Issue #344). I was reading and thinking about step by step checks for some blog post, but my colleague Geoff Montee already summarized current state of documentation in this Galera Issue #431, "Document that the GCache is invalidated when a node restarts". Take extra care when you decide to shut down the last remaining Galera cluster node!

While checking for known bugs for one of the problems I discussed, I noted this MySQL bug, Bug #80580, "count(*) much slower on 5.7 than 5.6". Unfortunately there was no public test case, even though the bug was "Verified". I do not like to see bug reports like that. So, I've added my test case as a comment. Enjoy and keep thinking that MySQL 5.7 is the best of them all (besides already available 8.0, that added indexes to the tables in Performance Schema). It just has some performance problems with queries that do SELECT count(*) to InnoDB tables, also in case of fulltext indexes, see this recent Bug #83398 by Daniël van Eeden. But nobody cares much about few minor bugs for corner cases (and about those who process that bugs, but that's a different topic)...

By the way, Justin Swanhart had already added support to MySQL 8 to his ps_history schema/tool. Read his blog post for more details. Probably this is one of the first open source third party tools to support MySQL 8 explicitly. Correct me if I am wrong, but the only other one I know about is MySQL Sandbox by Giuseppe Maxia.

During some relatively non-busy time on Friday I decided to upgrade one of my Ubuntu VMs (the one running 15.04) to 16.04, and did that successfully (first to 15.10, then to 16.04). It's nice to know that Ubuntu allowed seamless in-place upgrades since 12.04. I had Percona Server installed from their .deb packages running there, historically, so I made sure it is also upgraded to the latest version for 16.04. In the process I hit a small problem with their repository that Percona tried to warn us all about in this blog post. Unfortunately, as you can see from comments (mine and other user's) suggested way to avoid the warning does not work as expected. I've got no reply from Percona about this.

Also on Friday Alibaba had finally released their AliSQL as open source on GitHub. I've cloned and built it immediately on my Ubuntu 14.04 netbook, and it works! It identifies itself as just MySQL 5.6.32, but read here about some of the great features now available. No bugs found so far, but I had no time to try hard.

So, that were some of things I read about, was excited about, complained about and worked on during this week as a support engineer in MariaDB. Do you consider this kind of digest useful?