This is not only because I am going to talk about applying profilers to all kinds and forks of MySQL at FOSDEM 2017 MySQL & Friends Devroom this week. It seems profilers is the only way to study or troubleshoot MyRocks performance problems (if any) at the moment (when provided status variables and SHOW ENGINE ROCKSDB STATUS etc are not enough), as Facebook's MySQL does NOT have Performance Schema compiled in by default:
mysql> show engines;
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine | Support | Comment | Transactions | XA | Savepoints |
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
| ROCKSDB | DEFAULT | RocksDB storage engine | YES | YES | YES || MRG_MYISAM | YES | Collection of identical MyISAM tables | NO | NO | NO |
| MyISAM | YES | MyISAM storage engine | NO | NO | NO |
| BLACKHOLE | YES | /dev/null storage engine (anything you write to it disappears) | NO | NO | NO |
| CSV | YES | CSV storage engine | NO | NO | NO |
| MEMORY | YES | Hash based, stored in memory, useful for temporary tables | NO | NO | NO |
| ARCHIVE | YES | Archive storage engine | NO | NO | NO |
| FEDERATED | NO | Federated MySQL storage engine | NULL | NULL | NULL |
| InnoDB | NO | Supports transactions, row-level locking, and foreign keys | NULL | NULL | NULL |
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
9 rows in set (0.00 sec)
This may change when MyRocks builds from Percona or MariaDB will appear as ready for common use, but for now perf is a way to go, and I like it!
I do not remember how exactly I've come up with the idea to compare MyRocks scalability to InnoDB scalability on my old QuadXeon box (now running Fedora 25):
[openxs@fc23 mysql-server]$ pt-summaryfor the famous use case presented in Bug #68079. Scalability problems identified there cause a lot of work by Oracle engineers, and a lot of changes in MySQL 5.6.x and 5.7.x (so natable that MariaDB decided to use only those from 5.7 in the upcoming MariaDB 10.2, see MDEV-10476). I've already mentioned this bug probably in every my public talk about MySQL since 2013, so one day I decided to check how the fix helped for recent InnoDB and then compare to MyRocks performance with all default settings on the same hardware for different number of concurrent threads.
# Percona Toolkit System Summary Report ######################
Date | 2017-01-30 11:32:53 UTC (local TZ: EET +0200)
Hostname | fc23
Uptime | 4:02, 3 users, load average: 3.47, 2.12, 1.27
Platform | Linux
Release | Fedora release 25 (Twenty Five)
Kernel | 4.9.5-200.fc25.x86_64
Architecture | CPU = 64-bit, OS = 64-bit
Threading | NPTL 2.24
SELinux | Enforcing
Virtualized | No virtualization detected
# Processor ##################################################
Processors | physical = 1, cores = 4, virtual = 4, hyperthreading = no
Speeds | 4x2499.000
Models | 4xIntel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz
Caches | 4x2048 KB
# Memory #####################################################
Total | 7.8G
Free | 2.9G
Used | physical = 1.0G, swap allocated = 0.0, swap used = 0.0, virtual = 1.0G
Shared | 176.0M
Buffers | 3.9G
Caches | 6.2G
Dirty | 208 kB
UsedRSS | 2.5G
Swappiness | 60
DirtyPolicy | 20, 10
DirtyStatus | 0, 0
...
So, one day in April 2016 I took a dump of tables involved from the bug, replaced InnoDB with ROCKSDB there, added explicit CHARSET=utf8 COLLATE=utf8_bin clauses, and loaded the dump into Facebook's MySQL 5.6 built from source. I've ended up with the following:
mysql> show create table task\GThe tables are simple and small. Then I tried to study how well this query scales when number of threads it is running from is increasing:
*************************** 1. row ***************************
Table: task
Create Table: CREATE TABLE `task` (
`sys_id` char(32) COLLATE utf8_bin NOT NULL DEFAULT '',
`u_root_cause` char(32) COLLATE utf8_bin DEFAULT NULL,
`u_business_impact_description` mediumtext COLLATE utf8_bin,
`u_business_impact_category` mediumtext COLLATE utf8_bin,
PRIMARY KEY (`sys_id`)
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)
mysql> show create table incident\G
*************************** 1. row ***************************
Table: incident
Create Table: CREATE TABLE `incident` (
`sys_id` char(32) COLLATE utf8_bin NOT NULL DEFAULT '',
`category` varchar(40) COLLATE utf8_bin DEFAULT NULL,
PRIMARY KEY (`sys_id`),
KEY `incident_category` (`category`)
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)
mysql> show table status like 'task'\G
*************************** 1. row ***************************
Name: task
Engine: ROCKSDB
Version: 10
Row_format: Dynamic
Rows: 8292
Avg_row_length: 69
Data_length: 573991
...
1 row in set (0.00 sec)
mysql> show table status like 'incident'\G
*************************** 1. row ***************************
Name: incident
Engine: ROCKSDB
Version: 10
Row_format: Dynamic
Rows: 8192
Avg_row_length: 87
Data_length: 719091
Max_data_length: 0
Index_length: 956624
...
1 row in set (0.00 sec)
mysql> explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category;For this I've used mysqlslap and commands like the following:
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
| 1 | SIMPLE | incident | index | PRIMARY,incident_category | incident_category | 123 | NULL | 8192 | Using index |
| 1 | SIMPLE | task | eq_ref | PRIMARY | PRIMARY | 96 | test.incident.sys_id | 1 | Using index |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
2 rows in set (0.00 sec)
[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=1 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'with different --concurrency values: 1, 2, 4, 8, 10, 16, 32 (on a box with 4 cores). Before the fix InnoDB had scalability problems even with concurrency N, where N is the number of cores. After the fix it scaled way better in this test.
You can find the summary of results in this Google spreadsheet, but here is a chart from it:
The X axis has log scale and shows number of concurrent threads running. The Y axis is the average number of problematic queries executed per second. "inner" and "straight" refer to INNER JOIN vs STRAIGHT_JOIN queries presented above.
"MyRocks" on this chart means ROCKSDB in MySQL from Facebook built from source corresponding to the following commit:
[openxs@fc23 mysql-5.6]$ git log -1while "InnoDB 5.7" means InnoDB from Oracle MySQL 5.7.17, the following commit specifically:
commit 6b4ba98698182868800f31e0d303f1e16026efdf...
Date: Sat Jan 28 13:51:47 2017 -0800
[openxs@fc23 mysql-server]$ git log -1For this post I wonder why the difference in favor of "straight" query is so big starting from 4 concurrent threads in case of MyRocks? I've profiled both queries running via mysqlslap as follows:
commit 23032807537d8dd8ee4ec1c4d40f0633cd4e12f9
...
Date: Mon Nov 28 16:48:20 2016 +0530
[openxs@fc23 ~]$ perf --versionProfiler was started, then mysqlslap command executed and Ctrl-C was pressed when it produced the results.
perf version 4.9.6.200.fc25.x86_64.g51a0
[openxs@fc23 ~]$ sudo perf record -a
[sudo] password for openxs:
^C[ perf record: Woken up 335 times to write data ]
[ perf record: Captured and wrote 84.913 MB perf.data (1825548 samples) ]
The result for STRAIGHT_JOIN queries was the following (up to 1% of "overhead"):
[openxs@fc23 ~]$ sudo perf report --stdio | more...The result for INNER JOIN queries was the following (up to 1% of "overhead"):
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:p'
# Event count (approx.): 62748127726390
#
# Overhead Command Shared Object Symbol...
5.32% my-oneconnectio libc-2.24.so [.] __memcmp_sse4_1
4.68% my-oneconnectio libc-2.24.so [.] __memcpy_ssse3
4.04% my-oneconnectio mysqld [.] rocksdb::BlockIter::Seek
3.33% my-oneconnectio mysqld [.] my_strnxfrm_unicode
3.11% my-oneconnectio mysqld [.] rocksdb::BlockBasedTable::Get
2.96% my-oneconnectio mysqld [.] myrocks::Rdb_pk_comparator::Compare
2.54% my-oneconnectio mysqld [.] rocksdb::BlockIter::BinarySeek
2.14% my-oneconnectio mysqld [.] rocksdb::InternalKeyComparator::Compare
2.00% my-oneconnectio mysqld [.] rocksdb::StatisticsImpl::recordTick
1.99% my-oneconnectio mysqld [.] rocksdb::MergingIterator::Next
1.95% my-oneconnectio mysqld [.] rocksdb::HistogramStat::Add
1.51% my-oneconnectio mysqld [.] join_read_key
1.51% my-oneconnectio mysqld [.] myrocks::rdb_unpack_utf8_str
1.41% my-oneconnectio mysqld [.] myrocks::Rdb_key_def::unpack_record
1.36% my-oneconnectio mysqld [.] sub_select
1.32% my-oneconnectio mysqld [.] my_uni_utf8
1.29% my-oneconnectio mysqld [.] rocksdb::Version::Get
1.22% my-oneconnectio libc-2.24.so [.] _int_malloc
1.22% my-oneconnectio mysqld [.] rocksdb::TableCache::Get
1.22% my-oneconnectio mysqld [.] rocksdb::BlockIter::Next
1.21% my-oneconnectio mysqld [.] rocksdb::ThreadLocalPtr::Get
1.18% my-oneconnectio mysqld [.] rocksdb::DBIter::FindNextUserEntryInternal
1.18% my-oneconnectio mysqld [.] rocksdb::(anonymous namespace)::FilePicker::GetNextFile
1.12% my-oneconnectio libc-2.24.so [.] malloc
1.07% my-oneconnectio mysqld [.] evaluate_join_record
1.07% my-oneconnectio mysqld [.] myrocks::Rdb_key_def::pack_index_tuple
1.01% my-oneconnectio mysqld [.] key_restore
[openxs@fc23 ~]$ sudo perf report --stdio | moreI am not an expert in MyRocks at all, so I will not even try to interpret these results (maybe in some later post, one day. I've just highlighted what I consider a real difference in the second (INNER JOIN) case. Note pthread_mutex_lock/pthread_mutex_unlock calls, for example that are NOT present in "more than 1%" output for STRAIGHT_JOIN case.
...
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:p'
# Event count (approx.): 162704428520300
#
# Overhead Command Shared Object Symbol...
#
5.90% my-oneconnectio libc-2.24.so [.] __memcpy_ssse3
4.38% my-oneconnectio libpthread-2.24.so [.] pthread_mutex_lock 3.69% my-oneconnectio libc-2.24.so [.] __memcmp_sse4_1
3.58% my-oneconnectio mysqld [.] rocksdb::BlockIter::Seek
2.53% my-oneconnectio libpthread-2.24.so [.] pthread_mutex_unlock 2.47% my-oneconnectio mysqld [.] rocksdb::StatisticsImpl::recordTick
2.35% my-oneconnectio mysqld [.] rocksdb::BlockBasedTable::Get
2.28% my-oneconnectio mysqld [.] my_strnxfrm_unicode
2.08% my-oneconnectio mysqld [.] rocksdb::BlockIter::BinarySeek
1.84% my-oneconnectio mysqld [.] rocksdb::HistogramStat::Add
1.84% my-oneconnectio mysqld [.] rocksdb::Version::Get
1.71% my-oneconnectio mysqld [.] myrocks::rdb_unpack_binary_or_utf8_varchar_space_pad 1.69% my-oneconnectio mysqld [.] rocksdb::LRUCacheShard::Lookup
1.67% my-oneconnectio mysqld [.] my_uni_utf8
1.61% my-oneconnectio mysqld [.] rocksdb::InternalKeyComparator::Compare
1.52% my-oneconnectio mysqld [.] myrocks::Rdb_pk_comparator::Compare
1.47% my-oneconnectio mysqld [.] rocksdb::(anonymous namespace)::FilePicker::GetNextFile
1.23% my-oneconnectio mysqld [.] rocksdb::BlockIter::Next
1.15% my-oneconnectio mysqld [.] rocksdb::MergingIterator::Next
1.10% my-oneconnectio mysqld [.] join_read_key
1.09% my-oneconnectio mysqld [.] rocksdb::Block::NewIterator
1.04% my-oneconnectio mysqld [.] rocksdb::TableCache::Get
I'd be happy to get any comments on the above. I also plan to share more/raw data from perf some day later, when I decide where to put them for public access (we speak about text outputs megabytes in size).
I've captured profiles with callgraphs as well:
[openxs@fc23 ~]$ sudo perf record -agAgain, full trees are huge and require real expert to be analyzed properly.
^C[ perf record: Woken up 565 times to write data ]
[ perf record: Captured and wrote 142.410 MB perf.data (644176 samples) ]
New perf report option (-g flat, available since kernels 4.4+) gave nice backtraces like this:
40.20%that show how data are read in MyRocks. You can get similar backtraces with pt-pmp as well, but you'll see only how many times specific backtrace was noted, not what was the related overhead. Also, performance impact from pt-pmp would be very notable in this case.
handle_one_connection
do_handle_one_connection
dispatch_command
mysql_parse
mysql_execute_command
execute_sqlcom_select
handle_select
mysql_select
JOIN::exec
sub_select
evaluate_join_record
sub_select
join_read_key
myrocks::ha_rocksdb::index_read_map_impl
myrocks::ha_rocksdb::get_row_by_rowid
myrocks::Rdb_transaction_impl::get
rocksdb::TransactionBaseImpl::Get
rocksdb::WriteBatchWithIndex::GetFromBatchAndDB
rocksdb::DBImpl::Get
rocksdb::DBImpl::GetImpl
rocksdb::Version::Get
Time to stop it seems, the post is already long. One day I'll continue, as there are many phenomena highlighted by this very simple, but famous, use case. Stay tuned!
Can you share both query plans? With MyRocks build on November 21, 2016 I get the same query plan for inner join and straight_join.
ReplyDelete# bin/mysql test -e 'explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category\G'
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: task
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 96
ref: NULL
rows: 8092
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: incident
type: eq_ref
possible_keys: PRIMARY,incident_category
key: PRIMARY
key_len: 96
ref: test.task.sys_id
rows: 1
Extra: NULL
# bin/mysql test -e 'explain select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category\G'
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: task
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 96
ref: NULL
rows: 8092
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: incident
type: eq_ref
possible_keys: PRIMARY,incident_category
key: PRIMARY
key_len: 96
ref: test.task.sys_id
rows: 1
Extra: NULL
And...
ReplyDelete# bin/mysql test -e 'show table status\G'
*************************** 1. row ***************************
Name: incident
Engine: ROCKSDB
Version: 10
Row_format: Dynamic
Rows: 12710
Avg_row_length: 100
Data_length: 1271085
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: NULL
Update_time: NULL
Check_time: NULL
Collation: utf8_bin
Checksum: NULL
Create_options:
Comment:
*************************** 2. row ***************************
Name: task
Engine: ROCKSDB
Version: 10
Row_format: Dynamic
Rows: 8092
Avg_row_length: 100
Data_length: 809212
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: NULL
Update_time: NULL
Check_time: NULL
Collation: utf8_bin
Checksum: NULL
Create_options:
Comment:
This comment has been removed by the author.
ReplyDeleteFor inner join using MyRocks from November 21 build, InnoDB from MySQL 5.6.26 and InnoDB from MySQL 5.7.10 with database cache large enough to cache data. I used read-committed for MyRocks and repeatable-read for InnoDB and the tuned my.cnf I use for all of my MyRocks vs InnoDB tests
ReplyDeleteMyRocks did much better than InnoDB. Numbers below are max time, grep "Maximum number" from the output.
myrocks inno-56 inno-57
1.568 1.797 1.932
1.573 2.620 2.322
1.623 3.429 2.723
1.646 5.631 4.068
1.648 8.216 6.318
1.715 11.009 8.501
1.937 13.854 10.818
1.912 16.422 12.988
2.261 17.872 14.502
2.408 19.734 16.172
2.686 21.004 17.563
2.864 22.422 18.856
2.972 23.595 19.935
3.456 25.397 21.410
Why do you report max time and not average time or QPS?
DeleteI used average time. Given this output:
Delete[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
Average number of seconds to run all queries: 18.684 seconds
Minimum number of seconds to run all queries: 17.953 seconds
Maximum number of seconds to run all queries: 19.222 seconds
Number of clients running queries: 4
Average number of queries per client: 250
I took time to run 1000 queries from this line:
Average number of seconds to run all queries: 18.684 seconds
and then, to produce charts, I divided 1000 by this value to get QPS.
I report max time because I want to spot outliers and not hide the slowdown that happens from too much mutex contention or running with more concurrency than CPU cores.
DeleteAt max concurrency I get the best QPS and the worst response time. Would be nice to make both clear.
When tests finish I will try with "modern" build of MyRocks. I used the Nov21 build for long running perf tests.
ReplyDeleteSwitching join order via "incident straight_join task" and results are same as above.
ReplyDeleteBut you are running many more concurrent queries than CPU cores and workload is CPU bound. For my tests above, concurrent queries was <= virtual core count. Let me try with concurrency > cores.
In my case plans were the following:
Delete[openxs@fc23 fb56]$ bin/mysql -uroot test -e 'explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
| 1 | SIMPLE | incident | index | PRIMARY,incident_category | incident_category | 123 | NULL | 8192 | Using index |
| 1 | SIMPLE | task | eq_ref | PRIMARY | PRIMARY | 96 | test.incident.sys_id | 1 | Using index |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
Now different plan:
[openxs@fc23 fb56]$ bin/mysql -uroot test -e 'explain select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
+----+-------------+----------+--------+---------------------------+---------+---------+------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+--------+---------------------------+---------+---------+------------------+------+----------------------------------------------+
| 1 | SIMPLE | task | index | PRIMARY | PRIMARY | 96 | NULL | 8292 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | incident | eq_ref | PRIMARY,incident_category | PRIMARY | 96 | test.task.sys_id | 1 | NULL |
+----+-------------+----------+--------+---------------------------+---------+---------+------------------+------+----------------------------------------------+
I don't remember why i decided to force a different join order and compare, but back in April 2016 I was already surprised that the plan with filesort etc for STRAIGHT_JOIN case leads to better results with concurrency equals or larger than number of cores.
Note that for InnoDB and single thread in my case INNER join leads to a bit faster execution.
I deliberately tuned nothing at all for InnoDB case (128M buffer pool it has by default should be just fine, and this is read-only test... I wish I knew what to tune for MyRocks, so I've added just settings needed to make it work. By no means these results are intended to say anything about InnoDB vs MyRocks performance, althogh in all tests so far MyRocks was behind. Notably improved since April 2016, but still more time to run the test than InnoDB from MySQL 5.7.x
I tested with up to 64 threads on 4 real cores, but even with 4 threads the difference
For "task inner join incident" your plan uses the secondary index from the incident table. Mine did not. Let me try to get your plan.
DeleteNeglected to list concurrency levels in results above. They were: 1 2 4 8 12 16 20 24 28 32 36 40 44 48.
ReplyDeleteWen from 1 to 192 clients on server with 48 cores when HT is enabled. Used "inner join" and MyRocks does much better than InnoDB
ReplyDeleteData is max time for each of the engines. Smaller is better.
Columns are:
#clients MyRocks InnoDB-56 InnoDB-57
1 1.568 1.797 1.932
2 1.573 2.620 2.322
4 1.623 3.429 2.723
8 1.646 5.631 4.068
12 1.648 8.216 6.318
16 1.715 11.009 8.501
20 1.937 13.854 10.818
24 1.912 16.422 12.988
28 2.261 17.872 14.502
32 2.408 19.734 16.172
36 2.686 21.004 17.563
40 2.864 22.422 18.856
44 2.972 23.595 19.935
48 3.436 25.458 21.419
64 4.635 33.736 28.486
80 5.665 41.474 34.968
96 6.557 49.743 42.035
112 7.633 57.949 48.860
128 8.649 66.100 55.909
144 9.700 74.247 62.810
160 10.628 82.216 69.476
176 11.823 91.016 76.655
192 12.785 99.076 83.587
I'd be happy to see nice graphs showing that with proper tuning and on better hardware MyRocks really rocks, one way or the other.
DeleteMore data with better formatting. MyRocks is still much faster than InnoDB. I didn't expect this and wonder what I am doing wrong.
ReplyDeletehttps://gist.github.com/mdcallag/2d981f60af765ffff289867e0da929c5
https://gist.github.com/mdcallag/18c24c9e43684f253a4df6c174148721
And notes from my performance debugging...
ReplyDeletehttps://gist.github.com/mdcallag/9fee03b1952946a8fb6617740ab26de2