Blog of (former?) MySQL Entomologist: names

Showing posts with label names. Show all posts

Monday, January 10, 2022

Differential Flame Graphs to Highlight Performance Schema Waits Impact

Yet another type of flame graphs that I had not discussed yet is a differential flame graph (again invented by Brendan Gregg). It shows the difference of two flame graphs in a clear way (assuming they are comparable - it's on you to make sure comparison makes sense and interpret the output properly). The flame graph is drawn using the "after" profile (such that the frame widths show values for the second graph), and then colorized by the delta to show how we got there. If the metric for the specific frame in the same stack is larger then it is shown in red. If it's smaller the frame is blue (hence the name red/blue differential flame graphs). The saturation is relative to the delta, so dark red is for the frame that has much bigger value in the second graph.While and very light frames can be ignored.

Let me apply this approach to the flame graphs showing waits reported by MySQL's performance schema (built as described in this blog post). As a proof of concept I'll use an easy to interpret case where the same oltp_write_only.lua sysbench test is run with different values of innodb_flush_log_at_trx_commit and otherwise the same settings like 16 concurrent threads on my old 2 cores Ubuntu 20.04 "home server" with slow HDD.

So, here is the first run:

sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua cleanup
sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua prepare
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'truncate table events_waits_history_long'
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'set global innodb_flush_log_at_trx_commit=0'
sysbench --table-size=1000000 --threads=16 --time=120 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua run
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits64_0.txt
cat /tmp/waits64_0.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' > /tmp/w64_0.out

That ended up with this sysbench statistics:

SQL statistics:
    queries performed:
        read:                            0
        write:                           26252
        other:                           13126
        total:                           39378
    transactions:                        6563   (54.60 per sec.)
    queries:                             39378 (327.59 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          120.2017s
    total number of events:              6563

Latency (ms):
         min:                                   66.37
         avg:                                  292.94
         max:                                 1222.92
         95th percentile:                      530.08
         sum:                              1922587.72

Threads fairness:
    events (avg/stddev):           410.1875/4.69
    execution time (avg/stddev):   120.1617/0.05

and then the second:

sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua cleanup
sysbench --table-size=1000000 --threads=1 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua prepare
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'truncate table events_waits_history_long'
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'set global innodb_flush_log_at_trx_commit=1'
sysbench --table-size=1000000 --threads=16 --time=120 --report-interval=5 --mysql-socket=/tmp/mysql8.sock --mysql-user=root --mysql-db=sbtest /usr/share/sysbench/oltp_write_only.lua run
bin/mysql -uroot --socket=/tmp/mysql8.sock performance_schema -B -e'select event_name, timer_wait from events_waits_history_long' >/tmp/waits64_1.txt
cat /tmp/waits64_1.txt | awk '{ printf("%s %d\n", $1, $2); }' | sed 's/\//;/g' > /tmp/w64_1.out

that produced the following:

SQL statistics:
    queries performed:
        read:                            0
        write:                           16238
        other:                           8119
        total:                           24357
    transactions:                        4059   (33.71 per sec.)
    queries:                             24357 (202.27 per sec.)
    ignored errors:                      1      (0.01 per sec.)
    reconnects:                          0      (0.00 per sec.)

General statistics:
    total time:                          120.4139s
    total number of events:              4059

Latency (ms):
         min:                                  151.87
         avg:                                  474.29
         max:                                 1316.65
         95th percentile:                      773.68
         sum:                              1925150.61

Threads fairness:
    events (avg/stddev):           253.6875/3.51
    execution time (avg/stddev):   120.3219/0.13

Now with resulting "collapsed stacks" in /tmp/w64_?.out files, we can build a simple differential flame graph

openxs@ao756:~/dbs/8.0$ ~/git/FlameGraph/difffolded.pl /tmp/w64_0.out /tmp/w64_1.out | ~/git/FlameGraph/flamegraph.pl --count picoseconds --title Waits > /tmp/w64_01_diff.svg

that, when captured as a .png file for this blog post looks as follows:

Differential flame graph visualizing the impact of innodb_flush_log_at_trx_commit value (the difference 1 makes vs 0) on waits reported by performance_schema for the same oltp_write_only.lua test

We clearly see the negative impact on performance (54 TPS vs 33 TPS). We clearly see that time spent on InnoDB log I/O increased, same as idle time somewhat increased. The highest relative increase was for /wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done condition variable (5+%) and that was probably related to binlog group commit implementation where we could not write to the binary log until redo log is flushed. Wild guess, surely.

I deliberately used as simple flame graphs as possible to make interpretation of the difference based on just a screenshot easier. This lame test demonstrated that we really can see what we expected - somewhat increased redo log I/O waits, and they are highlighted as red. Good enough for the proof of concept and way easier to speculate about than usual Off-CPU differential flame graph like this:

that I built while attempting to reproduce one real life MariaDB performance problem...

Saturday, January 8, 2022

Visualizing MySQL Plan Execution Time With Flame Graphs

In the previous post I've already shown one non-classical but useful application of flame graphs as a visualization of something interesting for MySQL DBAs (besides usual stack traces from profilers), time spent on waits or even statements execution by stages with related waits as reported by performance_schema. Today I am going to abuse the tools even more and try to show the real impact of each step in the query execution as reported by the EXPLAIN ANALYZE statement that is supported since MySQL 8.0.18.

The idea to use flame graphs for SQL plans visualization was (to the best of my knowledge) first suggested by Tanel Poder in this article, in context of Oracle RDBMS. There are several implementations of the idea in a free software here and there, for example.

Unfortunately I had not found any similar posts for MySQL and tools mentioned above do not work with MySQL. It made me wondering why?

For quite a some time we have a way to get several metrics for each execution step of the query plan in MySQL. Let's start with some stupid query (used to prove the concept and test possible steps to implement it):

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 33
Server version: 8.0.27 Source distribution

Copyright (c) 2000, 2021, Oracle and/or its affiliates.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> explain analyze select user, host from mysql.user u1 where u1.user not in (select distinct user from mysql.user) order by host desc\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop antijoin (cost=3.75 rows=25) (actual time=0.139..0.139 rows=0 loops=1)
    -> Covering index scan on u1 using PRIMARY (reverse) (cost=0.75 rows=5) (actual time=0.058..0.064 rows=5 loops=1)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) (actual time=0.003..0.003 rows=1 loops=5)
        -> Materialize with deduplication (cost=1.25..1.25 rows=5) (actual time=0.071..0.071 rows=4 loops=1)
            -> Filter: (mysql.`user`.`User` is not null) (cost=0.75 rows=5) (actual time=0.022..0.035 rows=5 loops=1)
                -> Covering index scan on user using PRIMARY (cost=0.75 rows=5) (actual time=0.020..0.031 rows=5 loops=1)

1 row in set (0,02 sec)

We see some hierarchical representation of query execution steps (it's Oracle's famous TREE format of EXPLAIN output, the only one supported for EXPLAIN ANALYZE). Each row represents some step, explain what the step was doing, quite verbosely, and then provides several useful metrics like cost, estimated number of rows and then actual time to return the first and all rows for this step etc. The actual time to return all rows at this step is what DBAs are usually interested in The steps form a tree, but the way it's represented is a bit unusual - we do not see JSON or any other structured format as some other RDBMSes provide. Instead each row has some level in the hierarchy represented by the number of spaces before the very informative "->" prompt. The first row with its additional "EXPLAIN: " prompts aside, each nesting level adds 4 spaces, as highlighted by the light green background above.

Maybe this unusual representation (not in the table with plan steps like in Oracle and not as JSON) prevented people from quickly implementing flame graphs-based query "profiler" for MySQL. But this definitely is not going to stop me from hacking something that may even work. I am not really a developer during recent 16+ years already, so I will not use Node.js or Python or anything fancy - just old Unix text processing tools like sed and awk, and surely the power of SQL (that small part I managed to master).

So, with the output like the above saved in a file:

openxs@ao756:~/dbs/8.0$ cat /tmp/explain.txt
*************************** 1. row ***************************
EXPLAIN: -> Nested loop antijoin (cost=3.75 rows=25) (actual time=0.115..0.115 rows=0 loops=1)
    -> Covering index scan on u1 using PRIMARY (reverse) (cost=0.75 rows=5) (actual time=0.045..0.049 rows=5 loops=1)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) (actual time=0.002..0.002 rows=1 loops=5)
        -> Materialize with deduplication (cost=1.25..1.25 rows=5) (actual time=0.063..0.063 rows=4 loops=1)
            -> Filter: (mysql.`user`.`User` is not null) (cost=0.75 rows=5) (actual time=0.025..0.034 rows=5 loops=1)
                -> Covering index scan on user using PRIMARY (cost=0.75 rows=5) (actual time=0.024..0.032 rows=5 loops=1)

openxs@ao756:~/dbs/8.0$

The first stage is simple and represented by the following command line I came up with quite fast:

openxs@ao756:~/dbs/8.0$ cat /tmp/explain.txt | awk 'NR > 1' | sed 's/EXPLAIN: //' | sed 's/(cost[^)][^)]*)//' | sed 's/(actual time=//' | sed 's/$..*$\.\...*/\1/' | sed 's/ $[^ ][^ ]*$$/; \1/'
-> Nested loop antijoin ; 0.115
    -> Covering index scan on u1 using PRIMARY (reverse) ; 0.045
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`); 0.002
        -> Materialize with deduplication ; 0.063
            -> Filter: (mysql.`user`.`User` is not null) ; 0.025
                -> Covering index scan on user using PRIMARY ; 0.024

openxs@ao756:~/dbs/8.0$

At each step sequentially I removed the first line of the output, removed that stupid "EXPLAIN :" "prompt", removed cost and number of rows estimations and then extracted just the actual time to return all rows from the step (separated by ";" as flamegraph.pl wants) , and kept the step detailed description with the initial spaces.

The next stage of processing took me a lot of time and efforts, as I had to convert the output above into a different format that is suitable for loading into the database:

openxs@ao756:~/dbs/8.0$ cat /tmp/plan.txt
1;0;0; Nested loop antijoin ; 0.115
2;1;1; Covering index scan on u1 using PRIMARY (reverse) ; 0.045
3;1;1; Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`); 0.002
4;3;2; Materialize with deduplication ; 0.063
5;4;3; Filter: (mysql.`user`.`User` is not null) ; 0.025
6;5;4; Covering index scan on user using PRIMARY ; 0.024

The format is simple: semicolon separated row number (step in the plan), row number of the "parent" step in the hierarchy (with 0 used for the very first step, obviously), the level of the step in the hierarchy, just in case, that same detailed step description and the value of the metric for the step (time to return all rows in milliseconds). This file is then loaded into the following MySQL table:

mysql> desc plan;
+--------+---------------+------+-----+---------+-------+
| Field | Type          | Null | Key | Default | Extra |
+--------+---------------+------+-----+---------+-------+
| seq    | int           | YES |     | NULL    |       |
| parent | int           | YES |     | NULL    |       |
| level | int           | YES |     | NULL    |       |
| step   | varchar(1024) | YES |     | NULL    |       |
| val    | decimal(10,3) | YES |     | NULL    |       |
+--------+---------------+------+-----+---------+-------+
5 rows in set (0,00 sec)

mysql> truncate plan;
Query OK, 0 rows affected (1,77 sec)

mysql> load data infile '/tmp/plan.txt' into table plan fields terminated by ';';
Query OK, 6 rows affected (0,13 sec)
Records: 7 Deleted: 0 Skipped: 0 Warnings: 0

mysql> select * from plan;
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
| seq | parent | level | step                                                                                     | val   |
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
|    1 |      0 |     0 | Nested loop antijoin                                                                    | 0.115 |
|    2 |      1 |     1 | Covering index scan on u1 using PRIMARY (reverse)                                       | 0.045 |
|    3 |      1 |     1 | Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`) | 0.002 |
|    4 |      3 |     2 | Materialize with deduplication                                                          | 0.063 |
|    5 |      4 |     3 | Filter: (mysql.`user`.`User` is not null)                                               | 0.025 |
|    6 |      5 |     4 | Covering index scan on user using PRIMARY                                               | 0.024 |
+------+--------+-------+------------------------------------------------------------------------------------------+-------+
6 rows in set (0,00 sec)

Now, how to end up with that format? I spent a lot of time with awk and created the following script:

awk '
function st_push(val) {
st_array[st_pos++] = val;
}

function st_pop() {
return (st_size() > 0) ? st_array[--st_pos] : "ERROR";
}

function st_size() {
return st_pos;
}

BEGIN { p[0] = 0; level[0] = 0; parent[0] = 0; }

NF > 0 {
# print "> " NR ";" $1;
split($0,a,"->");
lvl = length(a[1])/4;

level[NR] = lvl;
parent[NR] = NR-1;

if (level[NR] > level[NR-1]) {
    st_push(parent[NR]);
    p[level[NR]] = NR-1;
}
else if (level[NR] < level[NR-1]) {
    for (i=0; i<=(level[NR-1]-level[NR]); i++) {
      parent[NR] = st_pop();
      # print "poped " parent[NR];
    }
}
else {
    parent[NR]=parent[NR-1]
}
print NR ";" parent[NR] ";" lvl ";" a[2]
}'

that starts with 3 functions implementing a stack and then 3 associative arrays: p[] to store current parent row number for the specific nesting level, level[] to store hierarchy level of specific row and parent[] to store that number of row that is a parent of a given row in the plan.

To find the level of the row I split() the input row with at least one default filed separator ';' inside, using "->" as a field separator (assuming it is not used in the step description, ever) and then end up with initial spaces in a[1] and the rest (the step description and metrics value) in a[2]. I divide the length of a[1] by 4 to find the level.

This was a simple part. Then I try to build a value of a parent row using a stack to save previous parent row number for the level. It took some testing with a more primitive input format to come up with "tree traversal" code for such a weird tree representation format that seem to work on few different nested structures tested (correct me if it fails for more complex plans). I am not a developer any more so the code above may be wrong.

So, with the plan in the table, where each line has a seq number of seq number of the parent line as parent, one needs a simple enough recursive CTE to sum up all possible paths and summarizing time spent in the path:

mysql> with recursive cte_plan as (select seq, parent, level, concat(seq, concat(' -', step)) as step, round(val*1000) as val from plan where level = 0 union all select p.seq, p.parent, p.level, concat(c.step, concat(';', concat(p.seq, concat(' -',p.step)))), round(p.val*1000) as val from plan p join cte_plan c on p.parent = c.seq) select step, val from cte_plan\G
*************************** 1. row ***************************
step: 1 - Nested loop antijoin
val: 115
*************************** 2. row ***************************
step: 1 - Nested loop antijoin ;2 - Covering index scan on u1 using PRIMARY (reverse)
val: 45
*************************** 3. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)
val: 2
*************************** 4. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication
val: 63
*************************** 5. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null)
val: 25
*************************** 6. row ***************************
step: 1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) ;6 - Covering index scan on user using PRIMARY
val: 24
6 rows in set (0,01 sec)

Note that I've multiplied the metrics value by 1000 as flamegraph.pl script that expects metrics to be integer numbers. So they are actually in microseconds. I also added step number to each step, as actually we may have exact same step in different places in the hierarchy if wee read from the same table more than once in the query.

The final code to produce the input for flamegraph.pl is like this:

openxs@ao756:~/dbs/8.0$ ~/dbs/8.0/bin/mysql -uroot --socket=/tmp/mysql8.sock test -B -e"with recursive cte_plan as (select seq, parent, level, concat(seq, concat(' -', step)) as step, round(val*1000) as val from plan where level = 0 union all select p.seq, p.parent, p.level, concat(c.step, concat(';', concat(p.seq, concat(' -',p.step)))), round(p.val*1000) as val from plan p join cte_plan c on p.parent = c.seq) select step, val from cte_plan;" | awk ' NR > 1' > /tmp/processed_plan.txt
openxs@ao756:~/dbs/8.0$ cat /tmp/processed_plan.txt
1 - Nested loop antijoin        115
1 - Nested loop antijoin ;2 - Covering index scan on u1 using PRIMARY (reverse) 45
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`)   2
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication       63
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null)         25
1 - Nested loop antijoin ;3 - Single-row index lookup on <subquery2> using <auto_distinct_key> (user=mysql.u1.`User`);4 - Materialize with deduplication ;5 - Filter: (mysql.`user`.`User` is not null) ;6 - Covering index scan on user using PRIMARY 24

This is how I create the final flame graph representing the plan steps):

openxs@ao756:~/dbs/8.0$ cat /tmp/processed_plan.txt | ~/git/FlameGraph/flamegraph.pl --title "EXPLAIN Steps" --inverted --countname miliseconds > /tmp/explain2.svg

Surely these steps can be simplified, put into a script etc. I checked intermediate stages one by one, left some debugging details and so on. You can do better, but the final result is this nice icicle graph:

Time spent on each step of query execution represented as an icicle graph

where I highlighted the impact of the "Covering index scan on u1 using PRIMARY (reverse)" step. As usual with flame graphs, in the original .svg file you can do search, zoom in an out etc. For complex queries it may be really useful.

As a DBA I'd ask for something better than these weird stages (even if scripted). I wish we had all execution steps in tables or view, like in Oracle's v$sql_plan and v$sql_plan_statistics_all, or at least in JSON format maybe (like EXPLAIN FORMAT=JSON produces for the plan itself). One day, maybe (I am going to file a feature request for this later). Yet another day I plan to use the output of ANALYZE FORMAT=JSON in MariaDB to get similar information :)

Sunday, February 28, 2021

5 Years of Working for MariaDB Corporation

March 1, 2016, was my first day of working for MariaDB Corporation. It means that I worked for them full 5 years in a row! It's the longest period I've ever spent working in the same company. I worked for more than 7 years in MySQL Bugs Verification team of MySQL Support, but formally it was for 3 different companies over that period, MySQL AB, Sun and Oracle. So tonight, after 2 weekend on-call shifts in a row I want to summarize what I lost over that 5 years and what I won.

I never planned to leave Percona after just a bit more than 3 years, but I had to do something to either fix the direction of their services business development after April, 2015, or at least to make some points that are noticed and remembered. I made them by January 26, 2016, and then had to move on.

So, what I lost after leaving Percona:

In Percona I worked in the best Support team in the industry at the moment!
In Percona I was proud to be working for a company that does the right things, both for business around open source software, Community and customers.
In Percona I was involved in decision making, I influenced bug fixing process and priorities and was a kind of authority on everything I cared to state or do.
I had a really good salary, regular bonuses, longer vacations or otherwise properly compensated extra working time and all the opportunities to become a public person in MySQL Community.
I spent a lot of time working with Peter Zaitsev and really brilliant engineers both from Development, QA and Support.
After few initial months of getting used to a lot of work and work style, then till April 14 or so, 2015, it was a really comfortable place for me to work at and do things I like and good at.

I lost most of the above when I left. No more decision making of any kind (it was my decision to avoid that while joining MariaDB, to begin with). No more bugs processing or prioritizing. No more Percona Live conferences till 2019 when I finally managed to clarify the problems I had with (cancelled) participation in Percona Live Europe 2015. Nobody ever asked me to blog about anything since 2016 and until the beginning of 2020 (when MariaDB Foundation got really interested in my public performances). Joining MariaDB Corporation made me a suspicious member of MySQL Community and eventually forced me to leave Planet MySQL where my posts were not appreciated.

It takes me just one short tweet with MySQL bug number to share to have some of these bugs immediately hidden from the Community and made private. Looks like people suspect I have some secret agenda and mission from MariaDB Corporation, while I have none related to MySQL, neither to software nor to bugs in it - I do it at my own time and based on my own views that are not influenced by anyone...

Now what I gained from joining MariaDB Support team:

I still work in the best Support team in the industry, with new (to me) brilliant people, some of my former colleagues in MySQL and some of those I worked with in Percona and managed to get back into my team now in MariaDB Corporation.
I work harder than ever, at least twice as more as I did in Percona (at least speaking about the real life customer issues). The issues are more interesting, complex and challenging in general, and cover everything from MySQL 4.0.x to NDB cluster, MySQL 8.0.x and all versions of MariaDB Server, MaxScale and Connectors, and Galera clusters, with everything MySQL-related that Percona releases in between! This work is properly compensated in recent years.
Yes, I do work on Galera issues a lot, read the megabytes of logs and make sense of everything Galera. Who could imagine I'll got that way back in 2015?
I work closely and directly with the best developers, from Codership's Galera team, MariaDB Foundation developers, to Sergey Golubchik, Marko Mäkelä, Vladislav Vaintroub, Elena Stepanova and good old MySQL Optimizer team (now MariaDB's), and Monty himself, and more... We chat, we talk and discuss technical topics almost every day! I do some work they ask about, I build more things from sources than ever. It's really great and had almost never been the case before I joined MariaDB. I love this part of the corporate culture here.
My blog that you read now is way more popular than ever before 2016. At good times in 2017 I had more than 1000 reads per day, for weeks and months.
I am presenting at the conferences way more often than ever in Percona, from FOSDEM to Percona Live and everything MariaDB.
My influence and impact on MySQL Community got increased. I was declared a MySQL Community Contributor of the Year 2019. As often happens, it's easier to make impact when you are outsider. They can not ignore you even if that's only because you are considered "asshole" and "enemy" with "corporate agenda", for whatever reasons.

So far I do not regret that I made a decision in favor of MariaDB back in 2016, even though it forced me to keep up with or ignore many things I don't like at my current company. I am sorry that back in 2010 Monty and Kay were just 10 days too late to get me to SkySQL of the times. I had signed the contract with Oracle, and in 2012 there I really was mostly wasting my time, unfortunately...


That switch formally happened on March 1, 2016. It was a good decision.

Just few random notes at the and of a hard 7 days week of work. I hope you would not blame me too much for these. I also hope I'll still have my job tomorrow :)

Tuesday, January 26, 2021

What mysql_upgrade really does in MariaDB, Part II, Bugs and Missing Features

Both /proc sampling and bpftrace are cool topics to write about, but I should not forget my third upcoming talk at FOSDEM 2021 in less than 2 weeks, "Upgrading to a newer major version of MariaDB", that is mostly devoted to mysql_upgrade internals and is based on this blog post. Today I am going to provide some more background details and, probably for the first time, act as MariaDB entomologist and study some bug reports and feature requests related to mysql_upgrade (a.k.a. mariadb-upgrade in 10.5+) that were added over last 10 months or so.

I tried different ways to search for MariaDB JIRA issues related to mysql_upgrade. For examle, this is how I tried to find any bugs in mysql_upgrade closed over last two months (I recall there were few):

text ~ mysql_upgrade AND project = "MariaDB Server" AND status = Closed AND createdDate >= "2020-12-01" ORDER BY createdDate DESC

I've got a list of 7 reports, with 2 relevant bugs that are already fixed:

MDEV-24566 - "mysql_upgrade failed with "('mariadb.sys'@'localhost') does not exist" and mariadb 10.4/10.5 on docker". This is fixed in current docker images at https://hub.docker.com/r/mariadb/server. The problem was related to the Docker image only, and mysql_upgrade was actually affected by the initial database content there.
MDEV-24452 - "ALTER TABLE event take infinite time which for example breaks mysql_upgrade". Now this was a real blocker bug in recent 10.5. If any event managed to start before you executed mysql_upgrade, the utility and proper upgrade process was blocked. Good to see this fixed in upcoming 10.5.9.

So, Monty really fixes related bugs when they are reported, as promised. Let's consider now the following query, still not closed issues reported since April 1, 2020:

text ~ "mysql_upgrade" AND project = "MariaDB Server" AND status != Closed AND createdDate >= "2020-04-01" ORDER BY createdDate DESC

I checked then one by one (as they may be related to upgrade process but not to mysql_upgrade specifically) and placed into two lists, bugs and feature requests. Let me start with tasks (feature requests), so you know what was kind of missing by design or has to be done differently:

MDEV-24586 - "remove scripts/mysql_to_mariadb.sql". Actually proper logic is implemented in scripts/mysql_system_tables_fix.sql that forms part of mysql_upgrade, so separate script is no longer needed.
MDEV-24540 - "Detect incompatible MySQL virtual columns and report in error log." MariaDB does not support migration from MySQL-generated physical database tables containing virtual columns, and produces column mismatch errors, failures in mysql_upgrade etc. It would be great to have proper error messages from mysql_upgrade in this case, explaining the real problem and possible solutions (dump, drop and reload or whatever).
MDEV-24453 - "mysql_upgrade does not honor --verbose parameter". It is not passed to other binaries called and this may make debugging upgrade issues more complex.
MDEV-24316 - "cross upgrade from MySQL - have precheck tool". According to the reporter, Daniel Black, the goal would be to check on a database/global scale looking at tables, at features used, at settings, at character sets in table and determine the "migratablilty" of a given MySQL instance. I voted for this feature!
MDEV-24093 - "Detect during mysql_upgrade if type_mysql_json.so is needed and load it". After ~~MDEV-18323~~, MYSQL_JSON type is available as a dynamically loadable plugin. To make mysql_upgrade runs seamlessly we need to make sure it is loaded appropriately and unloaded when done with upgrade). This is already in review, so will be implemented really soon.
MDEV-23962 - "Remove arc directory support". I think only Monty (bug reporter) knows what is this about. I don't :)
MDEV-23008 - "store mysql_upgrade version info in system table instead of local file". One of the really important feature requests from my colleague since 2005, Hartmut Holzgraefe.
MDEV-22357 - "Clearing InnoDB autoincrement counter when upgrading from MariaDB < 10.2.4". CHECK TABLE ... FOR UPGRADE should work differently for InnoDB tables, for mysql_upgrade to work properly.
MDEV-22323 - "Upgrading MariaDB". This is the "umbrella" task to cover all problematic cases of MySQL to MariaDB, Percona Server to MariaDB and minor MariaDB server upgrades.
MDEV-22322 - "Percona Server -> Mariadb Upgrades". Summary of all the related issues. See MDEV-22679 etc.

Now back to more or less serious bugs that are still waiting for the fix:

MDEV-24579 - "Error table->get_ref_count() after update to 10.5.8". It seems DDL executed on mysql.* tables with InnoDB persistent statistics (like those executed by mysql_upgrade!) may cause problems for concurrent queries (up to assertion failure in this case). So we either should remove those tables (I wish!) or do something with mysql_upgrade, or (IMHO even better) do not let users connect and execute queries while mysql_upgrade is running, like MySQL 8 does when the server is started for the first time on older datadir and performs upgrade. Take care in the meantime...
MDEV-23652 - "Assertion failures upon reading InnoDB system table after normal upgrade from 10.2". Now this is a real bug :) Assertion failure during mysql_upgrade, this is surely something to fix!
MDEV-23636 - "mysql_upgrade [ERROR] InnoDB: Fetch of persistent statistics requested for table". I am not sure what's going on here, and why.
MDEV-23392 - "main.mysql_upgrade_view failed in buildbot with another wrong result". MTR test case failures is something to care about.
MDEV-22683 - "mysql_upgrade misses some changes to mysql schema". Over ttime and different versions, some structures in mysql schema get changed, but not all the changes make it to the scripts executed by mysql_ugrade. As a result a schema freshly created by mysql_install_db on a versions 10.x.y differs from a schema created on an earlier version and upgraded to 10.x.y by mysql_ugrade. The real diffs are listed, per version, in this bug report from Elena Stepanova.
MDEV-22655 - "CHECK TABLE ... FOR UPGRADE fails to report old datetime format". That's my favorite, unfortunately. Makes running mysql_upgrade useless for some cases of upgrade from pre-5.6 MySQL versions and leads to problems for tables partioned by datetime etc columns. See also MDEV-24499 - "Server upgrade causes compound index and related query to fail.".
MDEV-22645 - "default_role gets removed when migrating from 10.1 to 10.4". May have something to do with mydumper/myloader used, but still a problem.
MDEV-22482 - "pam v2: mysql_upgrade doesn't fix the ownership/privileges of auth_pam_tool". No comments.
MDEV-22477 - "mysql_upgrade fails with sql_notes=OFF". mysql_ugrade, or, more exactly mysql_system_tables.sql, uses @@warning_count variable in the upgrade logic. The variable, in turn, depends on the value of sql_notes. When it is OFF, @@warnings_count is not incremented, and mysql_upgrade doesn't work as expected.

After Monty's post in April 2020 many new mysql_upgrade bugs were reported, and some were already fixed. So, we are on the way...

* * *

To summarize:

There are still some bugs and missing features in mysql_ugrade.
MariaDB actively works on fixing them, as once promised by Monty.
Check the lists in this blog post if you plan to upgrade to recent MariaDB 10.x.y versions, carefully.
Please, report any problem with mysql_upgrade or upgrades in general to our JIRA.

Monday, January 25, 2021

Checking User Threads and Temporary Tables With gdb in MariaDB 10.4+, Step By Step

There were no posts about gdb tricks for a long time in this blog. This is surely unusual, but I had not done anything fancy with gdb for more than a year. Today I've got a chance finally to find something new in the code and answer yet another question based on code review and some basic gdb commands.

The question was about the way to find what temporary tables, if any, that are created by some connection. There is no way to do this in MariaDB 10.4, see MDEV-12459 for some related plans (and I_S.INNODB_TEMP_TABLE_INFO of MySQL that appeared in MariaDB for a short time only). My immediate answer was that this is surely stored somewhere in THD structure and I just have to find it (and a way to work with that information) using code review and/or gdb.

The first step was easy. I know that THD is defined in sql/sql_class.h, and there I see:

class THD: public THD_count, /* this must be first */
           public Statement,
           /*
             This is to track items changed during execution of a prepared
             statement/stored procedure. It's created by
             nocheck_register_item_tree_change() in memory root of THD,
             and freed in rollback_item_tree_changes().
             For conventional execution it's always empty.
           */
           public Item_change_list,
           public MDL_context_owner,
           public Open_tables_state
...

Temporary tables surely must be somewhere in that Open_tables_state. In the same file we can find the following:

class Open_tables_state
{
public:
/**
    As part of class THD, this member is set during execution
    of a prepared statement. When it is set, it is used
    by the locking subsystem to report a change in table metadata.
    When Open_tables_state part of THD is reset to open
    a system or INFORMATION_SCHEMA table, the member is cleared
    to avoid spurious ER_NEED_REPREPARE errors -- system and
    INFORMATION_SCHEMA tables are not subject to metadata version
    tracking.
    @sa check_and_update_table_version()
*/
Reprepare_observer *m_reprepare_observer;

/**
    List of regular tables in use by this thread. Contains temporary and
    base tables that were opened with @see open_tables().
*/
TABLE *open_tables;

/**
    A list of temporary tables used by this thread. This includes
    user-level temporary tables, created with CREATE TEMPORARY TABLE,
    and internal temporary tables, created, e.g., to resolve a SELECT,
    or for an intermediate table used in ALTER.
*/
All_tmp_tables_list *temporary_tables;
...

With this information I am ready to dive into gdb session. I have MariaDB 10.4.18 at hand and create a couple of temporary tables in connection with id 9:

MariaDB [test]> select version(), connection_id(), current_user();
+-----------------+-----------------+------------------+
| version()       | connection_id() | current_user()   |
+-----------------+-----------------+------------------+
| 10.4.18-MariaDB |               9 | openxs@localhost |
+-----------------+-----------------+------------------+
1 row in set (0,000 sec)

MariaDB [test]> create temporary table mytemp(c1 int, c2 varchar(100));
Query OK, 0 rows affected (0,034 sec)

MariaDB [test]> create temporary table mytemp2(id int, c2 int) engine=MyISAM;
Query OK, 0 rows affected (0,001 sec)

MariaDB [test]> show processlist;
+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+
| Id | User        | Host      | db   | Command | Time | State                    | Info             | Progress |
+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+
| 3 | system user |           | NULL | Daemon | NULL | InnoDB purge worker      | NULL             |    0.000 |
| 4 | system user |           | NULL | Daemon | NULL | InnoDB purge worker      | NULL             |    0.000 |
| 1 | system user |           | NULL | Daemon | NULL | InnoDB purge worker      | NULL             |    0.000 |
| 2 | system user |           | NULL | Daemon | NULL | InnoDB purge coordinator | NULL             |    0.000 |
| 5 | system user |           | NULL | Daemon | NULL | InnoDB shutdown handler | NULL             |    0.000 |
| 9 | openxs      | localhost | test | Query   |    0 | Init                     | show processlist |    0.000 |
+----+-------------+-----------+------+---------+------+--------------------------+------------------+----------+
6 rows in set (0,000 sec)

Now I attach gdb and immediately try to check what's inside the temporary_table filed of the do_command frame where thd is present:

openxs@ao756:~$ sudo gdb -p `pidof mysqld`
[sudo] password for openxs:
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 26620
[New LWP 26621]
... 28 more LWPs were here
[New LWP 26658]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f59fb69080d in poll () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) thread 31
[Switching to thread 31 (Thread 0x7f59dd6ca700 (LWP 26658))]
#0 0x00007f59fb69080d in poll () at ../sysdeps/unix/syscall-template.S:84
84      in ../sysdeps/unix/syscall-template.S
(gdb) p do_command::thd->thread_id
$1 = 9
(gdb) p do_command::thd->temporary_tables
$2 = (All_tmp_tables_list *) 0x5606b575cd28
(gdb) p *do_command::thd->temporary_tables
$3 = {<I_P_List_null_counter> = {<No data fields>}, <I_P_List_no_push_back<TMP_TABLE_SHARE>> = {<No data fields>}, m_first = 0x5606b60dfa38}
(gdb) p *do_command::thd->temporary_tables->m_first
$4 = {<TABLE_SHARE> = {table_category = TABLE_CATEGORY_TEMPORARY, name_hash = {
      key_offset = 0, key_length = 0, blength = 0, records = 0, flags = 0,
      array = {buffer = 0x0, elements = 0, max_element = 0,
        alloc_increment = 0, size_of_element = 0, malloc_flags = 0},
      get_key = 0x0, hash_function = 0x0, free = 0x0, charset = 0x0},
    mem_root = {free = 0x5606b60e3cf8, used = 0x5606b60e40e8, pre_alloc = 0x0,
      min_malloc = 32, block_size = 985, total_alloc = 2880, block_num = 6,
      first_block_usage = 0,
      error_handler = 0x5606b1e6bac0 <sql_alloc_error_handler()>,
      name = 0x5606b2538d63 "tmp_table_share"}, keynames = {count = 0,
      name = 0x0, type_names = 0x5606b60e3d78, type_lengths = 0x5606b60e3d94},
    fieldnames = {count = 2, name = 0x0, type_names = 0x5606b60e3d60,
      type_lengths = 0x5606b60e3d88}, intervals = 0x0, LOCK_ha_data = {
      m_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0,
          __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0,
            __next = 0x0}}, __size = '\000' <repeats 39 times>, __align = 0},
      m_psi = 0x0}, LOCK_share = {m_mutex = {__data = {__lock = 0,
          __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins = 0,
          __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
        __size = '\000' <repeats 39 times>, __align = 0}, m_psi = 0x0},
    tdc = 0x0, tabledef_version = {
      str = 0x5606b60e3d10 "~A\370\275_4\021К°Ё\364\267\342\023=\275",
      length = 16}, option_list = 0x0, option_struct = 0x0,
---Type <return> to continue, or q <return> to quit---field = 0x5606b60e3dQuit

You may be wondering why I jumped to Thread 31 immediately, how did I know that it corresponds to connection with thread_id 9, as I verified with later print? It was not pure luck, I knew I am the only user and just jumped to the last thread in order of creation. There is a better way for a general case, and it's navigating over a "list" of threads that must exist somewhere, as SHOW PROCESSLIST must have a way to get them all, easy one. We'll get back to that important task later in this post.

Now, in temporary_tables->m_first filed we have a table share, with a lot of details we may need. We can try to see some of them that were actually requested originally:

(gdb) p do_command::thd->temporary_tables->m_first->table_name
$5 = {str = 0x5606b60dfeef "mytemp2", length = 7}
(gdb) p do_command::thd->temporary_tables->m_first->table_name.str
$6 = 0x5606b60dfeef "mytemp2"
(gdb) p do_command::thd->temporary_tables->m_first->path
$7 = {str = 0x5606b60dfed8 "/tmp/#sql67fc_9_3", length = 17}
(gdb) p do_command::thd->temporary_tables->m_first->path.str
$8 = 0x5606b60dfed8 "/tmp/#sql67fc_9_3"

So, I can get as many details as are presented or can be found from the TABLE_SHARE structure. I see them immediately for the last temporary table I've created in that session. But what about the other, there might be many of them. I expected some kind of a linked list or array, but type information presented above gave me no real hint. Where is the next or previous item? This hints towards list by name, but that's all:

(gdb) p *do_command::thd->temporary_tables
$3 = {<I_P_List_null_counter> = {<No data fields>}, <I_P_List_no_push_back<TMP_TABLE_SHARE>> = {<No data fields>}, m_first = 0x5606b60dfa38}

The type, <I_P_List_no_push_back<TMP_TABLE_SHARE>>, looks like some template class instantiated with TMP_TABLE_SHARE, and I can find the source code:

template <typename T> class I_P_List_no_push_back;

/**
   Intrusive parameterized list.
   Unlike I_List does not require its elements to be descendant of ilink
   class and therefore allows them to participate in several such lists
   simultaneously.
   Unlike List is doubly-linked list and thus supports efficient deletion
   of element without iterator.
   @param T Type of elements which will belong to list.
   @param B Class which via its methods specifies which members
             of T should be used for participating in this list.
             Here is typical layout of such class:
             struct B
             {
               static inline T **next_ptr(T *el)
               {
                 return &el->next;
               }
               static inline T ***prev_ptr(T *el)
               {
                 return &el->prev;
               }
             };
   @param C Policy class specifying how counting of elements in the list
             should be done. Instance of this class is also used as a place
             where information about number of list elements is stored.
             @sa I_P_List_null_counter, I_P_List_counter
   @param I Policy class specifying whether I_P_List should support
             efficient push_back() operation. Instance of this class
             is used as place where we store information to support
             this operation.
             @sa I_P_List_no_push_back, I_P_List_fast_push_back.
*/

template <typename T, typename B,
          typename C = I_P_List_null_counter,
          typename I = I_P_List_no_push_back<T> >
class I_P_List : public C, public I
{
T *m_first;
...

but I get lost in all these C++ stuff. Luckily I asked at the Engineering channel and got a hint that "I" in the name means "Intrusive" and that base type T is supposed to include pointers to the next and previous item. Moreover, in case of TMP_TABLE_SHARE they are named tmp_next and tmp_prev. I had to read the entire structure, as next and prev had not worked for me...

With this hint it was easy to proceed:

(gdb) p do_command::thd->temporary_tables->m_first->tmp_next
$12 = (TMP_TABLE_SHARE *) 0x5606b60df558
(gdb) set $t = do_command::thd->temporary_tables->m_first
(gdb) p $t
$13 = (TMP_TABLE_SHARE *) 0x5606b60dfa38
(gdb) p $t->table_name.str
$14 = 0x5606b60dfeef "mytemp2"
(gdb) set $t = $t->tmp_next
(gdb) p $t
$15 = (TMP_TABLE_SHARE *) 0x5606b60df558
(gdb) p $t->table_name.str
$16 = 0x5606b60dfa0f "mytemp"
(gdb) set $t = $t->tmp_next
(gdb) p $t
$17 = (TMP_TABLE_SHARE *) 0x0

The idea is to iterate while $t is not zero, starting from temporary_tables->m_first. You can surely put it into a Python loop for automation. One day I'll do this too. For now I am happy to be able to list all temporary tables with all the details manually, with gdb commands.

The remaining question is: how to iterate over user threads in this MariaDB version? No more global threads variable:

(gdb) p threads
No symbol "threads" in current context.

No surprize, we had that changed in MySQL 5.7+ too.

Here I also used a hint from a way more experienced colleague, Sergei Golubchik. That's what we have now:

(gdb) p server_threads
$18 = {threads = {<base_ilist> = {first = 0x5606b60b6d28, last = {
        _vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,
        prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}, lock = {
    m_rwlock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0,
        __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0,
        __writer = 0, __shared = 0, __rwelision = 0 '\000',
        __pad1 = "\000\000\000\000\000\000", __pad2 = 0, __flags = 0},
      __size = '\000' <repeats 55 times>, __align = 0}, m_psi = 0x0}}
(gdb) ptype server_threads
type = class THD_list {
private:
    I_List<THD> threads;
    mysql_rwlock_t lock;

public:
    void init();
    void destroy();
    void insert(THD *);
    void erase(THD *);
    int iterate<std::vector<unsigned long long> >(my_bool (*)(THD *,
    std::vector<unsigned long long> *), std::vector<unsigned long long> *);
}
(gdb) p server_threads.threads
$19 = {<base_ilist> = {first = 0x5606b60b6d28, last = {
      _vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,
      prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}

From that I had to proceed myself. I already know what "I" means in these templates, so I expect to find the next pointer somewhere if I start from first:

(gdb) p server_threads.threads
$19 = {<base_ilist> = {first = 0x5606b60b6d28, last = {
_vptr.ilink = 0x5606b2d03a38 <vtable for ilink+16>,
prev = 0x7f59a80009b8, next = 0x0}}, <No data fields>}
(gdb) p server_threads.threads.first
$20 = (ilink *) 0x5606b60b6d28
(gdb) p *server_threads.threads.first
$21 = {_vptr.ilink = 0x5606b2d08f80 <vtable for THD+16>,
prev = 0x5606b2ed1de0 <server_threads>, next = 0x7f59980009a8}
(gdb) set $thd = (THD *)server_threads.threads.first
(gdb) p $thd->thread_id
$22 = 9

This was the initialization part, now let's check some more and iterate:

(gdb) p $thd->proc_info
$23 = 0x5606b2521cd9 "Reset for next command"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$24 = 5
(gdb) p $thd->proc_info
$25 = 0x5606b267a145 "InnoDB shutdown handler"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$26 = 2
(gdb) p $thd->main_security_ctx.user
$27 = 0x0
(gdb) p $thd->proc_info
$28 = 0x5606b26a3da9 "InnoDB purge coordinator"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$29 = 1
(gdb) p $thd->proc_info
$30 = 0x5606b26a3e20 "InnoDB purge worker"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$31 = 4
(gdb) p $thd->proc_info
$32 = 0x5606b26a3e20 "InnoDB purge worker"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$33 = 3
(gdb) p $thd->proc_info
$34 = 0x5606b26a3e20 "InnoDB purge worker"
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd->thread_id
$35 = 1095216660735
(gdb) p $thd->proc_info
$36 = 0x0
(gdb) set $thd = (THD *)$thd->next
(gdb) p $thd
$37 = (THD *) 0x0

The idea of iteration is also clear: we move to $thd->next if it's not zero. What we see matches the SHOW PROCESSLIST output with the exception of the last thread, with zero proc_info too. It is some "sentinel" that is not present in the PROCESSLIST. One day I'll figure out why is it so and automate checking all threads based on Python code of this kind, suggested by Shane Bester. Tonight I am just happy to document what I recently found, as all details related to gdb usage do change with time and new versions released.

Free travels and digging into the code in gdb with specific goal in mind - I miss these activities recently

* * *

To summarize:

It's relatively easy to find out all the details about every temporary table of any kind created in any MariaDB server user thread, in gdb.
It's still fun to work on MariaDB, as you can promptly get help from developers no matter what crazy questions you may ask
Changes towards more modern C++ may make it more diffical to debug in gdb initially for those unaware of the details of clasees implementation and design.

Sunday, January 24, 2021

Playing with recent bpftrace and MariaDB 10.5 on Fedora - Part II, Using the Existing Tools

In the previous post in this series I've presented a couple of my quick and dirty attempts to use bpftrace add uprobes to MariaDB server and dynamic libraries it uses, to trace queries and their execution times and collect stack traces related to mutex waits. Looks like for any non-trivial monitoring task we are going to end up with more than one probe and would need to do more processing to produce clean and useful results with minimal CPU and memory impact both in the process of collecting the data while in kernel context and processing them in user space.

I have a long way to go with my lame command towards this goal, so in this post I decided to check several existing bpftrace programs not directly related to MariaDB, see how they are structured, how they use built in variables and functions etc. I'll also try to apply some of them to MariaDB 10.5 running sysbench read write test with high enough concurrency,

Most popular tools are included into the bpftrace source code, along with examples of their usage. We can find them in the tools subdirectory:

[root@fc31 tools]# pwd
/mnt/home/openxs/git/bpftrace/tools
[root@fc31 tools]# ls
bashreadline.bt             loads_example.txt      syscount_example.txt
bashreadline_example.txt    mdflush.bt             tcpaccept.bt
biolatency.bt               mdflush_example.txt    tcpaccept_example.txt
biolatency_example.txt      naptime.bt             tcpconnect.bt
biosnoop.bt                 naptime_example.txt    tcpconnect_example.txt
biosnoop_example.txt        oomkill.bt             tcpdrop.bt
biostacks.bt                oomkill_example.txt    tcpdrop_example.txt
biostacks_example.txt       opensnoop.bt           tcplife.bt
bitesize.bt                 opensnoop_example.txt tcplife_example.txt
bitesize_example.txt        pidpersec.bt           tcpretrans.bt
capable.bt                  pidpersec_example.txt tcpretrans_example.txt
capable_example.txt         runqlat.bt             tcpsynbl.bt
CMakeLists.txt              runqlat_example.txt    tcpsynbl_example.txt
cpuwalk.bt                  runqlen.bt             threadsnoop.bt
cpuwalk_example.txt         runqlen_example.txt    threadsnoop_example.txt
dcsnoop.bt                  setuids.bt             vfscount.bt
dcsnoop_example.txt         setuids_example.txt    vfscount_example.txt
execsnoop.bt                statsnoop.bt           vfsstat.bt
execsnoop_example.txt       statsnoop_example.txt vfsstat_example.txt
gethostlatency.bt           swapin.bt              writeback.bt
gethostlatency_example.txt swapin_example.txt     writeback_example.txt
killsnoop.bt                syncsnoop.bt           xfsdist.bt
killsnoop_example.txt       syncsnoop_example.txt xfsdist_example.txt
loads.bt                    syscount.bt

I highlighted the tools I am going to try. But let's check the source code for one of them, with quite non-trivial and 48 lines long code, biosnoop.bt, to begin with. My comments are after each code fragment below:

#!/usr/bin/env bpftrace
#include <linux/blkdev.h>
/*
* biosnoop.bt   Block I/O tracing tool, showing per I/O latency.
*               For Linux, uses bpftrace, eBPF.
*
* TODO: switch to block tracepoints. Add offset and size columns.
*
* This is a bpftrace version of the bcc tool of the same name.
*
* 15-Nov-2017    Brendan Gregg    Created this.
*/

Here we can see how to use shebang first line to run the program with bpftrace if it's executable. Next line shows that for some cases bpftrace (as other eBPF tools) may need headers (in this case kernel header) to be able to resolve references to complex structures passed as arguments.

BEGIN
{
printf("%-12s %-7s %-16s %-6s %7s\n", "TIME(ms)", "DISK", "COMM", "PID", "LAT(ms)");
}

Next we see the BEGIN probe that, same as with awk, is executed once at the beginning of the program and in this case prints the formatted header for the further output.

kprobe:blk_account_io_start
{
   @start[arg0] = nsecs;
   @iopid[arg0] = pid;
   @iocomm[arg0] = comm;
   @disk[arg0] = ((struct request *)arg0)->rq_disk->disk_name;
}

Here we define a kernel probe for the blk_account_io_start function and store information in 4 associative arrays indexed by arg0, to store start time for the call in nanoseconds since the probe attached (nsecs), pid of the calling program, calling command itself (comm) and disk name that we get from deep nested structure of the first traced function call argument, arg0, via type cast and pointers. That's why we needed kernel headers, to reference different structure members by name and eventually de-reference to proper offset/address.

kprobe:blk_account_io_done
/@start[arg0] != 0 && @iopid[arg0] != 0 && @iocomm[arg0] != ""/

{
   $now = nsecs;
   printf("%-12u %-7s %-16s %-6d %7d\n",
        elapsed / 1e6, @disk[arg0], @iocomm[arg0], @iopid[arg0],
        ($now - @start[arg0]) / 1e6);

   delete(@start[arg0]);
   delete(@iopid[arg0]);
   delete(@iocomm[arg0]);
   delete(@disk[arg0]);
}

In the probe above that we define for the block_account_io_done function we first make sure to do something only if the block_account_io_start was already called by this same process with the same argument, otherwise we would not match the times properly. The problem here is that the system call is completed not when the start function returns, but when the corresponding system call informs the caller that we are done. It's not as easy as kprobe/kretprobe for the same function implemented by the kernel.

The action of the probe is simple. We calculate the time difference in millseconds by comparing the stored start timestampt with current nsecs value and output the details collected. Then, and this is essential, we delete the element from the associative array (a.k.a. map) with delete(), so that repeataed call from the same process to the same disk are not mixed up together and we do not use more memory than really needed.

END
{
   clear(@start);
   clear(@iopid);
   clear(@iocomm);
   clear(@disk);
}

In the END probe we clear() all associative arrays that were used by the program, essnetially deleteting all the items in the maps. Otherwise, as we've seen in my lame examples previously, they are dumped at the end of the program, and for this tool that produce monitoring output while it runs, we definitely do not need that.

Now that we better understand how "real" bpftrace programs are usually structured and designed, let's try to run biosnoop.bt that traces block I/O, and shows the issuing process (at least, the process that was on-CPU at the time of queue insert) and the latency of the I/O:

[root@fc31 tools]# ./biosnoop.bt
Attaching 4 probes...
TIME(ms)     DISK    COMM             PID    LAT(ms)
98           sda     mariadbd         6147         0
98           sda     mariadbd         6147         0
...
195          sda     mariadbd         6147        92
196          sda     mariadbd         6147         0
...
201          sda     mariadbd         6147         0
274          sda     mariadbd         6147        73
275          sda     mariadbd         6147         0
278          sda     mariadbd         6147         2
293          sda     mariadbd         6147         9
295          sda     mariadbd         6147         0
295          sda     mariadbd         6147         0
303          sda     mariadbd         6147         7
303          sda     mariadbd         6147         0
303          sda     mariadbd         6147         0
303          sda     mariadbd         6147         0
304          sda     mariadbd         6147         0
304          sda     mariadbd         6147         0
305          sda     mariadbd         6147         0
305          sda     mariadbd         6147         0
306          sda     mariadbd         6147         0
335          sda     mariadbd         6147        29
336          sda     jbd2/dm-0-8      419         14
337          sda     mariadbd         6147         1
365          sda     jbd2/dm-0-8      419         28
365          sda     mariadbd         6147        28
365          sda     kworker/2:4      24472        0
392          sda     mariadbd         6147        26
...

Here we see block I/O requests for my only disk, sda, from the mariadbd and few other processes, with timestamps starting from the startup and related latency (that is less than 1 millisecond for most cases, but sometimes appoached 100 milliseconds on this HDD).

The next tool to check is biolatency.bt that traces block I/O and shows latency as a power-of-2 histogram using hist() function:

[root@fc31 tools]# ./biolatency.bt
Attaching 4 probes...
Tracing block device I/O... Hit Ctrl-C to end.
^C

@usecs:
[128, 256)            31 |@@                                                  |
[256, 512)           421 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                       |
[512, 1K)            754 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)              94 |@@@@@@                                              |
[2K, 4K)              38 |@@                                                  |
[4K, 8K)              36 |@@                                                  |
[8K, 16K)             65 |@@@@                                                |
[16K, 32K)           109 |@@@@@@@                                             |
[32K, 64K)            80 |@@@@@                                               |
[64K, 128K)           25 |@                                                   |
[128K, 256K)          16 |@                                                   |

So, during the monitoring interval, until I hit Ctrl+C, the majority of block I/O calls had the latency of 256 to 1024 microseconds, less than 1 millisecond. There were few longer waiting calls too.

The next tool to check is biostacks.bt that is supposed to show block I/O latency as a histogram, with the kernel stack trace that initiated the I/O (do you still remember about /proc kernel stacks sampling that can be used for similar purposes?). This can help explain disk I/O that is not directly requested by applications. I've got the following:

[root@fc31 tools]# ./biostacks.bt > /tmp/biostacks.txt
cannot attach kprobe, probe entry may not exist
WARNING: could not attach probe kprobe:blk_start_request, skipping.
Attaching 5 probes...
Tracing block I/O with init stacks. Hit Ctrl-C to end.

^C[root@fc31 tools]# more /tmp/biostacks.txt

...

@usecs[
    blk_account_io_start+1
    blk_mq_make_request+481
    generic_make_request+653
    submit_bio+75
    ext4_io_submit+73
    ext4_writepages+694
    do_writepages+51
    __filemap_fdatawrite_range+172
    file_write_and_wait_range+107
    ext4_sync_file+240
    do_fsync+56
    __x64_sys_fdatasync+19
    do_syscall_64+77
    entry_SYSCALL_64_after_hwframe+68
]:
[4K, 8K)               2 |@@                                                  |
[8K, 16K)              2 |@@                                                  |
[16K, 32K)             9 |@@@@@@@@@@@@                                        |
[32K, 64K)             0 |                                                    |
[64K, 128K)            1 |@                                                   |
[128K, 256K)          38 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K)          19 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[512K, 1M)             5 |@@@@@@                                              |
[1M, 2M)               0 |                                                    |
[2M, 4M)               4 |@@@@@                                               |
[4M, 8M)               9 |@@@@@@@@@@@@                                        |
[8M, 16M)              2 |@@                                                  |
[16M, 32M)             1 |@                                                   |
[32M, 64M)             1 |@                                                   |
[64M, 128M)            1 |@                                                   |

@usecs[
    blk_account_io_start+1
    blk_mq_make_request+481
    generic_make_request+653
    submit_bio+75
    ext4_io_submit+73
    ext4_bio_write_page+609
    mpage_submit_page+97
    mpage_process_page_bufs+274
    mpage_prepare_extent_to_map+437
    ext4_writepages+668
    do_writepages+51
    __filemap_fdatawrite_range+172
    file_write_and_wait_range+107
    ext4_sync_file+240
    do_fsync+56
    __x64_sys_fdatasync+19
    do_syscall_64+77
    entry_SYSCALL_64_after_hwframe+68
]:
[16K, 32K)             8 |@                                                   |
[32K, 64K)            31 |@@@@                                                |
[64K, 128K)           14 |@@                                                  |
[128K, 256K)         346 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256K, 512K)         180 |@@@@@@@@@@@@@@@@@@@@@@@@@@@                         |
[512K, 1M)            22 |@@@                                                 |
[1M, 2M)               2 |                                                    |
[2M, 4M)               3 |                                                    |
[4M, 8M)               3 |                                                    |
[8M, 16M)              2 |                                                    |
[16M, 32M)             2 |                                                    |
[32M, 64M)             1 |                                                    |
[64M, 128M)            1 |                                                    |

I left a couple of stacks with typical high enough latency related to fdatasync on ext4.

bitesize.bt allows to see what are usual block I/O request size in bytes per program. This is what I've got for MariaDB:

[root@fc31 tools]# ./bitesize.bt
Attaching 3 probes...
Tracing block device I/O... Hit Ctrl-C to end.
^C
I/O size (bytes) histograms by process name:

@[NetworkManager]:
[4K, 8K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@[jbd2/dm-0-8]:
[4K, 8K)               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[8K, 16K)              0 |                                                    |
[16K, 32K)             0 |                                                    |
[32K, 64K)             0 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

...
@[mariadbd]:
[0]                    3 |@                                                   |
[1]                    0 |                                                    |
[2, 4)                 0 |                                                    |
[4, 8)                 0 |                                                    |
[8, 16)                0 |                                                    |
[16, 32)               0 |                                                    |
[32, 64)               0 |                                                    |
[64, 128)              0 |                                                    |
[128, 256)             0 |                                                    |
[256, 512)             0 |                                                    |
[512, 1K)              0 |                                                    |
[1K, 2K)               0 |                                                    |
[2K, 4K)               0 |                                                    |
[4K, 8K)               0 |                                                    |
[8K, 16K)              0 |                                                    |
[16K, 32K)            94 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K)            26 |@@@@@@@@@@@@@@                                      |
[64K, 128K)           16 |@@@@@@@@                                            |
[128K, 256K)           8 |@@@@                                                |
[256K, 512K)           1 |                                                    |
[512K, 1M)             2 |@                                                   |
[1M, 2M)               7 |@@@                                                 |
...

The majority of I/Os were in the 16K to 32K range, one or two InnoDB data pages.

The next well known tool is opensnoop.bt that traces the open() syscall system-wide, and prints various details. This is what I've got while sysbench test was starting:

[root@fc31 tools]# ./opensnoop.bt
Attaching 6 probes...
Tracing open syscalls... Hit Ctrl-C to end.
PID    COMM               FD ERR PATH
...
25558 sysbench           29   0 /usr/local/share/sysbench/oltp_read_write.lua
25558 sysbench           29   0 /usr/local/share/sysbench/oltp_read_write.lua
6147   mariadbd           45   0 ./sbtest/db.opt
6147   mariadbd           31   0 ./sbtest/db.opt
6147   mariadbd           35   0 ./sbtest/db.opt
6147   mariadbd           29   0 ./sbtest/db.opt
6147   mariadbd           27   0 ./sbtest/db.opt
25558 sysbench           -1   2 ./oltp_common.lua
25558 sysbench           -1   2 ./oltp_common/init.lua
25558 sysbench           -1   2 ./src/lua/oltp_common.lua
25558 sysbench           -1   2 /home/openxs/.luarocks/share/lua/5.1/oltp_common.lua
25558 sysbench           -1   2 /home/openxs/.luarocks/share/lua/5.1/oltp_common/init.lua
25558 sysbench           -1   2 /home/openxs/.luarocks/share/lua/oltp_common.lua
25558 sysbench           -1   2 /home/openxs/.luarocks/share/lua/oltp_common/init.lua
25558 sysbench           -1   2 /usr/local/share/lua/5.1/oltp_common.lua
25558 sysbench           -1   2 /usr/share/lua/5.1/oltp_common.lua
25558 sysbench           32   0 /usr/local/share/sysbench/oltp_common.lua
25558 sysbench           32   0 /usr/local/share/sysbench/oltp_common.lua
6147   mariadbd           80   0 ./sbtest/db.opt
...

It's interesting to find out that sysbench, based on test names, tries to open .lua files in some predefined lockations where they do not exist (based on error 2).

There are also similar tools to trace stat() and sync() calls:

[root@fc31 tools]# ./statsnoop.bt
Attaching 10 probes...
Tracing stat syscalls... Hit Ctrl-C to end.
PID    COMM             ERR PATH
...
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147   mariadbd           0 ./mysql/table_stats.MAI
6147   mariadbd           0 ./mysql/table_stats.MAD
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/column_stats.MAI
6147   mariadbd           0 ./mysql/column_stats.MAI
6147   mariadbd           0 ./mysql/column_stats.MAD
...
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147   mariadbd           0 ./mysql/table_stats.MAI
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/column_stats.MAI
6147   mariadbd           0 ./mysql/column_stats.MAI
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql
6147   mariadbd           0 /home/openxs/dbs/maria10.5/data/mysql/table_stats.MAI
6147   mariadbd           0 ./mysql/table_stats.MAI
6147   mariadbd           0 ./sbtest/sbtest3.frm
6147   mariadbd           0 ./sbtest/sbtest1.frm
6147   mariadbd           0 ./sbtest/sbtest2.frm
...

[root@fc31 tools]# ./syncsnoop.bt
Attaching 7 probes...
Tracing sync syscalls... Hit Ctrl-C to end.
TIME      PID    COMM             EVENT
12:10:36 621    auditd           tracepoint:syscalls:sys_enter_fsync
12:10:36 6147   mariadbd         tracepoint:syscalls:sys_enter_fdatasync
12:10:36 6147   mariadbd         tracepoint:syscalls:sys_enter_fdatasync
...

We see that MariaDB server does not only check sbtest.* tables used in the test, but also the tables with engine-independent statistics in the mysql database. In the latter output we see wallclock timestamp of the call (this is a problem with Performance Schema, by the way) that is provided by the time() builtin function of bpftrace. The printed timestamp is also async, it is the time at which userspace has processed the queued up event, not the time at which the bpftrace probe calls time().

Some tools may not work at all, for example:

[root@fc31 tools]# ./threadsnoop.bt
./threadsnoop.bt:19-21: ERROR: uprobe target file '/lib/x86_64-linux-gnu/libpthread.so.0' does not exist or is not executable

[openxs@fc31 ~]$ ldd /home/openxs/dbs/maria10.5/bin/mariadbd | grep thread
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f3d957bf000)

That's because the library in my case is in the different directory.

One can trace vfs_* functions too. For example, vfscount.bt just traces and counts all VFS calls:

[root@fc31 tools]# ./vfscount.bt
Attaching 65 probes...
Tracing VFS calls... Hit Ctrl-C to end.
^C

@[vfs_test_lock]: 2
@[vfs_symlink]: 3
@[vfs_setxattr]: 3
@[vfs_getxattr]: 4
@[vfs_mkdir]: 9
@[vfs_rmdir]: 9
@[vfs_rename]: 19
@[vfs_readlink]: 105
@[vfs_unlink]: 113
@[vfs_fallocate]: 394
@[vfs_statfs]: 450
@[vfs_lock_file]: 1081
@[vfs_fsync_range]: 1752
@[vfs_statx]: 2789
@[vfs_statx_fd]: 2846
@[vfs_open]: 2925
@[vfs_getattr]: 5360
@[vfs_getattr_nosec]: 5490
@[vfs_writev]: 6340
@[vfs_readv]: 12482
@[vfs_write]: 161284
@[vfs_read]: 307655

We can surely summarize calls per second as vfsstat.bt does:

[root@fc31 tools]# ./vfsstat.bt
Attaching 11 probes...
Tracing key VFS calls... Hit Ctrl-C to end.
12:15:14
@[vfs_open]: 22
@[vfs_writev]: 64
@[vfs_readv]: 124
@[vfs_read]: 1631
@[vfs_write]: 2015

12:15:15
@[vfs_readv]: 96
@[vfs_write]: 1006
@[vfs_read]: 1201
@[vfs_writev]: 3093

12:15:16
@[vfs_open]: 54
@[vfs_readv]: 139
@[vfs_writev]: 153
@[vfs_read]: 2640
@[vfs_write]: 4003

12:15:17
@[vfs_open]: 6
@[vfs_writev]: 89
@[vfs_readv]: 132
@[vfs_write]: 1709
@[vfs_read]: 3904

12:15:18
@[vfs_readv]: 271
@[vfs_write]: 1689
@[vfs_writev]: 2709
@[vfs_read]: 4479

^C

The interval probe is used for this in the code:

interval:s:1
{
   time();
   print(@);
   clear(@);
}

Let me check the last tool for today, writeback.bt, that traces when the kernel writeback procedure is writing dirtied pages to disk, and shows details such as the time, device numbers, reason for the write back, and the duration:

[root@fc31 tools]# ./writeback.bt
Attaching 4 probes...
Tracing writeback... Hit Ctrl-C to end.
TIME      DEVICE   PAGES    REASON           ms
12:16:02 253:0    65486    background       0.000
12:16:02 253:0    43351    periodic         0.005
12:16:02 253:0    43351    periodic         0.005
12:16:02 253:0    43351    periodic         0.000
12:16:02 253:0    65534    background       0.045
12:16:02 253:0    65534    background       0.000
12:16:05 253:0    43508    periodic         0.006
12:16:06 253:0    43575    periodic         0.004
12:16:06 8:0      43575    periodic         0.004
12:16:06 253:0    43575    periodic         0.001
12:16:07 253:0    65495    background       434.285
12:16:07 253:0    43947    periodic         0.005
12:16:07 253:0    43676    periodic         0.000
12:16:07 253:0    43549    periodic         22.272
12:16:07 253:0    43549    periodic         0.001
12:16:11 253:0    43604    periodic         301.541
12:16:11 253:0    43528    periodic         147.890
12:16:11 253:0    43433    periodic         119.225
12:16:11 253:0    43433    periodic         0.004
12:16:11 253:0    43433    periodic         0.000
^C

We clearly see notable time spent on background writeback at 12:16:07 and on periodic flushes at 12:16:11.

Yet another source of tools to check is Brendan Gregg's book and related GitHub source code examples. You can get them from GitHub as follows:

[openxs@fc31 git]$ git clone https://github.com/brendangregg/bpf-perf-tools-book.git
Cloning into 'bpf-perf-tools-book'...
remote: Enumerating objects: 600, done.
remote: Total 600 (delta 0), reused 0 (delta 0), pack-reused 600
Receiving objects: 100% (600/600), 991.41 KiB | 2.73 MiB/s, done.
Resolving deltas: 100% (394/394), done.
[openxs@fc31 git]$ cd bpf-perf-tools-book/
[openxs@fc31 bpf-perf-tools-book]$ ls
exercises images originals README.md updated

One day I'll check them in more details and share some outputs. This blog post is already too long...

Tracing across the River Thames with minimal impact back in 2019 :)

* * *

To summarize:

You should check all the tools/*.bt tools and related examples, both to know what is ready to use and to study how proper bpftrace programs are written.
Some tools may rely on kernel headers avaiable or specific pathnames of the libraries probed.
The tools presented in this post are good for studying disk I/O issues that may impact MariaDB performance. Some are useful alternatives to strace everything...
See also this my older post "BCC Tools for disk I/O Analysis and More" where bcc tools were used to monitor disk I/O on older Ubuntu 16.04. Some of them are more advanced and may have no bpftrace-based alternatives, but we all know that bpftrace is the future :)