Blog of (former?) MySQL Entomologist: January 2013

Thursday, January 31, 2013

On InnoDB I/O threads states

I was asked today what different state values for InnoDB I/O threads really mean, these ones:

--------
FILE I/O
--------
I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)
...

I tried to search the manual and Web in general and found no useful explanation (these verbose values should be self explanatory by design it seems). As question was asked, probably it's time to try to answer it...

These states for I/O threads are set by srv_set_io_thread_op_info() function. Quick search with grep on MySQL 5.5.29 source code tree gives us some hints, the rest we can try to find out via code review based on these hits:

[openxs@chief mysql-5.5]$ grep -rn srv_set_io_thread_op_info * 2>/dev/null
storage/innobase/srv/srv0srv.c:772:srv_set_io_thread_op_info(
storage/innobase/os/os0file.c:3494:             srv_set_io_thread_op_info(i, "not started yet");
storage/innobase/os/os0file.c:4370:             srv_set_io_thread_op_info(orig_seg, "wait Windows aio");
storage/innobase/os/os0file.c:4394:             srv_set_io_thread_op_info(orig_seg,
storage/innobase/os/os0file.c:4681:             srv_set_io_thread_op_info(global_seg,
storage/innobase/os/os0file.c:4691:     srv_set_io_thread_op_info(global_seg,
storage/innobase/os/os0file.c:4787:     srv_set_io_thread_op_info(global_segment,
storage/innobase/os/os0file.c:4805:     srv_set_io_thread_op_info(global_segment,
storage/innobase/os/os0file.c:4946:     srv_set_io_thread_op_info(global_segment, "consecutive i/o requests");
storage/innobase/os/os0file.c:4989:     srv_set_io_thread_op_info(global_segment, "doing file i/o");
storage/innobase/os/os0file.c:5010:     srv_set_io_thread_op_info(global_segment, "file i/o done");
storage/innobase/os/os0file.c:5062:     srv_set_io_thread_op_info(global_segment, "resetting wait event");
storage/innobase/os/os0file.c:5072:     srv_set_io_thread_op_info(global_segment, "waiting for i/o request");
storage/innobase/include/srv0srv.h:499:srv_set_io_thread_op_info(
storage/innobase/include/srv0srv.h:766:# define srv_set_io_thread_op_info(t,info)       ((void) 0)
storage/innobase/fil/fil0fil.c:4581:            srv_set_io_thread_op_info(segment, "native aio handle");
storage/innobase/fil/fil0fil.c:4593:            srv_set_io_thread_op_info(segment, "simulated aio handle");
storage/innobase/fil/fil0fil.c:4605:    srv_set_io_thread_op_info(segment, "complete io for fil node");
storage/innobase/fil/fil0fil.c:4622:            srv_set_io_thread_op_info(segment, "complete io for buf page");
storage/innobase/fil/fil0fil.c:4625:            srv_set_io_thread_op_info(segment, "complete io for log");

Those who are too lazy to pull source code can just use this beautiful online tool to review few files listed above and make their own conclusions.

Unfortunately code formatting is different for different calls, so we still miss some states above as strings. Let's review the files then. It all starts with os_aio_init() function on line of 3494 of os0file.c:

3493 for (i = 0; i < n_segments; i++) {

3494 srv_set_io_thread_op_info(i, "not started yet");

3495 }

It is clear what's going on here, you hardly will be able to catch this status in the output as it is set even before I/O threads are actually started.

Our next stop is line 4370, in os_aio_windows_handle() that, obviously, works only on Windows:

     /* NOTE! We only access constant fields in os_aio_array. Therefore

     we do not have to acquire the protecting mutex yet */

     ut_ad(os_aio_validate_skip());

     ut_ad(segment < array->n_segments);

     n = array->n_slots / array->n_segments;

     if (array == os_aio_sync_array) {

         WaitForSingleObject(

             os_aio_array_get_nth_slot(array, pos)->handle,

             INFINITE);

         i = pos;

     } else {

         srv_set_io_thread_op_info(orig_seg, "wait Windows aio");

         i = WaitForMultipleObjects((DWORD) n,

                        array->handles + segment * n,

                        FALSE,

                        INFINITE);

     }

So, "wait" here means that we do wait for completion. Later in the same function we are checking result:

4393 if (orig_seg != ULINT_UNDEFINED) {

4394 srv_set_io_thread_op_info(orig_seg,

4395 "get windows aio return value");

4396 }

4397

4398 ret = GetOverlappedResult(slot->file, &(slot->control), &len, TRUE);

That's all states that may be set in os_aio_windows_handle(). Next stop is line 4681 in an infinite loop in the os_aio_linux_handle() function:

     /* Loop until we have found a completed request. */

     for (;;) {

         ibool   any_reserved = FALSE;

         os_mutex_enter(array->mutex);

         for (i = 0; i < n; ++i) {

             slot = os_aio_array_get_nth_slot(

                 array, i + segment * n);

             if (!slot->reserved) {

                 continue;

             } else if (slot->io_already_done) {

                 /* Something for us to work on. */

                 goto found;

             } else {

                 any_reserved = TRUE;

             }

         }

...

         /* Wait for some request. Note that we return

         from wait iff we have found a request. */

         srv_set_io_thread_op_info(global_seg,

             "waiting for completed aio requests");

         os_aio_linux_collect(array, segment, n);
     }

 found:

     /* Note that it may be that there are more then one completed

     IO requests. We process them one at a time. We may have a case

     here to improve the performance slightly by dealing with all

     requests in one sweep. */

     srv_set_io_thread_op_info(global_seg,

                 "processing completed aio requests");

So state "waiting for completed aio requests" means that there is no pending request and no completed request, so we have nothing to do. When at least some AIO request is completed., state will be "processing completed aio requests". All these for real AIO on Linux.

In case of simulated AIO we have  os_aio_simulated_handle() function, where state is set on line 4787:

 restart:

     /* NOTE! We only access constant fields in os_aio_array. Therefore

     we do not have to acquire the protecting mutex yet */

     srv_set_io_thread_op_info(global_segment,

                   "looking for i/o requests (a)");

Later in the same function:

     /* Look through n slots after the segment * n'th slot */

     if (array == os_aio_read_array

         && os_aio_recommend_sleep_for_read_threads) {

         /* Give other threads chance to add several i/os to the array

         at once. */

         goto recommended_sleep;

     }

     srv_set_io_thread_op_info(global_segment,

                   "looking for i/o requests (b)");

So, here we have stages (a) and (b), before and after sleep, and here is the recommended_sleep label in the same function (before we do sleep state "waiting for i/o request" is set):  

 wait_for_io:

     srv_set_io_thread_op_info(global_segment, "resetting wait event");

     /* We wait here until there again can be i/os in the segment

     of this thread */

     os_event_reset(os_aio_segment_wait_events[global_segment]);

     os_mutex_exit(array->mutex);

 recommended_sleep:

     srv_set_io_thread_op_info(global_segment, "waiting for i/o request");

     os_event_wait(os_aio_segment_wait_events[global_segment]);

     if (os_aio_print_debug) {

         fprintf(stderr,

             "InnoDB: i/o handler thread for i/o"

             " segment %lu wakes up\n",

             (ulong) global_segment);

     }

     goto restart;

By the way, we end up at that wait_for_io label when there is no I/O to do by simulated AIO, so "resetting wait event" state is set for a very short period. 

Let's continue with os_aio_simulated_handle() back from line 4946:

     srv_set_io_thread_op_info(global_segment, "consecutive i/o requests");

     /* We have now collected n_consecutive i/o requests in the array;

     allocate a single buffer which can hold all data, and perform the

     i/o */

This state is set if we picked up some request and found there are consecutive requests that can be processed as an array. This state is set only while we are allocating the buffer, for real I/O there is another state, "doing file i/o":

     srv_set_io_thread_op_info(global_segment, "doing file i/o");

     if (os_aio_print_debug) {

         fprintf(stderr,

             "InnoDB: doing i/o of type %lu at offset %lu %lu,"

             " length %lu\n",

             (ulong) slot->type, (ulong) slot->offset_high,

             (ulong) slot->offset, (ulong) total_len);

     }

     /* Do the i/o with ordinary, synchronous i/o functions: */

     if (slot->type == OS_FILE_WRITE) {

         ret = os_file_write(slot->name, slot->file, combined_buf,

                     slot->offset, slot->offset_high,

                     total_len);

     } else {

         ret = os_file_read(slot->file, combined_buf,

                    slot->offset, slot->offset_high, total_len);

     }

     ut_a(ret);

     srv_set_io_thread_op_info(global_segment, "file i/o done");

When sync I/O is completed, state "file i/o done" is set, as you can see above.After we've done I/O and before next sleep and restart we copy data back to original several buffers, set/reset everything properly and call to os_aio_simulated_handle() is completed:

     /* We return the messages for the first slot now, and if there were

     several slots, the messages will be returned with subsequent calls

     of this function */

 slot_io_done:

     ut_a(slot->reserved);

     *message1 = slot->message1;

     *message2 = slot->message2;

     *type = slot->type;

     os_mutex_exit(array->mutex);

     os_aio_array_free_slot(array, slot);

     return(ret);

That's all, so it is either Windows native AIO (a couple of states only) or Linux native AIO (a couple of states only), or simulated AIO (and in this case many states are possible). Probably that's why nobody cared to explain details - in real production situations with native AIO there are few states to consider anyway.

Now, there are some lines to check in fil0fil.c also,  fil_aio_wait() function:

     if (srv_use_native_aio) {

         srv_set_io_thread_op_info(segment, "native aio handle");

 #ifdef WIN_ASYNC_IO

         ret = os_aio_windows_handle(segment, 0, &fil_node,

                         &message, &type);

 #elif defined(LINUX_NATIVE_AIO)

         ret = os_aio_linux_handle(segment, &fil_node,

                       &message, &type);

 #else

         ut_error;

         ret = 0; /* Eliminate compiler warning */

 #endif

     } else {

         srv_set_io_thread_op_info(segment, "simulated aio handle");

         ret = os_aio_simulated_handle(segment, &fil_node,

                           &message, &type);

     }

So, here decision is made (at compile time) on what functions to call from three described above and, for some short time, state reflects type of AIO that is going to be used.

When whatever AIO function called completes, we have this code running:

     srv_set_io_thread_op_info(segment, "complete io for fil node");

     mutex_enter(&fil_system->mutex);

     fil_node_complete_io(fil_node, fil_system, type);

     mutex_exit(&fil_system->mutex);

     ut_ad(fil_validate_skip());

     /* Do the i/o handling */

     /* IMPORTANT: since i/o handling for reads will read also the insert

     buffer in tablespace 0, you have to be very careful not to introduce

     deadlocks in the i/o system. We keep tablespace 0 data files always

     open, and use a special i/o thread to serve insert buffer requests. */

     if (fil_node->space->purpose == FIL_TABLESPACE) {

         srv_set_io_thread_op_info(segment, "complete io for buf page");

         buf_page_io_complete(message);

     } else {

         srv_set_io_thread_op_info(segment, "complete io for log");

         log_io_complete(message);

     }

Here state reflects type of I/O completed, data page or redo log.

That's all. I have to think what is the better way to summarize the states and possible transitions for each state. Maybe just a table, or something like a state machine represented somehow. I should think about this. So, to be continued...

Sunday, January 27, 2013

Fun with Bugs, Issue #3, January 2013

This week in my posts on Facebook I paid attention mostly to bugs in MySQL 5.6.x, as I really expected to see 5.6 GA announcement soon. Maybe some of the bugs were the reason to postpone release a bit. Anyway, let's start with serious bug reports and some trends I've noted.

I've noted that some of "Verified" bug reports somehow silently miss comments about their status in MySQL 5.6. Check Bug #68137, for example. It is repeatable on 5.5.31 (yet to be released, but it means 5.5.30 is cloned off and will be released soon) and 5.7.1. But 5.6.x is not mentioned anywhere. Does it mean 5.6.x is not affected? Surely, no. I hope it does NOT mean any attempt by Oracle to pretend that 5.6 is bugs free. Looks like Sveta had just forgotten to check or report result on this version when following her great bugs verification procedure.

I would say that the above was just a one time mistake, but then look at Bug #68145 that was verified by Sveta in less than 30 minutes after my last comment there (and "verification", at the level I can do it as community member busy with main job actually). It was verified on many versions again, 5.0.97, 5.1.69, 5.5.31, 5.7.1, but NOT on 5.6.x. This looks like a trend now, and I do not like it. There are bugs in 5.6.9, 5.6.10 and even 5.6.11, there is nothing bad about this if they are noted, processed and fixed fast. There is not (good) reason to skip checking on mentioning this version. I am monitoring carefully anyway, so nothing in public database will pass unnoticed and unknown to MySQL community.

Serious bug in InnoDB was reported (and verified on both 5.5.27, 5.6.9-rc and recent 5.6 code) this week, Bug #68148. After executing SET FOREIGN_KEY_CHECKS=0; you can easily drop index unsed by foreign key and then the tables becomes unusable (you can not even drop it with SQL statement) after server restart. As it is not a recent regression, I am not sure it would be easy to fix, so better just avoid DDL with the setting above.

When 5.6 is approaching GA it is really strange to see bugs in new/key features of MySQL 5.6 with unclear status. Check Bug #66916 (replication). We can only guess is it fixed in 5.6.10 or not. Check Bug #66865 (performance regression in 5.6 that is all about performance) reported by my colleague Vadim more than 4 months ago. It is still just "Open". So, is there a regression or not? Or it is just forgotten?

When bugs are forgotten and test cases are not added to the regression test suite (now we, in Community, can not check if they are added), we are doomed to see the same problems again and again. If you are Oracle customer, ask them what Bug #68178 is about (in a support request probably), what versions are affected and had it been previously reported and fixed. If you care what is it about (Oracle will probably just tell you to upgrade to the version with the fix eventually), it is about an easy way to crash MySQL server by sending many smart SELECT convert_tz(...) in a row. I hope 5.6 will NOT be released as GA before this bug is fixed.

Forgotten bug reports had always been a problem, but I tried to at least review all "Open" bug reports once in while. Now we can still see bug reports with clear, repeatable test case finally got from reporter (like Bug #67517), or bug that is just easy to verify by code review (like Bug #67386) open and having no visible attention for months. Some bugs just hang in other intermediate states for ages as well (after verification), like Bug #56267. Some are verified, but they look serious and still remain just in "Verified" status, like Bug #67638 (optimizer).

Chances are high for a casual reporter to never report again when he gets no reply for months. This bad trend continues, but with less bugs open reports there is chance to overcome this trend eventually.

Hardly any of new key features of 5.6 is bug free. For example, PERFORMANCE_SCHEMA, as it was recently found out (see Bug #68180) does not cover all execution paths. Comparing to bugs listed at the beginning of this post it may be not that serious, but I am a bit tired of getting new features in GA that are NOT implemented completely (as it happened to 5.0, 5.1 and 5.5 GA).

The last (but not the least) potential problem with 5.6 is related to the documentation. There are missing details and misleading statements there (see Bug #68097), sections like http://dev.mysql.com/doc/refman/5.5/en/innodb-restrictions.html#innodb-5-5 that are no longer relevand after recent funny bug fixes (see funny fix to Bug #63435 about innodb_version server variable) and misprints in release notes (like Bug #68181). Some sections are still work in progress it seems, like http://dev.mysql.com/doc/refman/5.6/en/online-ddl-partitioning.html. Some fixes (like Bug #67759) are not documented it seems. Maybe manual being not yet ready is the reason why MySQL 5.6 had not been declared GA yet?

Now about good trends. This week I've noted more MySQL engineers working on bugs processing. Long waiting Bug #68023 was finally verified by Matt Lord, who rarely worked on bugs not related to customer issues in the past. Look at his detailed research in Bug #67517 that I've noted (and "escalated") just today. Erlend Dahl had noted and verified awful Bug #68127 soon after my "escalation" and it was closed in 2 days (with the fix in 5.6.10, one of the reasons I though that it will be GA and announced really soon during the week)! Way to go, Oracle, really!

As a result, this week Oracle finally managed to decrease number of "Open" bug reports to the level smaller than during my last day in the company, August 31, 2012. Since December 2012 number of open bug reports is decreased by more than 170! New bug reports are processed fast, see Bug #68144 as one of recent examples. Verified in less that 3 hours (as well as two other bugs from Hartmut reported that day). Looks like it is not a problem now for Oracle to keep up with inflow of at least server bug reports (I'll write about Cluster and Workbench bugs in details in some of the following issues).

Another (IMHO good) trend - Oracle customers, at least those who did that before, do not hesitate to report bugs in public database (as I've suggested to do last year). See Bug #68165 as one of recent examples related to 5.6 and upgrade to it.

Finally, some fun. Make sure you check the oldest and still not closed MySQL bugs, Bug #2. 10 years passed and the problem is still far from solved! This is awful, really.

MariaDB also contributed some fun for us all with their version 10.0. Read Bug #68187 and then try to guess if Oracle is going to fix it any time soon. More fun in the next issue, I hope...

Sunday, January 20, 2013

Fun with Bugs, Issue #2, January 2013

Looks like Issue #1 was popular enough based on number of reviews, so let me continue and tell you what kinds of fun I had with MySQL bug reports between January 12 and today. Once again, this is a kind of digest for my bugs related posts on Facebook during this period.

I'd like to start with the latest open bug at the moment, Bug #68127. Looks like one can easily crash MySQL 5.6.9 by running few concurrent SELECT * from threads queries against PERFORMANCE_SCHEMA. As PERFORMANCE_SCHEMA is enabled by default in 5.6.9 (now even such an ignorant person as me knows this, thanks to readers of my previous posts), this is a serious problem IMHO. Comparing to this, some missing or unclear information in the manual, like those Bug #68097 or Bug #68089 I've reported are so minor...

Now, let me scroll Timeline back and check how the period started. Looks like it was mostly devoted to MySQL 5.6, and this is natural - it's approaching GA and should be the best release ever. That's why it's so sad so see regression bugs or incomplete features in it. Immediately before my previous "Fun with Bugs" post my former colleague Sveta Smirnova had published her review of 18 most important new troubleshooting features of MySQL 5.6. I've taken a quick look and ended up with a list of related bugs:

Bug #67830 - EXPLAIN for DELETE and UPDATE does not work as expected, still just "Verified"
Bug #57514 - delayed replication may not work as expected, still just "Verified"
Bug #67236 - problems with new host_cache table. This one is already "Closed".

Then there was a regression bug in optimizer, again in 5.6, Bug #68046. It was just in time for one internal discussion we had at Percona on long IN lists optimization. Make sure you check it if you are going to upgrade to 5.6 now or soon after GA and use long IN lists based on some old habits or previous experience with it as a solution for some query performance problems.

MySQL 5.6 will be under increasing attention, so I'd really appreciate for even less severe bugs like Bug #68118 to get proper attention from Oracle engineers and to be formally verified fast. Especially in cases like above, when old utilities with familiar names are rewritten (in Perl?) and now works differently than in current GA versions.

Optimizer had been a topic of my interest for years, so it was hard to skip the bug that actually does NOT affect 5.6 it seems, only 5.1.46+ and 5.5. I mean this one, Bug #68072. With proper indexes in place you can get wrong results for some queries when timediff() function is involved. What's funny is that without index the result will be correct. It's always great to choose performance vs correct results.

I have to admit that Oracle engineers did a great job in bugs processing during this period. Number of open bugs is now almost as low as during my last day in Oracle, and they really help to pinpoint cases when experienced engineers make mistakes when studying some problems and report bugs that are "false positives". It may take time to argue in such cases, but public discussion and double checking benefit all sides involved. See Bug #68075 as a great example of this. Even if bug still remains "Open", like Bug #68077, it's great to be able to read the analysis or opinion of Oracle engineers about the problem and potential solutions.

If you care about InnoDB scalability and want to see a case when MySQL 5.6 is better than 5.5, and understand why is it so and is MySQL 5.6 really that good and scalable in a real life, take a look at Bug #68079, especially at comments from Dimitri Kravtchuk there. I expect detailed study (if not a solution) from him soon, published at his great blog. Thanks to my former colleague Arnaud Adant the bug was not only formally verified soon, but he also made sure that Oracle's leading performance tuning expert is aware about it.

This was yet another great example of Oracle's care about public bugs database and problems reported there. They can do a lot. We, community users, should just remind them about bugs they miss for whatever reason. Short post about Bug #67124 - and now it is "Verified" and get a chance to have resources allocated and priority set so that more improvements will be done sooner.

I know Oracle does not care about other storage engines as much as about InnoDB, but still it would be nice to have some cases, like Bug #68086, at least clarified in the manual if not fixed. Lower level of care is demonstrated for bugs related to tools or connectors. And while I can consider these bugs funny (like in case of Bug #67994) they may not be easy to fix. Even more important in this case to have them formally verified, especially if verification itself looks easy.

Since my Issue #1 in this series I've got several notes from my new readers about problematic bug report. Bug #66819 was one of them (surely it is about InnoDB, it's all about InnoDB these days). My colleague Alexey Kopytov had studied root causes of infamous Bug #61104 in details there, and the bug was closed as fixed in versions 5.1.67, 5.5.29, 5.6.8, 5.7.0. Weixiang Zhai questioned that based on code review and then it was proved that not ALL cases are fixed. Oracle will work on the remaining in frames of new, internal bug report it seems, but in the meantime you should know that change buffer is still not yet entirely safe in 5.5.29 and it makes sense to still apply the same workaround as suggested in Bug #61104.

As you know, some bugs affect only debug builds. Does this mean they are not important to fix? No, having them prevents testing for other bugs properly, so they also have to be escalated. Check Bug #68116 from my colleague Laurynas for example. I hope it will not end up "Unsupported" only because his stack trace is from Percona Server...

By the way, MySQL 5.7 is already work in progress. We, in MySQL community, do not have access to source codes yet (as there were no official release), but I see 5.7 mentioned in bug reports more often now. Check this case, Bug #63178. I think that it is NOT acceptable to just close the bug with the fix only in 5.7.0 without any comments on why the problem was not fixed in earlier versions or access to source code of the fix. But who am I to suggest anything... On the positive side, the fact that somebody cared to close the bug and write 5.7.0 there probably means that there will be a formal release of 5.7.x soon. Let's hope...

I'd like to finish this post in a way similar to the previous one: if you have a valid support contract with Oracle, ask them what Bug #65664 is about and is it really fixed in MySQL 5.1.66. I have reasons to think it is not, unfortunately. Find out for yourself - this is your benefit as a customer after all, to be able to demand a definite reply. I can only assume or guess (or check the code) at the moment.

To be continued...