Thursday, January 31, 2013

On InnoDB I/O threads states

I was asked today what different state values for InnoDB I/O threads really mean, these ones:

I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)

I tried to search the manual and Web in general and found no useful explanation (these verbose values should be self explanatory by design it seems). As question was asked, probably it's time to try to answer it...

These states for I/O threads are set by srv_set_io_thread_op_info() function. Quick search with grep on MySQL 5.5.29 source code tree gives us some hints, the rest we can try to find out via code review based on these hits:

[openxs@chief mysql-5.5]$ grep -rn  srv_set_io_thread_op_info * 2>/dev/null
storage/innobase/os/os0file.c:3494:             srv_set_io_thread_op_info(i, "not started yet");
storage/innobase/os/os0file.c:4370:             srv_set_io_thread_op_info(orig_seg, "wait Windows aio");
storage/innobase/os/os0file.c:4394:             srv_set_io_thread_op_info(orig_seg,
storage/innobase/os/os0file.c:4681:             srv_set_io_thread_op_info(global_seg,
storage/innobase/os/os0file.c:4691:     srv_set_io_thread_op_info(global_seg,
storage/innobase/os/os0file.c:4787:     srv_set_io_thread_op_info(global_segment,
storage/innobase/os/os0file.c:4805:     srv_set_io_thread_op_info(global_segment,
storage/innobase/os/os0file.c:4946:     srv_set_io_thread_op_info(global_segment, "consecutive i/o requests");
storage/innobase/os/os0file.c:4989:     srv_set_io_thread_op_info(global_segment, "doing file i/o");
storage/innobase/os/os0file.c:5010:     srv_set_io_thread_op_info(global_segment, "file i/o done");
storage/innobase/os/os0file.c:5062:     srv_set_io_thread_op_info(global_segment, "resetting wait event");
storage/innobase/os/os0file.c:5072:     srv_set_io_thread_op_info(global_segment, "waiting for i/o request");
storage/innobase/include/srv0srv.h:766:# define srv_set_io_thread_op_info(t,info)       ((void) 0)
storage/innobase/fil/fil0fil.c:4581:            srv_set_io_thread_op_info(segment, "native aio handle");
storage/innobase/fil/fil0fil.c:4593:            srv_set_io_thread_op_info(segment, "simulated aio handle");
storage/innobase/fil/fil0fil.c:4605:    srv_set_io_thread_op_info(segment, "complete io for fil node");
storage/innobase/fil/fil0fil.c:4622:            srv_set_io_thread_op_info(segment, "complete io for buf page");
storage/innobase/fil/fil0fil.c:4625:            srv_set_io_thread_op_info(segment, "complete io for log");

Those who are too lazy to pull source code can just use this beautiful online tool to review few files listed above and make their own conclusions.

Unfortunately code formatting is different for different calls, so we still miss some states above as strings. Let's review the files then. It all starts with os_aio_init() function on line of 3494 of os0file.c:

3493  for (i = 0; i < n_segments; i++) {
3494  srv_set_io_thread_op_info(i, "not started yet");
3495  }
It is clear what's going on here, you hardly will be able to catch this status in the output as it is set even before I/O threads are actually started. 
Our next stop is line 4370, in os_aio_windows_handle() that, obviously, works only on Windows:

4356  /* NOTE! We only access constant fields in os_aio_array. Therefore
4357  we do not have to acquire the protecting mutex yet */
4359  ut_ad(os_aio_validate_skip());
4360  ut_ad(segment < array->n_segments);
4362  n = array->n_slots / array->n_segments;
4364  if (array == os_aio_sync_array) {
4365  WaitForSingleObject(
4366  os_aio_array_get_nth_slot(array, pos)->handle,
4367  INFINITE);
4368  i = pos;
4369  } else {
4370  srv_set_io_thread_op_info(orig_seg, "wait Windows aio");
4371  i = WaitForMultipleObjects((DWORD) n,
4372  array->handles + segment * n,
4373  FALSE,
4374  INFINITE);
4375  }
So, "wait" here means that we do wait for completion. Later in the same function we are checking result:
4393  if (orig_seg != ULINT_UNDEFINED) {
4394  srv_set_io_thread_op_info(orig_seg,
4395  "get windows aio return value");
4396  }
4398  ret = GetOverlappedResult(slot->file, &(slot->control), &len, TRUE);
That's all states that may be set in  os_aio_windows_handle(). Next stop is line 4681 in an infinite loop in the os_aio_linux_handle() function:

4648  /* Loop until we have found a completed request. */
4649  for (;;) {
4650  ibool any_reserved = FALSE;
4651  os_mutex_enter(array->mutex);
4652  for (i = 0; i < n; ++i) {
4653  slot = os_aio_array_get_nth_slot(
4654  array, i + segment * n);
4655  if (!slot->reserved) {
4656  continue;
4657  } else if (slot->io_already_done) {
4658  /* Something for us to work on. */
4659  goto found;
4660  } else {
4661  any_reserved = TRUE;
4662  }
4663  }
4678  /* Wait for some request. Note that we return
4679  from wait iff we have found a request. */
4681  srv_set_io_thread_op_info(global_seg,
4682  "waiting for completed aio requests");
4683  os_aio_linux_collect(array, segment, n);
4684  }
4686 found:
4687  /* Note that it may be that there are more then one completed
4688  IO requests. We process them one at a time. We may have a case
4689  here to improve the performance slightly by dealing with all
4690  requests in one sweep. */
4691  srv_set_io_thread_op_info(global_seg,
4692  "processing completed aio requests");
So state "waiting for completed aio requests" means that there is no pending request and no completed request, so we have nothing to do. When at least some AIO request is completed., state will be "processing completed aio requests". All these for real AIO on Linux.

In case of simulated AIO we have  os_aio_simulated_handle() function, where state is set on line 4787:

4783 restart:
4784  /* NOTE! We only access constant fields in os_aio_array. Therefore
4785  we do not have to acquire the protecting mutex yet */
4787  srv_set_io_thread_op_info(global_segment,
4788  "looking for i/o requests (a)");
Later in the same function:
4794  /* Look through n slots after the segment * n'th slot */
4796  if (array == os_aio_read_array
4797  && os_aio_recommend_sleep_for_read_threads) {
4799  /* Give other threads chance to add several i/os to the array
4800  at once. */
4802  goto recommended_sleep;
4803  }
4805  srv_set_io_thread_op_info(global_segment,
4806  "looking for i/o requests (b)");
So, here we have stages (a) and (b), before and after sleep, and here is the recommended_sleep label in the same function (before we do sleep state "waiting for i/o request" is set): 

5061 wait_for_io:
5062  srv_set_io_thread_op_info(global_segment, "resetting wait event");
5064  /* We wait here until there again can be i/os in the segment
5065  of this thread */
5067  os_event_reset(os_aio_segment_wait_events[global_segment]);
5069  os_mutex_exit(array->mutex);
5071 recommended_sleep:
5072  srv_set_io_thread_op_info(global_segment, "waiting for i/o request");
5074  os_event_wait(os_aio_segment_wait_events[global_segment]);
5076  if (os_aio_print_debug) {
5077  fprintf(stderr,
5078  "InnoDB: i/o handler thread for i/o"
5079  " segment %lu wakes up\n",
5080  (ulong) global_segment);
5081  }
5083  goto restart;

By the way, we end up at that wait_for_io label when there is no I/O to do by simulated AIO, so "resetting wait event" state is set for a very short period.
Let's continue with os_aio_simulated_handle() back from line 4946:

4946  srv_set_io_thread_op_info(global_segment, "consecutive i/o requests");
4948  /* We have now collected n_consecutive i/o requests in the array;
4949  allocate a single buffer which can hold all data, and perform the
4950  i/o */

This state is set if we picked up some request and found there are consecutive requests that can be processed as an array. This state is set only while we are allocating the buffer, for real I/O there is another state, "doing file i/o":

4989  srv_set_io_thread_op_info(global_segment, "doing file i/o");
4991  if (os_aio_print_debug) {
4992  fprintf(stderr,
4993  "InnoDB: doing i/o of type %lu at offset %lu %lu,"
4994  " length %lu\n",
4995  (ulong) slot->type, (ulong) slot->offset_high,
4996  (ulong) slot->offset, (ulong) total_len);
4997  }
4999  /* Do the i/o with ordinary, synchronous i/o functions: */
5000  if (slot->type == OS_FILE_WRITE) {
5001  ret = os_file_write(slot->name, slot->file, combined_buf,
5002  slot->offset, slot->offset_high,
5003  total_len);
5004  } else {
5005  ret = os_file_read(slot->file, combined_buf,
5006  slot->offset, slot->offset_high, total_len);
5007  }
5009  ut_a(ret);
5010  srv_set_io_thread_op_info(global_segment, "file i/o done");
When sync I/O is completed, state "file i/o done" is set, as you can see above.After we've done I/O and before next sleep and restart we copy data back to original several buffers, set/reset everything properly and call to os_aio_simulated_handle() is completed:

5042  /* We return the messages for the first slot now, and if there were
5043  several slots, the messages will be returned with subsequent calls
5044  of this function */
5046 slot_io_done:
5048  ut_a(slot->reserved);
5050  *message1 = slot->message1;
5051  *message2 = slot->message2;
5053  *type = slot->type;
5055  os_mutex_exit(array->mutex);
5057  os_aio_array_free_slot(array, slot);
5059  return(ret);

That's all, so it is either Windows native AIO (a couple of states only) or Linux native AIO (a couple of states only), or simulated AIO (and in this case many states are possible). Probably that's why nobody cared to explain details - in real production situations with native AIO there are few states to consider anyway.

Now, there are some lines to check in fil0fil.c also,  fil_aio_wait() function:
4580  if (srv_use_native_aio) {
4581  srv_set_io_thread_op_info(segment, "native aio handle");
4582 #ifdef WIN_ASYNC_IO
4583  ret = os_aio_windows_handle(segment, 0, &fil_node,
4584  &message, &type);
4585 #elif defined(LINUX_NATIVE_AIO)
4586  ret = os_aio_linux_handle(segment, &fil_node,
4587  &message, &type);
4588 #else
4589  ut_error;
4590  ret = 0; /* Eliminate compiler warning */
4591 #endif
4592  } else {
4593  srv_set_io_thread_op_info(segment, "simulated aio handle");
4595  ret = os_aio_simulated_handle(segment, &fil_node,
4596  &message, &type);
4597  }
So, here decision is made (at compile time) on what functions to call from three described above and, for some short time, state reflects type of AIO that is going to be used.

When whatever AIO function called completes, we have this code running:

4605  srv_set_io_thread_op_info(segment, "complete io for fil node");
4607  mutex_enter(&fil_system->mutex);
4609  fil_node_complete_io(fil_node, fil_system, type);
4611  mutex_exit(&fil_system->mutex);
4613  ut_ad(fil_validate_skip());
4615  /* Do the i/o handling */
4616  /* IMPORTANT: since i/o handling for reads will read also the insert
4617  buffer in tablespace 0, you have to be very careful not to introduce
4618  deadlocks in the i/o system. We keep tablespace 0 data files always
4619  open, and use a special i/o thread to serve insert buffer requests. */
4621  if (fil_node->space->purpose == FIL_TABLESPACE) {
4622  srv_set_io_thread_op_info(segment, "complete io for buf page");
4623  buf_page_io_complete(message);
4624  } else {
4625  srv_set_io_thread_op_info(segment, "complete io for log");
4626  log_io_complete(message);
4627  }

Here state reflects type of I/O completed, data page or redo log.

That's all. I have to think what is the better way to summarize the states and possible transitions for each state. Maybe just a table, or something like a state machine represented somehow. I should think about this. So, to be continued...

Sunday, January 27, 2013

Fun with Bugs, Issue #3, January 2013

This week in my posts on Facebook I paid attention mostly to bugs in MySQL 5.6.x, as I really expected to see 5.6 GA announcement soon. Maybe some of the bugs were the reason to postpone release a bit. Anyway, let's start with serious bug reports and some trends I've noted.

I've noted that some of "Verified" bug reports somehow silently miss comments about their status in MySQL 5.6. Check Bug #68137, for example. It is repeatable on 5.5.31 (yet to be released, but it means 5.5.30 is cloned off and will be released soon) and 5.7.1. But 5.6.x is not mentioned anywhere. Does it mean 5.6.x is not affected? Surely, no. I hope it does NOT mean any attempt by Oracle to pretend that 5.6 is bugs free. Looks like Sveta had just forgotten to check or report result on this version when following her great bugs verification procedure.

I would say that the above was just a one time mistake, but then look at Bug #68145 that was verified by Sveta in less than 30 minutes after my last comment there (and "verification", at the level I can do it as community member busy with main job actually). It was verified on many versions again, 5.0.97, 5.1.69, 5.5.31, 5.7.1, but NOT on 5.6.x. This looks like a trend now, and I do not like it. There are bugs in 5.6.9, 5.6.10 and even 5.6.11, there is nothing bad about this if they are noted, processed and fixed fast. There is not (good) reason to skip checking on mentioning this version. I am monitoring carefully anyway, so nothing in public database will pass unnoticed and unknown to MySQL community.

Serious bug in InnoDB was reported (and verified on both 5.5.27, 5.6.9-rc and recent 5.6 code) this week, Bug #68148. After executing SET FOREIGN_KEY_CHECKS=0; you can easily drop index unsed by foreign key and then the tables becomes unusable (you can not even drop it with SQL statement) after server restart. As it is not a recent regression, I am not sure it would be easy to fix, so better just avoid DDL with the setting above.

When 5.6 is approaching GA it is really strange to see bugs in new/key features of MySQL 5.6 with unclear status. Check Bug #66916 (replication). We can only guess is it fixed in 5.6.10 or not. Check Bug #66865 (performance regression in 5.6 that is all about performance) reported by my colleague Vadim more than 4 months ago. It is still just "Open". So, is there a regression or not? Or it is just forgotten?

When bugs are forgotten and test cases are not added to the regression test suite (now we, in Community, can not check if they are added), we are doomed to see the same problems again and again. If you are Oracle customer, ask them what Bug #68178 is about (in a support request probably), what versions are affected and had it been previously reported and fixed. If you care what is it about (Oracle will probably just tell you to upgrade to the version with the fix eventually), it is about an easy way to crash MySQL server by sending many smart SELECT convert_tz(...) in a row. I hope 5.6 will NOT be released as GA before this bug is fixed.

Forgotten bug reports had always been a problem, but I tried to at least review all "Open" bug reports once in while. Now we can still see bug reports with clear, repeatable test case finally got from reporter (like Bug #67517), or bug that is just easy to verify by code review (like Bug #67386) open and having no visible attention for months. Some bugs just hang in other intermediate states for ages as well (after verification), like Bug #56267. Some are verified, but they look serious and still remain just in "Verified" status, like Bug #67638 (optimizer).

Chances are high for a casual reporter to never report again when he gets no reply for months. This bad trend continues, but with less bugs open reports there is chance to overcome this trend eventually.

Hardly any of new key features of 5.6 is bug free. For example, PERFORMANCE_SCHEMA, as it was recently found out (see Bug #68180) does not cover all execution paths. Comparing to bugs listed at the beginning of this post it may be not that serious, but I am a bit tired of getting new features in GA that are NOT implemented completely (as it happened to 5.0, 5.1 and 5.5 GA).

The last (but not the least) potential problem with 5.6 is related to the documentation. There are missing details and misleading statements there (see Bug #68097), sections like that are no longer relevand after recent funny bug fixes (see funny fix to Bug #63435 about innodb_version server variable) and misprints in release notes (like Bug #68181). Some sections are still work in progress it seems, like Some fixes (like Bug #67759) are not documented it seems. Maybe manual being not yet ready is the reason why MySQL 5.6 had not been declared GA yet?

Now about good trends. This week I've noted more MySQL engineers working on bugs processing. Long waiting Bug #68023 was finally verified by Matt Lord, who rarely worked on bugs not related to customer issues in the past. Look at his detailed research in Bug #67517 that I've noted (and "escalated") just today. Erlend Dahl had noted and verified awful Bug #68127 soon after my "escalation" and it was closed in 2 days (with the fix in 5.6.10, one of the reasons I though that it will be GA and announced really soon during the week)! Way to go, Oracle, really!

As a result, this week Oracle finally managed to decrease number of "Open" bug reports to the level smaller than during my last day in the company, August 31, 2012. Since December 2012 number of open bug reports is decreased by more than 170! New bug reports are processed fast, see Bug #68144 as one of recent examples. Verified in less that 3 hours (as well as two other bugs from Hartmut reported that day). Looks like it is not a problem now for Oracle to keep up with inflow of at least server bug reports (I'll write about Cluster and Workbench bugs in details in some of the following issues).

Another (IMHO good) trend - Oracle customers, at least those who did that before, do not hesitate to report bugs in public database (as I've suggested to do last year). See Bug #68165 as one of recent examples related to 5.6 and upgrade to it.

Finally, some fun. Make sure you check the oldest and still not closed MySQL bugs, Bug #2. 10 years passed and the problem is still far from solved! This is awful, really.

MariaDB also contributed some fun for us all with their version 10.0. Read Bug #68187 and then try to guess if Oracle is going to fix it any time soon. More fun in the next issue, I hope...

Sunday, January 20, 2013

Fun with Bugs, Issue #2, January 2013

Looks like Issue #1 was popular enough based on number of reviews, so let me continue and tell you what kinds of fun I had with MySQL bug reports between January 12 and today. Once again, this is a kind of digest for my bugs related posts on Facebook during this period.

I'd like to start with the latest open bug at the moment, Bug #68127. Looks like one can easily crash MySQL 5.6.9 by running few concurrent  SELECT * from threads queries against PERFORMANCE_SCHEMA. As PERFORMANCE_SCHEMA is enabled by default in 5.6.9 (now even such an ignorant person as me knows this, thanks to readers of my previous posts), this is a serious problem IMHO. Comparing to this, some missing or unclear information in the manual, like those Bug #68097 or Bug #68089 I've reported are so minor...

Now, let me scroll Timeline back and check how the period started. Looks like it was mostly devoted to MySQL 5.6, and this is natural - it's approaching GA and should be the best release ever. That's why it's so sad so see regression bugs or incomplete features in it. Immediately before my previous "Fun with Bugs" post my former colleague Sveta Smirnova had published her review of 18 most important new troubleshooting features of MySQL 5.6. I've taken a quick look and ended up with a list of related bugs:
  • Bug #67830 - EXPLAIN for DELETE and UPDATE does not work as expected, still just "Verified"
  • Bug #57514 - delayed replication may not work as expected, still just "Verified"
  • Bug #67236 - problems with new host_cache table. This one is already "Closed".
Then there was a regression bug in optimizer, again in 5.6, Bug #68046. It was just in time for one internal discussion we had at Percona on long IN lists optimization. Make sure you check it if you are going to upgrade to 5.6 now or soon after GA and use long IN lists based on some old habits or previous experience with it as a solution for some query performance problems.

MySQL 5.6 will be under increasing attention, so I'd really appreciate for even less severe bugs like Bug #68118 to get proper attention from Oracle engineers and to be formally verified fast. Especially in cases like above, when old utilities with familiar names are rewritten (in Perl?) and now works differently than in current GA versions.

Optimizer had been a topic of my interest for years, so it was hard to skip the bug that actually does NOT affect 5.6 it seems, only 5.1.46+ and 5.5. I mean this one, Bug #68072.  With proper indexes in place you can get wrong results for some queries when timediff() function is involved. What's funny is that without index the result will be correct. It's always great to choose performance vs correct results.

I have to admit that Oracle engineers did a great job in bugs processing during this period. Number of open bugs is now almost as low as during my last day in Oracle, and they really help to pinpoint cases when experienced engineers make mistakes when studying some problems and report bugs that are "false positives". It may take time to argue in such cases, but public discussion and double checking benefit all sides involved. See Bug #68075 as a great example of this. Even if bug still remains "Open", like Bug #68077, it's great to be able to read the analysis or opinion of Oracle engineers about the problem and potential solutions.

If you care about InnoDB scalability and want to see a case when MySQL 5.6 is better than 5.5, and understand why is it so and is MySQL 5.6 really that good and scalable in a real life, take a look at Bug #68079, especially at comments from Dimitri Kravtchuk there. I expect detailed study (if not a solution) from him soon, published at his great blog. Thanks to my former colleague Arnaud Adant the bug was not only formally verified soon, but he also made sure that Oracle's leading performance tuning expert is aware about it.

This was yet another great example of Oracle's care about public bugs database and problems reported there. They can do a lot. We, community users, should just remind them about bugs they miss for whatever reason. Short post about Bug #67124 - and now it is "Verified" and get a chance to have resources allocated and priority set so that more improvements will be done sooner.

I know Oracle does not care about other storage engines as much as about InnoDB, but still it would be nice to have some cases, like Bug #68086, at least clarified in the manual if not fixed. Lower level of care is demonstrated for bugs related to tools or connectors. And while I can consider these bugs funny (like in case of Bug #67994) they may not be easy to fix. Even more important in this case to have them formally verified, especially if verification itself looks easy.

Since my Issue #1 in this series I've got several notes from my new readers about problematic bug report. Bug #66819 was one of them (surely it is about InnoDB, it's all about InnoDB these days). My colleague Alexey Kopytov had studied root causes of infamous Bug #61104 in details there, and the bug was closed as fixed in versions 5.1.67, 5.5.29, 5.6.8, 5.7.0. Weixiang Zhai questioned that based on code review and then it was proved that not ALL cases are fixed. Oracle will work on the remaining in frames of new, internal bug report it seems, but in the meantime you should know that change buffer is still not yet entirely safe in 5.5.29 and it makes sense to still apply the same workaround as suggested in Bug #61104.

As you know, some bugs affect only debug builds. Does this mean they are not important to fix? No, having them prevents testing for other bugs properly, so they also have to be escalated. Check Bug #68116 from my colleague Laurynas for example. I hope it will not end up "Unsupported" only because his stack trace is from Percona Server...

By the way, MySQL 5.7 is already work in progress. We, in MySQL community, do not have access to source codes yet (as there were no official release), but I see 5.7 mentioned in bug reports more often now. Check this case, Bug #63178. I think that it is NOT acceptable to just close the bug with the fix only in 5.7.0 without any comments on why the problem was not fixed in earlier versions or access to source code of the fix. But who am I to suggest anything... On the positive side, the fact that somebody cared to close the bug and write 5.7.0 there probably means that there will be a formal release of 5.7.x soon. Let's hope...
I'd like to finish this post in a way similar to the previous one: if you have a valid support contract with Oracle, ask them what Bug #65664 is about and is it really fixed in MySQL 5.1.66. I have reasons to think it is not, unfortunately. Find out for yourself - this is your benefit as a customer after all, to be able to demand a definite reply. I can only assume or guess (or check the code) at the moment.

To be continued...