Thursday, January 31, 2013

On InnoDB I/O threads states

I was asked today what different state values for InnoDB I/O threads really mean, these ones:

I/O thread 0 state: waiting for completed aio requests (insert buffer thread)
I/O thread 1 state: waiting for completed aio requests (log thread)
I/O thread 2 state: waiting for completed aio requests (read thread)

I tried to search the manual and Web in general and found no useful explanation (these verbose values should be self explanatory by design it seems). As question was asked, probably it's time to try to answer it...

These states for I/O threads are set by srv_set_io_thread_op_info() function. Quick search with grep on MySQL 5.5.29 source code tree gives us some hints, the rest we can try to find out via code review based on these hits:

[openxs@chief mysql-5.5]$ grep -rn  srv_set_io_thread_op_info * 2>/dev/null
storage/innobase/os/os0file.c:3494:             srv_set_io_thread_op_info(i, "not started yet");
storage/innobase/os/os0file.c:4370:             srv_set_io_thread_op_info(orig_seg, "wait Windows aio");
storage/innobase/os/os0file.c:4394:             srv_set_io_thread_op_info(orig_seg,
storage/innobase/os/os0file.c:4681:             srv_set_io_thread_op_info(global_seg,
storage/innobase/os/os0file.c:4691:     srv_set_io_thread_op_info(global_seg,
storage/innobase/os/os0file.c:4787:     srv_set_io_thread_op_info(global_segment,
storage/innobase/os/os0file.c:4805:     srv_set_io_thread_op_info(global_segment,
storage/innobase/os/os0file.c:4946:     srv_set_io_thread_op_info(global_segment, "consecutive i/o requests");
storage/innobase/os/os0file.c:4989:     srv_set_io_thread_op_info(global_segment, "doing file i/o");
storage/innobase/os/os0file.c:5010:     srv_set_io_thread_op_info(global_segment, "file i/o done");
storage/innobase/os/os0file.c:5062:     srv_set_io_thread_op_info(global_segment, "resetting wait event");
storage/innobase/os/os0file.c:5072:     srv_set_io_thread_op_info(global_segment, "waiting for i/o request");
storage/innobase/include/srv0srv.h:766:# define srv_set_io_thread_op_info(t,info)       ((void) 0)
storage/innobase/fil/fil0fil.c:4581:            srv_set_io_thread_op_info(segment, "native aio handle");
storage/innobase/fil/fil0fil.c:4593:            srv_set_io_thread_op_info(segment, "simulated aio handle");
storage/innobase/fil/fil0fil.c:4605:    srv_set_io_thread_op_info(segment, "complete io for fil node");
storage/innobase/fil/fil0fil.c:4622:            srv_set_io_thread_op_info(segment, "complete io for buf page");
storage/innobase/fil/fil0fil.c:4625:            srv_set_io_thread_op_info(segment, "complete io for log");

Those who are too lazy to pull source code can just use this beautiful online tool to review few files listed above and make their own conclusions.

Unfortunately code formatting is different for different calls, so we still miss some states above as strings. Let's review the files then. It all starts with os_aio_init() function on line of 3494 of os0file.c:

3493  for (i = 0; i < n_segments; i++) {
3494  srv_set_io_thread_op_info(i, "not started yet");
3495  }
It is clear what's going on here, you hardly will be able to catch this status in the output as it is set even before I/O threads are actually started. 
Our next stop is line 4370, in os_aio_windows_handle() that, obviously, works only on Windows:

4356  /* NOTE! We only access constant fields in os_aio_array. Therefore
4357  we do not have to acquire the protecting mutex yet */
4359  ut_ad(os_aio_validate_skip());
4360  ut_ad(segment < array->n_segments);
4362  n = array->n_slots / array->n_segments;
4364  if (array == os_aio_sync_array) {
4365  WaitForSingleObject(
4366  os_aio_array_get_nth_slot(array, pos)->handle,
4367  INFINITE);
4368  i = pos;
4369  } else {
4370  srv_set_io_thread_op_info(orig_seg, "wait Windows aio");
4371  i = WaitForMultipleObjects((DWORD) n,
4372  array->handles + segment * n,
4373  FALSE,
4374  INFINITE);
4375  }
So, "wait" here means that we do wait for completion. Later in the same function we are checking result:
4393  if (orig_seg != ULINT_UNDEFINED) {
4394  srv_set_io_thread_op_info(orig_seg,
4395  "get windows aio return value");
4396  }
4398  ret = GetOverlappedResult(slot->file, &(slot->control), &len, TRUE);
That's all states that may be set in  os_aio_windows_handle(). Next stop is line 4681 in an infinite loop in the os_aio_linux_handle() function:

4648  /* Loop until we have found a completed request. */
4649  for (;;) {
4650  ibool any_reserved = FALSE;
4651  os_mutex_enter(array->mutex);
4652  for (i = 0; i < n; ++i) {
4653  slot = os_aio_array_get_nth_slot(
4654  array, i + segment * n);
4655  if (!slot->reserved) {
4656  continue;
4657  } else if (slot->io_already_done) {
4658  /* Something for us to work on. */
4659  goto found;
4660  } else {
4661  any_reserved = TRUE;
4662  }
4663  }
4678  /* Wait for some request. Note that we return
4679  from wait iff we have found a request. */
4681  srv_set_io_thread_op_info(global_seg,
4682  "waiting for completed aio requests");
4683  os_aio_linux_collect(array, segment, n);
4684  }
4686 found:
4687  /* Note that it may be that there are more then one completed
4688  IO requests. We process them one at a time. We may have a case
4689  here to improve the performance slightly by dealing with all
4690  requests in one sweep. */
4691  srv_set_io_thread_op_info(global_seg,
4692  "processing completed aio requests");
So state "waiting for completed aio requests" means that there is no pending request and no completed request, so we have nothing to do. When at least some AIO request is completed., state will be "processing completed aio requests". All these for real AIO on Linux.

In case of simulated AIO we have  os_aio_simulated_handle() function, where state is set on line 4787:

4783 restart:
4784  /* NOTE! We only access constant fields in os_aio_array. Therefore
4785  we do not have to acquire the protecting mutex yet */
4787  srv_set_io_thread_op_info(global_segment,
4788  "looking for i/o requests (a)");
Later in the same function:
4794  /* Look through n slots after the segment * n'th slot */
4796  if (array == os_aio_read_array
4797  && os_aio_recommend_sleep_for_read_threads) {
4799  /* Give other threads chance to add several i/os to the array
4800  at once. */
4802  goto recommended_sleep;
4803  }
4805  srv_set_io_thread_op_info(global_segment,
4806  "looking for i/o requests (b)");
So, here we have stages (a) and (b), before and after sleep, and here is the recommended_sleep label in the same function (before we do sleep state "waiting for i/o request" is set): 

5061 wait_for_io:
5062  srv_set_io_thread_op_info(global_segment, "resetting wait event");
5064  /* We wait here until there again can be i/os in the segment
5065  of this thread */
5067  os_event_reset(os_aio_segment_wait_events[global_segment]);
5069  os_mutex_exit(array->mutex);
5071 recommended_sleep:
5072  srv_set_io_thread_op_info(global_segment, "waiting for i/o request");
5074  os_event_wait(os_aio_segment_wait_events[global_segment]);
5076  if (os_aio_print_debug) {
5077  fprintf(stderr,
5078  "InnoDB: i/o handler thread for i/o"
5079  " segment %lu wakes up\n",
5080  (ulong) global_segment);
5081  }
5083  goto restart;

By the way, we end up at that wait_for_io label when there is no I/O to do by simulated AIO, so "resetting wait event" state is set for a very short period.
Let's continue with os_aio_simulated_handle() back from line 4946:

4946  srv_set_io_thread_op_info(global_segment, "consecutive i/o requests");
4948  /* We have now collected n_consecutive i/o requests in the array;
4949  allocate a single buffer which can hold all data, and perform the
4950  i/o */

This state is set if we picked up some request and found there are consecutive requests that can be processed as an array. This state is set only while we are allocating the buffer, for real I/O there is another state, "doing file i/o":

4989  srv_set_io_thread_op_info(global_segment, "doing file i/o");
4991  if (os_aio_print_debug) {
4992  fprintf(stderr,
4993  "InnoDB: doing i/o of type %lu at offset %lu %lu,"
4994  " length %lu\n",
4995  (ulong) slot->type, (ulong) slot->offset_high,
4996  (ulong) slot->offset, (ulong) total_len);
4997  }
4999  /* Do the i/o with ordinary, synchronous i/o functions: */
5000  if (slot->type == OS_FILE_WRITE) {
5001  ret = os_file_write(slot->name, slot->file, combined_buf,
5002  slot->offset, slot->offset_high,
5003  total_len);
5004  } else {
5005  ret = os_file_read(slot->file, combined_buf,
5006  slot->offset, slot->offset_high, total_len);
5007  }
5009  ut_a(ret);
5010  srv_set_io_thread_op_info(global_segment, "file i/o done");
When sync I/O is completed, state "file i/o done" is set, as you can see above.After we've done I/O and before next sleep and restart we copy data back to original several buffers, set/reset everything properly and call to os_aio_simulated_handle() is completed:

5042  /* We return the messages for the first slot now, and if there were
5043  several slots, the messages will be returned with subsequent calls
5044  of this function */
5046 slot_io_done:
5048  ut_a(slot->reserved);
5050  *message1 = slot->message1;
5051  *message2 = slot->message2;
5053  *type = slot->type;
5055  os_mutex_exit(array->mutex);
5057  os_aio_array_free_slot(array, slot);
5059  return(ret);

That's all, so it is either Windows native AIO (a couple of states only) or Linux native AIO (a couple of states only), or simulated AIO (and in this case many states are possible). Probably that's why nobody cared to explain details - in real production situations with native AIO there are few states to consider anyway.

Now, there are some lines to check in fil0fil.c also,  fil_aio_wait() function:
4580  if (srv_use_native_aio) {
4581  srv_set_io_thread_op_info(segment, "native aio handle");
4582 #ifdef WIN_ASYNC_IO
4583  ret = os_aio_windows_handle(segment, 0, &fil_node,
4584  &message, &type);
4585 #elif defined(LINUX_NATIVE_AIO)
4586  ret = os_aio_linux_handle(segment, &fil_node,
4587  &message, &type);
4588 #else
4589  ut_error;
4590  ret = 0; /* Eliminate compiler warning */
4591 #endif
4592  } else {
4593  srv_set_io_thread_op_info(segment, "simulated aio handle");
4595  ret = os_aio_simulated_handle(segment, &fil_node,
4596  &message, &type);
4597  }
So, here decision is made (at compile time) on what functions to call from three described above and, for some short time, state reflects type of AIO that is going to be used.

When whatever AIO function called completes, we have this code running:

4605  srv_set_io_thread_op_info(segment, "complete io for fil node");
4607  mutex_enter(&fil_system->mutex);
4609  fil_node_complete_io(fil_node, fil_system, type);
4611  mutex_exit(&fil_system->mutex);
4613  ut_ad(fil_validate_skip());
4615  /* Do the i/o handling */
4616  /* IMPORTANT: since i/o handling for reads will read also the insert
4617  buffer in tablespace 0, you have to be very careful not to introduce
4618  deadlocks in the i/o system. We keep tablespace 0 data files always
4619  open, and use a special i/o thread to serve insert buffer requests. */
4621  if (fil_node->space->purpose == FIL_TABLESPACE) {
4622  srv_set_io_thread_op_info(segment, "complete io for buf page");
4623  buf_page_io_complete(message);
4624  } else {
4625  srv_set_io_thread_op_info(segment, "complete io for log");
4626  log_io_complete(message);
4627  }

Here state reflects type of I/O completed, data page or redo log.

That's all. I have to think what is the better way to summarize the states and possible transitions for each state. Maybe just a table, or something like a state machine represented somehow. I should think about this. So, to be continued...

1 comment:

  1. It would be great to have a table in I_S about how much time is spend in each state.