Tuesday, January 31, 2017

Profiling MyRocks with perf: Good Old Bug #68079 Use Case

Almost a year ago I've got really interested in MyRocks and built MySQL from Facebook that provides it from source. Since that time I build it from fresh sources few times per week (as I've described in that post) and once in a while try to work with it and study some details or use cases. Today I'd like to discuss one of them that I've recently studied with perf profiler.

This is not only because I am going to talk about applying profilers to all kinds and forks of MySQL at FOSDEM 2017 MySQL & Friends Devroom this week. It seems profilers is the only way to study or troubleshoot MyRocks performance problems (if any) at the moment (when provided status variables and SHOW ENGINE ROCKSDB STATUS etc are not enough), as Facebook's MySQL does NOT have Performance Schema compiled in by default:

mysql> show engines;
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine     | Support | Comment                                                        | Transactions | XA   | Savepoints |
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
| ROCKSDB    | DEFAULT | RocksDB storage engine                                         | YES          | YES  | YES        || MRG_MYISAM | YES     | Collection of identical MyISAM tables                          | NO           | NO   | NO         |
| MyISAM     | YES     | MyISAM storage engine                                          | NO           | NO   | NO         |
| BLACKHOLE  | YES     | /dev/null storage engine (anything you write to it disappears) | NO           | NO   | NO         |
| CSV        | YES     | CSV storage engine                                             | NO           | NO   | NO         |
| MEMORY     | YES     | Hash based, stored in memory, useful for temporary tables      | NO           | NO   | NO         |
| ARCHIVE    | YES     | Archive storage engine                                         | NO           | NO   | NO         |
| FEDERATED  | NO      | Federated MySQL storage engine                                 | NULL         | NULL | NULL       |
| InnoDB     | NO      | Supports transactions, row-level locking, and foreign keys     | NULL         | NULL | NULL       |
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
9 rows in set (0.00 sec)

This may change when MyRocks builds from Percona or MariaDB will appear as ready for common use, but for now perf is a way to go, and I like it!

I do not remember how exactly I've come up with the idea to compare MyRocks scalability to InnoDB scalability on my old QuadXeon box (now running Fedora 25):
[openxs@fc23 mysql-server]$ pt-summary
# Percona Toolkit System Summary Report ######################
        Date | 2017-01-30 11:32:53 UTC (local TZ: EET +0200)
    Hostname | fc23
      Uptime |  4:02,  3 users,  load average: 3.47, 2.12, 1.27
    Platform | Linux
     Release | Fedora release 25 (Twenty Five)
      Kernel | 4.9.5-200.fc25.x86_64

Architecture | CPU = 64-bit, OS = 64-bit
   Threading | NPTL 2.24
     SELinux | Enforcing
 Virtualized | No virtualization detected
# Processor ##################################################
  Processors | physical = 1, cores = 4, virtual = 4, hyperthreading = no
      Speeds | 4x2499.000
      Models | 4xIntel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz
      Caches | 4x2048 KB

# Memory #####################################################
       Total | 7.8G
        Free | 2.9G
        Used | physical = 1.0G, swap allocated = 0.0, swap used = 0.0, virtual = 1.0G
      Shared | 176.0M
     Buffers | 3.9G
      Caches | 6.2G
       Dirty | 208 kB
     UsedRSS | 2.5G
  Swappiness | 60
 DirtyPolicy | 20, 10
 DirtyStatus | 0, 0
...
for the famous use case presented in Bug #68079. Scalability problems identified there cause a lot of work by Oracle engineers, and a lot of changes in MySQL 5.6.x and 5.7.x (so natable that MariaDB decided to use only those from 5.7 in the upcoming MariaDB 10.2, see MDEV-10476). I've already mentioned this bug probably in every my public talk about MySQL since 2013, so one day I decided to check how the fix helped for recent InnoDB and then compare to MyRocks performance with all default settings on the same hardware for different number of concurrent threads.

So, one day in April 2016 I took a dump of tables involved from the bug, replaced InnoDB with ROCKSDB there, added explicit CHARSET=utf8 COLLATE=utf8_bin clauses, and loaded the dump into Facebook's MySQL 5.6 built from source. I've ended up with the following:
mysql> show create table task\G
*************************** 1. row ***************************
       Table: task
Create Table: CREATE TABLE `task` (
  `sys_id` char(32) COLLATE utf8_bin NOT NULL DEFAULT '',
  `u_root_cause` char(32) COLLATE utf8_bin DEFAULT NULL,
  `u_business_impact_description` mediumtext COLLATE utf8_bin,
  `u_business_impact_category` mediumtext COLLATE utf8_bin,
  PRIMARY KEY (`sys_id`)
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

mysql> show create table incident\G
*************************** 1. row ***************************
       Table: incident
Create Table: CREATE TABLE `incident` (
  `sys_id` char(32) COLLATE utf8_bin NOT NULL DEFAULT '',
  `category` varchar(40) COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`sys_id`),
  KEY `incident_category` (`category`)
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

mysql> show table status like 'task'\G
*************************** 1. row ***************************
           Name: task
         Engine: ROCKSDB
        Version: 10
     Row_format: Dynamic
           Rows: 8292
 Avg_row_length: 69
    Data_length: 573991
...
1 row in set (0.00 sec)

mysql> show table status like 'incident'\G
*************************** 1. row ***************************
           Name: incident
         Engine: ROCKSDB
        Version: 10
     Row_format: Dynamic
           Rows: 8192
 Avg_row_length: 87
    Data_length: 719091
Max_data_length: 0
   Index_length: 956624
...
1 row in set (0.00 sec)
The tables are simple and small. Then I tried to study how well this query scales when number of threads it is running from is increasing:
mysql> explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
| id | select_type | table    | type   | possible_keys             | key               | key_len | ref                  | rows | Extra       |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
|  1 | SIMPLE      | incident | index  | PRIMARY,incident_category | incident_category | 123     | NULL                 | 8192 | Using index |
|  1 | SIMPLE      | task     | eq_ref | PRIMARY                   | PRIMARY           | 96      | test.incident.sys_id |    1 | Using index |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
2 rows in set (0.00 sec)
For this I've used mysqlslap and commands like the following:
[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=1 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
with different --concurrency values: 1, 2, 4, 8, 10, 16, 32 (on a box with 4 cores). Before the fix InnoDB had scalability problems even with concurrency N, where N is the number of cores. After the fix it scaled way better in this test.

You can find the summary of results in this Google spreadsheet, but here is a chart from it:


The X axis has log scale and shows number of concurrent threads running. The Y axis is the average number of problematic queries executed per second. "inner" and "straight" refer to INNER JOIN vs STRAIGHT_JOIN queries presented above.

"MyRocks" on this chart means ROCKSDB in MySQL from Facebook built from source corresponding to the following commit:
[openxs@fc23 mysql-5.6]$ git log -1
commit 6b4ba98698182868800f31e0d303f1e16026efdf
...
Date:   Sat Jan 28 13:51:47 2017 -0800
while "InnoDB 5.7" means InnoDB from Oracle MySQL 5.7.17, the following commit specifically:
[openxs@fc23 mysql-server]$ git log -1
commit 23032807537d8dd8ee4ec1c4d40f0633cd4e12f9
...
Date:   Mon Nov 28 16:48:20 2016 +0530
For this post I wonder why the difference in favor of "straight" query is so big starting from 4 concurrent threads in case of MyRocks? I've profiled both queries running via mysqlslap as follows:
[openxs@fc23 ~]$ perf --version
perf version 4.9.6.200.fc25.x86_64.g51a0
[openxs@fc23 ~]$ sudo perf record -a
[sudo] password for openxs:
^C[ perf record: Woken up 335 times to write data ]
[ perf record: Captured and wrote 84.913 MB perf.data (1825548 samples) ]
Profiler was started, then mysqlslap command executed and Ctrl-C was pressed when it produced the results.

The result for STRAIGHT_JOIN queries was the following (up to 1% of "overhead"):
[openxs@fc23 ~]$ sudo perf report --stdio | more...
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:p'
# Event count (approx.): 62748127726390
#
# Overhead  Command          Shared Object                     Symbol...
     5.32%  my-oneconnectio  libc-2.24.so                      [.] __memcmp_sse4_1
     4.68%  my-oneconnectio  libc-2.24.so                      [.] __memcpy_ssse3
     4.04%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::Seek
     3.33%  my-oneconnectio  mysqld                            [.] my_strnxfrm_unicode
     3.11%  my-oneconnectio  mysqld                            [.] rocksdb::BlockBasedTable::Get
     2.96%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_pk_comparator::Compare
     2.54%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::BinarySeek
     2.14%  my-oneconnectio  mysqld                            [.] rocksdb::InternalKeyComparator::Compare
     2.00%  my-oneconnectio  mysqld                            [.] rocksdb::StatisticsImpl::recordTick
     1.99%  my-oneconnectio  mysqld                            [.] rocksdb::MergingIterator::Next
     1.95%  my-oneconnectio  mysqld                            [.] rocksdb::HistogramStat::Add
     1.51%  my-oneconnectio  mysqld                            [.] join_read_key
     1.51%  my-oneconnectio  mysqld                            [.] myrocks::rdb_unpack_utf8_str
     1.41%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_key_def::unpack_record
     1.36%  my-oneconnectio  mysqld                            [.] sub_select
     1.32%  my-oneconnectio  mysqld                            [.] my_uni_utf8
     1.29%  my-oneconnectio  mysqld                            [.] rocksdb::Version::Get
     1.22%  my-oneconnectio  libc-2.24.so                      [.] _int_malloc
     1.22%  my-oneconnectio  mysqld                            [.] rocksdb::TableCache::Get
     1.22%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::Next
     1.21%  my-oneconnectio  mysqld                            [.] rocksdb::ThreadLocalPtr::Get
     1.18%  my-oneconnectio  mysqld                            [.] rocksdb::DBIter::FindNextUserEntryInternal
     1.18%  my-oneconnectio  mysqld                            [.] rocksdb::(anonymous namespace)::FilePicker::GetNextFile
     1.12%  my-oneconnectio  libc-2.24.so                      [.] malloc
     1.07%  my-oneconnectio  mysqld                            [.] evaluate_join_record
     1.07%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_key_def::pack_index_tuple
     1.01%  my-oneconnectio  mysqld                            [.] key_restore
The result for INNER JOIN queries was the following (up to 1% of "overhead"):
[openxs@fc23 ~]$ sudo perf report --stdio | more
...
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:p'
# Event count (approx.): 162704428520300
#
# Overhead  Command          Shared Object                  Symbol...
#
     5.90%  my-oneconnectio  libc-2.24.so                   [.] __memcpy_ssse3
     4.38%  my-oneconnectio  libpthread-2.24.so             [.] pthread_mutex_lock     3.69%  my-oneconnectio  libc-2.24.so                   [.] __memcmp_sse4_1
     3.58%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::Seek
     2.53%  my-oneconnectio  libpthread-2.24.so             [.] pthread_mutex_unlock     2.47%  my-oneconnectio  mysqld                         [.] rocksdb::StatisticsImpl::recordTick
     2.35%  my-oneconnectio  mysqld                         [.] rocksdb::BlockBasedTable::Get
     2.28%  my-oneconnectio  mysqld                         [.] my_strnxfrm_unicode
     2.08%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::BinarySeek
     1.84%  my-oneconnectio  mysqld                         [.] rocksdb::HistogramStat::Add
     1.84%  my-oneconnectio  mysqld                         [.] rocksdb::Version::Get
     1.71%  my-oneconnectio  mysqld                         [.] myrocks::rdb_unpack_binary_or_utf8_varchar_space_pad     1.69%  my-oneconnectio  mysqld                         [.] rocksdb::LRUCacheShard::Lookup
     1.67%  my-oneconnectio  mysqld                         [.] my_uni_utf8
     1.61%  my-oneconnectio  mysqld                         [.] rocksdb::InternalKeyComparator::Compare
     1.52%  my-oneconnectio  mysqld                         [.] myrocks::Rdb_pk_comparator::Compare
     1.47%  my-oneconnectio  mysqld                         [.] rocksdb::(anonymous namespace)::FilePicker::GetNextFile
     1.23%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::Next
     1.15%  my-oneconnectio  mysqld                         [.] rocksdb::MergingIterator::Next
     1.10%  my-oneconnectio  mysqld                         [.] join_read_key
     1.09%  my-oneconnectio  mysqld                         [.] rocksdb::Block::NewIterator
     1.04%  my-oneconnectio  mysqld                         [.] rocksdb::TableCache::Get
I am not an expert in MyRocks at all, so I will not even try to interpret these results (maybe in some later post, one day. I've just highlighted what I consider a real difference in the second (INNER JOIN) case. Note pthread_mutex_lock/pthread_mutex_unlock calls, for example that are NOT present in "more than 1%" output for STRAIGHT_JOIN case.

I'd be happy to get any comments on the above. I also plan to share more/raw data from perf some day later, when I decide where to put them for public access (we speak about text outputs megabytes in size).

I've captured profiles with callgraphs as well:
[openxs@fc23 ~]$ sudo perf record -ag
^C[ perf record: Woken up 565 times to write data ]
[ perf record: Captured and wrote 142.410 MB perf.data (644176 samples) ]
Again, full trees are huge and require real expert to be analyzed properly.

New perf report option (-g flat, available since kernels 4.4+) gave nice backtraces like this:
           40.20%
                handle_one_connection
                do_handle_one_connection
                dispatch_command
                mysql_parse
                mysql_execute_command
                execute_sqlcom_select
                handle_select
                mysql_select
                JOIN::exec
                sub_select
                evaluate_join_record
                sub_select
                join_read_key
                myrocks::ha_rocksdb::index_read_map_impl
                myrocks::ha_rocksdb::get_row_by_rowid
                myrocks::Rdb_transaction_impl::get
                rocksdb::TransactionBaseImpl::Get
                rocksdb::WriteBatchWithIndex::GetFromBatchAndDB
                rocksdb::DBImpl::Get
                rocksdb::DBImpl::GetImpl
                rocksdb::Version::Get
that show how data are read in MyRocks. You can get similar backtraces with pt-pmp as well, but you'll see only how many times specific backtrace was noted, not what was the related overhead. Also, performance impact from pt-pmp would be very notable in this case.

Time to stop it seems, the post is already long. One day I'll continue, as there are many phenomena highlighted by this very simple, but famous, use case. Stay tuned!

Sunday, January 29, 2017

perf Basics for MySQL Profiling

Oprofile was widely used for MySQL profiling on Linux in the past. But since 2010 and 2.6.31 Linux kernels another profiler, perf, gets increasing popularity. It uses performance counters (CPU hardware registers that count hardware events such as instructions executed) subsystem in Linux. perf is capable of lightweight profiling. It is included in the Linux kernel, under tools/perf (so features available depends on kernel version), and is frequently updated and enhanced.

So, probably perf is the future of profiling on Linux and it makes sense to discuss its basic usage for profiling MySQL servers. For detailed discussions of features provided, numerous examples (not related to MySQL) and links I suggest to read great posts by Brendan Gregg, starting from this one.

I am going to use Ubuntu 14.04 for this basic Howto, with the following perf version provided by the linux-tools-common package:

openxs@ao756:~/dbs/maria10.1$ perf --version
perf version 3.13.11-ckt39
openxs@ao756:~/dbs/maria10.1$ dpkg -S `which perf`
linux-tools-common: /usr/bin/perf
On RPM-based systems package name is just perf. I plan to discuss and show details of (other versions of) perf usage on CentOS 6.8 and Fedora 25 later.

I'll work with a bit outdated MariaDB 10.1.x (built from source) I have at hand up and running. In case of standard vendor's packages, make sure that you have debug symbols installed (perf will inform you if some are missing anyway), to get useful results:
openxs@ao756:~$ dpkg -l | grep -i percona | grep dbg
ii  percona-server-5.7-dbg                                5.7.16-10-1.trusty                                  amd64        Debugging package for Percona Server
Also, ideally binaries that you are profiling should be built with  -fno-omit-frame-pointer option.

As in the previous post about basic oprofile usage, I'll try to start with a primitive test case based on  Bug #39630, for simplicity and to establish basic trust in the tool's outputs, but this time I also want to try profiling some concurrent load more close to a real life, along the lines of Bug #83912.

Basic two(!) steps to use perf for studying some problematic period of MySQL server are the following:
  1. Assuming the we want to get a profile of mysqld process during some problematic period, let's say in some script, we start recording samples for, let's say, 30 seconds, as a root user or via sudo:
    openxs@ao756:~/dbs/maria10.1$ sudo perf record -p `pidof mysqld` sleep 30
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.052 MB perf.data (~2257 samples) ]
  2. Samples are collected in the perf.data file in the current directory by default:
    openxs@ao756:~/dbs/maria10.1$ ls -l perf.data-rw------- 1 root root 56768 січ 29 13:53 perf.data
    When collection completes we can get profiling information using perf report command. If you want to save the result into some file or just review it, add --stdio option. otherwise you end up in a text-based "interactive" user interface of perf, that, while can be useful, may be misleading and weird for beginners. For the 'select benchmark(500000000, 2*2) case I've got:
    openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
    # ========
    # captured on: Sun Jan 29 14:01:24 2017
    # hostname : ao756
    # os release : 3.13.0-100-generic
    # perf version : 3.13.11-ckt39
    # arch : x86_64
    # nrcpus online : 2
    # nrcpus avail : 2

    # cpudesc : Intel(R) Pentium(R) CPU 987 @ 1.50GHz
    # cpuid : GenuineIntel,6,42,7
    # total memory : 3861452 kB
    # cmdline : /usr/lib/linux-tools-3.13.0-100/perf record -p 4270 sleep 30
    # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, e
    # HEADER_CPU_TOPOLOGY info available, use -I to display
    # HEADER_NUMA_TOPOLOGY info available, use -I to display
    # pmu mappings: cpu = 4, software = 1, tracepoint = 2, uncore_cbox_0 = 6, uncore
    # ========
    #
    # Samples: 248  of event 'cycles'
    # Event count (approx.): 24811349
    #
    # Overhead  Command       Shared Object
    # ........  .......  ..................  .......................................
    #
        11.60%   mysqld  [kernel.kallsyms]   [k] update_blocked_averages
         8.12%   mysqld  [kernel.kallsyms]   [k] update_cfs_rq_blocked_load
         6.98%   mysqld  [kernel.kallsyms]   [k] find_busiest_group
         5.76%   mysqld  [kernel.kallsyms]   [k] sys_io_getevents
         4.40%   mysqld  libc-2.19.so        [.] memset
         3.21%   mysqld  [kernel.kallsyms]   [k] update_curr
         3.03%   mysqld  [kernel.kallsyms]   [k] __calc_delta
         2.96%   mysqld  [kernel.kallsyms]   [k] put_prev_task_fair
         2.46%   mysqld  [kernel.kallsyms]   [k] timerqueue_add
         2.12%   mysqld  [kernel.kallsyms]   [k] cpumask_next_and
         2.11%   mysqld  libpthread-2.19.so  [.] __pthread_mutex_cond_lock     2.10%   mysqld  mysqld              [.] os_aio_linux_handle(unsigned long,
    ...
         1.09%   mysqld  mysqld              [.] buf_flush_lru_manager_thread
    ...
Now, the result above looks similar to and as suspicious as the result I've got with operf (that is also based on perf_events internally) on the same system and MariaDB version in the previous post. Everyone wonders where is the time spent on multiplication etc in the above? For operf (see my comment) I had found out that system-wide profiling gives expected results. It seems perf record -p pid also collects samples for single CPU (the one it runs on, or the one running that process actively, or just some one), and this is a problem for multi-threaded programs like mysqld if they are executed on multiple-core hardware (almost all hardware are like that these days, maybe some virtual machines aside). The solution is simple, add -a option for perf record:
       -a, --all-cpus
           System-wide collection from all CPUs.
and do not refer to mysqld, you can filter out related results later (for example, with  sudo perf report --stdio | grep mysqld | head -10, or interactively when --stdio option is not used). Like this:
openxs@ao756:~/dbs/maria10.1$ sudo perf record -a sleep 30
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 4.411 MB perf.data (~192717 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
# ========
# captured on: Sun Jan 29 14:27:31 2017
# hostname : ao756
# os release : 3.13.0-100-generic
# perf version : 3.13.11-ckt39
# arch : x86_64
# nrcpus online : 2
...
# Samples: 83K of event 'cycles'
# Event count (approx.): 30621323147
#
# Overhead          Command                  Shared Object
# ........  ...............  .............................  ............................................................................................................
#
    31.22%           mysqld  mysqld                         [.] Item_func_mul::int_op()    16.57%           mysqld  mysqld                         [.] Item_func_hybrid_field_type::val_int()
    16.51%           mysqld  mysqld                         [.] Item_hybrid_func::result_type() const
    16.11%           mysqld  mysqld                         [.] Item_func_benchmark::val_int()     9.92%           mysqld  mysqld                         [.] Item_int::val_int()
     3.27%           mysqld  mysqld                         [.] Type_handler_int_result::cmp_type() const
     3.25%           mysqld  mysqld                         [.] Type_handler_int_result::result_type() const
     0.27%      kworker/0:2  [kernel.kallsyms]              [k] aes_encrypt
     0.18%           compiz  i965_dri.so                    [.] 0x00000000000def10
     0.14%          swapper  [kernel.kallsyms]              [k] intel_idle
...
So, system-wide profile provides the information we can trust, even on multi-core systems for multi-threaded programs. If you know some other way, less intrusive way to get profiles that do not miss the details from threads running on other cores, please, tell me. I was NOT able to find it quickly.

By the way, nice interactive perf top command shows expected results as well when this primitive test runs (it is still a system-wide profile):

Samples: 17K of event 'cycles', Event count (approx.): 5061557268
 28,29%  mysqld                         [.] Item_func_mul::int_op()
 15,05%  mysqld                         [.] Item_func_benchmark::val_int()
 15,01%  mysqld                         [.] Item_hybrid_func::result_type() cons
 15,01%  mysqld                         [.] Item_func_hybrid_field_type::val_int
  9,12%  mysqld                         [.] Item_int::val_int()
  3,34%  perf                           [.] 0x00000000000564db
  3,18%  mysqld                         [.] Type_handler_int_result::cmp_type()
  2,93%  mysqld                         [.] Type_handler_int_result::result_type
  0,96%  libc-2.19.so                   [.] __strstr_sse2
  0,61%  libc-2.19.so                   [.] __GI___strcmp_ssse3
  0,24%  [kernel]                       [k] kallsyms_expand_symbol.constprop.1
  0,21%  [kernel]                       [k] module_get_kallsym
  0,20%  libc-2.19.so                   [.] _int_malloc
  0,20%  libharfbuzz.so.0.927.0         [.] 0x0000000000028051
  0,19%  i965_dri.so                    [.] 0x000000000010a9a0
  0,18%  [kernel]                       [k] format_decode
  0,16%  [kernel]                       [k] vsnprintf
  0,15%  libc-2.19.so                   [.] memchr
  0,15%  libc-2.19.so                   [.] __strchr_sse2
  0,14%  [kernel]                       [k] number.isra.1
  0,13%  [kernel]                       [k] memcpy
  0,09%  libc-2.19.so                   [.] strlen
Press '?' for help on key bindings
Press 'q' to get out of that screen.

perf record may be also used the same way as operf, run in foreground and be interrupted with Ctrl-C (examples are from my last night's session on CentOS):
[root@centos ~]# perf record -ag
^C[ perf record: Woken up 63 times to write data ]
[ perf record: Captured and wrote 15.702 MB perf.data (226643 samples) ]
or it can be started in the background and then interrupted with -SIGINT signal sent to the process:
[root@centos ~]# perf record -ag &[1] 2353
[root@centos ~]# mysql -uroot -e'select benchmark(500000000,2*2);' test
+--------------------------+
| benchmark(500000000,2*2) |
+--------------------------+
|                        0 |
+--------------------------+
[root@centos ~]# kill -sigint  2353[ perf record: Woken up 90 times to write data ]
[root@centos ~]# [ perf record: Captured and wrote 22.366 MB perf.data (322362 samples) ]

[1]+  Interrupt               perf record -ag
[root@centos ~]# ps aux | grep perf
root      2358  0.0  0.0 103312   864 pts/0    S+   22:16   0:00 grep perf
You've probably noted additional -g option in the examples above (explanation pasted from perf help record/man perf-record output):
       -g
           Enables call-graph (stack chain/backtrace) recording.
Another option often used is -F (like -F99 or -F1000):
       -F, --freq=
           Profile at this frequency.
For select benchmark(500000000, 2*2) case -g does not show anything really unexpected or enlightening:

openxs@ao756:~/dbs/maria10.1$ sudo perf record -ag sleep 30
[sudo] password for openxs:
[ perf record: Woken up 64 times to write data ]
[ perf record: Captured and wrote 16.496 MB perf.data (~720730 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
...

# Overhead          Command                  Shared Object
# ........  ...............  .............................  ............................................................................................................
#
    31.02%           mysqld  mysqld                         [.] Item_func_mul::int_op()
                     |
                     --- Item_func_mul::int_op()
                        |
                        |--94.56%-- Item_func_hybrid_field_type::val_int()
                        |          Item_func_benchmark::val_int()
                        |          Item::send(Protocol*, String*)
                        |          Protocol::send_result_set_row(List<Item>*)
                        |          select_send::send_data(List<Item>&)
                        |          JOIN::exec_inner()
                        |          JOIN::exec()
                        |          mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsi
                        |          handle_select(THD*, LEX*, select_result*, unsigned long)
                        |          execute_sqlcom_select(THD*, TABLE_LIST*)
                        |          mysql_execute_command(THD*)
                        |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          do_command(THD*)
                        |          do_handle_one_connection(THD*)
                        |          handle_one_connection
                        |          start_thread
                        |
                         --5.44%-- Item_func_benchmark::val_int()
                                   Item::send(Protocol*, String*)
                                   Protocol::send_result_set_row(List<Item>*)
                                   select_send::send_data(List<Item>&)
                                   JOIN::exec_inner()
                                   JOIN::exec()
                                   mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsi
                                   handle_select(THD*, LEX*, select_result*, unsigned long)
                                   execute_sqlcom_select(THD*, TABLE_LIST*)
                                   mysql_execute_command(THD*)
                                   mysql_parse(THD*, char*, unsigned int, Parser_state*)
                                   dispatch_command(enum_server_command, THD*, char*, unsigned int)
                                   do_command(THD*)
                                   do_handle_one_connection(THD*)
                                   handle_one_connection
                                   start_thread
...


But we clearly see backtraces/callgraphs.

Note that this tree starts with the on-CPU functions and works back through the ancestry. This approach is called a "callee based call graph". This can be flipped by using -G for an "inverted call graph", or by using the caller option to -g/--call-graph, instead of the callee default. The hottest (most frequent) stack trace in this perf call graph occurred in 29.33% of samples, which is the product of the overhead percentage and top stack leaf (31.02%*94.56%, which are relative rates). perf report can also be run with "-g graph" to show absolute overhead rates

I'd like to conclude this blog post with profiling system running test load presented in Bug #83912 (that, in turn, is just one of the observations based on a very complex real life performance problem I am still working on). There we run N+2 long running UPDATEs on N+2 different tables on a system with N cores. While UPDATEs run, sometimes SELECT asking for a single record by PRIMARY KEY for a different table may sometimes run slow, for fractions of second to many seconds.

So, with table to select from, t0, and other 4 (N+2, N=2 on my netbook used for testing) tables to update, t1 ... t4, created:
openxs@ao756:~/dbs/maria10.1$ bin/mysql -uroot test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 21
Server version: 10.1.18-MariaDB MariaDB Server

Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> select count(*) from t0;
+----------+
| count(*) |
+----------+
|  2097152 |
+----------+
1 row in set (9.33 sec)

MariaDB [test]> select * from t0 where id = 13;
+----+------+----------------------+
| id | c1   | c2                   |
+----+------+----------------------+
| 13 |  269 | 0.008545639932663777 |
+----+------+----------------------+
1 row in set (0.00 sec)

MariaDB [test]> show table status like 't1'\G
*************************** 1. row ***************************
           Name: t1
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 2079632
 Avg_row_length: 139
    Data_length: 289177600
Max_data_length: 0
   Index_length: 0
      Data_free: 6291456
 Auto_increment: 2424797
    Create_time: 2017-01-29 12:31:57
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options: stats_persistent=1
        Comment:
1 row in set (0.11 sec)

MariaDB [test]> explain select * from t0 where id = 13;
+------+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
| id   | select_type | table | type  | possible_keys | key     | key_len | ref   | rows | Extra |
+------+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
|    1 | SIMPLE      | t0    | const | PRIMARY       | PRIMARY | 4       | const |    1 |       |
+------+-------------+-------+-------+---------------+---------+---------+-------+------+-------+
1 row in set (0.00 sec)

MariaDB [test]> explain update t1 set c2 = rand() where id between 2000 and 300000;
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
| id   | select_type | table | type  | possible_keys | key     | key_len | ref  | rows   | Extra       |
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
|    1 | SIMPLE      | t1    | range | PRIMARY       | PRIMARY | 4       | NULL | 339246 | Using where |
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------------+
1 row in set (0.00 sec)
I've started 4 (N+2) threads each updating large part of one of the tables:
openxs@ao756:~/dbs/maria10.1$ for i in `seq 1 4`; do bin/mysqlslap -uroot --no-drop --create-schema=test --number-of-queries=1000 iternations=10 --concurrency=1 --query="update t$i set c2 = rand() where id between 2000 and 300000;" & done
[1] 15513
[2] 15514
[3] 15515
[4] 15516

In the SHOW PROCESSLIST we see:
...
| 26 | root | localhost | test | Query   |   10 | updating | update t2 set c2 = rand() where id between 2000 and 300000 |    0.000 |
| 27 | root | localhost | test | Query   |   32 | updating | update t1 set c2 = rand() where id between 2000 and 300000 |    0.000 |
| 28 | root | localhost | test | Query   |    5 | updating | update t4 set c2 = rand() where id between 2000 and 300000 |    0.000 |
| 29 | root | localhost | test | Query   |   11 | updating | update t3 set c2 = rand() where id between 2000 and 300000 |    0.000 |
...

Let's see with perf what these threads are doing, so they spend 29 seconds in updating status. First without -g:
openxs@ao756:~/dbs/maria10.1$ sudo perf record -a sleep 30[sudo] password for openxs:
[ perf record: Woken up 15 times to write data ]
[ perf record: Captured and wrote 4.260 MB perf.data (~186128 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
...
# Overhead          Command                  Shared Object
# ........  ...............  .............................  ............................................................................................................
#
    12.50%      kworker/0:1  [kernel.kallsyms]              [k] aes_encrypt    10.54%      kworker/1:1  [kernel.kallsyms]              [k] aes_encrypt
     9.77%      kworker/1:2  [kernel.kallsyms]              [k] aes_encrypt
     4.81%      kworker/0:0  [kernel.kallsyms]              [k] aes_encrypt
     4.52%           mysqld  mysqld                         [.] buf_calc_page_new_checksum(unsigned char const*)     3.64%           mysqld  mysqld                         [.] log_block_calc_checksum_innodb(unsigned char const*)
     2.22%           mysqld  mysqld                         [.] dtoa.constprop.6
     2.15%           mysqld  mysqld                         [.] trx_undo_report_row_operation(unsigned long, unsigned long, que_thr_t*, dict_index_t*, dtuple_t const*,
     2.10%           mysqld  mysqld                         [.] quorem
     1.78%           mysqld  mysqld                         [.] mtr_commit(mtr_t*)
     1.41%           mysqld  mysqld                         [.] ha_innobase::update_row(unsigned char const*, unsigned char*)
     1.15%           mysqld  mysqld                         [.] buf_page_optimistic_get(unsigned long, buf_block_t*, unsigned long, char const*, unsigned long, mtr_t*)

     1.15%           mysqld  mysqld                         [.] buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long
     1.11%           mysqld  mysqld                         [.] row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
     1.10%           mysqld  mysqld                         [.] multadd
...
I am not surprised to see aes_encrypt in kernel on top of this system-wide profile, as encrupted HDD is used on this netbook (one of Percona security requirements for any hardware ever used for work, even personal) and upodates are surely I/O bound with mysqld --no-defaults and 128M of buffer pool.

We may want to study some of mysqld-related entries further with callgraphs generated (-g):

openxs@ao756:~/dbs/maria10.1$ sudo perf record -ag sleep 20
[ perf record: Woken up 31 times to write data ]
[ perf record: Captured and wrote 8.313 MB perf.data (~363186 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio

...
     4.29%           mysqld  mysqld                         [.] buf_calc_page_new_checksum(unsigned char const*)
                     |
                     --- buf_calc_page_new_checksum(unsigned char const*)
                        |
                        |--67.15%-- buf_flush_init_for_writing(unsigned char*, void*, unsigned long)
                        |          buf_flush_page(buf_pool_t*, buf_page_t*, buf_flush_t, bool)
                        |          buf_flush_try_neighbors(unsigned long, unsigned long, buf_flush_t, unsigned long, unsigned long)
                        |          |
                        |          |--52.15%-- buf_do_flush_list_batch(buf_pool_t*, unsigned long, unsigned long)
                        |          |          buf_flush_list(unsigned long, unsigned long, unsigned long*)
                        |          |          buf_flush_page_cleaner_thread
                        |          |          start_thread
                        |          |
                        |           --47.85%-- buf_flush_LRU_list_batch(buf_pool_t*, unsigned long, bool, flush_counters_t*)
                        |                     buf_flush_batch(buf_pool_t*, buf_flush_t, unsigned long, unsigned long, bool, flush_counters_t*)
                        |                     buf_flush_LRU_tail()
                        |                     buf_flush_lru_manager_thread
                        |                     start_thread
                        |
                         --32.85%-- buf_page_is_corrupted(bool, unsigned char const*, unsigned long)
                                   buf_page_io_complete(buf_page_t*)
                                   |
                                   |--82.17%-- fil_aio_wait(unsigned long)
                                   |          io_handler_thread
                                   |          start_thread
                                   |
                                    --17.83%-- buf_read_page(unsigned long, unsigned long, unsigned long, trx_t*, buf_page_t**)
                                              buf_page_get_gen(unsigned long, unsigned long, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, uns
                                              |
                                              |--92.31%-- btr_pcur_move_to_next_page(btr_pcur_t*, mtr_t*)
                                              |          btr_pcur_move_to_next(btr_pcur_t*, mtr_t*)
                                              |          row_search_for_mysql(unsigned char*, unsigned long, row_prebuilt_t*, unsigned long, unsigned long)
                                              |          ha_innobase::index_next(unsigned char*)
                                              |          handler::ha_index_next(unsigned char*)
                                              |          handler::read_range_next()
                                              |          handler::multi_range_read_next(void**)
                                              |          Mrr_simple_index_reader::get_next(void**)
                                              |          DsMrr_impl::dsmrr_next(void**)
                                              |          QUICK_RANGE_SELECT::get_next()
                                              |          rr_quick(READ_RECORD*)
                                              |          mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, e
                                              |          mysql_execute_command(THD*)
                                              |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                                              |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                                              |          do_command(THD*)
                                              |          do_handle_one_connection(THD*)
                                              |          handle_one_connection
                                              |          start_thread
                                              |
                                              |--5.37%-- btr_estimate_n_rows_in_range(dict_index_t*, dtuple_t const*, unsigned long, dtuple_t const*, unsigned long, trx
                                              |          ha_innobase::records_in_range(unsigned int, st_key_range*, st_key_range*)
                                              |          handler::multi_range_read_info_const(unsigned int, st_range_seq_if*, void*, unsigned int, unsigned int*, unsign
                                              |          DsMrr_impl::dsmrr_info_const(unsigned int, st_range_seq_if*, void*, unsigned int, unsigned int*, unsigned int*,
                                              |          check_quick_select(PARAM*, unsigned int, bool, SEL_ARG*, bool, unsigned int*, unsigned int*, Cost_estimate*)
                                              |          get_key_scans_params(PARAM*, SEL_TREE*, bool, bool, double)
                                              |          SQL_SELECT::test_quick_select(THD*, Bitmap<64u>, unsigned long long, unsigned long long, bool, bool, bool)
                                              |          mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, e
                                              |          mysql_execute_command(THD*)
                                              |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                                              |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                                              |          do_command(THD*)
                                              |          do_handle_one_connection(THD*)
                                              |          handle_one_connection
                                              |          start_thread
...

     1.28%           mysqld  mysqld                         [.] ha_innobase::update_row(unsigned char const*, unsigned char*)
                     |
                     --- ha_innobase::update_row(unsigned char const*, unsigned char*)
                        |
                        |--97.71%-- handler::ha_update_row(unsigned char const*, unsigned char*)
                        |          mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool,
                        |          mysql_execute_command(THD*)
                        |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          do_command(THD*)
                        |          do_handle_one_connection(THD*)
                        |          handle_one_connection
                        |          start_thread
                        |
                         --2.29%-- mysql_update(THD*, TABLE_LIST*, List<Item>&, List<Item>&, Item*, unsigned int, st_order*, unsigned long long, enum_duplicates, bool,
                                   mysql_execute_command(THD*)
                                   mysql_parse(THD*, char*, unsigned int, Parser_state*)
                                   dispatch_command(enum_server_command, THD*, char*, unsigned int)
                                   do_command(THD*)
                                   do_handle_one_connection(THD*)
                                   handle_one_connection
                                   start_thread
...


Now to the interesting part (slow UPDATEs when I/O bound are expected). Usually (but not always, in original real life case!) simple SELECT runs fast, but if we limit innodb_thread_concurrency to, say, 4, it may start to run slow:
MariaDB [test]> select * from t0 where id = 13;
+----+------+----------------------+
| id | c1   | c2                   |
+----+------+----------------------+
| 13 |  269 | 0.008545639932663777 |
+----+------+----------------------+
1 row in set (0.00 sec)

MariaDB [test]> select @@innodb_thread_concurrency;
+-----------------------------+
| @@innodb_thread_concurrency |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (0.05 sec)

MariaDB [test]> set global innodb_thread_concurrency=4;
Query OK, 0 rows affected (0.25 sec)

MariaDB [test]> select * from t0 where id = 13;
+----+------+----------------------+
| id | c1   | c2                   |
+----+------+----------------------+
| 13 |  269 | 0.008545639932663777 |
+----+------+----------------------+
1 row in set (2.92 sec)

MariaDB [test]> select * from t0 where id = 13;
+----+------+----------------------+
| id | c1   | c2                   |
+----+------+----------------------+
| 13 |  269 | 0.008545639932663777 |
+----+------+----------------------+
1 row in set (12.16 sec)

Profile with callgraph may help to find out why this happens:
openxs@ao756:~/dbs/maria10.1$ sudo perf record -ag sleep 20
[ perf record: Woken up 32 times to write data ]
[ perf record: Captured and wrote 8.503 MB perf.data (~371488 samples) ]
openxs@ao756:~/dbs/maria10.1$ sudo perf report --stdio
...
                        |--1.40%-- sys_select
                        |          system_call_fastpath
                        |          __select
                        |          |
                        |          |--71.70%-- srv_conc_enter_innodb(trx_t*)
                        |          |          ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function)
                        |          |          handler::index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function)
                        |          |          handler::ha_index_read_idx_map(unsigned char*, unsigned int, unsigned char const*, unsigned long, ha_rkey_function)
                        |          |          join_read_const(st_join_table*)
                        |          |          join_read_const_table(THD*, st_join_table*, st_position*) [clone .isra.301]
                        |          |          make_join_statistics(JOIN*, List<TABLE_LIST>&, st_dynamic_array*)
                        |          |          JOIN::optimize_inner()
                        |          |          JOIN::optimize()
                        |          |          mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_o
                        |          |          handle_select(THD*, LEX*, select_result*, unsigned long)
                        |          |          execute_sqlcom_select(THD*, TABLE_LIST*)
                        |          |          mysql_execute_command(THD*)
                        |          |          mysql_parse(THD*, char*, unsigned int, Parser_state*)
                        |          |          dispatch_command(enum_server_command, THD*, char*, unsigned int)
                        |          |          do_command(THD*)
                        |          |          do_handle_one_connection(THD*)
                        |          |          handle_one_connection
                        |          |          start_thread
...
It spends 71.7% of time waiting to join InnoDB queue (in this limited InnoDB concurrency case) while diving into the index while optimizing the query... This time is spent on "statistics" stage, for single row SELECT by PRIMARY KEY. Now, go conclude this without a backtrace of calls or some profiler.