9.3 BPF Tools
This section covers the BPF tools you can use for disk performance analysis and troubleshooting. They are shown in Figure 9-3.
FIGURE 9-3 BPF tools for disk analysis
These tools are either from the BCC and bpftrace repositories covered in Chapters 4 and 5, or were created for this book. Some tools appear in both BCC and bpftrace. Table 9-2 lists the origins of the tools covered in this section (BT is short for bpftrace).
Table 9-2 Disk-Related Tools
Tool |
Source |
Target |
Description |
biolatency |
BCC/BT |
Block I/O |
Summarize block I/O latency as a histogram |
biosnoop |
BCC/BT |
Block I/O |
Trace block I/O with PID and latency |
biotop |
BCC |
Block I/O |
Top for disks: summarize block I/O by process |
bitesize |
BCC/BT |
Block I/O |
Show disk I/O size histogram by process |
seeksize |
Book |
Block I/O |
Show requested I/O seek distances |
biopattern |
Book |
Block I/O |
Identify random/sequential disk access patterns |
biostacks |
Book |
Block I/O |
Show disk I/O with initialization stacks |
bioerr |
Book |
Block I/O |
Trace disk errors |
mdflush |
BCC/BT |
MD |
Trace md flush requests |
iosched |
Book |
I/O sched |
Summarize I/O scheduler latency |
scsilatency |
Book |
SCSI |
Show SCSI command latency distributions |
scsiresult |
Book |
SCSI |
Show SCSI command result codes |
nvmelatency |
Book |
NVME |
Summarize NVME driver command latency |
For the tools from BCC and bpftrace, see their repositories for full and updated lists of tool options and capabilities. A selection of the most important capabilities are summarized here. See Chapter 8 for file system tools.
9.3.1 biolatency
biolatency(8)2 is a BCC and bpftrace tool to show block I/O device latency as a histogram. The term device latency refers to the time from issuing a request to the device, to when it completes, including time spent queued in the operating system.
2 Origin: I created this as iolatency.d for the 2011 DTrace book [Gregg 11], following the same name as my other iosnoop and iotop tools. This led to confusion since “io” is ambiguous, so for BPF I’ve added the “b” to these tools to signify block I/O. I created biolatency for BCC on 20-Sep-2015 and bpftrace on 13-Sep-2018.
The following shows biolatency(8) from BCC, on a production Hadoop instance, tracing block I/O for 10 seconds:
# biolatency 10 1 Tracing block device I/O... Hit Ctrl-C to end. usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 15 | | 128 -> 255 : 4475 |************ | 256 -> 511 : 14222 |****************************************| 512 -> 1023 : 12303 |********************************** | 1024 -> 2047 : 5649 |*************** | 2048 -> 4095 : 995 |** | 4096 -> 8191 : 1980 |***** | 8192 -> 16383 : 3681 |********** | 16384 -> 32767 : 1895 |***** | 32768 -> 65535 : 721 |** | 65536 -> 131071 : 394 |* | 131072 -> 262143 : 65 | | 262144 -> 524287 : 17 | |
This output shows a bi-modal distribution, with one mode between 128 and 2047 microseconds and the other between about 4 and 32 milliseconds. Now that I know that the device latency is bi-modal, understanding why may lead to tuning that moves more I/O to the faster mode. For example, the slower I/O could be random I/O, or larger-size I/O (which can be determined using other BPF tools). The slowest I/O in this output reached the 262- to 524-millisecond range: this sounds like deep queueing on the device.
biolatency(8) and the later biosnoop(8) tool have been used to solve many production issues. They can be especially useful for the analysis of multi-tenant drives in cloud environments, which can be noisy and break latency SLOs. When running on small cloud instances, Netflix’s Cloud Database team was able to use biolatency(8) and biosnoop(8) to isolate machines with unacceptably bi-modal or latent drives, and evict them from both distributed caching tiers and distributed database tiers. Upon further analysis, the team decided to change their deployment strategy based on these findings, and now deploy clusters to fewer nodes, choosing those large enough to have dedicated drives. This small change effectively eliminated the latency outliers with no additional infrastructure cost.
The biolatency(8) tool currently works by tracing various block I/O kernel functions using kprobes. It was written before tracepoint support was available in BCC, so used kprobes instead. The overhead of this tool should be negligible on most systems where the disk IOPS rate is low (<1000).
Queued Time
BCC biolatency(8) has a -Q option to include the OS queued time:
# biolatency -Q 10 1 Tracing block device I/O... Hit Ctrl-C to end. usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 1 | | 128 -> 255 : 2780 |********** | 256 -> 511 : 10386 |****************************************| 512 -> 1023 : 8399 |******************************** | 1024 -> 2047 : 4154 |*************** | 2048 -> 4095 : 1074 |**** | 4096 -> 8191 : 2078 |******** | 8192 -> 16383 : 7688 |***************************** | 16384 -> 32767 : 4111 |*************** | 32768 -> 65535 : 818 |*** | 65536 -> 131071 : 220 | | 131072 -> 262143 : 103 | | 262144 -> 524287 : 48 | | 524288 -> 1048575 : 6 | |
The output is not much different: this time there’s some more I/O in the slower mode. iostat(1) confirms that the queue lengths are small (avgqu-sz < 1).
Disks
Systems can have mixed storage devices: disks for the OS, disks for storage pools, and drives for removable media. The -D option in biolatency(8) shows histograms for disks separately, helping you see how each type performs. For example:
# biolatency -D Tracing block device I/O... Hit Ctrl-C to end. ^C [...] disk = 'sdb' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 0 | | 16 -> 31 : 0 | | 32 -> 63 : 0 | | 64 -> 127 : 0 | | 128 -> 255 : 1 | | 256 -> 511 : 25 |** | 512 -> 1023 : 43 |**** | 1024 -> 2047 : 206 |********************* | 2048 -> 4095 : 8 | | 4096 -> 8191 : 8 | | 8192 -> 16383 : 392 |****************************************| disk = 'nvme0n1' usecs : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 0 | | 8 -> 15 : 12 | | 16 -> 31 : 72 | | 32 -> 63 : 5980 |****************************************| 64 -> 127 : 1240 |******** | 128 -> 255 : 74 | | 256 -> 511 : 13 | | 512 -> 1023 : 4 | | 1024 -> 2047 : 23 | | 2048 -> 4095 : 10 | | 4096 -> 8191 : 63 | |
This output shows two very different disk devices: nvme0n1, a flash-memory based disk, with I/O latency often between 32 and 127 microseconds; and sdb, an external USB storage device, with a bimodal I/O latency distribution in the milliseconds.
Flags
BCC biolatency(8) also has a -F option to print each set of I/O flags differently. For example, with -m for millisecond histograms:
# biolatency -Fm Tracing block device I/O... Hit Ctrl-C to end. ^C [...] flags = Read msecs : count distribution 0 -> 1 : 180 |************* | 2 -> 3 : 519 |****************************************| 4 -> 7 : 60 |**** | 8 -> 15 : 123 |********* | 16 -> 31 : 68 |***** | 32 -> 63 : 0 | | 64 -> 127 : 2 | | 128 -> 255 : 12 | | 256 -> 511 : 0 | | 512 -> 1023 : 1 | | flags = Sync-Write msecs : count distribution 0 -> 1 : 8 |*** | 2 -> 3 : 26 |*********** | 4 -> 7 : 37 |*************** | 8 -> 15 : 65 |*************************** | 16 -> 31 : 93 |****************************************| 32 -> 63 : 20 |******** | 64 -> 127 : 6 |** | 128 -> 255 : 0 | | 256 -> 511 : 4 |* | 512 -> 1023 : 17 |******* | flags = Flush msecs : count distribution 0 -> 1 : 2 |****************************************| flags = Metadata-Read msecs : count distribution 0 -> 1 : 3 |****************************************| 2 -> 3 : 2 |************************** | 4 -> 7 : 0 | | 8 -> 15 : 1 |************* | 16 -> 31 : 1 |************* |
These flags may be handled differently by the storage device; separating them allows us to study them in isolation. The above output shows that synchronous writes are bi-modal, with a slower mode in the 512- to 1023-millisecond range.
These flags are also visible in the block tracepoints via the rwbs field and one-letter encodings: see the “rwbs” section, earlier in this chapter, for an explanation of this field.
BCC
Command line usage:
biolatency [options] [interval [count]]
Options include:
-m: Print output in milliseconds (default is microseconds)
-Q: Include OS queued time
-D: Show each disk separately
-F: Show each set of I/O flags separately
-T: Include a timestamp on the output
Using an interval of one will print per-second histograms. This information can be visualized as a latency heat map, with a full second as columns, latency ranges as rows, and a color saturation to show the number of I/O in that time range [Gregg 10]. See Chapter 17 for an example using Vector.
bpftrace
The following is the code for the bpftrace version, which summarizes its core functionality. This version does not support options.
#!/usr/local/bin/bpftrace BEGIN { printf("Tracing block device I/O... Hit Ctrl-C to end.\n"); } kprobe:blk_account_io_start { @start[arg0] = nsecs; } kprobe:blk_account_io_done /@start[arg0]/ { @usecs = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]); } END { clear(@start); }
This tool needs to store a timestamp at the start of each I/O to record its duration (latency). However, multiple I/O can be in flight concurrently. A single global timestamp variable would not work: a timestamp must be associated with each I/O. In many other BPF tools, this is solved by storing timestamps in a hash with the thread ID as a key. This does not work with disk I/O, since disk I/O can initiate on one thread and complete on another, in which case the thread ID changes. The solution used here is to take arg0 of these functions, which is the address of the struct request for the I/O, and use that memory address as the hash key. So long as the kernel does not change the memory address between issue and completion, it is suitable as the unique ID.
Tracepoints
The BCC and bpftrace versions of biolatency(8) should use the block tracepoints where possible, but there is a challenge: the struct request pointer is not currently available in the tracepoint arguments, so another key must be used to uniquely identify the I/O. One approach is to use the device ID and sector number. The core of the program can be changed to the following (biolatency-tp.bt):
[...] tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; } tracepoint:block:block_rq_complete /@start[args->dev, args->sector]/ { @usecs = hist((nsecs - @start[args->dev, args->sector]) / 1000); delete(@start[args->dev, args->sector]); } [...]
This assumes that there is not multiple concurrent I/O to the same device and sector. This is measuring the device time, not including the OS queued time.
9.3.2 biosnoop
biosnoop(8)3 is a BCC and bpftrace tool that prints a one-line summary for each disk I/O. The following shows biosnoop(8) from BCC, running on a Hadoop production instance:
3 Origin: While I was a sysadmin at the University of Newcastle, Australia, in 2000, a shared server was suffering slow disk performance, which was suspected to be caused by a researcher running a batch job. They refused to move their workload unless I could prove that they were causing the heavy disk I/O, but no tool could do this. A workaround concocted either by me or the senior admin, Doug Scott, was to SIGSTOP their process while watching iostat(1), then SIGCONT it a few seconds later: the dramatic drop in disk I/O proved that they were responsible. Wanting a less invasive method, I saw the Sun TNF/prex tracing utility in Adrian Cockcroft’s Sun Performance and Tuning book [Cockcroft 98], and on 3-Dec-2003 I created psio(1M), a utility to print disk I/O by process [185], which also had a mode to trace per-event disk I/O. DTrace was made available in beta in the same month, and I eventually rewrote my disk I/O tracer as iosnoop(1M) on 12-Mar-2004, initially before there was an io provider. I was quoted in The Register’s DTrace announcement talking about this work [Vance 04]. I created the BCC version as biosnoop(8) on 16-Sep-2015, and the bpftrace version on 15-Nov-2017.
# biosnoop TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms) 0.000000 java 5136 xvdq R 980043184 45056 0.35 0.000060 java 5136 xvdq R 980043272 45056 0.40 0.000083 java 5136 xvdq R 980043360 4096 0.42 [...] 0.143724 java 5136 xvdy R 5153784 45056 1.08 0.143755 java 5136 xvdy R 5153872 40960 1.10 0.185374 java 5136 xvdm R 2007186664 45056 0.34 0.189267 java 5136 xvdy R 979232832 45056 14.00 0.190330 java 5136 xvdy R 979232920 45056 15.05 0.190376 java 5136 xvdy R 979233008 45056 15.09 0.190403 java 5136 xvdy R 979233096 45056 15.12 0.190409 java 5136 xvdy R 979233184 45056 15.12 0.190441 java 5136 xvdy R 979233272 36864 15.15 0.190176 java 5136 xvdm R 2007186752 45056 5.13 0.190231 java 5136 xvdm R 2007186840 45056 5.18 [...]
This output shows Java with PID 5136 doing reads to different disks. There were six reads with latency of around 15 milliseconds. If you look closely at the TIME(s) column, which shows the I/O completion time, these all finished within a fraction of a millisecond and were to the same disk (xvdy). You can conclude that these were queued together: the latency creeping up from 14.00 to 15.15 milliseconds is another clue to queued I/O being completed in turn. The sector offsets are also contiguous: 45056 byte reads are 88 × 512-byte sectors.
As an example of production use: teams at Netflix that run stateful services routinely use biosnoop(8) to isolate issues with read-ahead degrading the performance of I/O-intensive workloads. Linux tries to intelligently read ahead data into the OS page cache, but this can cause severe performance issues for data stores running on fast solid-state drives, especially with the default read ahead settings. After identifying aggressive read-ahead, these teams then perform targeted refactors by analyzing histograms of I/O size and latency organized by thread, and then improve performance by using an appropriate madvise option, direct I/O, or changing the default read-ahead to smaller values such as 16 Kbytes. For histograms of I/O sizes, see vfssize(8) from Chapter 8 and bitesize(8) from this chapter; also see the readahead(8) tool in Chapter 8, which was created more recently for the analysis of this issue.
The biostoop(8) columns are:
TIME(s): I/O completion time in seconds
COMM: Process name, if cached
PID: Process ID, if cached
DISK: Storage device name
T: Type: R == reads, W == writes
SECTOR: Address on disk in units of 512-byte sectors
BYTES: Size of the I/O
LAT(ms): Duration of the I/O from device issue to device completion
This works in the same way as biolatency(8): tracing kernel block I/O functions. A future version should switch to the block tracepoints. The overhead of this tool is a little higher than biolatency(8) as it is printing per-event output.
OS Queued Time
A -Q option to BCC biosnoop(8) can be used to show the time spent between the creation of the I/O and the issue to the device: this time is mostly spent on OS queues, but could also include memory allocation and lock acquisition. For example:
# biosnoop -Q TIME(s) COMM PID DISK T SECTOR BYTES QUE(ms) LAT(ms) 19.925329 cksum 20405 sdb R 249631 16384 17.17 1.63 19.933890 cksum 20405 sdb R 249663 122880 17.81 8.51 19.942442 cksum 20405 sdb R 249903 122880 26.35 8.51 19.944161 cksum 20405 sdb R 250143 16384 34.91 1.66 19.952853 cksum 20405 sdb R 250175 122880 15.53 8.59 [...]
The queued time is shown in the QUE(ms) column. This example of high queue times for reads was from a USB flash drive using the CFQ I/O scheduler. Write I/O queues even more:
# biosnoop -Q TIME(s) COMM PID DISK T SECTOR BYTES QUE(ms) LAT(ms) [...] 2.338149 ? 0 W 0 8192 0.00 2.72 2.354710 ? 0 W 0 122880 0.00 16.17 2.371236 kworker/u16:1 18754 sdb W 486703 122880 2070.06 16.51 2.387687 cp 20631 nvme0n1 R 73365192 262144 0.01 3.23 2.389213 kworker/u16:1 18754 sdb W 486943 122880 2086.60 17.92 2.404042 kworker/u16:1 18754 sdb W 487183 122880 2104.53 14.81 2.421539 kworker/u16:1 18754 sdb W 487423 122880 2119.40 17.43 [...]
The queue time for writes exceeds two seconds. Note that earlier I/O lacked most of the column details: they were enqueued before tracing began, and so biosnoop(8) missed caching those details and only shows the device latency.
BCC
Command line usage:
biosnoop [options]
Options include -Q for OS queued time.
bpftrace
The following is the code for the bpftrace version, which traces the full duration of the I/O, including queued time:
#!/usr/local/bin/bpftrace BEGIN { printf("%-12s %-16s %-6s %7s\n", "TIME(ms)", "COMM", "PID", "LAT(ms)"); } kprobe:blk_account_io_start { @start[arg0] = nsecs; @iopid[arg0] = pid; @iocomm[arg0] = comm; } kprobe:blk_account_io_done /@start[arg0] != 0 && @iopid[arg0] != 0 && @iocomm[arg0] != ""/ { $now = nsecs; printf("%-12u %-16s %-6d %7d\n", elapsed / 1000000, @iocomm[arg0], @iopid[arg0], ($now - @start[arg0]) / 1000000); delete(@start[arg0]); delete(@iopid[arg0]); delete(@iocomm[arg0]); } END { clear(@start); clear(@iopid); clear(@iocomm); }
The blk_account_io_start() function often fires in process context and occurs when the I/O is queued. Later events, such as issuing the I/O to the device and I/O completion, may or may not happen in process context, so you cannot rely on the value of the pid and comm builtins at those later times. The solution is to store them in BPF maps during blk_account_io_start(), keyed by the request ID, so that they can be retrieved later.
As with biolatency(8), this tool can be rewritten to use the block tracepoints (see Section 9.5).
9.3.3 biotop
biotop(8)4 is a BCC tool that is top(1) for disks. The following shows it running on a production Hadoop instance, with -C to not clear the screen between updates:
4 Origin: I created the first iotop using DTrace on 15-Jul-2005, and wrote this BCC version 6-Feb-2016. These were inspired by top(1) by William LeFebvre.
# biotop -C Tracing... Output every 1 secs. Hit Ctrl-C to end 06:09:47 loadavg: 28.40 29.00 28.96 44/3812 124008 PID COMM D MAJ MIN DISK I/O Kbytes AVGms 123693 kworker/u258:0 W 202 4096 xvdq 1979 86148 0.93 55024 kworker/u257:8 W 202 4608 xvds 1480 64068 0.73 123693 kworker/u258:0 W 202 5376 xvdv 143 5700 0.52 5381 java R 202 176 xvdl 81 3456 3.01 43297 kworker/u257:0 W 202 80 xvdf 48 1996 0.56 5383 java R 202 112 xvdh 27 1152 16.05 5383 java R 202 5632 xvdw 27 1152 3.45 5383 java R 202 224 xvdo 27 1152 6.79 5383 java R 202 96 xvdg 24 1024 0.52 5383 java R 202 192 xvdm 24 1024 39.45 5383 java R 202 5888 xvdx 24 1024 0.64 5383 java R 202 5376 xvdv 24 1024 4.74 5383 java R 202 4096 xvdq 24 1024 3.07 5383 java R 202 48 xvdd 24 1024 0.62 5383 java R 202 5120 xvdu 24 1024 4.20 5383 java R 202 208 xvdn 24 1024 2.54 5383 java R 202 80 xvdf 24 1024 0.66 5383 java R 202 64 xvde 24 1024 8.08 5383 java R 202 32 xvdc 24 1024 0.63 5383 java R 202 160 xvdk 24 1024 1.42 [...]
This shows that a Java process is reading from many different disks. Top of the list are kworker threads initiating writes: this is background write flushing, and the real process that dirtied the pages is not known at this point (it can be identified using the file system tools from Chapter 8).
This works using the same events as biolatency(8), with similar overhead expectations.
Command line usage:
biotop [options] [interval [count]]
Options include:
-C: Don’t clear the screen
-r ROWS: Number of rows to print
The output is truncated to 20 rows by default, which can be tuned with -r.
9.3.4 bitesize
bitesize(8)5 is a BCC and bpftrace tool to show the size of disk I/O. The following shows the BCC version running on a production Hadoop instance:
5 Origin: I first created this as bitesize.d using DTrace on 31-Mar-2004, before the io provider was available. Allan McAleavy created the BCC version on 5-Feb-2016, and I created the bpftrace one on 7-Sep-2018.
# bitesize Tracing... Hit Ctrl-C to end. ^C [...] Process Name = kworker/u257:10 Kbytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 0 | | 4 -> 7 : 17 | | 8 -> 15 : 12 | | 16 -> 31 : 79 |* | 32 -> 63 : 3140 |****************************************| Process Name = java Kbytes : count distribution 0 -> 1 : 0 | | 2 -> 3 : 3 | | 4 -> 7 : 60 | | 8 -> 15 : 68 | | 16 -> 31 : 220 |** | 32 -> 63 : 3996 |****************************************|
This output shows that both the kworker thread and java are calling I/O mostly in the 32- to 63-Kbyte range. Checking the I/O size can lead to optimizations:
Sequential workloads should try the largest possible I/O size for peak performance. Larger sizes sometimes encounter slightly worse performance; there may be a sweet spot (e.g., 128 Kbytes) based on memory allocators and device logic.
Random workloads should try to match the I/O size with the application record size. Larger I/O sizes pollute the page cache with data that isn’t needed; smaller I/O sizes result in more I/O overhead than needed.
This works by instrumenting the block:block_rq_issue tracepoint.
BCC
bitesize(8) currently does not support options.
bpftrace
The following is the code for the bpftrace version:
#!/usr/local/bin/bpftrace BEGIN { printf("Tracing block device I/O... Hit Ctrl-C to end.\n"); } tracepoint:block:block_rq_issue { @[args->comm] = hist(args->bytes); } END { printf("\nI/O size (bytes) histograms by process name:"); }
The tracepoint provides the process name as args->comm, and the size as args->bytes. This insert tracepoint fires when the request is inserted on the OS queue. Later tracepoints such as completion do not provide args->comm, nor can the comm builtin be used, as they fire asynchronously to the process (e.g., on device completion interrupt).
9.3.5 seeksize
seeksize(8)6 is a bpftrace tool to show how many sectors that processes are requesting the disks to seek. This is only a problem for rotational magnetic media,7 where the drive heads must physically move from one sector offset to another, causing latency. Example output:
6 Origin: I first created it as seeksize.d using DTrace on 11-Sep-2004, as seek issues on rotational disks were common at the time. I created the bpftrace version it for a blog post on 18-Oct-2018 and revised it for this book on 20-Mar-2019.
7 Almost. Flash drives have their flash-translation-layer logic, and I’ve noticed a tiny slowdown (less than 1%) when seeking across large ranges vs small: perhaps it’s busting the flash equivalent of a TLB.
# seeksize.bt Attaching 3 probes... Tracing block I/O requested seeks... Hit Ctrl-C to end. ^C [...] @sectors[tar]: [0] 8220 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1] 0 | | [2, 4) 0 | | [4, 8) 0 | | [8, 16) 882 |@@@@@ | [16, 32) 1897 |@@@@@@@@@@@@ | [32, 64) 1588 |@@@@@@@@@@ | [64, 128) 1502 |@@@@@@@@@ | [128, 256) 1105 |@@@@@@ | [256, 512) 734 |@@@@ | [512, 1K) 501 |@@@ | [1K, 2K) 302 |@ | [2K, 4K) 194 |@ | [4K, 8K) 82 | | [8K, 16K) 0 | | [16K, 32K) 0 | | [32K, 64K) 6 | | [64K, 128K) 191 |@ | [128K, 256K) 0 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 1 | | [2M, 4M) 840 |@@@@@ | [4M, 8M) 887 |@@@@@ | [8M, 16M) 441 |@@ | [16M, 32M) 124 | | [32M, 64M) 220 |@ | [64M, 128M) 207 |@ | [128M, 256M) 205 |@ | [256M, 512M) 3 | | [512M, 1G) 286 |@ | @sectors[dd]: [0] 29908 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [1] 0 | | [...] [32M, 64M) 0 | | [64M, 128M) 1 | |
This output shows that processes named “dd” usually did not request any seeking: an offset of 0 was requested 29,908 times while tracing. This is expected, as I was running a dd(1) sequential workload. I also ran a tar(1) file system backup, which generated a mixed workload: some sequential, some random.
The source to seeksize(8) is:
#!/usr/local/bin/bpftrace BEGIN { printf("Tracing block I/O requested seeks... Hit Ctrl-C to end.\n"); } tracepoint:block:block_rq_issue { if (@last[args->dev]) { // calculate requested seek distance $last = @last[args->dev]; $dist = (args->sector - $last) > 0 ? args->sector - $last : $last - args->sector; // store details @sectors[args->comm] = hist($dist); } // save last requested position of disk head @last[args->dev] = args->sector + args->nr_sector; } END { clear(@last); }
This works by looking at the requested sector offset for each device I/O and comparing it to a recorded previous location. If the script is changed to use the block_rq_completion tracepoint, it will show the actual seeks encountered by the disk. But instead it uses the block_rq_issue tracepoint to answer a different question: how random is the workload the application is requesting? This randomness may change after the I/O is processed by the Linux I/O scheduler and by the on-disk scheduler. I first wrote this to prove which applications were causing random workloads, so I chose to measure the workload on requests.
The following tool, biopattern(8), measures randomness on I/O completion instead.
9.3.6 biopattern
biopattern(8)8 is a bpftrace tool to identify the pattern of I/O: random or sequential. For example:
8 Origin: I created the first version as iopattern using DTrace on 25-Jul-2005, based on a mockup that Ryan Matteson had sent me (which also had more columns). I created this bpftrace version for this book on 19-Mar-2019.
# biopattern.bt Attaching 4 probes... TIME %RND %SEQ COUNT KBYTES 00:05:54 83 16 2960 13312 00:05:55 82 17 3881 15524 00:05:56 78 21 3059 12232 00:05:57 73 26 2770 14204 00:05:58 0 100 1 0 00:05:59 0 0 0 0 00:06:00 0 99 1536 196360 00:06:01 0 100 13444 1720704 00:06:02 0 99 13864 1771876 00:06:03 0 100 13129 1680640 00:06:04 0 99 13532 1731484 [...]
This examples begins with a file system backup workload, which caused mostly random I/O. At 6:00 I switched to a sequential disk read, which was 99 or 100% sequential, and delivered a much higher throughput (KBYTES).
The source to biopattern(8) is:
#!/usr/local/bin/bpftrace BEGIN { printf("%-8s %5s %5s %8s %10s\n", "TIME", "%RND", "%SEQ", "COUNT", "KBYTES"); } tracepoint:block:block_rq_complete { if (@lastsector[args->dev] == args->sector) { @sequential++; } else { @random++; } @bytes = @bytes + args->nr_sector * 512; @lastsector[args->dev] = args->sector + args->nr_sector; } interval:s:1 { $count = @random + @sequential; $div = $count; if ($div == 0) { $div = 1; } time("%H:%M:%S "); printf("%5d %5d %8d %10d\n", @random * 100 / $div, @sequential * 100 / $div, $count, @bytes / 1024); clear(@random); clear(@sequential); clear(@bytes); } END { clear(@lastsector); clear(@random); clear(@sequential); clear(@bytes); }
This works by instrumenting block I/O completion and remembering the last sector (disk address) used for each device, so that it can be compared with the following I/O to see if it carried on from the previous address (sequential) or did not (random).9
9 Prior to the tracing era, I would identify random/sequential workloads by interpreting iostat(1) output and looking for high service times with small I/O sizes (random) or low service times with high I/O sizes (sequential).
This tool can be changed to instrument tracepoint:block:block_rq_insert, which will show the randomness of the workload applied (similar to seeksize(8)).
9.3.7 biostacks
biostacks(8)10 is a bpftrace tool that traces full I/O latency (from OS enqueue to device completion) with the I/O initialization stack trace. For example:
10 Origin: I created it for this book on 19-Mar-2019. I had constructed a similar tool live during an internal Facebook talk in 2018, and for the first time saw initialization stacks associated with I/O completion times.
# biostacks.bt Attaching 5 probes... Tracing block I/O with init stacks. Hit Ctrl-C to end. ^C [...] @usecs[ blk_account_io_start+1 blk_mq_make_request+1069 generic_make_request+292 submit_bio+115 swap_readpage+310 read_swap_cache_async+64 swapin_readahead+614 do_swap_page+1086 handle_pte_fault+725 __handle_mm_fault+1144 handle_mm_fault+177 __do_page_fault+592 do_page_fault+46 page_fault+69 ]: [16K, 32K) 1 | | [32K, 64K) 32 | | [64K, 128K) 3362 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [128K, 256K) 38 | | [256K, 512K) 0 | | [512K, 1M) 0 | | [1M, 2M) 1 | | [2M, 4M) 1 | | [4M, 8M) 1 | | @usecs[ blk_account_io_start+1 blk_mq_make_request+1069 generic_make_request+292 submit_bio+115 submit_bh_wbc+384 ll_rw_block+173 __breadahead+68 __ext4_get_inode_loc+914 ext4_iget+146 ext4_iget_normal+48 ext4_lookup+240 lookup_slow+171 walk_component+451 path_lookupat+132 filename_lookup+182 user_path_at_empty+54 vfs_statx+118 SYSC_newfstatat+53 sys_newfstatat+14 do_syscall_64+115 entry_SYSCALL_64_after_hwframe+61 ]: [8K, 16K) 18 |@@@@@@@@@@@ | [16K, 32K) 20 |@@@@@@@@@@@@ | [32K, 64K) 10 |@@@@@@ | [64K, 128K) 56 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128K, 256K) 81 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256K, 512K) 7 |@@@@ |
I have seen cases where there was mysterious disk I/O without any application causing it. The reason turned out to be background file system tasks. (In one case it was ZFS’s background scrubber, which periodically verifies checksums.) biostacks(8) can identify the real reason for disk I/O by showing the kernel stack trace.
The above output has two interesting stacks. The first was triggered by a page fault that became a swap in: this is swapping.11 The second was a newfstatat() syscall that became a readahead.
11 Linux terminology, where this means switching pages with the swap device. Swapping for other kernels can mean moving entire processes.
The source to biostacks(8) is:
#!/usr/local/bin/bpftrace BEGIN { printf("Tracing block I/O with init stacks. Hit Ctrl-C to end.\n"); } kprobe:blk_account_io_start { @reqstack[arg0] = kstack; @reqts[arg0] = nsecs; } kprobe:blk_start_request, kprobe:blk_mq_start_request /@reqts[arg0]/ { @usecs[@reqstack[arg0]] = hist(nsecs - @reqts[arg0]); delete(@reqstack[arg0]); delete(@reqts[arg0]); } END { clear(@reqstack); clear(@reqts); }
This works by saving the kernel stack and a timestamp when the I/O was initiated and retrieving that saved stack and timestamp when the I/O completed. These are saved in a map keyed by the struct request pointer, which is arg0 to the traced kernel functions. The kernel stack trace is recorded using the kstack builtin. You can change this to ustack to record the user-level stack trace or add them both.
With the Linux 5.0 switch to multi-queue only, the blk_start_request() function was removed from the kernel. On that and later kernels, this tool prints a warning:
Warning: could not attach probe kprobe:blk_start_request, skipping.
This can be ignored, or that kprobe can be deleted from the tool. The tool could also be rewritten to use tracepoints. See the “Tracepoints” subsection of Section 9.3.1.
9.3.8 bioerr
bioerr(8)12 traces block I/O errors and prints details. For example, running bioerr(8) on my laptop:
12 Origin: I created it for this book on 19-Mar-2019.
# bioerr.bt Attaching 2 probes... Tracing block I/O errors. Hit Ctrl-C to end. 00:31:52 device: 0,0, sector: -1, bytes: 0, flags: N, error: -5 00:31:54 device: 0,0, sector: -1, bytes: 0, flags: N, error: -5 00:31:56 device: 0,0, sector: -1, bytes: 0, flags: N, error: -5 00:31:58 device: 0,0, sector: -1, bytes: 0, flags: N, error: -5 00:32:00 device: 0,0, sector: -1, bytes: 0, flags: N, error: -5 [...]
This output is far more interesting than I was expecting. (I wasn’t expecting any errors, but ran it just in case.) Every two seconds there is a zero-byte request to device 0,0, which seems bogus, and which returns with a -5 error (EIO).
The previous tool, biostacks(8), was created to investigate this kind of issue. In this case I don’t need to see the latency, and I only want to see stacks for the device 0,0 I/O. I can tweak biostacks(8) to do this, although it can also be done as a bpftrace one-liner (in this case, I’ll check that the stack trace is still meaningful by the time this tracepoint is hit; if it were not still meaningful, I’d need to switch back to a kprobe of blk_account_io_start() to really catch the initialization of this I/O):
# bpftrace -e 't:block:block_rq_issue /args->dev == 0/ { @[kstack]++ }' Attaching 1 probe... ^C @[ blk_peek_request+590 scsi_request_fn+51 __blk_run_queue+67 blk_execute_rq_nowait+168 blk_execute_rq+80 scsi_execute+227 scsi_test_unit_ready+96 sd_check_events+248 disk_check_events+101 disk_events_workfn+22 process_one_work+478 worker_thread+50 kthread+289 ret_from_fork+53 ]: 3
This shows that device 0 I/O was created from scsi_test_unit_ready(). A little more digging into the parent functions shows that it was checking for USB removable media. As an experiment, I traced scsi_test_unit_ready() while inserting a USB flash drive, which changed its return value. This was my laptop detecting USB drives.
The source to bioerr(8) is:
#!/usr/local/bin/bpftrace BEGIN { printf("Tracing block I/O errors. Hit Ctrl-C to end.\n"); } tracepoint:block:block_rq_complete /args->error != 0/ { time("%H:%M:%S "); printf("device: %d,%d, sector: %d, bytes: %d, flags: %s, error: %d\n", args->dev >> 20, args->dev & ((1 << 20) - 1), args->sector, args->nr_sector * 512, args->rwbs, args->error); }
The logic for mapping the device identifier (args->dev) to the major and minor numbers comes from the format file for this tracepoint:
# cat /sys/kernel/debug/tracing/events/block/block_rq_complete/format name: block_rq_complete [...] print fmt: "%d,%d %s (%s) %llu + %u [%d]", ((unsigned int) ((REC->dev) >> 20)), ((unsigned int) ((REC->dev) & ((1U << 20) - 1))), REC->rwbs, __get_str(cmd), (unsigned long long)REC->sector, REC->nr_sector, REC->error
While bioerr(8) is a handy tool, note that perf(1) can be used for similar functionality by filtering on error. The output includes the format string as defined by the /sys format file. For example:
# perf record -e block:block_rq_complete --filter 'error != 0' # perf script ksoftirqd/2 22 [002] 2289450.691041: block:block_rq_complete: 0,0 N () 18446744073709551615 + 0 [-5] [...]
The BPF tool can be customized to include more information, going beyond the standard capabilities of perf(1).
For example, the error returned, in this case -5 for EIO, has been mapped from a block error code. It may be interesting to see the original block error code, which can be traced from functions that handle it, for example:
# bpftrace -e 'kprobe:blk_status_to_errno /arg0/ { @[arg0]++ }' Attaching 1 probe... ^C @[10]: 2
It’s really block I/O status 10, which is BLK_STS_IOERR. These are defined in linux/blk_types.h:
#define BLK_STS_OK 0 #define BLK_STS_NOTSUPP ((__force blk_status_t)1) #define BLK_STS_TIMEOUT ((__force blk_status_t)2) #define BLK_STS_NOSPC ((__force blk_status_t)3) #define BLK_STS_TRANSPORT ((__force blk_status_t)4) #define BLK_STS_TARGET ((__force blk_status_t)5) #define BLK_STS_NEXUS ((__force blk_status_t)6) #define BLK_STS_MEDIUM ((__force blk_status_t)7) #define BLK_STS_PROTECTION ((__force blk_status_t)8) #define BLK_STS_RESOURCE ((__force blk_status_t)9) #define BLK_STS_IOERR ((__force blk_status_t)10)
bioerr(8) could be enhanced to print these BLK_STS code names instead of the error numbers. These are actually mapped from SCSI result codes, which can be traced from the scsi events. I’ll demonstrate SCSI tracing in sections 9.3.11 and 9.3.12.
9.3.9 mdflush
mdflush(8)13 is a BCC and bpftrace tool for tracing flush events from md, the multiple devices driver that is used on some systems to implement software RAID. For example, running the BCC version on a production server using md:
13 Origin: I created it for BCC on 13-Feb-2015 and for bpftrace on 8-Sep-2018.
# mdflush Tracing md flush requests... Hit Ctrl-C to end. TIME PID COMM DEVICE 23:43:37 333 kworker/0:1H md0 23:43:37 4038 xfsaild/md0 md0 23:43:38 8751 filebeat md0 23:43:43 5575 filebeat md0 23:43:48 5824 filebeat md0 23:43:53 5575 filebeat md0 23:43:58 5824 filebeat md0 [...]
md flush events are usually infrequent and cause bursts of disk writes, perturbing system performance. Knowing exactly when they occurred can be useful for correlation with monitoring dashboards, to see if they align with latency spikes or other problems.
This output shows a process called filebeat doing md flushes every five seconds (I just discovered this). filebeat is a service that sends log files to Logstash or directly to Elasticsearch.
This works by tracing the md_flush_request() function using a kprobe. Since the event frequency is low, the overhead should be negligible.
BCC
mdflush(8) currently does not support any options.
bpftrace
The following is the code for the bpftrace version:
#!/usr/local/bin/bpftrace #include <linux/genhd.h> #include <linux/bio.h> BEGIN { printf("Tracing md flush events... Hit Ctrl-C to end.\n"); printf("%-8s %-6s %-16s %s", "TIME", "PID", "COMM", "DEVICE"); } kprobe:md_flush_request { time("%H:%M:%S "); printf("%-6d %-16s %s\n", pid, comm, ((struct bio *)arg1)->bi_disk->disk_name); }
The program digs out the disk name via the struct bio argument.
9.3.10 iosched
iosched(8)14 traces the time that requests were queued in the I/O scheduler, and groups this by scheduler name. For example:
14 Origin: I created it for this book on 20-Mar-2019.
# iosched.bt Attaching 5 probes... Tracing block I/O schedulers. Hit Ctrl-C to end. ^C @usecs[cfq]: [2, 4) 1 | | [4, 8) 3 |@ | [8, 16) 18 |@@@@@@@ | [16, 32) 6 |@@ | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 6 |@@ | [1K, 2K) 8 |@@@ | [2K, 4K) 0 | | [4K, 8K) 0 | | [8K, 16K) 28 |@@@@@@@@@@@ | [16K, 32K) 131 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 68 |@@@@@@@@@@@@@@@@@@@@@@@@@@ |
This shows the CFQ scheduler in use, with queueing times usually between eight and 64 milliseconds.
The source to iosched(8) is:
#!/usr/local/bin/bpftrace #include <linux/blkdev.h> BEGIN { printf("Tracing block I/O schedulers. Hit Ctrl-C to end.\n"); } kprobe:__elv_add_request { @start[arg1] = nsecs; } kprobe:blk_start_request, kprobe:blk_mq_start_request /@start[arg0]/ { $r = (struct request *)arg0; @usecs[$r->q->elevator->type->elevator_name] = hist((nsecs - @start[arg0]) / 1000); delete(@start[arg0]); } END { clear(@start); }
This works by recording a timestamp when requests were added to an I/O scheduler via an elevator function, __elv_add_request(), and then calculating the time queued when the I/O was issued. This focuses tracing I/O to only those that pass via an I/O scheduler, and also focuses on tracing just the queued time. The scheduler (elevator) name is fetched from the struct request.
With the Linux 5.0 switch to multi-queue only, the blk_start_request() function was removed from the kernel. On that and later kernels this tool will print a warning about skipping the blk_start_request() kprobe, which can be ignored, or that kprobe can be removed from this program.
9.3.11 scsilatency
scsilatency(8)15 is a tool to trace SCSI commands with latency distributions. For example:
15 Origin: I created it for this book on 21-Mar-2019, inspired by similar tools I created for the 2011 DTrace book [Gregg 11].
# scsilatency.bt Attaching 4 probes... Tracing scsi latency. Hit Ctrl-C to end. ^C @usecs[0, TEST_UNIT_READY]: [128K, 256K) 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256K, 512K) 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [512K, 1M) 0 | | [1M, 2M) 1 |@@@@@@@@@@@@@@@@@ | [2M, 4M) 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [4M, 8M) 3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [8M, 16M) 1 |@@@@@@@@@@@@@@@@@ | @usecs[42, WRITE_10]: [2K, 4K) 2 |@ | [4K, 8K) 0 | | [8K, 16K) 2 |@ | [16K, 32K) 50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 57 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| @usecs[40, READ_10]: [4K, 8K) 15 |@ | [8K, 16K) 676 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16K, 32K) 447 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 2 | | [...]
This has a latency histogram for each SCSI command type, showing the opcode and command name (if available).
The source to scsilatency(8) is:
#!/usr/local/bin/bpftrace #include <scsi/scsi_cmnd.h> BEGIN { printf("Tracing scsi latency. Hit Ctrl-C to end.\n"); // SCSI opcodes from scsi/scsi_proto.h; add more mappings if desired: @opcode[0x00] = "TEST_UNIT_READY"; @opcode[0x03] = "REQUEST_SENSE"; @opcode[0x08] = "READ_6"; @opcode[0x0a] = "WRITE_6"; @opcode[0x0b] = "SEEK_6"; @opcode[0x12] = "INQUIRY"; @opcode[0x18] = "ERASE"; @opcode[0x28] = "READ_10"; @opcode[0x2a] = "WRITE_10"; @opcode[0x2b] = "SEEK_10"; @opcode[0x35] = "SYNCHRONIZE_CACHE"; } kprobe:scsi_init_io { @start[arg0] = nsecs; } kprobe:scsi_done, kprobe:scsi_mq_done /@start[arg0]/ { $cmnd = (struct scsi_cmnd *)arg0; $opcode = *$cmnd->req.cmd & 0xff; @usecs[$opcode, @opcode[$opcode]] = hist((nsecs - @start[arg0]) / 1000); } END { clear(@start); clear(@opcode); }
There are many possible SCSI commands; this tool only translates a handful into the opcode names. Since the opcode number is printed with the output, if a translation is missing it can still be determined by referring to scsi/scsi_proto.h, and this tool can be enhanced to include it.
There are scsi tracepoints, and one is used in the next tool, but these lack a unique identifier, which would be needed as a BPF map key to store a timestamp.
Due to the Linux 5.0 switch to multi-queue only, the scsi_done() function was removed, and so the kprobe:scsi_done can be removed.
With the Linux 5.0 switch to multi-queue only, scsi_done() function was removed from the kernel. On that and later kernels this tool will print a warning about skipping the scsi_done() kprobe, which can be ignored, or that kprobe can be removed from this program.
9.3.12 scsiresult
scsiresult(8)16 summarizes SCSI command results: the host and status codes. For example:
16 Origin: I created it for this book on 21-Mar-2019, inspired by similar tools I created for the 2011 DTrace book [Gregg 11].
# scsiresult.bt Attaching 3 probes... Tracing scsi command results. Hit Ctrl-C to end. ^C @[DID_BAD_TARGET, SAM_STAT_GOOD]: 1 @[DID_OK, SAM_STAT_CHECK_CONDITION]: 10 @[DID_OK, SAM_STAT_GOOD]: 2202
This shows 2202 results with the codes DID_OK and SAM_STAT_GOOD and one with DID_BAD_TARGET and SAM_STAT_GOOD. These codes are defined in the kernel source, for example, from include/scsi/scsi.h:
#define DID_OK 0x00 /* NO error */ #define DID_NO_CONNECT 0x01 /* Couldn't connect before timeout period */ #define DID_BUS_BUSY 0x02 /* BUS stayed busy through time out period */ #define DID_TIME_OUT 0x03 /* TIMED OUT for other reason */ #define DID_BAD_TARGET 0x04 /* BAD target. */ [...]
This tool can be used to identify anomalous results from SCSI devices.
The source to scsiresult(8) is:
#!/usr/local/bin/bpftrace BEGIN { printf("Tracing scsi command results. Hit Ctrl-C to end.\n"); // host byte codes, from include/scsi/scsi.h: @host[0x00] = "DID_OK"; @host[0x01] = "DID_NO_CONNECT"; @host[0x02] = "DID_BUS_BUSY"; @host[0x03] = "DID_TIME_OUT"; @host[0x04] = "DID_BAD_TARGET"; @host[0x05] = "DID_ABORT"; @host[0x06] = "DID_PARITY"; @host[0x07] = "DID_ERROR"; @host[0x08] = "DID_RESET"; @host[0x09] = "DID_BAD_INTR"; @host[0x0a] = "DID_PASSTHROUGH"; @host[0x0b] = "DID_SOFT_ERROR"; @host[0x0c] = "DID_IMM_RETRY"; @host[0x0d] = "DID_REQUEUE"; @host[0x0e] = "DID_TRANSPORT_DISRUPTED"; @host[0x0f] = "DID_TRANSPORT_FAILFAST"; @host[0x10] = "DID_TARGET_FAILURE"; @host[0x11] = "DID_NEXUS_FAILURE"; @host[0x12] = "DID_ALLOC_FAILURE"; @host[0x13] = "DID_MEDIUM_ERROR"; // status byte codes, from include/scsi/scsi_proto.h: @status[0x00] = "SAM_STAT_GOOD"; @status[0x02] = "SAM_STAT_CHECK_CONDITION"; @status[0x04] = "SAM_STAT_CONDITION_MET"; @status[0x08] = "SAM_STAT_BUSY"; @status[0x10] = "SAM_STAT_INTERMEDIATE"; @status[0x14] = "SAM_STAT_INTERMEDIATE_CONDITION_MET"; @status[0x18] = "SAM_STAT_RESERVATION_CONFLICT"; @status[0x22] = "SAM_STAT_COMMAND_TERMINATED"; @status[0x28] = "SAM_STAT_TASK_SET_FULL"; @status[0x30] = "SAM_STAT_ACA_ACTIVE"; @status[0x40] = "SAM_STAT_TASK_ABORTED"; } tracepoint:scsi:scsi_dispatch_cmd_done { @[@host[(args->result >> 16) & 0xff], @status[args->result & 0xff]] = count(); } END { clear(@status); clear(@host); }
This works by tracing the scsi:scsi_dispatch_cmd_done tracepoint and fetching the host and status bytes from the result, and then mapping them to kernel names. The kernel has similar lookup tables in include/trace/events/scsi.h for the tracepoint format string.
The result also has driver and message bytes, not shown by this tool. It is of the format:
driver_byte << 24 | host_byte << 16 | msg_byte << 8 | status_byte
This tool can be enhanced to add these bytes and other details to the map as additional keys. Other details are readily available in that tracepoint:
# bpftrace -lv t:scsi:scsi_dispatch_cmd_done tracepoint:scsi:scsi_dispatch_cmd_done unsigned int host_no; unsigned int channel; unsigned int id; unsigned int lun; int result; unsigned int opcode; unsigned int cmd_len; unsigned int data_sglen; unsigned int prot_sglen; unsigned char prot_op; __data_loc unsigned char[] cmnd;
Even more details are available via kprobes of scsi functions, although without the interface stability.
9.3.13 nvmelatency
nvmelatency(8)17 traces the nvme storage driver and shows command latencies by disk and nvme command opcode. This can be useful for isolating device latency from the latency measured higher in the stack at the block I/O layer. For example:
17 Origin: I created it for this book on 21-Mar-2019, inspired by similar storage driver tools that I created for the 2011 DTrace book [Gregg 11].
# nvmelatency.bt Attaching 4 probes... Tracing nvme command latency. Hit Ctrl-C to end. ^C @usecs[nvme0n1, nvme_cmd_flush]: [8, 16) 2 |@@@@@@@@@ | [16, 32) 7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 6 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 11 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 3 |@@@@@@@@@@@@@@ | [1K, 2K) 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [2K, 4K) 1 |@@@@ | [4K, 8K) 4 |@@@@@@@@@@@@@@@@@@ | @usecs[nvme0n1, nvme_cmd_write]: [8, 16) 3 |@@@@ | [16, 32) 37 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32, 64) 20 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 6 |@@@@@@@@ | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 7 |@@@@@@@@@ | @usecs[nvme0n1, nvme_cmd_read]: [32, 64) 7653 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 568 |@@@ | [128, 256) 45 | | [256, 512) 4 | | [512, 1K) 0 | | [1K, 2K) 0 | | [2K, 4K) 0 | | [4K, 8K) 1 | |
This output showed that only one disk was in use, nvme0n1, and the latency distributions for three nvme command types.
Tracepoints for nvme were recently added to Linux, but I wrote this tool on a system that did not have them, to show what can be accomplished with kprobes and storage drivers. I began by frequency counting which nvme functions were in use during different I/O workloads:
# bpftrace -e 'kprobe:nvme* { @[func] = count(); }' Attaching 184 probes... ^C @[nvme_pci_complete_rq]: 5998 @[nvme_free_iod]: 6047 @[nvme_setup_cmd]: 6048 @[nvme_queue_rq]: 6071 @[nvme_complete_rq]: 6171 @[nvme_irq]: 6304 @[nvme_process_cq]: 12327
Browsing the source for these functions showed that latency could be traced as the time from nvme_setup_cmd() to nvme_complete_rq().
The existence of tracepoints can aid in tool development, even if you are on a system that lacks them. By inspecting how the nvme tracepoints worked [187], I was able to develop this tool more quickly, because the tracepoint source showed how to correctly interpret nvme opcodes.
The source to nvmelatency(8) is:
#!/usr/local/bin/bpftrace #include <linux/blkdev.h> #include <linux/nvme.h> BEGIN { printf("Tracing nvme command latency. Hit Ctrl-C to end.\n"); // from linux/nvme.h: @ioopcode[0x00] = "nvme_cmd_flush"; @ioopcode[0x01] = "nvme_cmd_write"; @ioopcode[0x02] = "nvme_cmd_read"; @ioopcode[0x04] = "nvme_cmd_write_uncor"; @ioopcode[0x05] = "nvme_cmd_compare"; @ioopcode[0x08] = "nvme_cmd_write_zeroes"; @ioopcode[0x09] = "nvme_cmd_dsm"; @ioopcode[0x0d] = "nvme_cmd_resv_register"; @ioopcode[0x0e] = "nvme_cmd_resv_report"; @ioopcode[0x11] = "nvme_cmd_resv_acquire"; @ioopcode[0x15] = "nvme_cmd_resv_release"; } kprobe:nvme_setup_cmd { $req = (struct request *)arg1; if ($req->rq_disk) { @start[arg1] = nsecs; @cmd[arg1] = arg2; } else { @admin_commands = count(); } } kprobe:nvme_complete_rq /@start[arg0]/ { $req = (struct request *)arg0; $cmd = (struct nvme_command *)@cmd[arg0]; $disk = $req->rq_disk; $opcode = $cmd->common.opcode & 0xff; @usecs[$disk->disk_name, @ioopcode[$opcode]] = hist((nsecs - @start[arg0]) / 1000); delete(@start[tid]); delete(@cmd[tid]); } END { clear(@ioopcode); clear(@start); clear(@cmd); }
If a request is created without a disk, it is an admin command. The script can be enhanced to decode and time the admin commands (see nvme_admin_opcode in include/linux/nvme.h). To keep this tool short, I simply counted admin commands so that if any are present they will be noted in the output.