BPF and the I/O Stack

By Brendan Gregg
Jan 10, 2020

📄 Contents

␡

⎙ Print

< Back Page 2 of 6 Next >

This chapter is from the book 

BPF Performance Tools

Learn More Buy

9.2 Traditional Tools

This section covers iostat(1) for disk activity summaries, perf(1) for block I/O tracing, blktrace(8), and the SCSI log.

9.2.1 iostat

iostat(1) summarizes per-disk I/O statistics, providing metrics for IOPS, throughput, I/O request times, and utilization. It can be executed by any user, and is typically the first command used to investigate disk I/O issues at the command line. The statistics it sources are maintained by the kernel by default, so the overhead of this tool is considered negligible.

iostat(1) provides many options for customizing the output. A useful combination is -dxz 1, to show disk utilization only (-d), extended columns (-x), skipping devices with zero metrics (-z), and per-second output (1). The output is so wide that I’ll show a left portion and then the right portion; this is from a production issue I helped debug:

# iostat -dxz 1
Linux 4.4.0-1072-aws (...)      12/18/2018      _x86_64_        (16 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s \ ...
xvda              0.00     0.29    0.21    0.17     6.29     3.09 / ...
xvdb              0.00     0.08   44.39    9.98  5507.39  1110.55 \ ...
                                                                  / ...
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s \ ...
xvdb              0.00     0.00  745.00    0.00 91656.00     0.00 / ...
                                                                  \ ...
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s / ...
xvdb              0.00     0.00  739.00    0.00 92152.00     0.00 \ ...

These columns summarize the workload applied, and are useful for workload characterization. The first two provide insight into disk merges: this is where a new I/O is found to be reading or writing to a disk location adjacent (front or back) to another queued I/O, so they are merged for efficiency.

The columns are:

rrqm/s: Read requests queued and merged per second
wrqm/s: Write requests queued and merged per second
r/s: Read requests completed per second (after merges)
w/s: Write requests completed per second (after merges)
rkB/s: Kbytes read from the disk device per second
wkB/s: Kbytes written to the disk device per second

The first group of output (showing both xvda and xvdb devices) is the summary since boot, and can be used for comparison with the subsequent one-second summaries. This output shows that xvdb normally has a read throughput of 5,507 Kbytes/sec, but the current one-second summaries show over 90,000 read Kbytes/sec. The system has a heavier-than-normal read workload.

Some math can be applied to these columns to figure out the average read and write size. Dividing the rkB/s column by the r/s column shows the average read size is about 124 Kbytes. A newer version of iostat(1) includes average sizes as the rareq-sz (read average request size) and wareq-sz columns.

The right columns show:

... \ avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
... /    49.32     0.00   12.74    6.96   19.87   3.96   0.15
... \   243.43     2.28   41.96   41.75   42.88   1.52   8.25
... /
... \ avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
... /   246.06    25.32   33.84   33.84    0.00   1.35 100.40
... ... / avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
... \   249.40    24.75   33.49   33.49    0.00   1.35 100.00

These show the resulting performance by the device. The columns are:

avgrq-sz: Average request size in sectors (512 bytes).
avgqu-sz: Average number of requests both waiting in the driver request queue and active on the device.
await: Average I/O request time (aka response time), including time waiting in the driver request queue and the I/O response time of the device (ms).
r_await: Same as await, but for reads only (ms).
w_await: Same as await, but for writes only (ms).
svctm: Average (inferred) I/O response time for the disk device (ms).
%util: Percentage of time device was busy processing I/O requests (utilization).

The most important metric for delivered performance is await. If the application and file system use a technique to mitigate write latency (e.g., write through), then w_await may not matter as much, and you can focus on r_await instead.

For resource usage and capacity planning, %util is important, but keep in mind that it is only a measure of busy-ness (non-idle time), and may mean little for virtual devices backed by multiple disks. Those devices may be better understood by the load applied: IOPS (r/s + w/s) and throughput (rkB/s + wkB/s).

This example output shows the disk hitting 100% utilization, and an average read I/O time of 33 milliseconds. For the workload applied and the disk device, this turned out to be expected performance. The real issue was that the files being read had become so large they could no longer be cached in the page cache, and were read from disk instead.

9.2.2 perf

perf(1) was introduced in Chapter 6 for PMC analysis and timed stack sampling. Its tracing capabilities can also be used for disk analysis, especially using the block tracepoints.

For example, tracing the queuing of requests (block_rq_insert), their issue to a storage device (block_rq_issue), and their completions (block_rq_complete):

# perf record -e block:block_rq_insert,block:block_rq_issue,block:block_rq_complete -a
^C[ perf record: Woken up 7 times to write data ]
[ perf record: Captured and wrote 6.415 MB perf.data (20434 samples) ]
# perf script
    kworker/u16:3 25003 [004] 543348.164811:   block:block_rq_insert: 259,0 RM 4096 ()
2564656 + 8 [kworker/u16:3]
    kworker/4:1H    533 [004] 543348.164815:   block:block_rq_issue: 259,0 RM 4096 ()
2564656 + 8 [kworker/4:1H]
         swapper      0 [004] 543348.164887:   block:block_rq_complete: 259,0 RM ()
2564656 + 8 [0]
   kworker/u17:0 23867  [005] 543348.164960:   block:block_rq_complete: 259,0 R ()
3190760 + 256 [0]
              dd 25337  [001] 543348.165046:   block:block_rq_insert: 259,0 R 131072 ()
3191272 + 256 [dd]
              dd 25337  [001] 543348.165050:   block:block_rq_issue: 259,0 R 131072 ()
3191272 + 256 [dd]
              dd 25337  [001] 543348.165111:   block:block_rq_complete: 259,0 R ()
3191272 + 256 [0]
[...]

The output contains many details, beginning with the process that was on-CPU when the event occurred, which may or may not be the process responsible for the event. Other details include a timestamp, disk major and minor numbers, a string encoding the type of I/O (rwbs, described earlier), and other details about the I/O.

I have in the past built tools that post-process these events for calculating latency histograms, and visualizing access patterns.¹ However, for busy systems this means dumping all block events to user space for post-processing. BPF can do this processing in the kernel more efficiently, and then emit only the desired output. See the later biosnoop(8) tool as an example.

1 See iolatency(8) in perf-tools [78]: this uses Ftrace to access the same per-event tracepoint data from the trace buffer, which avoids the overhead of creating and writing a perf.data file.

9.2.3 blktrace

blktrace(8) is a specialized utility for tracing block I/O events. Using its btrace(8) front-end to trace all events:

# btrace /dev/nvme2n1
259,0    2        1     0.000000000   430  Q  WS 2163864 + 8 [jbd2/nvme2n1-8]
259,0    2        2     0.000009556   430  G  WS 2163864 + 8 [jbd2/nvme2n1-8]
259,0    2        3     0.000011109   430  P   N [jbd2/nvme2n1-8]
259,0    2        4     0.000013256   430  Q  WS 2163872 + 8 [jbd2/nvme2n1-8]
259,0    2        5     0.000015740   430  M  WS 2163872 + 8 [jbd2/nvme2n1-8]
[...]
259,0    2       15     0.000026963   430  I  WS 2163864 + 48 [jbd2/nvme2n1-8]
259,0    2       16     0.000046155   430  D  WS 2163864 + 48 [jbd2/nvme2n1-8]
259,0    2       17     0.000699822   430  Q  WS 2163912 + 8 [jbd2/nvme2n1-8]
259,0    2       18     0.000701539   430  G  WS 2163912 + 8 [jbd2/nvme2n1-8]
259,0    2       19     0.000702820   430  I  WS 2163912 + 8 [jbd2/nvme2n1-8]
259,0    2       20     0.000704649   430  D  WS 2163912 + 8 [jbd2/nvme2n1-8]
259,0   11        1     0.000664811     0  C  WS 2163864 + 48 [0]
259,0   11        2     0.001098435     0  C  WS 2163912 + 8 [0]
[...]

Multiple event lines are printed for each I/O. The columns are:

Device major, minor number
CPU ID
Sequence number
Action time, in seconds
Process ID
Action identifier (see blkparse(1)): Q == queued, G == get request, P == plug, M == merge, D == issued, C == completed, etc.
RWBS description (see the “rwbs” section earlier in this chapter): W == write, S == synchronous, etc.
Address + size [device]

The output can be post-processed and visualized using Chris Mason’s seekwatcher [91].

As with perf(1) per-event dumping, the overhead of blktrace(8) can be a problem for busy disk I/O workloads. In-kernel summaries using BPF can greatly reduce this overhead.

9.2.4 SCSI Logging

Linux also has a built-in facility for SCSI event logging. It can be enabled via sysctl(8) or /proc. For example, both of these commands set the logging to the maximum for all event types (warning: depending on your disk workload, this may flood your system log):

# sysctl w dev.scsi.logging_level=0x1b6db6db
# echo 0x1b6db6db > /proc/sys/dev/scsi/logging_level

The format of the number is a bitfield that sets the logging level from 1 to 7 for 10 different event types. It is defined in drivers/scsi/scsi_logging.h. The sg3utils package provides a scsi_logging_level(8) tool for setting these. For example:

scsi_logging_level -s --all 3

Example events:

# dmesg
[...]
[542136.259412] sd 0:0:0:0: tag#0 Send: scmd 0x0000000001fb89dc
[542136.259422] sd 0:0:0:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
[542136.261103] sd 0:0:0:0: tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[542136.261110] sd 0:0:0:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
[542136.261115] sd 0:0:0:0: tag#0 Sense Key : Not Ready [current]
[542136.261121] sd 0:0:0:0: tag#0 Add. Sense: Medium not present
[542136.261127] sd 0:0:0:0: tag#0 0 sectors total, 0 bytes done.
[...]

This can be used to help debug errors and timeouts. While timestamps are provided (the first column), using them to calculate I/O latency is difficult without unique identifying details.

BPF tracing can be used to produce custom SCSI-level and other I/O stack-level logs, with more I/O details including latency calculated in the kernel.

< Back Page 2 of 6 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address