- 12.1. Background
- 12.2. Benchmarking Types
- 12.3. Methodology
- 12.4. Benchmark Questions
- 12.5. Exercises
12.3. Methodology
This section describes methodologies and exercises for performing benchmarking, whether micro-benchmarking, simulations, or replays. The topics are summarized in Table 12.2.
Table 12.2 Benchmark Analysis Methodologies
Methodology |
Types |
Passive benchmarking |
experimental analysis |
Active benchmarking |
observational analysis |
CPU profiling |
observational analysis |
USE method |
observational analysis |
Workload characterization |
observational analysis |
Custom benchmarks |
software development |
Ramping load |
experimental analysis |
Sanity check |
observational analysis |
Statistical analysis |
statistical analysis |
12.3.1. Passive Benchmarking
This is the fire-and-forget strategy of benchmarking—where the benchmark is executed and then ignored until it has completed. The main objective is the collection of benchmark data. This is how benchmarks are commonly executed and is described as its own methodology for comparison with active benchmarking.
These are some example passive benchmarking steps:
- Pick a benchmark tool.
- Run it with a variety of options.
- Make a slide deck of the results.
- Hand the slides to management.
Problems with this approach have been discussed previously. In summary, the results may be
- Invalid due to benchmark software bugs
- Limited by the benchmark software (e.g., single-threaded)
- Limited by a component that is unrelated to the benchmark target (e.g., a congested network)
- Limited by configuration (performance features not enabled, not a maximum configuration)
- Subject to perturbations (and not repeatable)
- Benchmarking the wrong thing entirely
Passive benchmarking is easy to perform but prone to errors. When performed by the vendor, it can create false alarms that waste engineering resources or cause lost sales. When performed by the customer, it can result in poor product choices that haunt the company later on.
12.3.2. Active Benchmarking
With active benchmarking, you analyze performance while the benchmark is running—not just after it’s done—using other tools. You can confirm that the benchmark tests what it says it tests, and that you understand what that is. Active benchmarking can also identify the true limiters of the system under test, or of the benchmark itself. It can be very helpful to include specific details of the limit encountered when sharing the benchmark results.
As a bonus, this can be a good time to develop your skills with performance observability tools. In theory, you are examining a known load and can see how it appears from these tools.
Ideally, the benchmark can be configured and left running in steady state, so that analysis can be performed over a period of hours or days.
Example
As an example, let’s look at the first test of the Bonnie++ micro-benchmark tool. It is described on its home page [7]:
- Bonnie++ is a benchmark suite that is aimed at performing a number of simple tests of hard drive and file system performance.
The first test is “Sequential Output” and “Per Chr” and was executed on two different operating systems for comparison.
Fedora/Linux (under KVM virtualization):
# bonnie++ [...] Version 1.03e ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 9d219ce8-cf52-40 2G 52384 23 47334 3 31938 3 74866 67 1009669 61 +++++ +++ [...]
SmartOS/illumos (under OS virtualization):
# bonnie++ Version 1.03e ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP smartos1.local 2G 162464 99 72027 86 65222 99 251249 99 2426619 99 +++++ +++ [...]
So SmartOS is 3.1x faster. If we were to stop right here, that would be passive benchmarking.
Given that Bonnie++ is a “hard drive and file system performance” benchmark, we can begin by checking the workload that was performed.
Running iostat(1M) on SmartOS to check disk I/O:
$ iostat -xnz 1 [...] r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 668.9 0.0 82964.3 0.0 6.0 0.0 8.9 1 60 c0t1d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 0.0 419.0 0.0 53514.5 0.0 10.0 0.0 23.8 0 100 c0t1d0 [...]
The disks begin idle, then show variable write throughput during the benchmark (kw/s), at a rate much lower than what Bonnie++ reported as its K/sec result.
Running vfsstat(1M) on SmartOS to check file system I/O (VFS-level):
$ vfsstat 1 r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone [...] 45.3 1514.7 4.5 193877.3 0.0 0.1 0.0 0.0 0 6 412.4 5.5 b8b2464c 45.3 1343.6 4.5 171979.4 0.0 0.1 0.0 0.1 0 7 1343.6 14.0 b8b2464c 45.3 1224.8 4.5 156776.9 0.0 0.1 0.0 0.1 0 6 1157.9 12.2 b8b2464c 45.3 1224.8 4.5 156776.9 0.0 0.1 0.0 0.1 0 6 1157.9 12.2 b8b2464c
Now the throughput is consistent with the Bonnie++ result. The IOPS, however, are not: vfsstat(1M) shows the writes are about 128 Kbytes each (kw/s / w/s), and not “Per Chr.”
Using truss(1) on SmartOS to investigate the writes to the file system (ignoring the overhead of truss(1) for the moment):
write(4, "\001020304050607\b\t\n\v".., 131072) = 131072 write(4, "\001020304050607\b\t\n\v".., 131072) = 131072 write(4, "\001020304050607\b\t\n\v".., 131072) = 131072
This confirms that Bonnie++ is performing 128 Kbyte file system writes.
Using strace(1) on Fedora for comparison:
write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24"..., 4096) = 4096 write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24"..., 4096) = 4096 write(3, "\0\1\2\3\4\5\6\7\10\t\n\v\f\r\16\17\20\21\22\23\24"..., 4096) = 4096
This shows that Fedora is performing 4 Kbyte file system writes, whereas SmartOS was performing 128 Kbyte writes.
With more analysis (using DTrace), this was seen to be buffering of putc() in the system library, with each operating system defaulting to a different buffering size. As an experiment, Bonnie++ on Fedora was adjusted to use a 128 Kbyte buffer (using setbuffer()), which improved its performance by 18%.
Active performance analysis determined various other characteristics of how this test was performed, providing a better understanding of the result [8]. The conclusion was that it was ultimately limited by single-threaded CPU speed and spent 85% of its CPU time in user mode.
Bonnie++ is not an unusually bad benchmark tool; it has served people well on many occasions. I picked it for this example (and also chose the most suspicious of its tests to study) because it’s well known, I’ve studied it before, and findings like this are not uncommon. But it is just one example.
It should be noted that a newer experimental version of Bonnie++ has changed the “Per Chr” test to actually perform 1-byte file system I/O. Comparing results between different Bonnie++ versions, for this test, will show significant differences. For more about Bonnie++ performance analysis, see the article by Roch Bourbonnais on “Decoding Bonnie++” [9].
12.3.3. CPU Profiling
CPU profiling of both the benchmark target and the benchmark software is worth singling out as a methodology, because it can result in some quick discoveries. It is often performed as part of an active benchmarking investigation.
The intent is to quickly check what all the software is doing, to see if anything interesting shows up. This can also narrow your study to the software components that matter the most: those in play for the benchmark.
Both user- and kernel-level stacks can be profiled. User-level CPU profiling was introduced in Chapter 5, Applications. Both were covered in Chapter 6, CPUs, with examples in Section 6.6, Analysis, including flame graphs.
Example
A disk micro-benchmark was performed on a proposed new system with some disappointing results: disk throughput was worse than on the old system. I was asked to find out what was wrong, with the expectation that either the disks or the disk controller was inferior and should be upgraded.
I began with the USE method (Chapter 2, Methodology) and found that the disks were not very busy, despite that being the point of the benchmark test. There was some CPU usage, in system-time (the kernel).
For a disk benchmark, you might not expect the CPUs to be an interesting target for analysis. Given some CPU usage in the kernel, I thought it was worth a quick check to see if anything interesting showed up, even though I didn’t expect it to. I profiled and generated the flame graph shown in Figure 12.2.
Figure 12.2 Flame graph profiling of kernel-time
Browsing the stack frames showed that 62.17% of CPU samples included a function called zfs_zone_io_throttle(). I didn’t need to read the code for this function, as its name was enough of a clue: a resource control, ZFS I/O throttling, was active and artificially throttling the benchmark! This was a default setting on the new system (but not the older system) and had been overlooked when the benchmark was performed.
12.3.4. USE Method
The USE method was introduced in Chapter 2, Methodology, and is described in chapters for the resources it studies. Applying the USE method during benchmarking can ensure that a limit is found. Either some component, hardware or software, has reached 100% utilization, or you are not driving the system to its limit.
An example of using the USE method was described in Section 12.3.2, Active Benchmarking, where it helped discover that a disk benchmark was not working as intended.
12.3.5. Workload Characterization
Workload characterization was also introduced in Chapter 2, Methodology, and discussed in later chapters. This methodology can be used to determine how well a given benchmark relates to a current production environment by characterizing the production workload for comparison.
12.3.6. Custom Benchmarks
For simple benchmarks, it may be desirable to code the software yourself. Try to keep the program as short as possible, to avoid complexity that hinders analysis.
The C programming language is usually a good choice, as it maps closely to what is executed—although think carefully about how compiler optimizations will affect your code: the compiler may elide simple benchmark routines if it thinks the output is unused and therefore unnecessary to calculate. It may be worth disassembling the compiled binary to see what will actually be executed.
Languages that involve virtual machines, asynchronous garbage collection, and dynamic runtime compilation can be much more difficult to debug and control with reliable precision. You may need to use such languages anyway, if it is necessary to simulate client software written in them.
Writing custom benchmarks can also reveal subtle details about the target that can prove useful later on. For example, when developing a database benchmark, you may discover that the API supports various options for improving performance that are not currently in use in the production environment, which was developed before the options existed.
Your software may simply generate load (a load generator) and leave the measurements for other tools. One way to perform this is to ramp load.
12.3.7. Ramping Load
This is a simple method for determining the maximum throughput a system can handle. In involves adding load in small increments and measuring the delivered throughput until a limit is reached. The results can be graphed, showing a scalability profile. This profile can be studied visually or by using scalability models (see Chapter 2, Methodology).
As an example, Figure 12.3 shows how a file system and system scale with threads. Each thread performs 8 Kbyte random reads on a cached file, and these were added one by one.
Figure 12.3 Ramping file system load
This system peaked at almost half a million reads per second. The results were checked using VFS-level statistics, which confirmed that the I/O size was 8 Kbytes, and that at peak over 3.5 Gbytes/s was transferred.
The load generator for this test was written in Perl and is short enough to include entirely as an example:
#!/usr/bin/perl -w # # randread.pl - randomly read over specified file. use strict; my $IOSIZE = 8192; # size of I/O, bytes my $QUANTA = $IOSIZE; # seek granularity, bytes die "USAGE: randread.pl filename\n" if @ARGV != 1 or not -e $ARGV[0]; my $file = $ARGV[0]; my $span = -s $file; # span to randomly read, bytes my $junk; open FILE, "$file" or die "ERROR: reading $file: $!\n"; while (1) { seek(FILE, int(rand($span / $QUANTA)) * $QUANTA, 0); sysread(FILE, $junk, $IOSIZE); } close FILE;
This uses sysread() to call the read() syscall directly and avoid buffering.
This was written to micro-benchmark an NFS server and was executed in parallel from a farm of clients, each performing random reads on an NFS-mounted file. The results of the micro-benchmark (reads per second) were measured on the NFS server, using nfsstat(1M) and other tools.
The number of files used and their combined size were controlled (this forms the working set size), so that some tests could return entirely from cache, and others from disk. (See Design Example in Section 12.2.1, Micro-Benchmarking.)
The number of instances executing on the client farm was incremented one by one, to ramp up the load until a limit was reached. This was also graphed to study the scalability profile, along with resource utilization (USE method), confirming that a resource had been exhausted. In this case it was CPU resources, which initiated another investigation to improve performance further.
I used this program and this approach to find the limits in the Oracle ZFS Storage Appliance (formally the Sun ZFS Storage Appliance [10]). These limits were used as the official results—which to the best of our knowledge set world records. I also had a similar set of software written in C, but it wasn’t needed in this case: I had an abundance of client CPUs, and while the switch to C reduced their utilization, it didn’t make a difference for the result as the same bottleneck was reached on the target. Other, more sophisticated benchmarks were also tried, as well as other languages, but they could not improve upon these results.
When following this approach, measure latency as well as the throughput, especially the latency distribution. Once the system approaches its limit, queueing delays may become significant, causing latency to increase. If you push load too high, latency may become so high that it is no longer reasonable to consider the result as valid. Ask yourself if the delivered latency would be acceptable to a customer.
For example: You use a large array of clients to drive a target system to 990,000 IOPS, which responds with an average I/O latency of 5 ms. You’d really like it to break 1 million IOPS, but the system is already reaching saturation. By adding more and more clients, you manage to scrape past 1 million IOPS; however, all operations are now heavily queued, with average latency of over 50 ms (which is not acceptable)! Which result do you give marketing? (Answer: 990,000 IOPS.)
12.3.8. Sanity Check
This is an exercise for checking a benchmark result by investigating whether any characteristic doesn’t make sense. It includes checking whether the result would have required some component to exceed its known limits, such as network bandwidth, controller bandwidth, interconnect bandwidth, or disk IOPS. If any limit has been exceeded, it is worth investigating in more detail. In most cases, this exercise ultimately discovers that the benchmark result is bogus.
Here’s an example: An NFS server is benchmarked with 8 Kbyte reads and is reported to deliver 50,000 IOPS. It is connected to the network using a single 1 Gbit/s Ethernet port. The network throughput required to drive 50,000 IOPS x 8 Kbytes = 400,000 Kbytes/s, plus protocol headers. This is over 3.2 Gbits/s—well in excess of the 1 Gbit/s known limit. Something is wrong!
Results like this usually mean the benchmark has tested client caching and not driven the entire workload to the NFS server.
I’ve used this calculation to identify numerous bogus benchmarks, which have included the following throughputs over a 1 Gbit/s interface [11]:
- 120 Mbytes/s
- 200 Mbytes/s
- 350 Mbytes/s
- 800 Mbytes/s
- 1.15 Gbytes/s
These are all throughputs in a single direction. The 120 Mbyte/s result may be fine—a 1 Gbit/s interface should reach around 119 Mbytes/s. The 200 Mbyte/s result is possible only if there was heavy traffic in both directions and this was summed; however, these are single-direction results. The 350 Mbyte/s and beyond results are bogus.
When you’re given a benchmark result to check, look for what simple sums you can perform on the provided numbers to discover such limits.
If you have access to the system, it may be possible to further test results by constructing new observations or experiments. This can follow the scientific method: the question you’re testing now is whether the benchmark result is valid. From this, hypotheses and predictions may be drawn and then tested for verification.
12.3.9. Statistical Analysis
Statistical analysis is a process for the collection and study of benchmark data. It follows three phases:
- Selection of the benchmark tool, its configuration, and system performance metrics to capture
- Execution of the benchmark, collecting a large dataset of results and metrics
- Interpretation of the data with statistical analysis, producing a report
Unlike active benchmarking, which focuses on analysis of the system while the benchmark is running, statistical analysis focuses on analyzing the results. It is also different from passive benchmarking, in which no analysis is performed at all.
This approach is used in environments where access to a large-scale system may be both time-limited and expensive. For example, there may be only one “max config” system available, but many teams want access to run tests at the same time, including
- Sales: during proof of concepts, to run a simulated customer load to show what the max config system can deliver
- Marketing: to get the best numbers for a marketing campaign
- Support: to investigate pathologies that arise only on the max config system, under serious load
- Engineering: to test the performance of new features and code changes
- Quality: to perform non-regression testing and certifications
Each team may have only a limited time to run its benchmarks on the system, but much more time to analyze the results afterward.
As the collection of metrics is expensive, make an extra effort to ensure that they are reliable and trustworthy, to avoid having to redo them later if a problem is found. Apart from checking how they are generated technically, you can also collect more statistical properties so that problems can be found sooner. These may include statistics for variation, full distributions, error margins, and others (see Section 2.8, Statistics, in Chapter 2, Methodology). When benchmarking for code changes or non-regression testing, it is crucial to understand the variation and error margins, in order to make sense of a pair of results.
Also collect as much performance data as possible from the running system (without harming the result due to the collection overhead), so that forensic analysis can be performed afterward on this data. Data collection may include the use of tools such as sar(1), third-party products, and custom tools that dump all statistics available.
For example, on Linux, a custom shell script may copy the contents of the /proc statistic files before and after the run. Everything possible can be included, in case it is needed. Such a script may also be executed at intervals during the benchmark, provided the performance overhead is acceptable. Other statistical tools may also be used to create logs.
On Solaris-based systems, kstat -p can be used to dump all kernel statistics, which can be recorded before and after the run and also at intervals. This output is easy to parse and can be imported into a database for advanced analysis.
Statistical analysis of results and metrics can include scalability analysis and queueing theory to model the system as a network of queues. These topics were introduced in Chapter 2, Methodology, and are the subject of separate texts ([Jain 91], [Gunther 97], [Gunther 07]).