- 12.1. Background
- 12.2. Benchmarking Types
- 12.3. Methodology
- 12.4. Benchmark Questions
- 12.5. Exercises
12.2. Benchmarking Types
A spectrum of benchmark types is pictured in Figure 12.1, based on the workload they test. The production workload is also included in the spectrum.
Figure 12.1 Benchmark types
The following sections describe the three benchmarking types: micro-benchmarks, simulations, and trace/replay. Industry-standard benchmarks are also discussed.
12.2.1. Micro-Benchmarking
Micro-benchmarking uses artificial workloads that test a particular type of operation, for example, performing a single type of file system I/O, database query, CPU instruction, or system call. The advantage is the simplicity: narrowing the number of components and code paths involved results in an easier target to study and allows performance differences to be root-caused quickly. Tests are also usually repeatable, because variation from other components is factored out as much as possible. Micro-benchmarks are also usually quick to test on different systems. And because they are deliberately artificial, micro-benchmarks are not easily confused with real workload simulations.
For micro-benchmark results to be consumed, they need to be mapped to the target workload. A micro-benchmark may test several dimensions, but only one or two may be relevant. Performance analysis or modeling of the target system can help determine which micro-benchmark results are appropriate, and to what degree.
Example micro-benchmark tools mentioned in previous chapters include, by resource type,
- CPU: UnixBench, SysBench
- Memory I/O: lmbench (in Chapter 6, CPUs)
- File system: Bonnie, Bonnie++, SysBench, fio
- Disk: hdparm
- Network: iperf
There are many, many more benchmark tools available. However, remember the warning from [Traeger 08]: “Most popular benchmarks are flawed.”
You can also develop your own. Aim to keep them as simple as possible, identifying attributes of the workload that can be tested individually. (See Section 12.3.6, Custom Benchmarks, for more about this.)
Design Example
Consider designing a file system micro-benchmark to test the following attributes: sequential or random I/O, I/O size, and direction (read or write). Table 12.1 shows five sample tests to investigate these dimensions, along with the reason for each test.
Table 12.1 Sample File System Micro-Benchmark Tests
# |
Test |
Intent |
1 |
sequential 512-byte reads |
to test maximum (realistic) IOPS |
2 |
sequential 128-Kbyte reads |
to test maximum read throughput |
3 |
sequential 128-Kbyte writes |
to test maximum write throughput |
4 |
random 512-byte reads |
to test the effect of random I/O |
5 |
random 512-byte writes |
to test the effect of rewrites |
More tests can be added as desired. All of these tests are multiplied by two additional factors:
Working set size: the size of the data being accessed (e.g., total file size):
- – Much smaller than main memory: so that the data caches entirely in the file system cache, and the performance of the file system software can be investigated
- – Much larger than main memory: to minimize the effect of the file system cache and drive the benchmark toward testing disk I/O
Thread count: assuming a small working set size:
- – Single-threaded to test file system performance based on the current CPU clock speed
- – Multithreaded—sufficient to saturate all CPUs—to test the maximum performance of the system: file system and CPUs
These can quickly multiply to form a large matrix of tests. There are statistical analysis techniques to reduce the required set to test.
Creating benchmarks that focus on top speeds has been called sunny day performance testing. So that issues are not overlooked, you also want to consider cloudy day performance testing, which involves testing nonideal situations, including contention, perturbations, and workload variance.
12.2.2. Simulation
Many benchmarks simulate customer application workloads (and are sometimes called macro-benchmarks). These may be based on workload characterization of the production environment (see Chapter 2, Methodology) to determine the characteristics to simulate. For example, it may be found that a production NFS workload is composed of the following operation types and probabilities: reads, 40%; writes, 7%; getattr, 19%; readdir, 1%; and so on. Other characteristics can also be measured and simulated.
Simulations can produce results that resemble how clients will perform with the real-world workload, if not closely, at least close enough to be useful. They can encompass many factors that would be time-consuming to investigate using micro-benchmarking. Simulations can also include the effects of complex system interactions that may be missed altogether when using micro-benchmarks.
The CPU benchmarks Whetstone and Dhrystone, introduced in Chapter 6, CPUs, are examples of simulations. Whetstone was developed in 1972 to simulate scientific workloads of the time. Dhrystone, from 1984, simulates integer-based workloads of the time. The SPEC SFS benchmark, mentioned earlier, is another workload simulation.
A workload simulation may be stateless, where each server request is unrelated to the previous request. For example, the NFS server workload described previously may be simulated by requesting a series of operations, with each operation type chosen randomly based on the measured probability.
A simulation may also be stateful, where each request is dependent on client state, at minimum the previous request. It may be found that NFS reads and writes tend to arrive in groups, such that the probability of a write when the previous operation was a write is much higher than if it were a read. Such a workload can be better simulated using a Markov model, by representing requests as states and measuring the probability of state transitions [Jain 91].
A problem with simulations is that they can ignore variance, as described in Section 12.1.3, Benchmarking Sins. Customer usage patterns can also change over time, requiring these simulations to be updated and adjusted to stay relevant. There may be resistance to this, however, if there are already published results based on the older benchmark version, which would no longer be usable for comparisons with the new version.
12.2.3. Replay
A third type of benchmarking involves attempting to replay a trace log to the target, testing its performance with the actual captured client operations. This sounds ideal—as good as testing in production, right? It is, however, problematic: when characteristics and delivered latency change on the server, the captured client workload is unlikely to respond naturally to these differences, which may prove no better than a simulated customer workload. When too much faith is placed in it, it can be worse.
Consider this hypothetical situation: A customer is considering upgrading storage infrastructure. The current production workload is traced and replayed on the new hardware. Unfortunately, performance is worse, and the sale is lost. The problem: the trace/replay operated at the disk I/O level. The old system housed 10 K rpm disks, and the new system houses slower 7,200 rpm disks. However, the new system provides 16 times the amount of file system cache and faster processors. The actual production workload would have improved, as it would have returned largely from cache—which was not simulated by replaying disk events.
While this is a case of testing the wrong thing, other subtle timing effects can mess things up, even with the correct level of trace/replay. As with all benchmarks, it is crucial to analyze and understand what’s going on.
12.2.4. Industry Standards
Industry-standard benchmarks are available from independent organizations, which aim to create fair and relevant benchmarks. These are usually a collection of different micro-benchmarks and workload simulations that are well defined and documented and must be executed under certain guidelines so that the results are as intended. Vendors may participate (usually for a fee), which provides the vendor with the software to execute the benchmark. Their result usually requires full disclosure of the configured environment, which may be audited.
For the customer, these benchmarks can save a lot of time, as benchmark results may already be available for a variety of vendors and products. The task for you, then, is to find the benchmark that most closely resembles your future or current production workload. For current workloads, this may be determined by workload characterization.
The need for industry-standard benchmarks was made clear by a 1985 paper titled “A Measure of Transaction Processing Power” by Jim Gray and others [Anon 85]. It described the need to measure price/performance ratio and detailed three benchmarks that vendors could execute, called Sort, Scan, and DebitCredit. It also suggested an industry-standard measure of transactions per second (TPS), based on DebitCredit, which could be used much like miles per gallon for cars. Jim Gray and his work later encouraged the creation of the TPC [DeWitt 08].
Apart from the TPS measure, others that have been used for the same role include
- MIPS: millions of instructions per second. While this is a measure of performance, the work that is performed depends on the type of instruction, which may be difficult to compare between different processor architectures.
- FLOPS: floating-point operations per second—a similar role to MIPS, but for workloads that make heavy use of floating-point calculations.
Industry benchmarks typically measure a custom metric based on the benchmark, which serves only for comparisons with itself.
TPC
The TPC creates and administers various industry benchmarks, with a focus on database performance. These include
- TPC-C: a simulation of a complete computing environment where a population of users executes transactions against a database.
- TPC-DS: a simulation of a decision support system, including queries and data maintenance.
- TPC-E: an online transaction processing (OLTP) workload, modeling a brokerage firm database with customers who generate transactions related to trades, account inquiries, and market research.
- TPC-H: a decision support benchmark, simulating ad hoc queries and concurrent data modifications.
- TPC-VMS: The TPC Virtual Measurement Single System allows other benchmarks to be gathered for virtualized databases.
TPC results are shared online [5] and include price/performance.
SPEC
The Standard Performance Evaluation Corporation (SPEC) develops and publishes a standardized set of industry benchmarks, including
- SPEC CPU2006: a measure of compute-intensive workloads. This includes CINT2006 for integer performance, and CFP2006 for floating-point performance.
- SPECjEnterprise2010: a measure of full-system performance for Java Enterprise Edition (Java EE) 5 or later application servers, databases, and supporting infrastructure.
- SPECsfs2008: a simulation of a client file access workload for NFS and common Internet file system (CIFS) servers (see [2]).
- SPECvirt_sc2010: For virtualized environments, this measures the performance of the virtualized hardware, the platform, and the guest operating system and application software.
SPEC’s results are shared online [6] and include details of how systems were tuned and a list of components, but not usually price.