Taming Your Emu to Improve Application Performance
- Catching Your Emu With trapstat
- Programming Your Emu
- Things to Watch Out For
- Commands and API Quick Reference
- About the Authors
- Ordering Sun Documents
- Accessing Sun Documentation Online
One of my favorite features of the SolarisTM 9 Operating System (Solaris OS) is multiple page size support (MPSS). Why? Because it's one of the easiest ways to achieve a significant performance gain for a large range of applications.
Memory intensive applications that have a large working set often perform suboptimally on the Solaris OS without a little tuning. This is because they make inefficient use of the microprocessor's translation lookup buffer (TLB) facility. MPSS allows you to exploit larger page sizes for the microprocessor's memory management unit (MMU, or M-Emu), which allows more efficient use of the TLB, ultimately resulting in improved application performance.
Applications most likely to benefit from MPSS typically have working sets greater than a few hundred megabytes, and are memory intensive. Because the TLB can only hold a few hundred translations at a time, these applications typically overflow the microprocessors TLB. The Solaris kernel services overflows from the UltraSPARCTM TLB, which can result in a significant amount of system-software time.
There is a catch, however. Regular performance tools like mpstat, sar, and vmstat do not report the time spent processing TLB overflows (we refer to them as TLB misses) as system time. Instead, they report an application's TLB misses as user time. This can be quite misleading because it can appear that the CPU is spending all of its time running an application when, in fact, it is really spending a large fraction of time in the kernel.
This Sun BluePrintsTM OnLine article explains how to engage the MPSS feature on the Solaris OS and how to analyze its effect on performance. It briefly explains the hardware feature being exploited, how to measure the usage of this hardware feature with standard Solaris OS tools, and the ways by which users and programmers can invoke the feature. The article doesn't explain the underlying theory in great detail, but provides working examples and references to help you locate additional information on the subject.
Catching Your Emu With trapstat
To help you determine how frequently an application overflows the TLB, the Solaris 9 OS introduces a new tool called trapstat. This tool provides an easy way to measure the time spent in the kernel servicing TLB misses. Using the -t option, trapstat reports how many TLB misses occur and what percentage of the total CPU time is spent processing TLB misses.
The -t option provides first-level summary statistics. Time spent servicing TLB misses is summarized in the lower right corner of the report. As shown in the following example, 46.2 percent of the total execution time is spent servicing TLB misses. The TLB miss detail is broken down to show TLB misses incurred in the data portion of the address space (dTLB) and for the instruction portion of the address space (iTLB). Data are also provided for user misses (u) and kernel-mode misses (k). We are primarily interested in the user-mode misses because our application likely runs in user mode.
sol9# trapstat -t 1 111 cpu m| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim -----+-------------------------------+-------------------------------+---- 0 u| 1 0.0 0 0.0 | 2171237 45.7 0 0.0 |45.7 0 k| 2 0.0 0 0.0 | 3751 0.1 7 0.0 | 0.1 =====+===============================+===============================+==== ttl | 3 0.0 0 0.0 | 2192238 46.2 7 0.0 |46.2
For additional detail, you can use the -T option to provide a per-page-size breakdown. In our example, trapstat -T, shows that almost all of the misses occurred on 8-kilobyte pages.
sol9# trapstat -T 1 cpu m size| itlb-miss %tim itsb-miss %tim | dtlb-miss %tim dtsb-miss %tim |%tim ----------+-------------------------------+-------------------------------+---- 0 u 8k| 30 0.0 0 0.0 | 2170236 46.1 0 0.0 |46.1 0 u 64k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 0 u 512k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 0 u 4m| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 - - - - - + - - - - - - - - - - - - - - - + - - - - - - - - - - - - - - - + - - 0 k 8k| 1 0.0 0 0.0 | 4174 0.1 10 0.0 | 0.1 0 k 64k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 0 k 512k| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 0 k 4m| 0 0.0 0 0.0 | 0 0.0 0 0.0 | 0.0 ==========+===============================+===============================+==== ttl | 31 0.0 0 0.0 | 2174410 46.2 10 0.0 |46.2
We can conclude from this output that the application could potentially run almost twice as fast if we could eliminate the majority of the TLB misses. Our objective in using the mechanisms discussed in the following paragraphs is to minimize the user-mode data TLB misses (dTLB) by instructing the application to use larger pages for its data segments. Typically, data misses are incurred in the program's heap or stack segments. We can use the Solaris 9 OS MPSS commands to direct the application to use 4-megabyte pages for its heap, stack, or anonymous memory mappings.