- Understanding Why Virtual-to-Physical Address Translation Affects Performance
- Working With Multiple Page Sizes in the Solaris OS
- Configuring for Multiple Page Sizes
- About the Author
Configuring for Multiple Page Sizes
After determining that our application warrants the use of large pages, we need to construct a strategy for determining what parts of our application to enhance to use large pages. For example, we should consider whether we should attempt to enable large pages for our target process's heap and stack. The trapstat utility provides a little information about the types of address space that incur TLB misses.
The instruction TLB (iTLB) miss information is likely a result from the process's text and library text because instructions typically reside in these mappings. It is possible, however, for a program to execute code from other mappings; for example, the Java virtual machine compiles instructions on-the-fly into its heap and then executes from there. However, for the vast majority of applications, we can first guess that iTLB misses result from the text or library mappings.
Data TLB misses are likely to occur from the program's writable segments (its heap, stack, data mapping, and read-only data within the text mapping).
The default page size for the Solaris OS is 8 kilobytes on UltraSPARC and 4 kilobytes on Intel x86 microprocessors. Larger pages of 4 megabytes are used by the Solaris kernel for its instruction and data sections; however, user applications requiring larger pages must explicitly request them.
The use of larger page sizes in the Solaris 2.6 OS through the Solaris 8 OS is only available through a special form of System V shared memory. To optimize database performance, we can use a form of shared memory called intimate shared memory (ISM). ISM is requested by the shmat(2) system call with the SHM_SHARE_MMU flag and is allocated as 4 megabyte pages, if possible. Databases such as Oracle, Informix, and Sybase request shared memory by using this flag and typically perform as much as 10 percent to 20 percent better as a result of a reduced TLB miss rate.
The Solaris 9 OS introduces a generic framework for allowing user applications to request larger page sizes. At the same time, ISM was also enhanced to take advantage of the other supported large page sizes, for example, 64 kilobytes and 512 kilobytes. Unmodified applications can be directed to use larger page sizes by means of the ppgsz(1M) command and the libmpss.so library. Applications can also be customized to request larger page sizes by the memcntl(2) system call.
The Solaris 9 OS large-page infrastructure allows larger pages to be requested for the mappings of /dev/zero, that is, for the heap, stack, and other anonymous mappings.
Enabling Large Pages in the Solaris 9 OS
The new framework, MPSS, provided in the Solaris 9 OS allows larger page sizes to be requested for user processes. The memcntl() system call specifies page-size advice for a given address range. A wrapper program, ppgsz, and an interposition library, libmpss.so, call memcntl() on behalf of the target process so that unmodified binaries can make use of larger page sizes.
Advising Page-Size Preferences With ppgsz(1M)
The ppgsz command is a wrapper that advises a preferred page size for a process's heap or stack of a target process. These page-size preferences are inherited across fork() but not across exec(). Thus, if the target program spawns (forks then execs) another program, page sizes are not inherited. If inheritance of page sizes is required, the libmpss.so library should be used instead.
For example, to start a target process with 4 megabyte pages for its heap, we could use the ppgsz wrapper.
sol9# ppgsz -o heap=4M ./testprog & sol9# pmap -sx ´pgrep testprog´ 2953: ./testprog Address Kbytes RSS Anon Locked Pgsz Mode Mapped File 00010000 8 8 - - 8K r-x-- dev:277,83 ino:114875 00020000 8 8 8 - 8K rwx-- dev:277,83 ino:114875 00022000 3960 3960 3960 - 8K rwx-- [ heap ] 00400000 131072 131072 131072 - 4M rwx-- [ heap ] FF280000 120 120 - - 8K r-x-- libc.so.1 FF29E000 136 128 - - - r-x-- libc.so.1 FF2C0000 72 72 - - 8K r-x-- libc.so.1 FF2D2000 192 192 - - - r-x-- libc.so.1 FF302000 112 112 - - 8K r-x-- libc.so.1 FF31E000 48 32 - - - r-x-- libc.so.1 FF33A000 24 24 24 - 8K rwx-- libc.so.1 FF340000 8 8 8 - 8K rwx-- libc.so.1 FF390000 8 8 - - 8K r-x-- libc_psr.so.1 FF3A0000 8 8 - - 8K r-x-- libdl.so.1 FF3B0000 8 8 8 - 8K rwx-- [ anon ] FF3C0000 152 152 - - 8K r-x-- ld.so.1 FF3F6000 8 8 8 - 8K rwx-- ld.so.1 FFBFA000 24 24 24 - 8K rwx-- [ stack ] -------- ------- ------- ------- ------- total Kb 135968 135944 135112 -
Interposing Shared Libraries With libmpss.so
The mpss.so shared object in /usr/lib provides a means by which the preferred stack or heap page size can be selectively configured for launched processes and their descendants. The library has an the advantage over the wrapper in that page sizes are inherited across exec(). To enable mpss.so, ensure that the following string is present in the environment (see ld.so.1(1)) along with one or more MPSS environment variables.
sol9# LD_PRELOAD=$LD_PRELOAD:mpss.so.1
Once preloaded, the mpss.so.1 shared object reads the following environment variables to determine preferred page size requirements and processes for which these requirements are specified.
MPSSHEAP=size MPSSSTACK=size MPSSHEAP and MPSSSTACK specify the preferred page sizes for the heap and stack, respectively. The speci- fied page size(s) are applied to all created processes. MPSSCFGFILE=config-file config-file is a text file which contains one or more mpss configuration entries of the form: exec-name:heap-size:stack-size
For example, the following commands enable 4-megabyte pages for the heap of all subsequently started processes.
sol9# export LD_PRELOAD=$LD_PRELOAD:mpss.so.1 sol9# export MPSSHEAP=4M sol9# ./testprog
See mpss.so.1(1) for all available configuration options.
Compiling Your Application to Request Larger Page Sizes
The Sun ForteTM 8 compilers provide options to cause the target application to request specific page sizes. The following options are supported for the compiler.
Set Stack and Heap Page Size With -xpagesize=n
(SPARC) Sets the preferred page size for the stack and the heap. The n value must be one of the following: 8K, 64K, 512K, 4M, 32M, 256M, 2G, 16G, or default.
You must specify a valid page size for the Solaris OS on the target platform, as returned by getpagesize(3C). If you do not specify a valid page size, the request is silently ignored at run-time. The Solaris OS offers no guarantee that the page size request will be honored. You can use pmap(1) or meminfo(2) to determine page size of the target platform.
The -xpagesize option has no effect unless you use it at compile time and at link time.
NOTE
This feature is not available on the Solaris 7 OS and the Solaris 8 OS. A program compiled with this option will not link on the Solaris 7 OS or the Solaris 8 OS.
If you specify -xpagesize=default, the Solaris OS sets the page size. -xpagesize without an argument is the equivalent to -xpagesize=default.
Compiling with this option has the same effect as setting the LD_PRELOAD environment variable to mpss.so.1 with the equivalent options, or running the ppgsz(1) command in the Solaris 9 software with the equivalent options before running the program. See the man pages for the Solaris 9 OS for details.
This option is a macro for -xpagesize_heap and -xpagesize_stack. These two options accept the same arguments as -xpagesize: 8K, 64K, 512K, 4M, 32M, 256M, 2G, 16G, or default. You can set them both with the same value by specifying -xpagesize or you can specify them individually with different values.
Set Heap Page Size in Memory With -xpagesize_heap=n
(SPARC) Sets the page size in memory for the heap. The n value can be 8K, 64K, 512K, 4M, 32M, 256M, 2G, 16G, or default. You must specify a valid page size for the Solaris OS on the target platform, as returned by getpagesize(3C). If you do not specify a valid page size, the request is silently ignored at run time.
You can use pmap(1) or meminfo(2) to determine page size at the target platform. If you specify -xpagesize_heap=default, the Solaris OS sets the page size. -xpagesize_heap without an argument is the equivalent to -xpagesize_heap=default.
Compiling with this option has the same effect as setting the LD_PRELOAD environment variable to mpss.so.1 with the equivalent options, or running the ppgsz(1) command in the Solaris 9 software with the equivalent options before running the program. See the man pages for the Solaris 9 OS for details.
NOTE
This feature is not available on the Solaris 7 OS and the Solaris 8 OS. A program compiled with this option will not link on the Solaris 7 OS or the Solaris 8 OS.
Set Stack Page Size in Memory With -xpagesize_stack=n
(SPARC) Set the page size in memory for the stack. n can be 8K, 64K, 512K, 4M, 32M, 256M, 2G, 16G, or default. You must specify a valid page size for the Solaris OS on the target platform, as returned by getpagesize(3C). If you do not specify a valid page size, the request is silently ignored at run-time. You can use pmap(1) or meminfo(2) to determine page size at the target platform.
If you specify -xpagesize_stack=default, the Solaris OS sets the page size. -xpagesize_stack without an argument is the equivalent to -xpagesize_stack=default.
Compiling with this option has the same effect as setting the LD_PRELOAD environment variable to mpss.so.1 with the equivalent options, or running the Solaris 9 command ppgsz(1) with the equivalent options before running the program. See the man pages for the Solaris 9 OS for details.
NOTE
This feature is not available on the Solaris 7 OS and the Solaris 8 OS. A program compiled with this option will not link on the Solaris 7 OS or the Solaris 8 OS.
Enhancing an Application to Request Larger Page Sizes
The memcntl(3C) interface has been enhanced to allow page size requests to be made on behalf of a process. Thus, an application can automatically request larger page sizes when appropriate. Such an application wanting to request a larger page size should do so by using the existing memcntl() interface.
int memcntl(caddr_t addr, size_t len, int cmd, caddr_t arg,int attr, int mask);
With the cmd argument, we can now specify a new control operation, MC_HAT_ADVISE, for page-size operations. When the cmd argument is set to MC_HAT_ADVISE, the arg argument is interpreted as a pointer to a new structure, as shown below. Currently, only three commands are supported; each command sets a preferred page size. mha_flags must always be set to zero. It is reserved for future use. Only one command can be specified at a time.
struct memcntl mha{ uint_t mha_cmd; /* command(s) */ uint_t mha_flags; /* flags */ size_t mha_pagesize; };
If mha_cmd is set to MHA_MAPSIZE_VA, we apply the set preferred page-size operation to the address range (addr, addr + len). mha_pagesize must be a supported page size, as returned by getpagesizes(), or zero to let the system select the page size. The address and size of the range must be aligned to the new preferred page size. The access protections within new page-size regions contained in the range must be the same or the operation will fail. If there are holes in the address range or if the mapping is mapped with MAP_NORESERVE, the operation will fail. The address range can be contained inside a larger mapping or can span many mappings of varying sizes.
The memcntl() interface promotes or demotes the preferred page sizes for any MAP_PRIVATE /dev/zero mappings, provided that the constraints mentioned above are met. Two special objects in the user address space require special handling: the process's heap and the primary thread stack (not the stack for additional threads).
The heap consists of the last .bss adjacent to the brk area and the brk area itself. The following figure illustrates the mapping procedure.
FIGURE 3 Process Address Space Mappings
For these two cases we have separate commands.
MHA_MAPSIZE_STACK /* token for processes main stack */ MHA_MAPSIZE_BSSBRK /* token heap */
When MHA_MAPSIZE_STACK and MHA_MAPSIZE_BSSBRK are used, mha_pagesize must be a supported page size, as returned by getpagesizes(3C), or zero to let the system select the page size. The operation is then applied to the entire existing stack or heap mappings. The advice is then used for future page allocations. These commands for changing the preferred page size for stack or heap may first adjust the existing range in accordance with the new page size. This could involve creating new segments to pad out the base and length of the existing range to the new, preferred, page-size alignment.
Applications need to know what to align their memory requests on to attain maximum performance (for example, when using mmap() for creating new mappings) and to avoid misaligned mprotect(), munmap(), and mmap() requests that could result in page demotion, which is when larger pages are broken up into smaller pages.
Most applications that use mmap() pass in NULL for its addr argument to let the OS manage its address space. If applications also want to use large pages with memcntl(), they should suggest to the OS that it specify, by means of a new flag, MAP_ALIGN, the minimum page size alignment desired. If specified, mmap() interprets the addr argument only as the required minimum alignment and is free to find a hole in the user address space that satisfies the minimum alignment specified in the addr argument. The alignment must be a power of two multiple of PAGESIZE, or zero to let the system choose the alignment. If MAP_ALIGN is specified along with MAP_FIXED, the request will fail. If the alignment request cannot be satisfied, mmap() will also fail.
For reference, we provide the following example. This code fragment sets the page size for the program's heap to 4 megabytes. Note the use of memalign to align the request on a 4-megabyte boundary. Because the heap starts on a boundary that is not 4-megabyte aligned, the first few megabytes of the heap can reside on 8-kilobyte pages. If the performance-sensitive data structures reside within this area, the program might not realize the full benefits of a larger page size. By allocating a 4-megabyte aligned area, we increase the chance that the subsequent virtual addresses allocated will land on a large page.
#include <sys/types.h> #include <sys/mman.h> #include <stdlib.h> #define MEGABYTE ((size_t)(1024 * 1024)) #define FOUR_MEGABYTE ((size_t)4 * MEGABYTE) int main(int argc, char *argv[]) { struct memcntl_mha mha; char *my_memory; /* Set pagesize to 4MB for heap */ mha.mha_cmd = MHA_MAPSIZE_BSSBRK; mha.mha_flags = 0; mha.mha_pagesize = FOUR_MEGABYTE; memcntl(NULL, 0, MC_HAT_ADVISE, (char *)&mha, 0, 0); /* Ensure user memory starts on first large page */ my_memory = (char *)memalign(FOUR_MEGABYTE, (size_t)100 * MEGABYTE);
Determining Whether Your UltraSPARC CPU Model Works Well With Large Pages
The TLB configurations are quite different across versions of UltraSPARC processors, but they share a few items in common. UltraSPARC I through IV supports four page sizes: 8 kilobytes, 64 kilobytes, 512 kilobytes, and 4 megabytes. In addition, there are separate TLBs for the instruction and data paths.
UltraSPARC I and II
The UltraSPARC I and II microprocessors (143 megahertz480 megahertz) have two TLBs, one for the instruction path and one for the data path. Each TLB is a 64-entry, fully associative TLB that supports all four page sizes. User applications can use any of the four page sizes.
750 Megahertz UltraSPARC III
The 750 megahertz UltraSPARC III microprocessor has four TLBs: two for instruction and two for data. The instruction TLBs are implemented as a 16-entry, fully associative TLB that supports all four page sizes and a larger 128-entry TLB that supports only 8 kilobyte entries. The data TLBs are implemented as a 16-entry, fully associative TLB that supports all four page sizes and a larger 512-entry, two-way set associative TLB that supports only 8 kilobyte entries.
The 16-entry dTLB has nine locked entries, which are locked by software for the Solaris kernel, leaving only seven slots for large page sizes. Thus, use of large pages is typically not beneficial on 750 megahertz UltraSPARC III systems.
900 Megahertz+ UltraSPARC III
The 900 megahertz onwards UltraSPARC III microprocessors have five TLBs: two for instruction and three for data. The instruction TLBs are configured as a 16-entry, fully associative TLB that supports all four page sizes and a larger 128-entry TLB that supports only 8 kilobyte entries. The data TLBs are configured as a 16-entry, fully associative TLB that supports all four page sizes and two larger 512-entry, two-way set associative TLBs that support one page size per process. The increased size of the data TLBs on 900 megahertz UltraSPARC III provides a large TLB spread (2 gigabytes when 4 megabyte pages are used) and typically increases performance significantly for large memory applications.
The large data TLBs are configured automatically in accordance with the most common page sizes in a process's address space. A process using one large page size in addition to the base page size (8 kilobytes) will have one of its large TLBs automatically programmed to enable the large page size when eight or more pages are using the larger page size within the process. It is assumed that the smaller TLB is available if there are fewer than eight pages.
Because the large TLBs support all four page sizes, large pages can be used effectively on UltraSPARC III. However, because the large TLBs can be configured for only one page size at a time per process, only two page sizes should be used concurrently. One of those page sizes should be the system's base page size (8 kilobytes) for mappings not using large pagesfor example, the program text or libraries. The other larger page size is available for the remainder of the mappings. The most common selections for page sizes are 8 kilobytes and 4 megabytes, providing the greatest TLB spread for the large TLB.