- Introduction
- Caches
- Virtual Memory Issues
- Memory
- Input/Output Devices
- I/O Performance Tips for Application Writers
- Summary
- References
3.3 Virtual Memory Issues
3.3.1 Overview
Most of today’s computers are virtual memory machines. Such machines translate logical memory addresses in a user’s program to physical memory addresses. There are several advantages that virtual memory machines have over those machines that do not have virtual memory. They allow programs that have logical address space requirements larger than physical memory to execute, albeit slowly. Virtual memory machines are capable of executing multiple processes which, when combined, occupy several times more memory than is actually in the machine.
Even though a process’ address space appears to be sequential on a virtual memory, it is likely not to be in the physical address space. This is because virtual memory systems break up a process’ memory into blocks, usually referred to as pages. Page sizes are typically four KB or larger in size and do not always have to be uniform, i.e., a process may have several pages that are four KB in size and yet several more pages which are one MB or larger. By breaking virtual memory into pages, the operating system can place some of the pages on magnetic disk. In this way, the operating system can keep only those pages currently being accessed by the process in physical memory, with the remainder residing on disk. When memory is accessed in a page that is not currently in physical memory, the operating system copies the page in question from disk and places it in physical memory. This may cause another page to be displaced from physical memory, which forces the operating system to place it on disk. The location on magnetic disk that is used for virtual memory pages is called swap space since it is used to swap pages in and out of physical memory.
The operating system needs a map to translate virtual memory addresses to physical memory addresses. This is done by mapping pages in virtual memory to those in physical memory. A common way to accomplish this is with a lookup table, generally referred to as a page table. Processes typically have multiple page tables, including page tables for text, data areas, etc.
3.3.2 The Translation Lookaside Buffer
So, accessing a particular location in memory involves identifying which page table to use and then performing the virtual to physical translation, using that table to finally retrieve the data. Many applications access memory sequentially, so translations for the same page are often repeated many times. To take advantage of this situation, most virtual memory machines have a special cache referred to as a Translation Lookaside Buffer (TLB), which contains physical to virtual memory translations. Typically, the TLB translations (or lookups) and instruction execution occur simultaneously so that memory operations occur very quickly. In the event that a process accesses memory which does not have its translation in the TLB, it is said to encounter a TLB miss. When a TLB miss occurs, the page identification and subsequent translation need to be constructed and the result placed in the TLB.
Most TLBs are located on chip and, as a result, are small in size. This fact, in combination with inadvertent programming mistakes, can result in a high number of TLB misses. There are two primary solutions for this problem—reducing the number of pages that need to be translated or programming to avoid causing a high number of TLB misses.
Probably the most common cause of TLB misses is due to memory being accessed with large, often constant, distances between them. The distance between memory accesses is usually referred to as stride. Unit stride is used to describe memory accesses that are sequential, i.e., the next element is exactly one element away. A stride of seven refers to a memory access pattern where every seventh element in a list is accessed, skipping over six elements at a time. Multidimensional arrays have inherently long strides, depending on how they are accessed. For example, an array declared as follows,
REAL*8 X(1000,2000)
has a stride of 1000 between columns, meaning that x(N,1) and x(N,2) are 1000 elements apart, regardless of the value of N. Consider the following sequence of code:
REAL*8 X(1000,1000), Y(1000,1000) ... DO I=1, M DO J=1, N X(I,J) = Y(I,J) END DO END DO
Assume that both X and Y are declared as above. Then, regardless of the values of M or N, the execution of this sequence of code results in both X and Y being accessed with strides of 1000. Note that this means that X(I,1) and X(I,2) are 8000 bytes apart. Many computers use a page size of only four KB. If this code were executed on such a machine, then each iteration of the inner loop forces two different TLB entries (one for X() and one for Y()).
To illustrate the severity of TLB misses and the benefit of being able to use larger page sizes, the following sequence of code was executed on a HP N-4000 computer:
for( j = 0; j < jmax; j++ ) { for( i = 0; i < imax; i++ ) x[i*stride + j] = 1.1 + x[i*stride + j]; }
This is clearly a deliberate attempt to cause TLB misses, because the array x is a double precision array and hence each element is eight bytes in size. The value of stride is set to 516, which translates to a byte stride of 4128; the value of jmax is fixed at 256 while imax is varied from 256 to 2048. The result is a test which accessed from 0.5 MB to 4 MB of data. The HP N-4000 has a 1 MB, four-way set associative cache which, when combined with a stride of 516, keeps the cache misses to a minimum. The results are astounding, as illustrated in Table 3-4. Note the time per access for the 0.5 MB problem size drops sharply after the page size is increased from 4 KB to 16 KB. This implies that a TLB miss takes roughly 160 ns since the data (0.5 MB) resides entirely in cache. More importantly, it’s interesting that TLB misses can cause your code to run over 30 times slower!
Table 3-4. TLB Misses in Nanoseconds as a Function of Page Size and Overall Problem Size.
Page Size |
Problem Size |
|||
---|---|---|---|---|
512 KB |
1 MB |
2 MB |
4 MB |
|
4 KB |
163.4 |
170.3 |
216.1 |
224.5 |
16 KB |
4.9 |
6.3 |
77.9 |
81.9 |
64 KB |
4.9 |
6.1 |
22.0 |
23.8 |
256 KB |
4.9 |
6.2 |
22.0 |
23.8 |
1 MB |
4.9 |
6.1 |
22.4 |
24.1 |
4 MB |
4.9 |
6.2 |
22.2 |
23.9 |