3.4 Memory
To the casual observer, memory systems appear to be the most elementary part of the computer. It seems to be simply a large number of identical computer chips which store data. This is, of course, far from reality.
The memory system or main memory, as it is typically described, is the crossroads of data movement in the computer. Data moving to and from caches moves through the memory system. Input and output devices such as magnetic disk, network devices, and magnetic tape all have the main memory system as the target of their output and the source of their input.
Computer hardware vendors have, by increasing address space capabilities, themselves caused memory capacity demands to increase. During the 70’s almost all computer vendors moved from 16-bit addresses to 32-bit. As we move into the next millennium, most computers will have 64-bit address spaces. As a result, memory systems that used to be measured in kilobytes (KB) are now measured in gigabytes (GB).
With the economies of scale in processors, multiprocessor computers have evolved from simply having two processors in the 80’s to configurations with literally thousands of processors. At the same time, memory systems are expected to be capable of “feeding” all these data-consuming processors.
So, while complexity in processor design has increased at a dramatic rate, memory systems have been driven to keep up with them—albeit unsuccessfully. Added to this pressure is the tremendous increase in multiprocessor systems that require not just faster memory systems, but larger capacity systems that allow overall performance to scale with the number of processors.
3.4.1 Basics of Memory Technology
Random access memory is by far the most popular memory chip technology today. It is described as random because it allows you to address any particular memory location without having to step through memory in sequential fashion until you arrive at the desired destination. Most memory systems today (including caches) consist of dynamic random access memory (DRAM) or static random access memory (SRAM).
DRAM is less expensive as it typically requires only a single transistor per bit. The downside is that, due their simplicity, the electrical charge must be periodically refreshed. Hence, DRAM can be periodically unavailable because it is being refreshed. SRAM doesn’t require periodic refreshes, which allows it to be faster but more complex—generally requiring five to six transistors per bit. The result is a more expensive piece of hardware.
There are two times that are important in measuring memory performance: access time and cycle time. Access time is the total time from when a read or write is requested until it actually arrives at its destination. Cycle time is the minimum time between requests to memory. Note that since SRAM does not need to be refreshed, there is no difference between access time and cycle time.
3.4.2 Interleaving
Simple memory system organizations using DRAM (which is most likely since it is less expensive than SRAM) result in each memory transaction’s requiring the sum of access time plus cycle time. One way to improve this is to construct the memory system so that it consists of multiple banks of memory organized so that sequential words of memory are located in different banks. Addresses can be sent to multiple banks simultaneously and multiple words can then be retrieved simultaneously. This will improve performance substantially. Having multiple, independent memory banks benefits single processor performance as well as multiprocessor performance because, with enough banks, different processors can be accessing different sets of banks simultaneously. This practice of having multiple memory banks with sequential words distributed across them in a round-robin fashion is referred to as interleaving. Interleaving reduces the effective cycle time by enabling multiple memory requests to be performed simultaneously.
The benefits of interleaving can be defeated when the memory access pattern is such that the same banks are accessed repeatedly. Let us assume that we have a memory system with 16 banks and that the computer uses cache lines which contain 4 words, with each word being 8 bytes in size. If the following loop is executed on this computer, then the same set of banks is being accessed repeatedly (since stride = 64 words = 16 * 4 words):
double *x; ... stride = 64; sum = 0.0; for( j = 0; j < jmax; j++ ) { for( i = 0; i < imax; i++ ) sum += x[i*stride + j]; }
As a result, each successive memory access has to wait until the previous one completes (the sum of the access time and cycle time). This causes the processor (and hence the user’s program) to stall on each memory access. This predicament is referred to as a bank stall or bank contention.
Let’s revisit the sequence of code used in the previous TLB discussion:
for( j = 0; j < jmax; j++ ) { for( i = 0; i < imax; i++ ) x[i*stride + j] = 1.1 + x[i*stride + j]; }
First, fix imax at 16384 and jmax at 512 so that the problem size is 64 MB in size. In Figure 3-4, we show the average access time for various hardware platforms using several values of stride. Two sets of data are charted for the HP N-4000, one using 4 KB pages and another using 1 MB pages. Note the tremendous difference in performance for the N-4000 caused by TLB misses, as illustrated by the divergence in the graphs after a stride of 16. This data was generated using 8 KB pages on the SUN UE3500 and 16 KB pages on the SGI Origin 2000. Note that, even with 1 MB pages, the N-4000’s performance decreases after a stride of 8, indicating that memory bank contention is hindering performance.
Figure 3-4. Memory access time as a function of word stride for various computers.
Page size does not just benefit applications that use a lot of data, it also benefits those that have a large text (i.e., instructions) segment. Electronic design simulations, relational database engines, and operating systems are all examples of applications whose performance is sensitive to text size.
How does one alter the page size for an executable? HP-UX provides a mechanism to change the attributes of an executable file. This can be accomplished by a system utility, chatr. For the tests discussed here, the executable was modified to request four KB data pages with the following command:
chatr +pd 4K ./a.out
Similarly,
chatr +pd 1M ./a.out
modified the executable to request 1 MB data pages.
Does this imply that all applications should use huge, say, one MB, pages to improve performance? Not necessarily. Consider the processes that are executing constantly (e.g., daemons) on a computer; usually there are a dozen or more. Daemons typically occupy only about 100 to 200 KB of memory and hence with 4 KB pages a couple dozen of them would use less than 5 MB of memory. However, if each of them were to use one MB pages, then they would each occupy at least one MB of memory per daemon for a total of 24 MB. Since these applications run constantly, the computer will be using 19 MB (24 - 5) of memory unnecessarily! So, it’s not always a good idea to use huge pages, but for applications that use a large amount of memory, it can give large boosts in performance.
3.4.3 Hiding Latency
With processor speeds improving at a faster rate than memory speeds, there have been multiple approaches to overcome delays due to memory accesses. One solution we’ve discussed is the creation of caches. Another successful approach enables processors to issue memory requests and continue execution until the data requested is used by another instruction in the program. This is referred to as stall-on-use. Another strategy is to enable prefetching of data either automatically (in hardware) or through the use of special prefetch instructions.
Processors implemented with stall-on-use which can issue multiple memory transactions can do a good job of hiding or reducing effective memory latency. For example, suppose such a processor is capable of issuing up to 10 memory transactions and the realizable memory latency is 200 clocks. Then the effective memory latency can be as little as 200 / 10 = 20 clocks!
Prefetching can also dramatically reduce effective memory latency. Note that if a prefetch instruction can be issued for data more than 200 clocks before the data is accessed by a regular memory instruction, then the memory latency will be reduced to that of the cache latency. This is because the delay of 200 clocks or more allows the data to be moved into the cache (i.e., prefetched).
Data prefetching can be accomplished with software or hardware. Some hardware platforms actually detect when a processor is accessing memory sequentially and will begin prefetching the data from memory into the processor’s cache without any other intervention. Most computers that support prefetching do so with software prefetching. This is typically done through the use of special instructions or instruction arguments. For example, PA-RISC 2.0 processors perform software prefetching by interpreting load instructions into a special register as prefetches.