Memory Hierarchy in Cache-Based Systems
This article is to help the reader understand the architecture of modern microprocessors. It introduces and explains the most common terminology and addresses some of the performance related aspects.
This is an introductory article on caches. After reading this article you should understand how modern microprocessors work and how a cache design impacts performance.
This article is written for programmers and people who have a general interest in microprocessors.
Despite improvements in technology, microprocessors are still much faster than main memory. Memory access time is increasingly the bottleneck in overall application performance. As a result, an application might spend a considerable amount of time waiting for data. This not only negatively impacts the overall performance, but the application cannot benefit much from a processor clock-speed upgrade either.
One way to overcome this problem is to insert a small high-speed buffer memory between the processor and main memory. Such a buffer is generally referred to as cache memory, or cache for short.
The application can take advantage of this enhancement by fetching data from the cache instead of main memory. Thanks to the shorter access time to the cache, application performance is improved. Of course, there is still traffic between memory and the cache, but it is minimal. This relatively simple concept works out well in practice. The vast majority of applications benefit from caches.
This article describes how the basic idea of caches is implemented, what sort of caches are found in most modern systems, and their impact on performance.
Because this article is accessible to a relatively large group of readers many important details are omitted.
Cache Hierarchy
As FIGURE 1 shows, the cache [Handy] is placed between the CPU and the main memory.
FIGURE 1 Example of a Cache-Based Memory System.
The system first copies the data needed by the CPU from memory into the cache, and then from the cache into a register in the CPU. Storage of results is in the opposite direction. First the system copies the data into the cache. Depending on the cache architecture details, the data is then immediately copied back to memory (write-through), or deferred (write-back). If an application needs the same data again, data access time is reduced significantly if the data is still in the cache.
To amortize the cost of the memory transfer, more than one element is loaded into the cache. The unit of transfer is called a cache block or cache line.1 Access to a single data element brings an entire line into the cache. The line is guaranteed to contain the element requested.
Related to this is the concept of sub-blocking. With sub-blocking, a cache allocates a line/block with a length that is a multiple of the cache line. The slots within the larger block are then filled with the individual cache lines (or sub-blocks). This design works well if lines are accessed consecutively, but is less efficient in case of irregular access patterns, because not all slots within one block may be filled.
So far, we have only applied caches to data transfer. There is, however, no reason why you could not use caches for other purposesto fetch instructions, for example. Cache Functionality and Organization explores these other purposes in more detail.
Thanks to advances in chip process technology, it is possible to implement multiple levels of cache memory. Some of these levels will be a part of the microprocessor (they are said to be on-chip), whereas other levels may be external to the chip.
To distinguish between these caches, a level notation is used. The higher the level, the farther away the cache is from the CPU. FIGURE 2 shows an example. The level 1 (L1) cache is on-chip, whereas the level 2 (L2) cache is external to the microprocessor.
Note that in FIGURE 2, and in the remainder of this article, we distinguish between the CPU and microprocessor. CPU refers to the execution part of the processor, whereas microprocessor refers to the entire chip, which includes more than the CPU.
FIGURE 2 Multiple Levels of Cache Memory
In FIGURE 2, the size of the cache increases from left to right, but the speed decreases. In other words, the capacity increases, but it takes longer to move the data in and out.
In some designs, there are three levels of cache. To complicate matters even further, caches at a certain level can also be shared between processors. This topic however is beyond the scope of this paper.
Latency and Bandwidth
Latency and bandwidth are two metrics associated with caches and memory. Neither of them is uniform, but is specific to a particular component of the memory hierarchy.
The latency is often expressed in processor cycles or in nanoseconds, whereas bandwidth is usually given in megabytes per second or gigabytes per second.
Although not entirely correct, in practice the latency of a memory component is measured as the time it takes to fetch one unit of transfer (typically a cache line). As the speed of a component depends on its relative location in the hierarchy, the latency is not uniform. As a rule of thumb, it is safe to say that latency increases when moving from left to right in FIGURE 2.
Some of the memory components, the L1 cache for example, may be physically located on the microprocessor. The advantage is that their speed will scale with the processor clock. It is, therefore, meaningful to express the latency of such components in processor clock cycles, instead of nanoseconds.
On some microprocessors, the integrated (on-chip) caches do not always run at the speed of the processor. They operate at a clock rate that is an integer quotient (1/2, 1/3, and so forth) of the processor clock.
Cache components external to the processor do not usually, or only partially2, benefit from a processor clock upgrade. Their latencies are often given in nanoseconds. Main memory latency is almost always expressed in nanoseconds.
Bandwidth is a measure of the asymptotic speed of a memory component. This number reflects how fast large bulks of data can be moved in and out. Just as with latency, the bandwidth is not uniform. Typically, bandwidth decreases the further one moves away from the CPU.
Virtual Memory
Although not considered in detail in this article, virtual memory is mentioned for reasons of completeness and to introduce the TLB cache. For more details, refer to [CocPet] and [MauMcDl]. The latter covers the virtual memory in the Solaris™ operating environment (Solaris OE) in great detail.
On a virtual memory system, memory extends to disk. Addresses need not fit in physical memory. Certain portions of the data and instructions can be temporarily stored on disk, in the swap space. The latter is disk space set aside by the Solaris OE and used as an extension of physical memory. The system administrator decides on the size of the swap space. The Solaris OE manages both the physical and virtual memory.
The unit of transfer between virtual memory and physical memory is called a page. The size of a page is system dependent3.
If the physical memory is completely used up, but another process needs to run, or a running process needs more data, the Solaris OE frees up space in memory by moving a page out of the memory to the swap space to make room for the new page. The selection of the page that has to move out is controlled by the Solaris OE. Various page replacement policies are possible. These replacement policies are, however, beyond the scope of this article.
Certain components in the system (the CPU for example) use virtual addresses. These addresses must be mapped into the physical RAM memory. This mapping between a virtual and physical address is relatively expensive. Therefore, these translated addresses (plus some other data structures) are stored in an entry in the so-called Translation Lookaside Buffer (TLB). The TLB is a cache and behaves like a cache. For example, to amortize the cost of setting up an entry, you would like to re-use it as often as possible.
The unit of virtual management is a page; one entry in the TLB corresponds to one page.