3.2 Caches
Registers are the closest user accessible data storage areas to the functional units. Typically, CISC and RISC processors have fewer than 100 registers and hence these don’t contain much data. Memory is very far from the processor in terms of processor clocks, so intermediate storage areas are desired. Cache is a type of memory located between the processor and main memory.
The largest data caches on high performance processors are usually a few megabytes in size. Cache may be on-chip (physically part of the processor chip) or off-chip.
3.2.1 Cache Line
A cache line is the smallest unit of data that can be transferred to or from memory. A cache line may contain more than one datum. These are the blocks of data that are brought into the cache from memory. Cache lines are usually between 32 and 128 bytes. Since memory is many times larger than cache, multiple addresses from memory MUST map to the same cache line. To not do this would mean that only a small portion of memory could utilize the cache.
There are advantages and disadvantages to having longer (i.e., larger) cache lines. Measured in clocks, the overhead in identifying which cache line to move is likely to be the same regardless of line size. So, if we assume that the time per byte of data is the same, then the resulting bandwidth (bytes/second) will be better. Hence, processors with longer cache lines typically have higher memory bandwidth performance. Long cache lines are good for algorithms that access memory with unit stride accesses since each element of the cache line is used. On the other hand, if an algorithm accesses memory in a random manner (e.g., indirect accesses), then computers with short cache lines may perform better, since the algorithm may access only a single datum in any given cache line.
3.2.2 Cache Organization
Generally speaking, caches are organized so that a cache line can be placed in a certain set of locations in the cache. If there are multiple sets, then the cache is referred to as being set associative. Usually the cache is organized so that any cache line from memory can map to one cache line in any of n sets. This is referred to as a n -way set associative cache. Fully associative caches are those where a cache line can be placed anywhere in the cache.
The simplest (and cheapest) type of cache is direct mapped, which is actually a one-way set associative cache. If the cache size is one MB, then any two locations in memory that are one MB apart map to the same cache line in the cache. For example, data located at hexadecimal addresses 0x0010000, 0x00200000, 0x00300000, ... , 0xFFF00000 are multiples of one MB apart and all must use the same location in a direct mapped cache. An improvement on the direct mapped cache is a two-way associative cache. When a cache line is brought into cache, it is placed in one, but not both, of the two sets. If the cache line and total cache size are held constant, a two-way set associative cache will experience less cache thrashing than a direct mapped cache. Increasing the associativity usually continues to increase performance, but the additional benefit decreases beyond a certain way of associativity.
A fully associative cache is the most advanced (and expensive) type of cache. Due to the complexities involved and resulting expense, these caches tend to be small in size.
3.2.3 Cache Mechanisms
Cache line replacement strategy and write policies are two important performance issues relating to caches. Strategies for cache line replacement are manifested in n -way set associative caches for n > 1. Users don’t have to worry about this functionality with direct mapped caches because the decision is already made for you! Some processors update the cache line in the set that was Least Recently Used (LRU), while others randomly choose which one to update or use round-robin replacement. There are other replacement strategies, but these are actually the two most common and they represent both ends of the performance spectrum. LRU usually performs better than random replacement, but it is more difficult to implement.
When a processor performs a store instruction, it typically writes data into the cache-resident line containing the address. In general, there are two policies for dealing with the cache line subsequent to its having data written into it. In the first policy, the data is written to both the cache line in the cache and to main memory. This is referred to as write through since the write is made to main memory through the cache. The second policy, which is much more common, is referred to as write back. In this case, when a processor executes a store instruction, the data is written only to the cache line in the cache. This modified cache line is written to memory only when necessary (usually when it is replaced in the cache by another cache line). Note that this write back may occur much later than the actual store instruction’s execution. As a result, write back caches usually make fewer memory transactions, which is a good thing as we shall see later. All caches discussed in this chapter are assumed to be write back caches.
3.2.4 Cache Thrashing
The main problem with direct mapped caches is that severe cache thrashing can occur. To illustrate what is meant by cache thrashing, consider the following:
REAL*8 X(*), Y(*) ... DO I = 1, N Y(I) = X(I) + Y(I) ENDDO
This requires loading X(I), loading Y(I), adding X(I) and Y(I), and storing Y(I) for each iteration (value of I).
Suppose we are executing this loop on a machine with a single, direct mapped, 1 MB cache and that a cache line is 32 bytes in length. If X and Y are arrays of eight-byte data (e.g., data declared as REAL*8 in Fortran) then X and Y could be allocated as illustrated in Figure 3-2. That is, X and Y are a multiple of the cache size apart in memory. On the first iteration elements X(1) through X(4) are loaded into cache and X(1) is loaded into a register. Note that operations in Figure 3-2 that force cache lines to be moved between the cache and memory are highlighted in bold. Then Y(1) through Y(4) are loaded into cache from memory and Y(1) is loaded into a register. Note that the cache line containing Y(1) displaces the cache line contain-ing X(1) through X(4). Completing the iteration, X(1) and Y(1) are added and Y(1) is stored. On the second iteration, X(2) must be loaded. This requires that elements X(1) through X(4) be again loaded into cache. But this displaces the cache line containing Y(1) and hence forces that line to be stored to memory (because Y(1) has been modified) before the line containing X(1) through X(4) can be loaded into the cache. Each load of X and each load of Y requires that the data be moved to and from memory. In this case, the cache is nearly useless. This process of repeatedly displacing and loading cache lines is referred to as cache thrashing.
Figure 3-2. Example of cache thrashing.
If the cache happened to be a two-way set associative cache, then most, if not all, of this thrashing would be eliminated. This is because the cache line for X would likely reside in one set of the cache and the cache line for Y in the other. Both round-robin and least-recently-used replacement strategies would address this situation very well. A random replacement approach might take a few iterations before getting it right. That is, since the set number to be replaced is generated randomly, the result could easily be that the first set is replaced multiple times before selecting the second set (or vice-versa).
The previous example was derived from an example where the arrays X and Y were declared as follows:
REAL*8 X(7340032), Y(7340032)
Note that 7340032 in hexadecimal is 0x700000, which is seven MB, an integral multiple of the cache size.
If the first array had been “padded” so that it was not a multiple of the cache size apart, then this cache thrashing could have been avoided. Suppose X and Y were declared as follows:
REAL*8 X(7340036), Y(7340036)
Then many memory transactions will be eliminated. Note that the padding is for four elements. Since each element is eight bytes, this is effectively padding the array by one cache line (32 bytes = 4 × 8 bytes). With the arrays declared as above, we then have a sequence of operations as outlined in Figure 3-3. Not only does the second iteration not require any memory transactions, neither does iteration 3 or 4! At the beginning of iteration five, the cache line containing Y(1) through Y(4) will have to be stored to memory, so we will have roughly three memory transactions for every four iterations. Compare this to the previous example where we had three memory transactions for every single iteration.
Figure 3-3. Cache thrashing eliminated by padding arrays.
To demonstrate the difference in performance, the example above was executed on an HP V-Class computer with a two MB direct-mapped cache. With N set to 7340032 in both cases, the resulting times to execute the loop are shown in Table 3-1.
Table 3-1. Performance of Y(I) = X(I) + Y(I) for Loop with Fixed Length.
Array Dimension |
Elapsed Time (sec.) |
---|---|
7340032 |
7.21 |
7340036 |
3.52 |
3.2.5 Caching Instructions Versus Data
One compelling reason for using direct mapped caches is that they are easier to implement than associative caches. One benefit of this is that processors can access direct mapped caches faster than associative caches. This can be extremely important for data as well as instructions. It may not matter how much data you have in registers if the processor continuously stalls waiting on instructions.
The area of memory reserved for the executable instructions of a program is referred to as the text area. Nearly all workloads require more megabytes of data than they do of text. Some architectures use a single cache for both instructions and data. Such caches are generally referred to as unified caches. One can easily imagine a situation where a series of instructions that are repeatedly executed (e.g., a loop) just happen to map to the same cache line(s) as the data. This sort of cache thrashing is very difficult to identify, not to mention remedy. Modern computer architectures have alleviated this problem in two different ways. One is to have associative unified caches and another is to have the instruction cache separate from the data cache. The latter solution is to literally have two, independent caches. Unified associative caches still run the risk of text displacing data (and vice versa), but having separate instruction cache and data cache eliminates this problem. So, what is becoming the most common implementation in the industry? A combination of the two, of course! Most new architectures feature separate I and D caches that are small and very close to the processor with another, larger unified cache that is a little farther away. This brings us to the next topic.
3.2.6 Multiple Levels of Caches
It’s already been established that faster memory is more expensive than slower memory. In particular, cache is much faster and more expensive than main memory. There is so much truth in this that one sometimes wonders if caches should be spelled c-a-s-h! Given the advantages of cache to today’s processor performance, it is easy to see how an intermediate level of cache might be beneficial and economically feasible. Consider the Compaq AlphaServer DS20 machine which has a 500 MHz Alpha 21264 processor with a 64 KB instruction cache and 64 KB data cache on chip and an intermediate (or secondary) cache that is four MB in size. The primary benefit of such caches is, of course, to reduce the average time per access to memory.
Suppose we have a workload which we have studied and we are able to produce a rough estimate of how often memory accesses occur as a function of the data storage available. Table 3-2 is a representative listing of this information.
Table 3-2. Memory Accesses Relative to Data Storage Available for a Synthetic Workload.
Size Range |
Cumulative Percentage of Memory Accesses |
Percentage of Accesses in This Range |
---|---|---|
0 - 4 KB |
72% |
72% |
4 - 16 KB |
83% |
11% |
16 - 64KB |
87% |
4% |
64 - 256 KB |
90% |
3% |
256 KB - 1 MB |
92% |
2% |
1 - 4 MB |
95% |
3% |
4 - 16 MB |
99% |
4% |
16 - 64 MB |
100% |
1% |
Now consider three different cache architectures with the same processor and memory system. Suppose they have cache structure and attributes as shown in Table 3-3. Note that only Architecture 3 has two caches.
Table 3-3. Data Storage Sizes and Associated Access Times for Hypothetical Architectures.
Architecture |
L1 Cache Size |
L1 Access Time (Cycles) |
L2 Cache Size |
L2 Access Time (Cycles) |
---|---|---|---|---|
1 |
1 MB |
3 |
None |
N/A |
2 |
4 MB |
5 |
None |
N/A |
3 |
16 KB |
1 |
4 MB |
6 |
If we assume it is 100 clocks from the processor to main memory for all three architectures, then, using the data in Table 3-2 and Table 3-3, we can produce the expected number of clock cycles for an average data access for all three architectures.
For architecture 1, 92% of the accesses are in one MB of data storage or less. So:
Expected latency for architecture 1 = 92% × 3 cycles + 8% × 100 cycles = 10.76 cycles.
Now, architecture 2 has a four MB cache and so 95% of the accesses are estimated to occur within it. Thus:
Expected latency for architecture 2 = 95% × 5 cycles + 5% × 100 cycles = 9.75 cycles.
So, all else being equal, architecture 1 is likely to be faster for this application than architecture 2 even though it has a smaller cache! The remaining cache architecture has the following:
Expected latency for architecture 3 = 83% × 1 cycle + 12% × 6 cycles + 5% × 100 cycles = 6.55 cycles
What an improvement! Even though the primary cache is very small and the secondary cache is the same size and slower than that of architecture 2, it is much faster than either of the other cache architectures.