- Cache Hierarchy
- Cache Functionality and Organization
- Hiding Latency With Prefetch
- Cache Organization and Replacement Policies
- UltraSPARC III Cu Memory Hierarchy
- References
Cache Organization and Replacement Policies
Caches have a certain organization and a replacement policy. The organization describes in what way the lines are organized within the cache. The replacement policy dictates which line will be removed (evicted) from the cache in case an incoming line must be placed in the cache.
Direct Mapped
Direct mapped is a simple and efficient organization. The (virtual or physical) memory address of the incoming cache line controls which cache location is going to be used.
Implementing this organization is straightforward and is relatively easy to make it scale with the processor clock.
In a direct mapped organization, the replacement policy is built-in because cache line replacement is controlled by the (virtual or physical) memory address.
In many cases this design works well, but, because the candidate location is controlled by the memory address and not the usage, this policy has the potential downside of replacing a cache line that still contains information needed shortly afterwards.
Any line with the same address modulo the cache size, will map onto the same cache location. As long as the program accesses one single stream of data consecutively (unit stride) all is well. If the program skips elements or accesses multiple data streams simultaneously, additional cache refills may be generated.
Consider a simple examplea 4-kilobyte cache with a line size of 32 bytes direct-mapped on virtual addresses. Thus each load/store to cache moves 32 bytes. If one variable of type float takes 4 bytes on our system, each cache line will hold eight (32/4=8) such variables.
The following loop calculates the inner product of these two arrays. Each array element is assumed to be 4 bytes long; the data has not been cached yet.
float a[1024], b[1024]; for (i=0; i<1024; i++) sum += a[i]*b[i];
The generic system executes this loop as follows:
i Operation Status In cache Comment 0 load a[0] miss a[0..7] assume a[] was not cached yet load b[0] miss b[0..7] assume b[] was not cached yet t=a[0]*b[0] sum += t 1 load a[1] hit a[0..7] previous load brought it in load b[1] hit b[0..7] previous load brought it in t=a[1]*b[1] sum += t ..... etc ..... 7 load a[7] hit a[0..7] previous load brought it in load b[7] hit b[0..7] previous load brought it in t=a[7]*b[7] sum += t 8 load a[8] miss a[8..15] this line was not cached yet load b[8] miss b[8..15] this line was not cached yet t=a[8]*b[8] sum += t 9 load a[9] hit a[8..15] previous load brought it in load b[9] hit b[8..15] previous load brought it in t=a[9]*b[9] sum += t .......
In this example a[0...7] denotes elements a[0],..,a[7] ; a similar notation is used for vector b and other array squences of elements.
The cache hit rate is 7/8, which equals 87.5 percent. However, this is the best case scenario.
Assume that the two arrays a[1024] and b[1024] are stored consecutively in memory. That is, a[i+1] follows a[i] (i=0.,,,.n-2) in memory and b[0] follows a[n-1], b[1] again follows b[0], and so forth. This loop will no longer perform as nicely as indicated previously.
In this case, the following occurs:
i Operation Status In cache Comment 0 load a[0] miss a[0..7] assume a[] was not cached yet load b[0] miss b[0..7] b[0] is 4 KByte away from a[0] in memory and will wipe out a[0..7] from the cache t=a[0]*b[0] sum += t 1 load a[1] miss a[0..7] previous load wiped out a[0..7] load b[1] miss b[0..7] previous load wiped out b[0..7] t=a[1]*b[1] sum += t 2 load a[2] miss a[0..7] previous load wiped out a[0..7] load b[2] miss b[0..7] previous load wiped out b[0..7] .......
Because of the direct-mapped architecture of the cache and the way the data is organized, every array reference results in a cache miss. This degrades performance noticeably. More specifically, you get seven times as many cache misses as in the favorable scenario. This is called cache thrashing.
Several software-based solutions to this thrashing problem are available. At the source code level, you might consider unrolling the loop. In the following example this has been done to a depth of two.
for (i=0; i<1024; i+=2){ ta0 = a[i]; ta1 = a[i+1]; tb0 = b[i]; tb1 = b[i+1]; sum += ta0*tb0+ta1*tb1; }
The advantage of this approach is that now a[i+1] and b[i+1] are used before the cache line they are part of is evicted from the cache. Note that the optimal unroll depth is eight so all elements brought in are used immediately, before they are evicted by a next load.
The downside of loop unrolling is the need for more registers. For example, the original loop needs three floating point registers to store a[i], b[i] and sum. The unrolled version needs five floating point registers.
On Sun systems with direct mapped caches, several solutions are available. At a higher level, you might get help from the Solaris OE and/or compiler.
Because the mapping is a function of the memory address, the Solaris OE might change the addresses such that different lines no longer map onto the same cache location. This technique is often referred to as page coloring or cache coloring (MauMcDl).
The compiler typically unrolls loops and supports padding of data5 to avoid collisions.
Fully Associative
The fully associative cache design solves the potential problem of thrashing with a direct-mapped cache. The replacement policy is no longer a function of the memory address, but considers usage instead.
With this design, typically the oldest cache line is evicted from the cache. This policy is called least recently used (LRU)6.
In the previous example, LRU prevents the cache lines of a and b from being moved out prematurely.
The downside of a fully associative design is cost. Additional logic is required to track usage of lines. The larger the cache, the higher the cost. Therefore, it is difficult to scale this technology to very large (data) caches. Luckily, a good alternative exists.
Set Associative
A set-associative cache design uses several direct-mapped caches. Each cache is often referred to as a set. On an incoming request, the cache controller decides which set the line will go into. Within the set, a direct-mapped scheme is used to allocate a slot in the cache.
The name reflects the number of direct-mapped caches. For example, in a 2-way set associative design two direct mapped caches are used.
Another design parameter is the algorithm that selects the set. This could be random, LRU, or any other selection scheme.
FIGURE 5 shows a four-way set associative cache.
FIGURE 5 Four-Way Set Associative Design.
Note that a set-associative cache tends to reduce the amount of thrashing. Thrashing can still occur, however, not only within one set but also between sets.
Thrashing between sets is a function of the algorithm that selects the set; whereas thrashing within one set is related to the (virtual) memory address of the data.
Usually, the size of a set is 2n kilobytes (n=1, 2,...). If (virtual) addresses of incoming lines in the same set are 2m apart (m > n), thrashing occurs.
For example, in a two-way set associative design, an update of this type might cause thrashing:
float x[4096][4096]; for (i=1; i<n-1; i++) for (j=1; j<n-1; j++) x[i][j] = x[i][j-1]+x[i][j+1]+x[i][j]+x[i-1][j]+x[i+1][j];
FIGURE 6 shows the computational grid on which these computations are performed. Array element x[i][j] is located at the intersection of the vertical line at i and the horizontal line at j.
FIGURE 6 Computational Grid
For a fixed value of i, this loop updates one cache line (containing x[i][j]) and references two other lines (containing x[i-1][j] and x[i+1][j]).
Assume that the two lines containing x[i][j] and x[i-1][j] are in a different set (to avoid collisions). The question is where the line with x[i+1][j] will go.
As there are only two sets, the cache controller has no other choice than to select a set that already has one of the other two cache lines. In virtual memory, these three lines are 4 * 4096 =16 kilobytes apart. Therefore, cache thrashing within one of the sets will occur if one set is 16 kilobytes or less in size.