- Cache Hierarchy
- Cache Functionality and Organization
- Hiding Latency With Prefetch
- Cache Organization and Replacement Policies
- UltraSPARC III Cu Memory Hierarchy
- References
UltraSPARC III Cu Memory Hierarchy
TABLE 1 presents the memory hierarchy of the UltraSPARC™ III Cu microprocessor. For more details, refer to [ArchManual].
TABLE 1 Cache Characteristics of the UltraSPARC III Cu Microprocessor
Cache |
Function |
Size |
7Line size |
8Subblock |
Organization |
I-cache | Instructions | 32 Kbytes | 32 bytes | none | 4-way set associative |
D-cache | Data | 64 Kbytes | 32 bytes | none | 4-way set associative |
I-TLB | Address | 16 entries9 |
|
|
Fully associative |
|
Address | 128 entries10 |
|
|
2-way set associative |
D-TLB | Address |
16 entries9 |
|
|
Fully associative |
|
Address | 512 entries9, 11 |
|
|
2-way set associative |
|
Address |
|
|
2-way set associative | |
P-cache | Prefetch | 2 Kbytes | 64 bytes | 32 bytes | 4-way set associative |
W-cache | Stores | 2 Kbytes | 64 bytes | 32 bytes | 4-way set associative |
E-cache | Unified | 8 Mbytes | 64-512 bytes12 | 64 bytes | 2-way set associative |
As TABLE 1 shows, the UltraSPARC III Cu processor has two caches that have not yet been mentioned.
The P-cache is a prefetch cache, used to store data that has been brought in as a result of a prefetch operation (instruction or hardware initiated, if supported). Only floating point loads can get data from the P-cache.
The W-cache acts as a holding station for stored data. This cache reduces bandwidth to the E-cache by coalescing and bursting stores to the E-cache.
FIGURE 7 is a block diagram with the main components of the memory hierarchy on the UltraSPARC III Cu microprocessor. The arrows indicate the various prefetch possibilities. For more details, refer to [ArchManual].
FIGURE 7 Main Components of the Memory Hierarchy
As commented earlier, UltraSPARC has two separate TLBs for instruction and data (I-TLB and D-TLB, respectively). In FIGURE 7, these TLBs are not shown as separate blocks, but labelled TLBs instead.
Performance Aspects
This section uses a simple example to demonstrate the way the memory hierarchy should be used and how performance can degrade if data is not accessed in the intended way.
Consider the following loop that sums up all elements in a two-dimensional array.
float x[m][n]; for (i=0; i<m; i++) for (j=0; j<n; j++) .......... (1) sum += x[i][j];
To simplify the discussion, assume the following:
Matrix x, or a subset, is not present yet in any of the caches in the system
The first element x[0][0] is at a data cache line boundary and the first element in a virtual memory page
The dimensions of x are a multiple of foursuch that four rows span one virtual memory page
The system has only one data cache.
A data cache line contains four elements
None of these assumptions are crucial in the remainder of the discussion. They simply make the explanation a little easier.
FIGURE 8 uses shading to indicate in what way the matrix is built up, as seen from a memory access and storage perspective. In this figure, the matrix fits in two pages. Each row consists of four data cache lines.
FIGURE 8 Matrix x
The nested loop in (1) is ordered such that element x[0][0] is the first one to be read. Because (by assumption) none of the data is cached anywhere, this memory access results in both a data cache miss and a TLB miss. The latter is because the page that this element is part of has not been mapped in the TLB cache yet. Therefore, this is a very expensive reference.
The next reference is to x[0][1]. This element is included in the cache line that was brought in as a result of the reference to x[0][0]. It is also part of the page that was just mapped into the TLB. As a result, this is a very fast memory access. Elements x[0][2] and x[0][3] will have similar fast memory access.
However, the reference to x[0][4] will be slower, because a new cache line must be brought in. Reading x[0][5] will be fast again, because it is part of the second cache line that was just loaded into the data cache.
This pattern repeats itself until all the elements in the first page are exhausted. In the example, the last element is x[3][n-1].
The reference to x[4][0] results in both a data cache and a TLB miss again. The TLB miss is because this element is part of the second page and no data in this page has been accessed yet.
It is easy to see that this kind of memory access pattern produces (m* n)/4 data cache misses and m/4 TLB misses.
Other than potentially reusing previously cached elements, little can be done to avoid these transfers (referred to as the natural cache traffic).
The preceding example demonstrates what happens if memory access follows the storage order. The following paragraphs discuss what happens if such a favorable memory access pattern does not exist. The same nested loop is discussed, but with the order of the loops interchanged:
float x[m][n]; for (j=0; j<n; j++) for (i=0; i<m; i++) .......... (2) sum += x[i][j];
Similar to the previous example, the first reference (to x[0][0]) results in a data cache and TLB miss.
The next reference is to x[1][0]. This element is still part of the same page (and hence no TLB miss occurs), but not part of the cache line just brought in. Therefore, this reference results a data cache miss. Likewise, the references to x[2][0] and x[3][0] each results in a data cache miss.
On the next reference, to x[4][0], a TLB miss occurs because a page boundary is being crossed. The references to x[5][0], x[6][0] and x[7][0] cause a data cache line miss again.
While progressing through the first column of the matrix, the TLB and data cache are filled up. What happens next depends on the value of m.
If m is sufficiently small, the cache lines and TLB entries are still present when proceeding to the next column (j=1 in the loop above). If this is the case, all is well. TLB entries are no longer needed and the values of x[0..m-1][1] are all in the data cache already. If a cache line spans four elements, references to x[0..m-1][4] again result in cache misses, but these misses are part of the natural traffic between main memory and the cache subsystem.
Performance problems arise if m and n are large. To make this more specific, an imaginary system with the following memory characteristics is introduced13:
A data cache with a capacity of 32 kilobytes, LRU replacement, and a line size of 16 bytes
n A fully associative TLB with 256 entries
n One virtual memory page has a size of 4 kilobytes
One cache line contains 16/4 = 4 elements14. The system can store 32 Kbytes/16 = 2048 cache lines in the data cache and map 256*4 = 1 megabyte of data in the TLB. If the data requirement exceeds one or both of these thresholds, performance degrades.
Assume that m = 4096 and n = 1024.
By the time 2048 elements of the first column of matrix x are read, the data cache is full. It contains the first four columns of the matrix.
The next reference (i=2048), causes one of the lines to be removed from the cache (evicted). Because of the LRU replacement, the very first cache line loaded (x[0][0]...x[0][3]) is evicted from the cache.
For i = 2049, the second cache line will be replaced, and so forth.
After the last value of the inner loop iteration (i=4095) is executed, the top half of the first four columns of matrix x are replaced by the bottom part of the first four columns.
TABLE 2 shows a snapshot of the data cache after it has finished processing the iteration for the indicated values of j and i.
TABLE 2 Three Snapshots of the Data Cache
J=0 I=2047 |
J=0 I=2048 |
J=0 I=4095 |
x[ 0] [0] ... x[ 0] [3] x[ 1] [0] ... x[ 1] [3] x[ 2] [0] ... x[ 2] [3] x[ 3] [0] ... x[ 3] [3] . . . x[2046] [0] ... x[2046] [3] x[2047] [0] ... x[2047] [3] |
x[ 2048] [0] ... x[ 2048] [3] x[ 1] [0] ... x[ 1] [3] x[ 2] [0] ... x[ 2] [3] x[ 3] [0] ... x[ 3] [3] . . . x[2046] [0] ... x[2046] [3] x[2047] [0] ... x[2047] [3] |
x[2048] [0] ... x[2048] [3] x[2049] [0] ... x[2049] [3] x[2050] [0] ... x[2050] [3] x[2051] [0] ... x[2051] [3] . . . x[4094][0] ... x[4094][3] x[4095][0] ... x[4095][3] |
When the program processes the next column of the matrix, j=1, all cache lines must be reloaded again. For example, the first element needed is x[0][1]. Initially this was brought into the cache for j = 0 and i = 0, but it was evicted when loading subsequent elements of the matrix.
Therefore, all references to the second column of the matrix result in a cache miss again. There is no spatial locality (that is, not using all of the elements in one cache line.) In a similar manner, it can be seen that all references to the subsequent columns of the matrix are cache misses.
In other words, all references to x[i][j] result in a cache miss. As a result, performance is very poor.
But, things are even worse than this. Due to the poor memory access pattern, no spatial or temporal locality (that is re-use of cache lines) is obtained in the TLB cache either. Every reference to a matrix element results in a TLB miss too.
The explanation for this is similar to that seen for the data cache.
The TLB cache has a capacity of 256 entries. One row of the matrix is 1024 * 4 = 4 kilobytes, which is exactly the size of one virtual memory page.
The program starts reading the first column of the matrix (j = 0). A TLB entry must be set up for the page that contains x[0][0]. The next reference is to x[1][0]. This element has not yet been mapped into the TLB either, giving rise to another TLB miss. This pattern is repeated for the next 254 iterations: they all cause a TLB entry to be set up.
When i = 256 an older entry must be replaced instead of re-using these cached entries because the TLB cache is full. Similar to what was seen with the data cache, subsequent loop iterations evict older entries from the TLB. For i = 511, all first 256 entries have been flushed out of the TLB.
By the time the program processes all elements in the first column of matrix, the TLB entries map to the last 256 rows of the matrix.
Unfortunately, the program cannot access these elements, but starts with the top part of the matrix again. None of the pages corresponding to these elements are in the TLB and a TLB miss occurs again on every matrix reference.
This pattern repeats itself over and over again, resulting in a TLB miss on every reference to matrix x.
It is clear that this loop performs very poorly under these circumstances.
For larger matrixes, the behavior is similar. If the values for m and/or n are reduced, some or all of these effects are less pronounced. For small matrixes, they will be gone entirely.
Performance Results
To illustrate the preceding data, this example was run on a UltraSPARC III Cu processor at 900 Mhz in a Sun Fire™ 6800 system.
In the row-oriented version, the inner loop is over the rows of the matrix:
for (i=0; i<m; i++) for (j=0; j<n; j++) sum += x[i][j];
In the column-oriented version, the inner loop is over the columns of the matrix:
for (j=0; j<n; j++) for (i=0; i<m; i++) sum += x[i][j];
Without special precautions, the C compiler from the Sun ONE™ Studio 7 Compiler Collection suite transforms the column version with the bad memory access pattern into the row version. This is, of course, the right thing to do, but it defeats the purpose of our experiment. To prevent this optimization, we crafted the example to prevent the compiler from applying the loop interchange.
FIGURE 9 shows the performances (in Megaflops per second) as a function of the memory footprint (in kilobytes).
FIGURE 9 Performance of the Matrix Summation Operation
Only square nxn matrices are used. The footprint is then given by n * n * 4/1024 kilobytes. Note the log-log scale in the graph.
To filter out timing anomalies, each performance experiment was repeated three times. The average over these three results is the value that was plotted.
The shape of these curves is not surprising. With both versions, the performance is highest for a footprint of about 64 kilobytes, the size of the L1 D-cache on the UltraSPARC III Cu processor.
Because the entire matrix fits in this cache, it will also fit in the D-TLB cache and the memory access pattern is irrelevant when it comes to performance.
However, there is a slight difference between the two versions. The column-oriented version is slightly slower. Additional cache conflicts cause this difference. The column version has a non-unit stride access pattern, increasing the chance of premature cache line evictions.
In general, the curve for the row version is smoother than for the column version. This is because the non-unit stride access pattern in the latter gives rise to more cache line collisions.
Once the matrix no longer fits in the D-cache but still fits in the L2 E-cache, performance degrades, but the performance curves are still similar. This is because all of the matrix can be mapped into the E-cache and D-TLB.
As soon as the matrix no longer fits in the E-cache, performance degrades again, but the curves are no longer identical. The row version outperforms the column version in a rather dramatic way. It is about four to five times faster.
Software Prefetch Effect
Studying the effect of software prefetch is interesting. The previous results shown were obtained with software prefetch disabled (-xprefetch=no compiler option).
With the -xprefetch=yes option, the compiler applies heuristics to decide which prefetch instructions to insert and whether or not to insert them.
On small problems, prefetching data is not meaningful, as the matrix will be in one of the caches anyway. There is not much latency to hide and the additional cost of the prefetch instruction only slows down the program.
A prefetch instruction on data that is not mapped into the D-TLB will be dropped. The poor memory access pattern in the column version generates many D-TLB misses on large matrices. Therefore, there is no practical reason to insert prefetch instructions in the column version.15
This is different for the row version. Thanks to the favorable memory access pattern, this algorithm only generates a modest amount of D-TLB misses compared to the total number of cache references. When accessing a large matrix, prefetch should, therefore, help hide the memory latency
FIGURE 10 is a plot of the performance for the row version, with and without software prefetch. The speed-up (performance ratio between the two versions) is given in FIGURE 11.
Comparing the performances shown in FIGURE 9 and FIGURE 10, it is seen that prefetch not only improves performance for large matrices that do not fit in the level 2 E-cache, but also helps for data sets that are E-cache resident. For a memory footprint between 64 kilobytes and 8 megabytes, the row version with prefetch does not degrade in performance (FIGURE 10), whereas the original version without prefetch does (FIGURE 9).
FIGURE 10 Effect of Software Prefetch on the Row Oriented Version.
FIGURE 11 Performance Ratio for the Row Version With and Without Prefetch.
For matrices that fit in the D-cache, prefetch slows down the operation, but the degradation is limited to approximately 20 percent.
For larger matrices, prefetch has a significant effect on performance. The speed-up for E-cache resident matrices is close to 4. For larger matrices, the speed-up is almost a factor of 5.