4.6 Memory Coherency
A given memory location is said to be coherent if all CPUs and I/O devices in a machine observe one and the same value in that location. Modern machines make aggressive use of memory caches and thereby introduce many potential sources for incoherence. For example, to maximize performance, caches usually delay stores by the CPU as long as possible and write them back to main memory (or the next level in the cache hierarchy) only when absolutely necessary. As a result, the same memory location may have different values in the various caches or in main memory. Modern machines also often use separate caches for instructions and data, thereby introducing the risk of the same location being cached both in the instruction and data cache, but with potentially different values.
A particularly insidious source of incoherence arises from virtually-tagged caches that tag the cache contents not with the memory location (physical address) but with the virtual address with which the location was accessed. Consider the scenario where the same location is accessed multiple times with different virtual addresses. This is called virtual aliasing because different virtual addresses refer to the same memory location. Linux frequently does this and with virtually-tagged caches, the memory location may end up being cached multiple times! Moreover, updating the memory location through one virtual address would not update the cache copies created by the aliases, and we would again have a situation where a memory location is incoherent.
I/O devices are yet another source of incoherence: a device may write a memory location through DMA (see Chapter 7, Device I/O) and this new value may or may not be observed by the memory caches.
We would like to emphasize that, by itself, an incoherent memory location is not an issue. A problem arises only if an incoherent value is observed, e.g., by a CPU or an I/O device. When this happens, the result is usually catastrophic. For example, suppose we are dealing with a platform that uses separate instruction and data caches. If the operating system reads a page of code from the text section of an executable file and copies it to the user space of a process, the data cache will be updated but the instruction cache will remain unchanged. Thus, when the process attempts to execute the newly loaded code, it may end up fetching stale instructions and the process may crash. To avoid such problems, memory locations must be made coherent before a stale value can be observed. Depending on the platform architecture, maintaining coherency may be the responsibility of hardware or software. In practice, the two often share the responsibility, with the hardware taking care of certain sources of incoherence and the software taking care of the rest.
4.6.1 Maintenance of coherency in the Linux kernel
To accommodate the wide variety of possible memory coherence schemes, Linux defines the interface shown in Figure 4.38. Every platform must provide a suitable implementation of this interface. The interface is designed to handle all coherence issues except DMA coherence. DMA is handled separately, as we see in Chapter 7, Device I/O.
Figure 4.38. Kernel interface to maintain memory coherency.
The first routine in this interface is flush_cache_all(). It must ensure that for data accessed through the kernel address space, all memory locations are coherent. Linux calls this routine just before changing or removing a mapping in the page-table-mapped kernel segment or the kmap segment. On platforms with virtually-tagged caches, the routine is usually implemented by flushing all data caches. On other platforms, this routine normally performs no operation.
The second routine, flush_icache_range(), ensures that a specific range of memory locations is coherent with respect to instruction fetches and data accesses. The routine takes two arguments, start and end. The address range that must be made coherent extends from start up to and including end - 1. All addresses in this range must be valid kernel or mapped user-space virtual addresses. Linux calls this routine after writing instructions to memory. For example, when loading a kernel module, Linux first allocates memory in the page-table-mapped kernel segment, copies the executable image to this memory, and then calls flush_icache_range() on the text segment to ensure the d_caches and i_caches are coherent before attempting to execute any code in the kernel module. On platforms that do not use separate instruction or data caches or that maintain coherency in hardware, this routine normally performs no operation. On other platforms, it is usually implemented by flushing the instruction cache for the given address range.
The next three routines are all used by Linux to inform platform-specific code that the virtual-to-physical translation of a section of a (user) address space is about to be changed. On platforms with physically indexed caches, these routines normally perform no operation. However, in the presence of virtually indexed caches, these routines must ensure that coherency is maintained. In practice, this usually means that all cache lines associated with any of the affected virtual addresses must be flushed from the caches. The three routines differ in the size of the section they affect: flush_cache_mm() affects the entire address space identified by mm structure pointer mm; flush_cache_range() affects a range of addresses. The range extends from start up to and including end -1, where start and end are both user-level addresses and are given as the second and third arguments to this routine. The third routine, flush_cache_page(), affects a single page. The vm-area that covers this page is identified by the vma argument and the user-space address of the affected page is given by argument addr. These routines are closely related to the TLB coherency routines flush_tlb_mm(), flush_tlb_range(), and flush_tlb_page() in the sense that they are used in pairs. For example, Linux changes the page table of an entire address space mm by using code that follows the pattern shown below:
flush cache mm( mm); ... change page table of mm... flush tlb mm( mm);
That is, before changing a virtual-to-physical translation, Linux calls one of the flush_cache routines to ensure that the affected memory locations are coherent. In the second step, it changes the page table (virtual-to-physical translations), and in the third step, it calls one of the flush_tlb routines to ensure that the TLB is coherent. The order in which these steps occur is critical: The memory locations need to be made coherent before the translations are changed because otherwise the flush routine might fault. Conversely, the TLB must be made coherent after the translations are changed because otherwise another CPU might pick up a stale translation after the TLB has been flushed but before the page table has been fully updated.
The three routines just discussed establish coherence for memory locations that may have been written by a user process. The next three routines complement this by providing the means to establish coherence for memory locations written by the kernel.
The first routine is flush_dcache_page(). The Linux kernel calls it to notify platform-specific code that it just dirtied (wrote) a page that is present in the page cache. The page is identified by the routine's only argument, pg, which is a pointer to the page frame descriptor of the dirtied page. The routine must ensure that the content of the page is coherent as far as any user-space accesses are concerned. Because the routine affects coherency only in regards to user-space accesses, the kernel does not call it for page cache pages that cannot possibly be mapped into user space. For example, the content of a symbolic link is never mapped into user space. Thus, there is no need to call flush_dcache_page() after writing the content of a symbolic link. Also, note that this routine is used only for pages that are present in the page cache. This includes all non-anonymous pages and old anonymous pages that have been moved to the swap cache.
Newly created anonymous pages are not entered into the page cache and must be handled separately. This is the purpose of clear_user_page() and copy_user_page(). The former creates an anonymous page, which is cleared to 0. The latter copies a page that is mapped into user space (e.g., as a result of a copy-on-write operation). Both routines take an argument called pg as their last argument. This is a pointer to the page frame descriptor of the page that is being written. Apart from this, clear_user_page() takes two other arguments: to and uaddr. The former is the kernel-space address at which the page resides, and uaddr is the user-space address at which the page will be mapped (because anonymous pages are process-private, there can be only one such address). Similarly, copy_user_page() takes three other arguments: from, to, and uaddr. The first two are the kernel-space addresses of the source and the destination page, and uaddr is again the user-space address at which the new page will be mapped. Why do these two routines have such a complicated interface? The reason is that on platforms with virtually indexed caches, it is possible to write the new page and make it coherent with the page at uaddr without requiring any explicit cache flushing. The basic idea is for the platform-specific code to write the destination page not through the page at address to, but through a kernel-space address that maps to the same cache lines as uaddr. Because anonymous pages are created frequently, this clever trick can achieve a significant performance boost on platforms with virtually indexed caches. On the other hand, on platforms with physically indexed caches, these operations normally perform no operation other than clearing or copying the page, so despite having rather complicated interfaces, the two routines can be implemented optimally on all platforms.
4.6.2 IA-64 implementation
The IA-64 architecture guarantees that virtual aliases are supported in hardware but leaves open the possibility that on certain CPUs there may be a performance penalty if two virtual addresses map to the same memory location and the addresses do not differ by a value that is an integer multiple of 1 Mbyte. This implies that Linux can treat IA-64 as if all caches were physically indexed, and hence flush_cache_all(), flush_cache_mm(), flush_cache_range(), and flush_cache_page() do not have to perform any operation at all. To avoid the potential performance penalty, Linux/ia64 maps shared memory segments at 1-Mbytealigned addresses whenever possible.
While IA-64 generally requires that coherence is maintained in hardware, there is one important exception: When a CPU writes to a memory location, it does not have to maintain coherence with the instruction caches. For this reason, the IA-64 version of flush_icache_range() must establish coherence by using the flush cache (fc) instruction. This instruction takes one address operand, which identifies the cache line that is to be written back (if it is dirty) and then is evicted from all levels of the cache hierarchy. This instruction is broadcast to all CPUs in an MP machine, so it is sufficient to execute it on one CPU to establish coherence across the entire machine. The architecture guarantees that cache lines are at least 32 bytes in size, so the routine can be implemented by executing fc once for every 32-byte block in the address range from start to end -1. Care must be taken not to execute fc on an address that is outside this range; doing so could trigger a fault and crash the kernel.
Is implementing flush_icache_range() sufficient to guarantee coherence between the data and instruction caches? Unfortunately, the answer is no. To see this, consider that Linux calls this routine only when it positively knows that the data it wrote to memory will be executed eventually. It does not know this when writing a page that later on gets mapped into user space. For this reason, flush_dcache_page(), clear_user_page(), and copy_user_page() logically also must call flush_icache_range() on the target page. Given that there are usually many more data pages than code pages, this naive implementation would be prohibitively slow. Instead, Linux/ia64 attempts to delay flushing the cache with the following trick: The Linux kernel reserves a 1-bit field called PG arch 1 in every page frame descriptor for platform-specific purposes. On IA-64, this bit indicates whether or not the page is coherent in the instruction and data caches. A value of 0 signifies that the page may not be coherent and a value of 1 signifies that the page is definitely_coherent. With this setup, flush_dcache_page(), clear_user_page(), and copy_user_page() can be implemented so that they simply clear the PG arch 1 bit of the dirtied page. Of course, before mapping an executable page into user space, the kernel still needs to flush the cache if the PG arch 1 bit is off. Fortunately, we can use the platform-specific update_mmu_cache() for this purpose. Recall from Section 4.4.2 that this routine is called whenever a translation is inserted or updated in the page table. On IA-64, we can use this as an opportunity to check whether the page being mapped has the execute permission (X) bit enabled. If not, there is no need to establish coherency with the instruction cache and the PG arch 1 bit is ignored. On the other hand, if the X bit is enabled and the PG arch 1 bit is 0, coherency must be established by a call to flush_icache_range(). After this call returns, the PG arch 1 bit can be turned on, because it is now known that the page is coherent. This approach ensures that if the same page is mapped into other processes, the cache flush does not have to be repeated again (assuming it has not been dirtied in the meantime).
There is just one small problem: Linux traditionally maps the memory stack and memory allocated by the brk() system call with execute permission turned on. This has a devastating effect on the delayed i_cache flush scheme because it causes all anonymous pages to be flushed from the cache even though usually none of them are ever executed. To fix this problem, Linux/ia64 is lazy about turning on the X bit in PTEs. This works as follows: When installing the PTE for a vm-area that is both writable and executable, Linux/ia64 does not turn on the X bit. If a process attempts to execute such a page, a protection violation fault is raised by the CPU and the Linux page fault handler is invoked. Because the vm-area permits execute accesses, the page fault handler simply turns on the X bit in the PTE and then updates the page table. The process can then resume execution as if the X bit had been turned on all along. In other words, this scheme ensures that for vm-areas that are mapped both executable and writable, the X bit in the PTEs will be enabled only if the page experiences an execute access. This technique has the desired effect of avoiding practically all cache flushing on anonymous pages.
The beauty of combining the delayed i_cache flush scheme with the lazy execute bit is that together they are able to not just delay but completely avoid flushing the cache for pages that are never executed. As we see in Chapter 7, Device I/O, the effectiveness of this pair is enhanced even further by the Linux DMA interface because it can be used to avoid the need to flush even the executable pages. The net effect is that on IA-64 it is virtually never necessary to explicitly flush the cache, even though the hardware does not maintain i_cache coherence for writes by the CPU.