4.4 Translation Lookaside Buffer (TLB)
Every time the CPU accesses virtual memory, a virtual address must be translated to the corresponding physical address. Conceptually, this translation requires a page-table walk, and with a three-level page table, three memory accesses would be required. In other words, every virtual access would result in four physical memory accesses. Clearly, if a virtual memory access were four times slower than a physical access, virtual memory would not be very popular! Fortunately, a clever trick removes most of this performance penalty: modern CPUs use a small associative memory to cache the PTEs of recently accessed virtual pages. This memory is called the translation lookaside buffer (TLB).
The TLB works as follows. On a virtual memory access, the CPU searches the TLB for the virtual page number of the page that is being accessed, an operation known as TLB lookup. If a TLB entry is found with a matching virtual page number, a TLB hit occurred and the CPU can go ahead and use the PTE stored in the TLB entry to calculate the target physical address. Now, the reason the TLB makes virtual memory practical is that because it is smalltypically on the order of a few dozen entriesit can be built directly into the CPU and it runs at full CPU speed. This means that as long as a translation can be found in the TLB, a virtual access executes just as fast as a physical access. Indeed, modern CPUs often execute faster in virtual memory because the TLB entries indicate whether it is safe to access memory speculatively (e.g., to prefetch instructions).
But what happens if there is no TLB entry with a matching virtual page number? This event is termed a TLB miss and, depending on the CPU architecture, is handled in one of two ways:
Hardware TLB miss handling: In this case, the CPU goes ahead and walks the page table to find the right PTE. If the PTE can be found and is marked present, then the CPU installs the new translation in the TLB. Otherwise, the CPU raises a page fault and hands over control to the operating system.
Software TLB miss handling: In this case, the CPU simply raises a TLB miss fault. The fault is intercepted by the operating system, which invokes the TLB miss handler in response. The miss handler then walks the page table in software and, if a matching Pte_that is marked present is found, the new translation is inserted in the TLB. If the PTE is not found, control is handed over to the page fault handler.
Whether a TLB miss is handled in hardware or in software, the bottom line is that miss handling results in a page-table walk and if a Pte_that is marked present can be found, the TLB is updated with the new translation. Most CISC architectures (such as IA-32) perform TLB miss handling in hardware, and most RISC architectures (such as Alpha) use a software approach. A hardware solution is often faster, but is less flexible. Indeed, the performance advantage may be lost if the hardware poorly matches the needs of the operating system. As we see later, IA-64 provides a hybrid solution that retains much of the flexibility of the software approach without sacrificing the speed of the hardware approach.
TLB replacement policy
Let us now consider what should happen when the TLB is full (all entries are in use) and the CPU or the TLB miss handler needs to insert a new translation. The question now is, Which existing entry should be evicted (overwritten) to make space for the new entry? This choice is governed by the TLB replacement policy. Usually, some form or approximation of LRU is used for this purpose. With LRU, the TLB entry that has not been used the longest is the one that is evicted from the TLB. The exact choice of replacement policy often depends on whether the policy is implemented in hardware or in software. Hardware solutions tend to use simpler policies, such as not recently used (NRU), whereas software solutions can implement full LRU or even more sophisticated schemes without much difficulty.
Note that if TLB miss handling is implemented in hardware, the replacement policy obviously also must be implemented in hardware. However, with a software TLB miss handler, the replacement policy can be implemented either in hardware or in software. Some architectures (e.g., MIPS) employ software replacement, but many newer architectures, including IA-64, offer a hardware replacement policy.
Removing old entries from the TLB
A final challenge in using a TLB is how to keep it synchronized (or coherent) with the underlying page table. Just as with any other cache, care must be taken to avoid cases where the TLB contains stale entries that are no longer valid. Stale entries can result from a number of scenarios. For example, when a virtual page is paged out to disk, the PTE in the page table is marked not present. If that page still has a TLB entry, it is now stale (because we assumed that the TLB contains only present PTEs). Similarly, a process might map a file into memory, access a few pages in the mapped area, and then unmap the file. At this point, the TLB may still contain entries that were inserted when the mapped area was accessed, but because the mapping no longer exists, those entries are now stale. The event that by far creates the most stale entries occurs when execution switches from one process to another. Because each process has its own address space, the entire TLB becomes stale on a context switch!
Given the number and complexity of the scenarios that can lead to stale entries, it is up to the operating system to ensure that they are flushed from the TLB before they can cause any harm. Depending on the CPU architecture, different kinds of TLB flush instructions are provided. Typical instructions flush the entire TLB, the entry for a specific virtual page, or all TLB entries that fall in a specific address range.
Note that a context switch normally requires that the entire TLB be flushed. However, because this is such a common operation and because TLB fault handling is relatively slow, CPU architects over the years have come up with various schemes to avoid this problem. These schemes go by various names, such as address-space numbers, context numbers, or region IDs, but they all share the basic idea: The tag used for matching a TLB entry is expanded to contain not just the virtual page number but also an address-space number that uniquely identifies the process (address space) to which the translation belongs. The CPU is also extended to contain a new register, asn, that identifies the address-space number of the currently executing process. Now, when the TLB is searched, the CPU ignores entries whose unique number does not match the value in the asn register. With this setup, a context switch simply requires updating the asn registerno flushing is needed anymore. Effectively, this scheme makes it possible to share the TLB across multiple processes.
4.4.1 The IA-64 TLB architecture
The IA-64 architecture uses an interesting approach to speed up virtual-to-physical address translation. Apart from the basic TLB, there are three other hardware structures, two of which, the region registers and the protection key registers, are designed to increase the effectiveness of the TLB. The third, the virtual hash page table walker (VHPT walker) is designed to reduce the penalty of a TLB miss.
Figure 4.29 illustrates how an IA-64 CPU translates a virtual address to a physical address. Let us start in the upper-right corner of the figure. There, we find a virtual address that has been divided into the three fields: the virtual region number vrn, the virtual page number vpn, and the page offset field offset. As usual, the page offset does not participate in the translation and is copied straight to the offset field of the physical address at the lower-right corner of the figure. In contrast, the 3-bit region number vrn is first sent to the region registers in the upper-left corner. Here, the region register indexed by vrn is read, and the resulting region ID value is sent to the TLB. At the TLB, the region ID is combined with the virtual page number vpn and the resulting region ID/vpn key is used to search the TLB. If an entry matches the search key, the remaining fields of the entry provide the information necessary to complete the address translation. Specifically, the pfn field provides the page frame number associated with the virtual page number. This field, too, can be copied down to the corresponding field in the physical address. The memory attribute field ma determines whether or not the memory access can be cached. If it can, the uc field (bit 63) of the physical address is cleared; otherwise, it is set. The final two fields, +rights and key, are used to check whether the memory access is permitted. The +rights field provides a set of positive rights that control what kind of accesses (read, write, or execute) are permitted at what privilege level (user and/or kernel). The key field is sent to the protection key registers. There, the register with a matching key value is read, and its -rights field supplies the negative rights needed to complete the permission check. Specifically, any kind of access specified in the -rights field is prohibited, even if it would otherwise be permitted by the +rights field. If there is no register matching the key value, a KEY MISS FAULT is raised. The operating system can intercept this fault and decide whether to install the missing key or take some other action (such as terminate the offending process). At this point the CPU has both the physical address and the information necessary to check whether the memory access is permitted, so the translation is complete.
Figure 4.29. IA-64 virtual address translation hardware.
A somewhat unusual aspect of IA-64 is that the present bit is also part of the TLB entry. The Linux kernel never inserts a translation for a page that is not present, but the VHPT walker may do so.
TLB structure and management policies
As Figure 4.30 illustrates, the IA-64 TLB is divided into four logically separate units. On the left side is the instruction TLB (ITLB), which translates instruction addresses; on the right side is the data TLB (DTLB), which translates data addresses. Both the ITLB and the DTLB are further subdivided into translation registers (ITR and DTR) and translation caches (ITC and DTC). The difference between the two lies in where the replacement policy is implemented: for the translation caches, the hardware (CPU) implements the replacement policy, whereas for translation registers, the replacement policy is implemented in software. In other words, when a TLB entry is inserted into a translation register, both the TLB entry and the translation register name (e.g., itr1) have to be specified. In contrast, insertion into a translation cache requires specification of only the TLB entrythe hardware then picks an existing entry and replaces it with the new one.
Figure 4.30. IA-64 TLB structure.
The architecture guarantees that the ITC and DTC have a size of at least one entry. Of course, a realistic CPU typically supports dozens of entries in each cache. For example, Itanium implements 96 ITC entries and 128 DTC entries. Even so, to guarantee forward progress the operating system must never assume that more than one entry can be held in the cache at any given time. Otherwise, when inserting two entries back to back, the hardware replacement policy may end up replacing the first entry when inserting the second one. Thus, the operating system must be written in a way that ensures forward progress even if only the second entry survives.
Both ITR and DTR are guaranteed to support at least eight translation registers. However, the IA-64 architecture leaves hardware designers the option to implement translation registers in the form of translation cache entries that are marked so that they are never considered for replacement by the hardware. With this option, the more translation registers used, the fewer entries available in the translation cache. For this reason, it is generally preferable to allocate only as many translation registers as are really needed and to allocate them in order of increasing register index.
Linux/ia64 uses translation registers to pin certain critical code sections and data structures. For example, one ITR entry is used to pin the TLB fault handlers of the kernel, and another is used to map firmware code that cannot risk taking a TLB miss fault. Similarly, the kernel uses a DTR entry to pin the kernel stack of the currently running process.
The VHPT walker and the virtually-mapped linear page table
One question we have glossed over so far is what happens when there is no matching TLB entry for a given region ID/vpn pair, i.e., when a TLB miss occurs. On IA-64, this event can be handled in one of two ways: if enabled, the VHPT walker becomes active and attempts to fill in the missing TLB entry. If the VHPT walker is disabled, the CPU signals a TLB miss fault, which is intercepted by the Linux kernel. The details of how a TLB miss is handled in software is described in Section 4.5. For now, let us focus on the case where the TLB miss is handled by the VHPT walker.
First, let us note that the use of the VHPT walker is completely optional. If an operating system decides not to use it, IA-64 leaves the operating system complete control over the page-table structure and the PTE format. In order to use the VHPT walker, the operating system may need to limit its choices somewhat. Specifically, the VHPT walker can support one of two modes: hashed mode or linear-page-table mode. In hashed mode, the operating system should use a hash table as its page table and the PTEs have a format that is known as the long format. With the long format, each PTE is 32 bytes in size. In linear-page-table mode, the operating system needs to be able to support a virtually-mapped linear page table and the PTEs have a format known as the short format. The short format is the one Linux uses. As shown in Figure 4.25, this type of PTE is 8 bytes in size.
The VHPT walker configuration is determined by the page table address control register pta. This register is illustrated in Figure 4.31. Bit ve controls whether or not the VHPT walker is enabled. The operation of the VHPT walker is further qualified by a control bit in each region register. Only when a TLB miss occurs in a region for which this control bit and pta.ve are both 1 does the VHPT walker become active. The second control bit in pta is vf; it determines whether the walker operates in hashed (long-format) or virtually-mapped linear page table (short-format) mode. A value of 1 indicates hashed mode. Let us assume that ve is 1 (enabled) and that vf is 0 (virtually-mapped linear page table mode). With this configuration, the base and size fields define the address range that the linear page table occupies in each region. The base field contains the 49 most significant bits of the region-relative offset at which the table starts, and the size field contains the number of address bits that the table spans (i.e., the table is 2pta.size bytes long).
Figure 4.31. Format of IA-64 page table address register pta.
Note that while the VHPT walker can be disabled separately in each region (by a bit in the region registers), for those regions in which it is enabled, the linear page table is mapped at the same relative address range in each region. Given this constraint, the location at which the linear page table is placed needs to be chosen carefully.
Figure 4.32 illustrates the solution that Linux/ia64 is using. Two factors influence the placement of the linear page table. First, so that virtual address space is not wasted, it is preferable for the page table to not overlap with the normal space mapped by the page table. The latter space is illustrated in the figure by the rectangle at the bottom of the region. As usual, the addresses listed next to it are valid for the case in which a three-level page table is used with a page size of 8 Kbytes. Second, the page table also may not overlap with the address-space hole in the middle of the region that exists if the CPU does not implement all address bits (as determined by the IMPL_VA_MSB parameter). This hole is illustrated in the figure by the dark-shaded rectangle in the middle of the region. The addresses listed next to it are valid for the case where IMPL_VA_MSB = 50. With these two factors in mind, Linux/ia64 sets up the pta register such that the linear page table is mapped at the top end of the region, as illustrated by the lightly shaded rectangle. Note that for certain combinations of page sizes and IMPL_VA_MSB, the page-table-mapped space might either cross over into the unimplemented space or might overlap with the virtually-mapped linear page table (when IMPL_VA_MSB = 60). The Linux kernel checks for this at boot time, and if it detects an overlap condition, it prints an error message and halts execution.
Figure 4.32. Virtually-mapped linear page table inside a Linux/ia64 region.
Now, let us take a look at how the VHPT walker operates. When a TLB miss occurs for virtual address va, the walker calculates the virtual address va3 of the Pte that maps va. Using the virtually-mapped linear page table, this address is given by:
va´ = Îva/261 ˚ ·261 + pta.base ·215+8 · (Îva/PAGE_SIZE˚ mod 2pta.size)
That is, va3 is the sum of the region's base address, the region offset of the linear page table, and the offset of the PTE within the linear page table. In the last summand, the factor of 8 is used because each PTE has a size of 8 bytes, and the modulo operation truncates away the most significant address bits that are not mapped by the page table. The VHPT walker then attempts to read the PTE stored at this address. Because this is again a virtual address, the CPU goes through the normal virtual-to-physical address translation. If a TLB entry exists for va3, the translation succeeds and the walker can read the PTE from physical memory and install the PTE for va. However, if the TLB entry for va3 is also missing, the walker gives up and requests assistance by raising a VHPT TRANSLATION FAULT.
Let us emphasize that the VHPT walker never walks the Linux page table. It could not possibly do so because it has no knowledge of the page-table structure used by Linux. For example, it does not know how many levels the page-table tree has or how big each directory is. But why use the VHPT walker at all given that it can handle a TLB miss only if the TLB entry for the linear page table is already present? The reason is spatial locality of reference. Consider that the TLB entry that maps a particular PTE actually maps an entire page of PTEs. Thus, after a TLB entry for a page-table page is installed, all TLB misses that access PTEs in the same page can be handled entirely by the VHPT walker, avoiding costly TLB miss faults. For example, with a page size of 8 Kbytes, each page-table TLB entry maps 8 Kbytes/8 = 1024 PTEs and hence 1024 · 8 Kbytes = 8 Mbytes of memory. In other words, when accessing memory sequentially, the VHPT walker reduces TLB miss faults from one per page to only one per 1024 pages! Given the high cost of fielding a fault on modern CPUs, the VHPT walker clearly has the potential to dramatically increase performance.
On the other hand, if memory is accessed in an extremely sparse pattern, the linear page table can be disadvantageous because the TLB entries for the page table take up space without being of much benefit. For example, again assuming a page size of 8 Kbytes, the most extreme case would occur when one byte is accessed every 8 Mbytes. In this case, each memory access would require two TLB entries: one for the page being accessed and one for the corresponding page-table page. Thus, the effective size of the TLB would be reduced by a factor of two! Fortunately, few applications exhibit such extreme access patterns for prolonged periods of time, so this is usually not an issue.
On a final note, it is worth pointing out that the virtually-mapped linear page table used by Linux/ia64 is not a self-mapped virtual page table (see Section 4.3.2). The two are very similar in nature, but the IA-64 page table does not have a self-mapped entry in the global directory. The reason is that none is needed: The self-mapped entry really is needed only if the virtual page table is used to access global- and middle-directory entries. Since Linux/ia64 does not do that, it needs no self-mapping entry. Another way to look at this situation is to think of the virtual page table as existing in the TLB only: If a page in the virtual page table happens to be mapped in the TLB, it can be used to access the PTE directory, but if it is not mapped, an ordinary page-table walk is required.
Linux/ia64 and the region and protection key registers
Let us now return to Figure 4.29 on page 177 and take a closer look at the workings of the region and protection key registers and how Linux uses them. Both register files are under complete control of the operating system; the IA-64 architecture does not dictate a particular way of using them. However, they clearly were designed with a particular use in mind. Specifically, the region registers provide a means to share the TLB across multiple processes (address spaces). For example, if a unique region ID is assigned to each address space, the TLB entries for the same virtual page number vpn of different address spaces can reside in the TLB at the same time because they remain distinguishable, thanks to the region ID. With this use of region IDs, a context switch no longer requires flushing of the entire TLB. Instead, it is simply necessary to load the region ID of the new process into the appropriate region registers. This reduction in TLB flushing can dramatically improve performance for certain applications. Also, because each region has its own region register, it is even possible to have portions of different address spaces active at the same time. The Linux kernel takes advantage of this by permanently installing the region ID of the kernel in rr5rr7 and installing the region ID of the currently running process in rr0rr4. With this setup, kernel TLB entries and the TLB entries of various user-level processes can coexist without any difficulty and without wasting any TLB entries.
Whereas region registers make it possible to share the entire TLB across processes, protection key registers enable the sharing of individual TLB entries across processes, even if the processes must have distinct access rights for the page mapped by the TLB entry. To see how this works, suppose that a particular TLB entry maps a page of a shared object. The TLB entry for this page would be installed with the access rights (+rights) set according to the needs of the owner of the object. The key field would be set to a value that uniquely identifies the shared object. The operating system could then grant a process restricted access to this object by using one of the protection key registers to map the object's unique ID to an appropriate set of negative rights (-rights). This kind of fine-grained sharing of TLB entries has the potential to greatly improve TLB utilization, e.g., for shared libraries. However, the short-format mode of the VHPT walker cannot take advantage of the protection key registers and, for this reason, Linux disables them by clearing the pk bit in the processor status register (see Chapter 2, IA-64 Architecture).
4.4.2 Maintenance of TLB coherency
For proper operation of the Linux kernel to be guaranteed, the TLB must be kept coherent with the page tables. For this purpose, Linux defines an interface that abstracts the platform differences of how entries are flushed from the TLB. The interface used for this purpose, shown in Figure 4.33, is called the TLB flush interface (file include/asm/pgalloc.h).
Figure 4.33. Kernel interface to maintain TLB coherency.
The first routine in this interface is flush tlb page(). It flushes the TLB entry of a particular page. The routine takes two arguments, a vm-area pointer vma and a virtual address addr. The latter is the address of the page whose TLB entry is being flushed, and the former points to the vm-area structure that covers this page. Because the vm-area structure contains a link back to the mm structure to which it belongs, the vma argument indirectly also identifies the address space for which the TLB entry is being flushed.
The second routine, flush_tlb_range(), flushes the TLB entries that map virtual pages inside an arbitrary address range. It takes three arguments: an mm structure pointer mm, a start address start, and an end address end. The mm argument identifies the address space for which the TLB entries should be flushed, and start and end identify the first and the last virtual page whose TLB entries should be flushed.
The third routine, flush_tlb_pgtables, flushes TLB entries that map the virtually-mapped linear page table. Platforms that do not use a virtual page table do not have to do anything here. For the other platforms, argument mm identifies the address space for which the TLB entries should be flushed, and arguments start and end specify the virtual address range for which the virtual-page-table TLB entries should be flushed.
The fourth routine, flush_tlb_mm(), flushes all TLB entries for a particular address space. The address space is identified by argument mm, which is a pointer to an mm structure. Depending on the capabilities of the platform, this routine can either truly flush the relevant TLB entries or simply assign a new address-space number to mm.
Note that the four routines discussed so far may flush more than just the requested TLB entries. For example, if a particular platform does not have an instruction to flush a specific TLB entry, it is safe to implement flush_tlb_page() such that it flushes the entire TLB.
The next routine, flush_tlb_all(), flushes the entire TLB. This is a fallback routine in the sense that it can be used when none of the previous, more fine-grained routines are suitable. By definition, this routine flushes even TLB entries that map kernel pages. Because this is the only routine that does this, any address-translation-related changes to the page-table-mapped kernel segment or the kmap segment must be followed by a call to this routine. For this reason, calls to vmalloc() and vfree() are relatively expensive.
The last routine in this interface is update mmu_cache(). Instead of flushing a TLB entry, it can be used to proactively install a new translation. The routine takes three arguments: a vm-area pointer vma, a virtual address addr, and a page-table entry pte. The Linux kernel calls this routine to notify platform-specific code that the virtual page identified by addr now maps to the page frame identified by pte. The vma argument identifies the vm-area structure that covers the virtual page. This routine gives platform-specific code a hint when the page table changes. Because it gives only a hint, a platform is not required to do anything. The platform could use this routine either for platform-specific purposes or to aggressively update the TLB even before the new translation is used for the first time. However, it is important to keep in mind that installation of a new translation generally displaces an existing entry from the TLB, so whether or not this is a good idea depends both on the applications in use and on the performance characteristics of the platform.
IA-64 implementation
On Linux/ia64, flush_tlb_mm() is implemented such that it forces the allocation of a new address-space number (region ID) for the address space identified by argument mm. This is logically equivalent to purging all TLB entries for that address space, but it has the advantage of not requiring execution of any TLB purge instructions.
The flush_tlb_all() implementation is based on the ptc.e (purge translation cache entry) instruction, which purges a large section of the TLB. Exactly how large a section of the TLB is purged depends on the CPU model. The architected sequence to flush the entire TLB is as follows:
long flags, i, j, addr = BASE_ADDR; local irq_save( flags); /* disable interrupts */ for ( i = 0; i < COUNT0; ++ i, addr += STRIDE0) { for ( j = 0; j < COUNT1; ++ j, addr += STRIDE1) ptc e( addr); } local irq restore( flags); /* reenable interrupts */
Here, BASE_ADDR, COUNT0, STRIDE0, COUNT1, and STRIDE1 are CPU-model-specific values that can be obtained from PAL firmware (see Chapter 10, Booting). The advantage of using such an architected loop instead of an instruction that is guaranteed to flush the entire TLB is that ptc.e is easier to implement on CPUs with multiple levels and different types of TLBs. For example, a particular CPU model might have two levels of separate instruction and data TLBs, making difficult to clear all TLBs atomically with a single instruction. However, in the particular case of Itanium, COUNT0 and COUNT1 both have a value of 1, meaning that a single ptc.e instruction flushes the entire TLB.
All other flush routines are implemented with either the ptc.l (purge local translation cache) or the ptc.ga (purge global translation cache and ALAT) instruction. The former is used on UP machines, and the latter is used on MP machines. Both instructions take two operandsa start address and a sizewhich define the virtual address range from which TLB entries should be purged. The ptc.l instruction affects only the local TLB, so it is generally faster to execute than ptc.ga, which affects the entire machine (see the architecture manual for exact definition [26]). Only one CPU can execute ptc.ga at any given time. To enforce this, the Linux/ia64 kernel uses a spinlock to serialize execution of this instruction.
Linux/ia64 uses update_mmu_cache() to implement a cache flushing that we discuss in more detail in Section 4.6. The routine could also be used to proactively install a new translation in the TLB. However, Linux gives no indication whether the translation is needed for instruction execution or for a data access, so it would be unclear whether the translation should be installed in the instruction or data TLB (or both). Because of this uncertainty and because installing a translation always entails the risk of evicting another, perhaps more useful, TLB entry, it is better to avoid proactive installation of TLB entries.
4.4.3 Lazy TLB flushing
To avoid the need to flush the TLB on every context switch, Linux defines an interface that abstracts differences in how address-space numbers (ASNs) work on a particular platform. This interface is called the ASN interface (file include/asm/mmu_context.h), shown in Figure 4.34. Support for this interface is optional in the sense that platforms with no ASN support simply define empty functions for the routines in this interface and instead define flush_tlb_mm() in such a way that it flushes the entire TLB.
Figure 4.34. Kernel interface to manage address-space numbers.
Each mm structure contains a component called the mm context, which has a platform-specific type called the mm context type (mm_context_t in file include/asm/mmu.h). Often, this type is a single word that stores the ASN of the address space. However, some platforms allocate ASNs in a CPU-local fashion. For those, this type is typically an array of words such that the ith entry stores the ASN that the address space has on CPU i.
When we take a look at Figure 4.34, we see that it defines four routines. The first routine, init_new_context(), initializes the mm context of a newly created address space. It takes two arguments: a task pointer task and an mm structure pointer mm. The latter is a pointer to the new address space, and the former points to the task that created it. Normally, this routine simply clears the mm context to a special value (such as 0) that indicates that no ASN has been allocated yet. On success, the routine should return a value of 0.
The remaining routines all take one argument, an mm structure pointer mm that identifies the address space and, therefore, the ASN that is to be manipulated. The get mmu-context() routine ensures that the mm context contains a valid ASN. If the mm context is already valid, nothing needs to be done. Otherwise, a free (unused) ASN needs to be allocated and stored in the mm context. It would be tempting to use the process ID (pid) as the ASN. However, this does not work because the execve() system call creates a new address space without changing the process ID.
Routine reload_context() is responsible for activating on the current CPU the ASN represented by the mm context. Logically, this activation entails writing the ASN to the CPU's asn register, but the exact details of how this is done are, of course, platform specific. When this routine is called, the mm context is guaranteed to contain a valid ASN.
Finally, when an ASN is no longer needed, Linux frees it by calling destroy_context(). This call marks the ASN represented by the mm context as available for reuse by another address space and should free any memory that may have been allocated in get mmu_context(). Even though the ASN is available for reuse after this call, the TLB may still contain old translations with this ASN. For correct operation, it is essential that platform-specific code purges these old translations before activating a reused ASN. This is usually achieved by allocating ASNs in a round-robin fashion and flushing the entire TLB before wrapping around to the first available ASN.
IA-64 implementation
Linux/ia64 uses region IDs to implement the ASN interface. The mm context in the mm structure consists of a single word that holds the region ID of the address space. A value of 0 means that no ASN has been allocated yet. The init_new_context() routine can therefore simply clear the mm context to 0.
The IA-64 architecture defines region IDs to be 24 bits wide, but, depending on CPU model, as few as 18 bits can be supported. For example, Itanium supports just the architectural minimum of 18 bits. In Linux/ia64, region ID 0 is reserved for the kernel, and the remaining IDs are handed out by get mmu_context() in a round-robin fashion. After the last available region ID has been handed out, the entire TLB is flushed and a new range of available region IDs is calculated such that all region IDs currently in use are outside this range. Once this range has been found, the get mmu_context() continues to hand out region IDs in a round-robin fashion until the space is exhausted again, at which point the steps of flushing the TLB and finding an available range of region IDs are repeated.
Region IDs are 18 to 24 bits wide but only 15 to 21 bits are effectively available for Linux. The reason is that IA-64 requires the TLB to match the region ID and the virtual page number (see Figure 4.29 on page 177) but not necessarily the virtual region number (vrn). Thus, a TLB may not be able to distinguish, e.g., address 0x2000000000000000 from address 0x4000000000000000 unless the region IDs in rr1 and rr2 are different. To ensure this, Linux/ia64 encodes vrn in the three least significant bits of the region ID.
Note that the region IDs returned by get mmu_context() are shared across all CPUs. Such a global region ID allocation policy is ideal for UP and small to moderately large MP machines. A global scheme is advantageous for MP machines because it makes possible the use of the ptc.ga instruction to purge translations from all TLBs in the machine. On the downside, global region ID allocation is a potential point of contention and, perhaps worse, causes the region ID space to be exhausted faster the more CPUs there are in the machine. To see this, assume there are eight distinct region IDs and that a CPU on average creates a new address space once a second. With a single CPU, the TLB would have to be flushed once every eight seconds because the region ID space has been exhausted. In a machine with eight CPUs, each creating an address space once a second, a global allocation scheme would require that the TLB be flushed once every second. In contrast, a local scheme could get by with as little TLB flushing as the UP case, i.e., one flush every eight seconds. But it would have to use an interprocessor interrupt (IPI, see Chapter 8, Symmetric Multiprocessing) instead of ptc.ga to perform a global TLB flush (because each CPU uses its own set of region IDs). In other words, deciding between the local and the global scheme involves a classic tradeoff between smaller fixed overheads (ptc.ga versus IPI) and better scalability (region ID space is exhausted with a rate proportional to total versus per-CPU rate at which new address spaces are created).
On IA-64, reload_context() has the effect of loading region registers rr0 to rr4 according to the value in the mm context. As explained in the previous paragraph, the value actually stored in the region registers is formed by shifting the mm context value left by three bits and encoding the region's vrn value in the least significant bits.
The IA-64 version of destroy_context() does not need to do anything: get mmu_context() does not allocate any memory, so no memory needs to be freed here. Similarly, the range of available region IDs is recalculated only after the existing range has been exhausted, so there is no need to flush old translations from the TLB here.