4.3 Page Tables
Linux maintains the page table of each process in physical memory and accesses a page table through the identity-mapped kernel segment. Because they are stored in physical memory, page tables themselves cannot be swapped out to disk. This means that a process with a huge virtual address space could run out of memory simply because the page table alone uses up all available memory. Similarly, if thousands of processes are running, the page tables could take up most or even all of the available memory, making it impossible for the processes to run efficiently. However, on modern machines, main memory is usually large enough to make these issues either theoretical or at most second-order. On the positive side, keeping the page tables in physical memory greatly simplifies kernel design because there is no possibility that handling a page fault would cause another (nested) page fault.
Each page table is represented as a multiway tree. Logically, the tree has three levels, as illustrated in Figure 4.16. As customary in this chapter, this figure shows the tree growing from left to right: at the first level (leftmost part of the figure), we find the global directory (pgd); at the second level (to the right of the global directory), we find the middle directories (pmd); and at the third level we find the PTE_directories. Typically, each directory (node in the tree) occupies one page frame and contains a fixed number of entries. Entries in the global and middle directories are either not present or they point to a directory in the next level of the tree. The PTE directories form the leaves of the tree, and its entries consist of page-table entries (PTEs).
Figure 4.16. Linux page-table tree.
The primary benefit of implementing a page table with a multiway tree instead of a linear array is that the former takes up space that is proportional only to the virtual address space actually in use, instead of being proportional to the maximum size of the virtual address space. To see this, consider that with a 1-Gbyte virtual address space and 8-Kbyte pages, a linear page table would need storage for more than 131,000 PTEs, even if not a single virtual page was actually in use. In contrast, with a multiway tree, an empty address space requires only a global directory whose entries are all marked not present.
Another benefit of multiway trees is that each node (directory) in the tree has a fixed-size (usually a page). This makes it unnecessary to reserve large, physically contiguous regions of memory as would be required for a linear page table. Also, because physical memory is managed as a set of page frames anyhow, the fixed node size makes it easy to build a multiway tree incrementally, as is normally done with demand paging.
Going back to Figure 4.16, we see that the size and structure of the directories is controlled by platform-specific constants. The number of entries in the global, middle, and PTE directories are given by constants PTRS_PER_PGD, PTRS_PER_PMD, and PTRS_PER_PTE, respectively. The global directory is special because often only part of it can be used to map user space (the rest is either reserved or used for kernel purposes). Two additional parameters define the portion of the global directory that is available for user space. Specifically, FIRST_USER_PGD_NR is the index of the first entry, and USER_PTRS_PER_PGD is the total number of global-directory entries available to map user space. For the middle and PTE directories, all entries are assumed to map user space.
So how can we use the page table to translate a virtual address to the corresponding physical address? Figure 4.17 illustrates this. At the top, we see that a virtual address, for the purpose of a page-table lookup, is broken up into multiple fields. The fields used to look up the page table are pgd index, pmd index, and pte index. As their names suggest, these fields are used to index the global, middle, and PTE directories, respectively. A page-table lookup starts with the page-table pointer stored in the mm structure of a process. In the figure, this is illustrated by the arrow labeled page-table_pointer. It points to the global directory, i.e., the root of the page-table tree. Given the global directory, the pgd index tells us which entry contains the address of the middle directory. With the address of the middle directory, we can use the pmd index to tell us which entry contains the address of the PTE directory. With the address of the PTE directory, we can use the pte index to tell us which entry contains the PTE of the page that the virtual address maps to. With the PTE, we can calculate the address of the physical page frame that backs the virtual page. To finish up the physical address calculation, we just need to add the value in the offset field of the virtual address to the page frame address.
Figure 4.16. Linux page-table tree.
Note that the width of the pgd index, pmd index, and pte index fields is dictated by the number of pointers that can be stored in a global, middle, and PTE directory, respectively. Similarly, the width of the offset field is dictated by the page size. Because these fields by definition consist of an integral number of bits, the page size and the number of entries stored in each directory all must be integer powers of 2. If the sum of the widths of these fields is less than the width of a virtual address, some of the address bits remain unused. The figure illustrates this with the field labeled unused. On 32-bit platforms, there are usually no unused bits. However, on 64-bit platforms, the theoretically available virtual address space is so big that it is not unusual for some bits to remain unused. We discuss this in more detail when discussing the IA-64 implementation of the virtual memory system.
4.3.1 Collapsing page-table levels
At the beginning of this section, we said that the Linux page tables logically contain three levels. The reason is that each platform is free to implement fewer than three levels. This is possible because the interfaces that Linux uses to access and manipulate page tables have been carefully structured to allow collapsing one or even two levels of the tree into the global directory. This is an elegant solution because it allows platform-independent code to be written as if each platform implemented three levels, yet on platforms that implement fewer levels, the code accessing the unimplemented levels is optimized away completely by the C compiler. That is, no extra overhead results from the extraneous logical page-table levels.
The basic idea behind collapsing a page-table level is to treat a single directory entry as if it were the entire directory for the next level. For example, we can collapse the middle directory into the global directory by treating each global-directory entry as if it were a middle directory with just a single entry (i.e., PTRS_PER_PMD = 1). Later in this chapter, we see some concrete examples of how this works when a page table is accessed.
4.3.2 Virtually-mapped linear page tables
As discussed in the previous section, linear page tables are not very practical when implemented in physical memory. However, under certain circumstances it is possible to map a multiway tree into virtual space and make it appear as if it were a linear page table. The trick that makes this possible is to place a self-mapping entry in the global directory. The self-mapping entry, instead of pointing to a middle directory, points to the page frame that contains the global directory. This is illustrated in Figure 4.18 for the case where the global-directory entry with index 7 is used as the self-mapped entry (labeled SELF). Note that the remainder of the global directory continues to be used in the normal fashion, with entries that are either not mapped or that point to a middle directory.
Figure 4.18. Self-mapping entry in the global directory.
To make it easier to understand how this self-mapping works, let us consider a specific example: Assume that page-table entries are 8 bytes in size, pages are 8 Kbytes in size, and that the directories in the page tree all contain 1024 entries. These assumptions imply that the page offset is 13 bits wide and that the pgd, pmd, and pte indices are all 10 bits wide. The virtual address is thus 43 bits wide. A final assumption we need to make is that the format of the entries in the pgd and pmd directories is identical to the format used in the PTE directories.
Assuming the self-mapping entry has been installed in the global-directory entry with index SELF, we claim that the equations below for L3(va), L2(va), and L1(va) are the virtual addresses at which we can find, respectively, the PTE-directory entry, middle-directory entry, and global-directory entry that correspond to virtual address va:
L3(va) = SELF · 233 + 8 · Îva/213˚ L2(va) = SELF · 233 + SELF · 222 + 8 · Îva/223˚ L1(va) = SELF · 233 + SELF · 222 + SELF · 213 + 8 · Îva/233˚
For example, if we assume that SELF = 1023= 0x3ff, we could access the page-table entry for virtual address va = 0x80af3at virtual address 0x7fe00000200.
The effect of the self-mapping entry can be observed most readily when considering how the above equations affect the virtual address. Figure 4.19 illustrates this effect. The first line shows the virtual address va broken up into the three directory indices va_pgd_idx, va_pmd_idx, va_pte_idx and the page offset va_off. Now, if we consider the effect of equation L3(va), we see that the page offset was replaced by 8·va_pte_idx (the factor of 8 comes from the fact that each page-table entry is assumed to be 8 bytes in size). Similarly, the pte index has been replaced with va_pmd_idx, and the pmd index has been replaced with va_pgd_idx. Finally, the pgd index has been replaced with SELF, the index of the self-mapping entry. When we look at the effects of L2(va) and L1(va), we see a pattern emerge: At each level, the previous address is shifted down by 10 bits (the width of an index) and the top 10 bits are filled in with the value of SELF.
Figure 4.19. Effect of self-mapping entry on virtual address.
Now, let us take a look at what the operational effect of L1(va) is when used in a normal three-level page table lookup. We start at the global directory and access the entry at index va off. This lookup returns a pointer to a middle directory. However, because this is the self-mapping entry, we really get a pointer to the global directory. In other words, the global directory now serves as a fake middle directory. So we use the global directory again to look up the pmd index, which happens to be SELF again. As before, this lookup returns a pointer back to the global directory but this time the directory serves as a fake PTE directory. To complete the three-level lookup, we again use the global directory to look up the pte index, which again contains SELF. The entry at index SELF is now interpreted as a PTE. Because we are dealing with the self-mapping entry, it once again points to the global directory, which now serves as a fake page frame. The physical address that L1(va) corresponds to is thus equal to the address of global directory plus the page offset of the virtual address, which is 8 · va_pgd_idx. Of course, this is exactly the physical address of the global-directory entry that corresponds to address vajust as we claimed!
For equations L2(va) and L3(va) the same logic applies: Every time SELF appears as a directory index, the final memory access goes one directory level higher than it normally would. That is, with SELF appearing once, we "remove" one directory level and thus access the PTE instead of the word addressed by va, and so on.
Another way to visualize the effect of the self-mapping entry is to look at the resulting virtual address space. Unfortunately, it would be impossible to draw the space to scale with the 1024-way tree we assumed so far. Thus, Figure 4.20 illustrates the address space that would result for a machine where each of the three indices is only 2 bits wide and the page offset is 4 bits wide. In other words, the figure illustrates the address space for a three-level 4-way tree with a 16-byte page size and page-table entries that are 4 bytes in size. The value of SELF is assumed to be 2; i.e., the second-last entry in the global directory is used for the self-mapping entry.
Figure 4.20. Effect of self-mapping entry on virtual address space.
The figure illustrates how the virtually-mapped linear page table occupies the address space from 0x200 to 0x2ff. Not surprisingly, this is the quarter of the address space that corresponds to the global-directory entry occupied by SELF. Most of the linear page table is occupied by page-table entries. However, its third quadrant (0x280to 0x2df) is occupied primarily by the middle-directory entries. The global-directory entries can be found in the tiny section covering the address range from 0x2a0 to 0x2af. Note how the pmd (and pgd) entries create a hole in the linear page table that corresponds exactly to the hole that the page table creates in the virtual address space. This is necessarily so because otherwise there would be virtual memory addresses for which there would be no PTE entries in the linear page table!
Note that with a more realistic 1024-way tree, the virtual address space occupied by the virtually-mapped linear page table is less than 0.1 percent of the entire address space. In other words, the amount of space occupied by the linear page table is minuscule.
Applications of virtually-mapped linear page tables
A virtually-mapped linear page table could be used as the primary means to manipulate page tables. Once the self-mapping entry has been created in the global directory, all other directory entries can be accessed through the virtual table. For example, to install the page frame of the middle directory for virtual address va, we could simply write the new global-directory entry to address L1(va). While this would work correctly, it is usually more effi-cient to walk the page table in physical space. The reason is that a virtual access involves a full page-table walk (three memory accesses), whereas a page-directory access in physical space would take just one memory access.
The true value of a linear virtual page table derives from the fact that because it is a linear table, special hardware that accelerates page-table lookups can easily be built for it. We see an example of this in Section 4.4.1.
4.3.3 Structure of Linux/ia64 page tables
On IA-64, the Linux kernel uses a three-level page-table tree. Each directory occupies one page frame and entries are 8 bytes in size. With a page size of 8 Kbytes, this implies that the page table forms a 1024-way tree. The global- and middle-directory entries contain the physical address of the next directory or 0 if the entry is not mapped. In contrast, the PTE-directory entries follow the PTE format that we describe in Section 4.3.4.
Figure 4.21 illustrates the user-space virtual address format. Corresponding to a page size of 8 Kbytes, the page offset occupies the lowest 13 bits. Similarly, because each directory level has 1024 ways, each index is 10 bits wide. As the figure shows, the PTE-directory index (pte index) occupies bits 13 to 22, and the middle-directory index (pmd index) occupies bits 23 to 32. What is unusual is that the global-directory index is split into a 7-bit low part (pgdl) and a 3-bit high part (pgdh) that covers bits 33 to 39 and 61 to 63, respectively. Bits 40 to 60 are unused and must be 0.
Figure 4.21. Format of user-space virtual address (with 8-Kbyte pages).
The global-directory index is split into two pieces to make it possible to map a portion of the virtual space of each IA-64 region. This provides room for future growth. For example, an application could place its data segment in region 3 and the stacks in region 4. With this arrangement, the starting addresses of the data segment and the stacks would be separated by 261 bytes and the application would be virtually guaranteed that, no matter how much data the application will have to process in the future, its data segment will never collide with the stacks. Of course, the size of the data segment would still be limited by the amount of space that can be mapped within a region, but by using separate regions the application isolates itself from having to know this implementation detail. In other words, as future machines can support more memory, future Linux kernels can implement a larger virtual address space per region and the application will be able to take advantage of this larger space without requiring any changes.
It should be noted that the Linux kernel normally expects directory indices to take up consecutive bits in a virtual address. The exception to this rule is that discontiguities are permitted in the global-directory index provided that no vm-area structure maps an address range that would overlap with any of the unimplemented user-space areas. On IA-64, this means that all vm-areas must be fully contained within the first 240 bytes of a region (assuming 8-Kbyte pages, as usual). Linux/ia64 checks for this constraint and rejects any attempts to violate it.
Recall from Figure 4.11 on page 149 that regions 5 through 7 are reserved for kernel use. A process therefore is not permitted to map any addresses into user space for which pgdh would be greater than or equal to 5. The Linux/ia64 kernel enforces this by setting parameters FIRST_USER_PGD_NR to 0 and USER_PTRS_PER_PGD to 5 · 27 = 640. This implies that the top three octants of each global directory remain unused. Apart from wasting a small amount of memory, this does not cause any problems.
Table 4.1. Summary of Linux/ia64 virtual memory and page-table parameters.
|
Page size |
|||
Parameter |
4 Kbytes |
8 Kbytes |
16 Kbytes |
64 Kbytes |
total user address space |
320 Gbytes |
5 Tbytes |
80 Tbytes |
20 Pbytes |
address space per region |
64 Gbytes |
1 Tbyte |
16 Tbytes |
4 Pbyte |
width of directory index |
9 bits |
10 bits |
11 bits |
13 bits |
PTRS_PER_PGD |
512 |
1024 |
2048 |
8192 |
PTRS_PER_PMD |
512 |
1024 |
2048 |
8192 |
PTRS_PER_PTE |
512 |
1024 |
2048 |
8192 |
FIRST_USER_PGD_NR |
0 |
0 |
0 |
0 |
USER_PTRS_PER_PGD |
320 |
640 |
1280 |
5120 |
Kernel page table
The kernel uses a separate page table to manage the page-table-mapped kernel segment. In contrast to user space, where there is one page table per process, this page table belongs to the kernel and is in effect independent of which process is running.
On IA-64, this segment is implemented in region 5; the structure of virtual addresses in this region is illustrated in Figure 4.22. We can see that it is identical to Figure 4.21, except that the global-directory index is no longer split into two pieces. Instead, it occupies bits 33 to 42. As required for region 5 addresses, the region number bits 61 to 63 have a fixed value of 5 or, in binary notation, 101.
Figure 4.22. Format of kernel virtual address for region 5 (with 8-Kbyte pages).
Summary
Table 4.1 summarizes key virtual-memory and page-table parameters as a function of the Linux/ia64 page size. With a page size of 8 Kbytes, the total mappable user address space is 5 Tbytes (five regions of 1 Tbyte each). Larger page sizes yield correspondingly larger spaces. With the smallest page size of 4 Kbytes, "only" 320 Gbytes of user address space are available. The other extreme is a page size of 64 Kbytes, which yields a mappable user address space of 20 Pbytes (20 ·250 bytes!). By almost any standard, this can only be described as huge.
Because Linux/ia64 uses the same index width in each of the three levels of the page table, the three parameters PTRS_PER_PGD, PTRS_PER_PMD, and PTRS_PER_PTE have the same value. With a page size of 8 Kbytes, they are all equal to 1024, and with a page size of 64 Kbytes, they increase to 8192. The value for USER_PTRS_PER_PGD simply reflects the fact that the last three-eights of the address space is occupied by kernel space.
4.3.4 Page-table entries (PTEs)
Linux assumes that page-table entries (PTEs) have one of two possible logical formats, depending on whether or not a virtual page is present in physical memory. If it is present, the PTE has the logical format shown in Figure 4.23 (a). If a virtual page is not present, the PTE has the format shown in Figure 4.23 (b). Comparing the two, we see that the page present bit is common, and, as the name suggests, it determines which format is in effect. A value of 1 indicates that the PTE has the present format, and a value of 0 indicates the PTE has the not present format.
Figure 4.23. Logical format of a page-table entry (PTE).
Let us first take a closer look at Figure 4.23 (a). As it shows, the bit next to the present bit is the dirty (D) bit. It indicates whether there has been a write access to the virtual page since the bit was cleared last. The next bit is called the accessed (A) bit. It indicates whether the virtual page has been accessed at all (read, written, or executed) since the bit was cleared last. As explained earlier, the A and D bits are related in that they both play a role in implementing the page replacement and writeback policies. The next three bits are the permission bits; they control whether read (R), write (W), or execute (X) accesses to the page are permitted. A value of 1 indicates that accesses of that type are permitted, and a value of 0 prohibits such accesses. The permission bits are followed by a related bit, the user (U) bit. This bit controls whether the page is accessible from the user level. If it is 0, only the kernel can access the page. If it is 1, both the kernel and the user level can access the page in accordance with the permission bits. The last field in the Linux PTE contains the page frame number (PFN). The width of this field is platform specific, since it depends on the maximum amount of physical memory that the platform can support.
If the present bit is off, the PTE has the much simpler format shown in Figure 4.23 (b). Apart from the present bit, there are just two fields: TYPE and OFFSET. These fields record the disk location where a virtual page has been stored. Specifically, TYPE specifies the swap device (or swap file) and OFFSET specifies the swap block number. The TYPE field must be 8 bits wide, which implies that up to 256 distinct swap devices or files can be supported. There is no minimum width for the OFFSET fieldLinux uses whatever number of bits are available on a particular platform. However, since its width determines the maximum block number, OFFSET determines the maximum size of a swap device or file, so it is generally a good idea to make this field as wide as possible.
The PTE manipulation interface
It is important to realize that the two formats depicted in Figure 4.23 are only logical formats. Linux never directly manipulates fields in a PTE. Instead, it manipulates them indirectly through the PTE manipulation interface (file include/asm/pgtable.h) shown in Figure 4.24. This interface gives each platform the flexibility to map the logical PTE format to a real format that best suits the requirements of the platform.
Figure 4.24. Kernel interface to manipulate page-table entries (PTEs).
The interface defines three types, called pte_t, swp_entry_t, and pgprot_t. The first one is an opaque type that represents a PTE in the platform-specific format. The second type, swp_entry_t represents PTE values for which the present bit is off. The reason Linux requires a separate type for this case is that limitations in the page out code require that such PTEs must fit into a variable of type unsigned long. No such limitation exists for PTEs that have the present bit on, so pte_t could be a much larger type. However, the two types are related in the sense that it must be possible to convert a value of type swp_entry_t to a value of type pte_t without loss of information. The reverse also must be true for PTE values that have the present bit off.
The third type, pgprot_t, is also a platform-specific opaque type and represents the PTE bits other than the PFN field. Its name derives from the fact that values of this type contain the page permission bits. However, a platform is free to represent any other PTE bits in this type, so it generally contains more than just the permission bits. Associated with pgprot_t are several manifest constants that define the fundamental permission bit settings that the Linux kernel uses. Table 4.2 describes the available constants. The first column gives the name of the constant, the second column provides the logical permission bit settings (R, W, and X) as well as the value of the U bit that the constant encodes (see Figure 4.23), and the third column describes how the constant is used by the kernel. Constants whose name begins with PAGE represent protection values that are commonly used throughout the kernel. In contrast, the _Pxwr and _Sxwr constants translate the protection and sharing attributes of the mmap() system call to suitable protection values. Specifically, if the MAP_PRIVATE flag is specified in the system call, modifications to the mapped area should remain private to the process and Linux uses one of the _Pxwr constants. If MAP_PRIVATE is not specified, the kernel uses one of the _Sxwr constants instead. The specific constant used depends on the protection attributes specified in the system call. In the name of the constant, r corresponds to PROT_READ, w to PROT_WRITE, and x to PROT EXEC. For example, for an mmap() system call with MAP_PRIVATE, PROT_READ, and PROT_WRITE specified, the kernel would use P011. As the table illustrates, the major difference between the _Pxwr and the _Sxwr constants is that the former always have the W bit turned off. With write permission turned off, Linux can use the copy-on-write scheme to copy only those privately-mapped pages that really are modified by the process.
Comparing PTEs
The first routine in the interface is pte_same(), which compares two PTEs for equality. The routine expects two arguments pte1 and pte2 of type pte_t and returns a non-zero value if the two PTEs are identical.
Table 4.2. Page protection constants.
Constant |
UXWR |
Description |
PAGE_NONE |
10 0 0 |
Protection value to use for a page with no access rights. |
PAGE_READONLY |
10 0 1 |
Protection value to use for a read-only page. |
PAGE_SHARED |
10 1 1 |
Protection value to use for a page where modifications are to be shared by all processes that map this page. |
PAGE_COPY |
11 0 1 |
Protection value to use for a copy-on-write page. |
PAGE_KERNEL |
01 1 1 |
Protection value to use for a page that is accessible by the kernel only. |
_Pxwr |
1 x 0 r |
Protection value to use on a page with access rights xwr when modifications to the page should remain private to the process. There are eight such constants, one for each possible combination of the execute (x), write (w), and read (r) bits. For example, constant _P001 defines the value to use for a page that is only readable. |
_Sxwr |
1xw r |
Protection value to use on a page with access rights xwr when modifications to the page should be shared with other processes. There are eight such constants, one for each possible combination of the xwr access bits. |
Converting between page frame descriptors and PTEs
The second set of routines in the interface provides the ability to convert page frame pointers to PTEs, and vice versa. Routine mk_pte() expects two arguments: a pointer page to a page frame descriptor and a protection value prot that specifies the page protection as a platform-specific pgprot_t value. The routine calculates the page frame number of page and combines it with prot to form a Pte_that is returned in the form of a value of type pte_t. Routine mk_pte_phys() provides the same functionality, except that the first argument is a physical address instead of a page frame descriptor. This routine is used to map a page frame for which there is no page frame descriptora case that usually arises for memory-mapped I/O pages. The pte_page() routine provides the reverse operation. It expects the single argument pte, which must be a PTE, and returns a pointer to the page frame descriptor that corresponds to pte. The operation of this routine is undefined if pte_maps a physical address for which no page frame descriptor exists.
Manipulating the accessed/dirty bits
The third set of routines provides the ability to manipulate and test the accessed (A) and dirty (D) bits in the PTE. They all expect the single argument pte, which must be a PTE of type pte_t. The routines that modify the PTE also return a value of that type. The routines that test a bit in the PTE simply return the current value of the bit. The pt_ mkold() routine clears the A bit in the PTE, and pte_mkyoung() sets it. Similarly, pte_mkclean() clears the D bit in the PTE, and pte_mkdirty() sets it. The pte_young() and pte_dirty() routines can be used to query the current settings of the A and D bits, respectively.
Manipulating protection bits
The fourth set of routines provides the ability to manipulate the protection bits in a PTE. The first argument to these routines is always a PTE of type pte_t. Those routines that modify the PTE also return a value of that type, whereas the routines that test a particular permission bit return a non-zero value if the bit is set and 0 if the bit is cleared. The pte_modify() routine can be used to replace the protection bits with the value passed in the second argument, prot. This argument must be of type pgprot_t and contain a protection value in the platform-specific format. The pte_wrprotect() routine can be used to clear the W bit in a PTE, and pte_mkwrite() can be used to set it. Similarly, pte_mkexec() can be used to set the X bit. No routines directly manipulate the R bit. The current setting of the R, W, and X permission bits can be queried with pte_read(), pte_write(), and pte_exec(), respectively.
Modifying PTEs atomically
The fifth set of routines provides routines for atomically modifying existing PTEs. These routines need to be used to avoid potential race conditions when PTEs that are marked present in a page table are modified. For example, consider a machine with two CPUs. If one CPU performs a write access to a page just as the other CPU attempts to clear the dirty bit, the bit could end up with the wrong value if it is not cleared atomically. To avoid such problems, the interface provides ptep_test and clear dirty() to atomically read and then clear the D bit. The routine expects one argument, pte_ptr, which must be a pointer to the Pte_to be modified. The return value is non-zero if the D bit was set and 0 otherwise. Similarly, ptep_test and clear young() can be used to atomically test and clear the A bit. The ptep_mkdirty() and ptep_set_wrprotect() routines are atomic variants of pte_mkdirty() and pte_wrprotect(), respectively. The last atomic routine, ptep_get_and_clear() is a catch-all routine in the sense that it can be used to atomically modify any bits in the PTE. It works by atomically reading and then clearing the PTE pointed to by pte_ptr. It is a catch-all because once the PTE is cleared, its present bit is off and all future accesses by other CPUs will cause a page fault. In the page fault handler, the other CPUs will have to follow the normal MP-locking protocol to avoid any potential race conditions. In other words, the other routines to atomically modify PTEs are not strictly neededthey could all be emulated with ptep_get_and_clear(). However, providing specialized routines for the most common operations makes sense because such routines are generally more efficient than implementations based on ptep_get_and_clear().
Manipulating swap entries
The final set of PTE-related routines provides the means to support moving virtual pages to swap space and back. Specifically, pte_to SWP_entry() converts a Pte to the corresponding swap entry. The PTE is specified by argument pte, and the return value is the corresponding swap entry. This operation is meaningful only if pte has the present bit turned off, i.e., the PTE must already be in the format illustrated in Figure 4.23 (b). The swp_entry_to_pte() routine provides the reverse operation and translates the swap entry passed in argument swp_entry_to an equivalent PTE. The TYPE and OFFSET fields can be extracted from a swap entry with SWP_TYPE() and SWP_OFFSET(), respectively. Conversely, SWP_ENTRY() can be used to construct a swap entry in accordance with the values of the type and offset arguments.
Note that no routine directly translates a PTE in the present format to a PTE in the not present format. Instead, Linux uses the following canonical sequence to move a page out to swap space:
Call ptep_get and clear() to read the old PTE and then clear it.
Reserve swap space for the page; the swap space location reserved determines the value of the TYPE and OFFSET fields.
Write the page frame to swap space.
Build the swap entry by calling SWP_ENTRY().
Call swp_entry_to_pte() to translate the swap entry into the corresponding PTE.
Install the new PTE in the page table.
Similarly, when a page is moved from disk to memory, the corresponding steps happen in reverse. In other words, when a virtual page is moved from memory to disk, or vice versa, the old PTE is essentially "thrown away" and a new one is constructed from ground up, which is why there is no need for routines that explicitly manipulate the present bit.
IA-64 implementation
Figure 4.25 illustrates the PTE format used by Linux/ia64. To help distinguish the logical PTE format in Figure 4.23 from the IA-64specific format shown here, the figure uses lowercase names for the field names. The figure illustrates that there is a present bit p, an accessed bit a, a dirty bit d, and a page frame number field pfn just as in the logical format. These bits serve exactly the same purpose as the corresponding bits in the logical format.
Figure 4.25. Format of IA-64 PTE when present bit is set (p = 1).
The IA-64 PTE format does not have separate permission bits. Instead, it uses the 2-bit privilege-level field pl and the 3-bit access-rights field ar to determine the protection status of a page. IA-64 supports four privilege levels, but Linux uses only two of them: 0 is the most-privileged level and is used for accesses by the kernel; 3 is the least-privileged level and is used for user accesses. Table 4.3 shows how the different values for ar and pl affect the rights for kernel and user-level accesses. Note that for the kernel, the access rights are determined solely by ar; pl has no influence. For user-level accesses, pl = 0 generally means that the page is owned by the kernel and therefore inaccessible. However, something interesting happens when ar = 7. In the table, this entry is marked XP0, which should be read as execute, promote to privilege-level 0. This means that the page is executable even at the user level. Furthermore, if an epc (enter privileged code) instruction is executed in such a page, the privilege level is raised to privilege-level 0 (kernel). This mechanism can be used to implement system calls and, as we see in Chapter 5, Kernel Entry and Exit, is also useful for signal handling. The gate page described in Section 4.2.4 is mapped in this fashion, and the IA-64specific constant PAGE GATE can be used to map other pages in this way.
Table 4.3. IA-64 page access rights.
ar |
kernel |
user |
|
pl =0 |
pl= 3 |
||
0 |
R |
|
R |
1 |
RX |
|
RX |
2 |
RW |
|
RW |
3 |
RWX |
|
RWX |
4 |
RW |
|
R |
5 |
RWX |
|
RX |
6 |
RW |
|
RWX |
7 |
RX |
XP0 |
X |
Another interesting aspect of the IA-64 protection scheme is that there is no way to grant write permission without also granting read permission. Fortunately, for the R and X permission bits, Linux allows platform-specific code to grant the access right even if it was not set in the logical protection value (this is not true for the W bit, of course). Thus, Linux/ia64 simply maps logical protection values that have only the W bit set to privilege-level and access-right values that grant both read and write permission.
The IA-64 PTE defines two fields, ma and ed, that have no equivalent in the logical format. The ma field specifies the memory attribute of the page. For all normal memory pages, this value is 0, indicating that accesses to the page are cacheable. However, to support memory-mapped I/O, this field can be used to select different types of uncached accesses. For example, ma = 4 marks the page as uncacheable. A value of ma = 6 marks the page as write coalescing, which is similar to uncacheable, except that consecutive writes can be coalesced into a single write transaction. As a special case, ma = 7 marks a page as a NaT Page. Speculative loads from such a page will deposit a NaT token in the target register. Other memory accesses, such as nonspeculative loads or stores, will trigger a NAT PAGE CONSUMPTION FAULT. The ed bit is also related to speculation. It is meaningful only for pages that are executable. For such pages, the bit indicates the speculation model that is in effect. A value of 1 indicates the recovery model, and a value of 0 indicates the no-recovery model. See the discussion on page 199 for more details on the speculation model. Linux supports only the recovery model, so this bit should always be 1.
Finally, note that some bits in the IA-64 PTE are marked rv. These bits are reserved for future architectural uses. Attempting to set them will result in a RESERVED REGISTER/FIELD FAULT when the PTE is loaded into the CPU. In contrast, the most significant 11 bits of the PTE are defined as ignored. These bits are completely ignored by the CPU and are available for use by the operating system. Linux/ia64 uses bit 63 to mark a page that has no access rights. Because such a page is inaccessible, there is no need to back it with a physical page frame. Thus, such a page can be considered to be present even if the p bit in the PTE is turned off.
Figure 4.26 shows the IA-64 PTE format when the present bit p is cleared. In this case, the bits other than the present bit are all ignored by the CPU and are available for use by the operating system. Linux uses bits 1 through 8 to encode the swap entry TYPE field and bits 9 through 62 to encode the swap entry OFFSET field. Bit 63 is again used to mark a page with no access rights. Note that if this bit is set, Linux considers the page present even though bit p is cleared.
Figure 4.26. Format of IA-64 PTE when present bit is cleared (p = 0).
Given the IA-64 PTE format, it is clear that types pte_t and pgprot_t must be wide enough to hold a 64-bit word. They could be defined as being equivalent to unsigned long. However, to take advantage of C type-checking rules, both types are declared as separate structures that contain just one 64-bit word each. This is a compiler trick that makes it easier to detect programming errors early on because passing a PTE where a protection value is expected (or vice versa) results in a type violation that can be reported by the C compiler.
The IA-64 implementations of the many PTE manipulation routines defined in Figure 4.24 are straightforward; most of them involve just simple bit manipulation in PTE or protection values. The PTE manipulation routines that need to execute atomically are all implemented with the cmpxchg8 instruction that we already encountered in Chapter 3, Processes, Tasks, and Threads.
4.3.5 Page-table accesses
Just like PTEs, Linux never directly manipulates directory entries. Instead, it manipulates them through the interface shown in Figure 4.27. In this interface, all virtual address arguments are called addr and must be values of type unsigned long. Global- and middle-directory entries are represented by types pgd_t and pmd_t, respectively. As we have seen already, PTE-directory entries are represented by type pte_t. All types usually occupy one machine word, but for some platforms, they are more complex structures occupying multiple words.
Figure 4.27. Kernel interface to access page tables.
The first routine in this interface, pgd_index(), simply extracts the global-directory index from virtual address addr and returns it.
The pgd_offset() routine looks up the global-directory entry for a user-space address. Argument mm specifies the mm structure (address space), and addr specifies the virtual address to look up. Similarly, pgd_offset_k() can be used to look up the global-directory entry for a kernel-space address. Because there is only one kernel page table, this routine needs just one argument: the virtual address addr, which must point inside the page-table-mapped kernel segment.
Once the global-directory entry has been found, pmd_offset() can be used to find the middle-directory entry of a virtual address. This routine expects two arguments, namely, pgd_entry_ptr, which must be a pointer to the global-directory entry found previously, and addr, the virtual address being looked up. This routine can be used both for user-space and kernel-space addresses.
Once the middle-directory entry has been found, pte_offset() can be used to find the PTE-directory entry of a virtual address. This routine again expects two arguments, namely, pmd_entry_ptr, which must be a pointer to the middle-directory entry found previously, and addr, the virtual address being looked up. This routine also can be used both for user-space and kernel-space addresses.
The final eight routines in the interface check the validity of a directory entry. The 3-character prefix of the routine name indicates the page-table level at which the check is used (pgd for global-, pmd for middle-, and pte for PTE-directory entries). Routines whose names end in none check whether the entry corresponds to a mapped portion of the address space. Routines whose names end in present check whether the entry corresponds to a portion of the address space that is present in physical memory. Because Linux page tables are not paged themselves, they are never stored in the swap space and the pgd_present() and pmd_present() routines always return the complement of pgd_none() and pmd_none() (an entry is present if it is mapped, and vice versa). Routines whose name end in bad check whether the entry is invalid. These routines are intended to make it easier to detect page table corruptions as early as possible but are not strictly required for the correct operation of Linux. Thus, on platforms that cannot (easily) distinguish valid and invalid entries, these routines can always return false. Also, note that pte_bad() does not exist, so there is no means to check whether a PTE is invalid.
The page-table access interface has been designed to support two primary access methods: address-driven and range-driven accesses. The former starts with a virtual address and, level by level, determines the global-, middle-, and PTE-directory entries that correspond to this address. In contrast, range-driven accesses visit all directory entries that fall into a certain address range. It is best to illustrate each method with an example. We do so next.
Example: Printing physical address of a virtual address
A good way to illustrate address-driven accesses is to look at how a user-space virtual address can be translated into the corresponding physical address. The code below implements a function called print_pa() that accomplishes this translation. It expects two arguments: a pointer, mm, to an mm structure and a virtual address, va. The function attempts to translate the virtual address by using the page table associated with mm. If the translation exists, the function prints the virtual address and the physical address it maps to. If no translation exists, nothing is printed.
print pa (struct mm struct * mm, unsigned long va) { pgd t *pgd = pgd offset( mm, va); if ( pgd_present(* pgd)) { pmd t *pmd = pmd_offset( pgd, va); if ( pmd_present(* pmd)) { pte t *pte = pte_offset( pmd, va); if ( pte_present(* pte)) printk("va 0x%lx -> pa 0x%lx\n", va, page_address( pte page(*pte)); } } }
Taking a closer look at the function, we see that it first uses pgd_offset() to obtain a pointer pgd to the global-directory entry for the virtual address. Then it uses pgd_present() to check whether the entry is present, i.e., to check whether the middle directory exists. If it exists, the function descends to the next level of the page-table tree where it uses pmd_offset() and pmd_present() to accomplish the equivalent steps for the middle directory. The page-table traversal is completed by a call to pte_offset(). This call accesses the middle directory and returns a pointer to the PTE. The function uses pte_present() to check whether the PTE maps a page that is present in physical memory. If so, it uses pte_page() to translate the PTE into the corresponding page frame descriptor. In the final step, the function uses page_address() to obtain the physical address of the page frame and prints both the virtual and physical addresses with a call to printk().
Let us emphasize that the above three-level lookup works correctly regardless of how many levels are actually implemented on a platform. For example, on a platform that has the middle level collapsed into the global level, pmd_offset() would return the value of pgd after casting it to a pointer to a middle-directory entry (pmd_)t) and routine pmd_present() would always return true. This behavior not only ensures proper operation but also lets the compiler optimize away the middle-directory access completely. Thus, even though logically a three-level lookup is being used, it would perform exactly like a two-level lookup.
Example: Counting the number of present pages
The kernel uses the range-driven access method when it needs to perform a certain operation on the directory entries that correspond to a given address range. For example, in response to a munmap() system call, the kernel needs to remove the page-table directory entries that fall in the range specified by the system call.
To see how this access method works, let us consider the simple example of counting the number of pages that are present in a given address range. This is what the count_pages() function below implements. It expects three arguments: a pointer, mm, to an mm structure and two user-space addresses, start and end, that specify the starting and ending point of the address range in which to count present pages (end points to the first byte beyond the desired range; both start and end are assumed to be global-directory aligned).
count_pages (struct mm_struct * mm, long start, long end) { int i, j, k, num_present = 0; pgd t *pgd = pgd_offset( mm, 0); for ( i = pgd_index( start); i < pgd_index( end); ++ i) if ( pgd_present( pgd[ i]) { pmd_t *pmd = pmd_offset( pgd + i, 0); for ( j = 0; j < PTRS_PER_PMD; ++ j) if ( pmd_present( pmd[ j])) { pte_t *pte = pte_offset( pmd + j, 0); for ( k = 0; k < PTRS_PER_PTE; ++ k) if ( pte_present( pte[ k])) ++ num_present; } } printk("%d pages present\n", num_present); }
The function calls pgd_offset() to obtain a pointer pgd_to the global-directory page. The first argument to this function is the mm pointer as usual. However, the second argument is 0. This causes pgd_offset() to return a pointer to the very first entry in the global directory. Since we know that the directory contains PTRS_PER_PGD entries, we can equally well treat this value as a pointer to the global directory itself and use array indexing to access particular entries. This is what the first for-loop does: It iterates variable i over the global-directory indices that correspond to the starting and ending addresses. Note how pgd_index() is used to extract these indices from the corresponding addresses. In the loop body, pgd_present() checks whether the ith entry in the directory is present. If not, the entry is skipped. If it is present, the function descends to the next level of the tree (the middle directory). Here the same basic steps are repeated: pmd_offset() obtains a pointer, pmd, to the middle directory, and a for-loop iterates over each index j in the middle directory (0 to PTRS_PER_PMD -1). In the third level, the analogous steps are repeated for the PTE directory. In the body of the third for-loop, the function iterates k over the indices in the PTE directory (0 to PTRS_PER_PTE -1); if the indexed entry refers to a page frame that is present in physical memory, the function increments the counter num_present. After iterating over the entire address range, the function prints the resulting page count by calling printk().
There are two points worth emphasizing: First, just as in the previous example, the above code works properly (and efficiently) when one or more levels of the page-table tree are collapsed into a higher level of the tree; second, note that given the global-, middle-, and PTE-directory indices i, j, k, there is no platform-independent way to construct a virtual address that would yield these indices. Thus, in the above example it would not be possible, e.g., to print the virtual address of each page that has been found to be present. There is nothing fundamentally difficult about providing such an operation, but because Linux does not need it, there is little point in defining it.
IA-64 implementation
Given the IA-64 page-table structure described in Section 4.3.3, it is straightforward to see how the page-table access interface is implemented. In fact, all operations are so simple that they are implemented either as macros or as inlined functions. The directory entry types (pgd_t, pmd_t, and pte_t) can all be thought of as being equivalent to a 64-bit word. However, to take advantage of C type-checking rules, each type has its own structure declaration. This ensures that if, say, a pmd entry is accidentally used when a pte entry is expected, the C compiler can detect the problem and issue a suitable error message.
The IA-64 implementation of pgd_index() simply extracts the pgdh and pgdl components (see Figure 4.21 on page 159) from the virtual address, concatenates them, and returns the result. Similarly, the implementations of pgd_offset(), pgd_offset_k(), pmd_offset(), and pte_offset() extract the relevant address bits from the virtual address to form an index and then use this value to index the directory identified by the first argument passed to these routines.
In all three levels of the directory, Linux/ia64 uses a value of 0 to indicate an entry that is not mapped. Thus, routines pgd_none(), pmd_none(), and pte_none() simply test whether the directory entry passed to the routine is equal to 0. Similarly, pgd_present() and pmd_present() test whether the directory entry is not 0. Because virtual pages can be paged out to disk, pte_present() is different and checks the present bit in the PTE. As a special case, a virtual page with no access rights is always considered present. Finally, on IA-64 the pgd_bad() and pmd_bad() routines verify the validity of a directory entry by checking whether they contain a valid physical address. Specifically, they check whether any of the unimplemented physical address bits are non-zero and return true if so.
4.3.6 Page-table directory creation
Now that we understand how page tables can be accessed, we are ready to explore how they can be created in the first place. Figure 4.28 shows the interface Linux uses to abstract platform differences in how this creation is accomplished. The first three routines can be used to create directories at each of the three levels of the page-table tree. Specifically, pgd_alloc() creates a global directory, pmd_alloc_one() creates a middle directory, and pte-alloc_one() creates a PTE directory. The routines return a pointer to the newly created directory or NULL if they fail for any reason. The newly created directory is initialized such that its entries are marked as not mapped. The memory needed for the directory is normally obtained from the page allocator. This implies that if the page allocator is short on memory, the routines can block execution temporarily.
Figure 4.28. Kernel interface to create page tables.
Once a directory is no longer needed, it can be freed by a call to one of pgd_free(), pmd_free(), or pte_free(), depending on whether it is a global, middle, or PTE directory, respectively. To reduce allocation and deallocation overheads, most platforms maintain a cache of free directories. This is usually done by implementing these routines such that they add the newly freed directory to a list of unused directories, instead of returning their memory to the page allocator. Access to this cache is provided through two separate allocation routines: pmd_alloc_one_fast() and pte_alloc_one_fast(). These work exactly like the normal middle-, and PTE-directory allocation routines, except that they guarantee not to block execution. The cache is particularly effective because it avoids not just the normal page allocator overheads but also the overhead of initializing the directory. To see this, consider that before a directory is freed, Linux removes all existing mappings from it, meaning that by the time pgd_free(), pmd_free(), or pte_free() is called, the directory is guaranteed to have all entries marked as not mapped. Thus, when a directory is taken from the cache, it is already initialized and can be returned directly, without the need to clear its content again. While the cache generally improves performance, it could create problems if it grew too large. To prevent this, the interface defines the do_check_pgt_cache() routine to make it possible to reduce or limit the size of the cache. The routine takes two arguments, low and high, which are both page counts that are interpreted as follows: if the cache occupies more than high pages, then the routine should free up directories until the cache occupies no more than low pages.
The final six routines set or clear directory entries. The pgd_populate() routine sets the pmd identified by argument pmd_entry_ptr as the new value for the pgd entry pointed to by argument pgd_entry_ptr. In contrast, pgd_clear() marks the entry pointed to by pgd_entry-ptr as not mapped. After this operation, pgd_none() would return true for this entry. The pmd_populate(), pmd_clear(), set_pte(), and pte_clear() routines implement the equivalent operations for middle- and PTE-directory entries. Backward compatibility is the only reason why set_pte() is not called pte_populate().
IA-64 implementation
The IA-64 implementation of the page-table creation interface is straightforward, largely because each directory occupies one page frame. Memory management of directory nodes is therefore particularly easy because the normal page allocator can be used to allocate and free directories. Like most other platforms, Linux/ia64 implements the nonblocking versions of the directory allocation routines by maintaining a cache of unused directories.