Page Table Management
Linux layers the machine independent/dependent layer in an unusual manner in comparison to other operating systems [CP99]. Other operating systems have objects that manage the underlying physical pages, such as the pmap object in BSD. Linux instead maintains the concept of a three-level page table in the architecture-independent code even if the underlying architecture does not support it. Although this is conceptually easy to understand, it also means that the distinction between different types of pages is very blurry, and page types are identified by their flags or what lists they exist on rather than the objects they belong to.
Architectures that manage their Memory Management Unit (MMU) differently are expected to emulate the three-level page tables. For example, on the x86 without PAE enabled, only two page table levels are available. The Page Middle Directory (PMD) is defined to be of size 1 and "folds back" directly onto the Page Global Directory (PGD), which is optimized out at compile time. Unfortunately, for architectures that do not manage their cache or Translation Lookaside Buffer (TLB) automatically, hooks that are architecture dependent have to be explicitly left in the code for when the TLB and CPU caches need to be altered and flushed, even if they are null operations on some architectures like the x86. These hooks are discussed further in Section 3.8.
This chapter will begin by describing how the page table is arranged and what types are used to describe the three separate levels of the page table. Next is how a virtual address is broken up into its component parts for navigating the table. After this is covered, I discuss the lowest level entry, the PTE, and what bits are used by the hardware. After that, the macros used for navigating a page table and setting and checking attributes will be discussed before talking about how the page table is populated and how pages are allocated and freed for the use with page tables. The initialization stage is then discussed, which shows how the page tables are initialized during boot strapping. Finally, I cover how the TLB and CPU caches are utilized.
3.1 Describing the Page Directory
Each process is a pointer (mm_struct→pgd) to its own PGD which is a physical page frame. This frame contains an array of type pgd_t, which is an architecture-specific type defined in <asm/page.h>. The page tables are loaded differently depending on the architecture. On the x86, the process page table is loaded by copying mm_struct→pgd into the cr3 register, which has the side effect of flushing the TLB. In fact, this is how the function __flush_tlb() is implemented in the architecture-dependent code.
Each active entry in the PGD table points to a page frame containing an array of PMD entries of type pmd_t, which in turn points to page frames containing PTEs of type pte_t, which finally point to page frames containing the actual user data. In the event that the page has been swapped out to backing storage, the swap entry is stored in the PTE and used by do_swap_page() during page fault to find the swap entry containing the page data. The page table layout is illustrated in Figure 3.1.
Figure 3.1 Page Table Layout
Any given linear address may be broken up into parts to yield offsets within these three page table levels and an offset within the actual page. To help break up the linear address into its component parts, a number of macros are provided in triplets for each page table level, namely a SHIFT, a SIZE and a MASK macro. The SHIFT macros specify the length in bits that are mapped by each level of the page tables as illustrated in Figure 3.2.
Figure 3.2 Linear Address Bit Size Macros
The MASK values can be ANDd with a linear address to mask out all the upper bits and are frequently used to determine if a linear address is aligned to a given level within the page table. The SIZE macros reveal how many bytes are addressed by each entry at each level. The relationship between the SIZE and MASK macros is illustrated in Figure 3.3.
Figure 3.3 Linear Address Size and Mask Macros
For the calculation of each of the triplets, only SHIFT is important because the other two are calculated based on it. For example, the three macros for page level on the x86 are:
5 #define PAGE_SHIFT 12 6 #define PAGE_SIZE (1UL << PAGE_SHIFT) 7 #define PAGE_MASK (~(PAGE_SIZE-1))
PAGE_SHIFT is the length in bits of the offset part of the linear address space, which is 12 bits on the x86. The size of a page is easily calculated as 2PAGE_SHIFT which is the equivalent of the previous code. Finally, the mask is calculated as the negation of the bits that make up the PAGE_SIZE - 1. If a page needs to be aligned on a page boundary, PAGE_ALIGN() is used. This macro adds PAGE_SIZE - 1 to the address before simply ANDing it with the PAGE_MASK to zero out the page offset bits.
PMD_SHIFT is the number of bits in the linear address that are mapped by the second-level part of the table. The PMD_SIZE and PMD_MASK are calculated in a similar way to the page-level macros.
PGDIR_SHIFT is the number of bits that are mapped by the top, or first level, of the page table. The PGDIR_SIZE and PGDIR_MASK are calculated in the same manner.
The last three macros of importance are the PTRS_PER_x, which determines the number of entries in each level of the page table. PTRS_PER_PGD is the number of pointers in the PGD, which is 1,024 on an x86 without PAE. PTRS_PER_PMD is for the PMD, which is one on the x86 without PAE, and PTRS_PER_PTE is for the lowest level, which is 1,024 on the x86.