3.2 Describing a Page Table Entry
As mentioned, each entry is described by the structs pte_t, pmd_t and pgd_t for PTEs, PMDs and PGDs respectively. Even though these are often just unsigned integers, they are defined as structs for two reasons. The first is for type protection so that they will not be used inappropriately. The second is for features like PAE on the x86 where an additional 4 bits is used for addressing more than 4GiB of memory. To store the protection bits, pgprot_t is defined, which holds the relevant flags and is usually stored in the lower bits of a page table entry.
For type casting, four macros are provided in asm/page.h, which takes the previous types and returns the relevant part of the structs. They are pte_val(), pmd_val(), pgd_val() and pgprot_val(). To reverse the type casting, four more macros are provided: __pte(), __pmd(), __pgd() and __pgprot().
Where exactly the protection bits are stored is architecture dependent. For illustration purposes, we will examine the case of an x86 architecture without PAE enabled, but the same principles apply across architectures. On an x86 without PAE, the pte_t is simply a 32-bit integer within a struct. Each pte_t points to an address of a page frame, and all the addresses pointed to are guaranteed to be page aligned. Therefore, there are PAGE_SHIFT (12) bits in that 32-bit value that are free for status bits of the page table entry. A number of the protection and status bits are listed in Table 3.1, but what bits exist and what they mean varies between architectures.
Table 3.1. Page Table Entry Protection and Status Bits
Bit |
Function |
---|---|
_PAGE_PRESENT |
Page is resident in memory and not swapped out. |
_PAGE_PROTNONE |
Page is resident, but not accessible. |
_PAGE_RW |
Set if the page may be written to |
_PAGE_USER |
Set if the page is accessible from userspace |
_PAGE_DIRTY |
Set if the page is written to |
_PAGE_ACCESSED |
Set if the page is accessed |
These bits are self-explanatory except for the _PAGE_PROTNONE, which I will discuss further. On the x86 with Pentium III and higher, this bit is called the Page Attribute Table (PAT) while earlier architectures such as the Pentium II had this bit reserved. The PAT bit is used to indicate the size of the page that the PTE is referencing. In a PGD entry, this same bit is instead called the Page Size Extension (PSE) bit, so obviously these bits are meant to be used in conjunction.
Because Linux does not use the PSE bit for user pages, the PAT bit is free in the PTE for other purposes. There is a requirement for having a page resident in memory, but inaccessible to the user space process, such as when a region is protected with mprotect() with the PROT_NONE flag. When the region is to be protected, the _PAGE_PRESENT bit is cleared, and the _PAGE_PROTNONE bit is set. The macro pte_present() checks if either of these bits are set, so the kernel itself knows the PTE is present. It is just inaccessible to userspace, which is a subtle, but important, point. Because the hardware bit _PAGE_PRESENT is clear, a page fault will occur if the page is accessed so that Linux can enforce the protection while still knowing the page is resident if it needs to swap it out or the process exits.