3.10 What's New in 2.6
Most of the mechanics for page table management are essentially the same for 2.6, but the changes that have been introduced are quite wide reaching and the implementations are in depth.
MMU-less Architecture Support A new file has been introduced called mm/nommu.c. This source file contains replacement code for functions that assume the existence of a MMU, like mmap() for example. This is to support architectures, usually microcontrollers, that have no MMU. Much of the work in this area was developed by the uCLinux Project (www.uclinux.org).
Reverse Mapping The most significant and important change to page table management is the introduction of Reverse Mapping (rmap). Referring to it as "rmap" is deliberate because it is the common use of the acronym and should not be confused with the -rmap tree developed by Rik van Riel, which has many more alterations to the stock VM than just the reverse mapping.
In a single sentence, rmap grants the ability to locate all PTEs that map a particular page given just the struct page. In 2.4, the only way to find all PTEs that mapped a shared page, such as a memory mapped shared library, is to linearly search all page tables belonging to all processes. This is far too expensive, and Linux tries to avoid the problem by using the swap cache (see Section 11.4). This means that, with many shared pages, Linux may have to swap out entire processes regardless of the page age and usage patterns. 2.6 instead has a PTE chain associated with every struct page, which may be traversed to remove a page from all page tables that reference it. This way, pages in the LRU can be swapped out in an intelligent manner without resorting to swapping entire processes.
As might be imagined by the reader, the implementation of this simple concept is a little involved. The first step in understanding the implementation is the union pte that is a field in struct page. This union has two fields, a pointer to a struct pte_chain called chain and a pte_addr_t called direct. The union is an optization whereby direct is used to save memory if there is only one PTE mapping the entry. Otherwise, a chain is used. The type pte_addr_t varies between architectures, but, whatever its type, it can be used to locate a PTE, so we will treat it as a pte_t for simplicity.
The struct pte_chain is a little more complex. The struct itself is very simple, but it is compact with overloaded fields, and a lot of development effort has been spent on making it small and efficient. Fortunately, this does not make it indecipherable.
First, it is the responsibility of the slab allocator to allocate and manage struct pte_chains because it is this type of task that the slab allocator is best at. Each struct pte_chain can hold up to NRPTE pointers to PTE structures. After that many PTEs have been filled, a struct pte_chain is allocated and added to the chain.
The struct pte_chain has two fields. The first is unsigned long next_and_idx, which has two purposes. When next_and_idx is ANDed with NRPTE, it returns the number of PTEs currently in this struct pte_chain and indicates where the next free slot is. When next_and_idx is ANDed with the negation of NRPTE (i.e., ~NRPTE), a pointer to the next struct pte_chain in the chain is returned 2. This is basically how a PTE chain is implemented.
To give you a taste of the rmap intricacies, I'll give an example of what happens when a new PTE needs to map a page. The basic process is to have the caller allocate a new pte_chain with pte_chain_alloc(). This allocated chain is passed with the struct page and the PTE to page_add_rmap(). If the existing PTE chain associated with the page has slots available, it will be used, and the pte_chain allocated by the caller is returned. If no slots were available, the allocated pte_chain will be added to the chain, and NULL returned.
There is a quite substantial API associated with rmap for tasks such as creating chains and adding and removing PTEs to a chain, but a full listing is beyond the scope of this section. Fortunately, the API is confined to mm/rmap.c, and the functions are heavily commented so that their purpose is clear.
There are two main benefits, both related to pageout, with the introduction of reverse mapping. The first is with the set up and tear down of page tables. As will be seen in Section 11.4, pages being paged out are placed in a swap cache, and information is written into the PTE that is necessary to find the page again. This can lead to multiple minor faults because pages are put into the swap cache and then faulted again by a process. With rmap, the setup and removal of PTEs is atomic. The second major benefit is when pages need to paged out, finding all PTEs referencing the pages is a simple operation, but impractical with 2.4, hence the swap cache.
Reverse mapping is not without its cost, though. The first, and obvious one, is the additional space requirements for the PTE chains. Arguably, the second is a CPU cost associated with reverse mapping, but it has not been proved to be significant. What is important to note, though, is that reverse mapping is only a benefit when pageouts are frequent. If the machines workload does not result in much pageout or memory is ample, reverse mapping is all cost with little or no benefit. At the time of writing, the merits and downsides to rmap are still the subject of a number of discussions.
Object-Based Reverse Mapping The reverse mapping required for each page can have very expensive space requirements. To compound the problem, many of the reverse mapped pages in a VMA will be essentially identical. One way of addressing this is to reverse map based on the VMAs rather than individual pages. That is, instead of having a reverse mapping for each page, all the VMAs that map a particular page would be traversed and unmap the page from each. Note that objects in this case refer to the VMAs, not an object in the object-orientated sense of the word3. At the time of writing, this feature has not been merged yet and was last seen in kernel 2.5.68-mm1, but a strong incentive exists to have it available if the problems with it can be resolved. For the very curious, the patch for just file/device backed objrmap at this release is available4, but it is only for the very very curious reader.
Two tasks require all PTEs that map a page to be traversed. The first task is page_referenced(), which checks all PTEs that map a page to see if the page has been referenced recently. The second task is when a page needs to be unmapped from all processes with try_to_unmap(). To complicate matters further, two types of mappings must be reverse mapped, those that are backed by a file or device and those that are anonymous. In both cases, the basic objective is to traverse all VMAs that map a particular page and then walk the page table for that VMA to get the PTE. The only difference is how it is implemented. The case where it is backed by some sort of file is the easiest case and was implemented first so I'll deal with it first. For the purposes of illustrating the implementation, I'll discuss how page_referenced() is implemented.
page_referenced() calls page_referenced_obj(), which is the top-level function for finding all PTEs within VMAs that map the page. As the page is mapped for a file or device, page→mapping contains a pointer to a valid address_space. The address_space has two linked lists that contain all VMAs that use the mapping with the address_space→i_mmap and address_space→i_mmap_shared fields. For every VMA that is on these linked lists, page_referenced_obj_one() is called with the VMA and the page as parameters. The function page_referenced_obj_one() first checks if the page is in an address managed by this VMA and, if so, traverses the page tables of the mm_struct using the VMA (vma→vm_mm) until it finds the PTE mapping the page for that mm_struct.
Anonymous page tracking is a lot trickier and was implented in a number of stages. It only made a very brief appearance and was removed again in 2.5.65-mm4 because it conflicted with a number of other changes. The first stage in the implementation was to use page→mapping and page→index fields to track mm_struct and address pairs. These fields previously had been used to store a pointer to swapper_space and a pointer to the swp_entry_t (See Chapter 11). Exactly how it is addressed is beyond the scope of this section, but the summary is that swp_entry_t is stored in page→private.
try_to_unmap_obj() works in a similar fashion, but, obviously, all the PTEs that reference a page with this method can do so without needing to reverse map the individual pages. A serious search complexity problem prevents it from being merged. The scenario that describes the problem is as follows.
Take a case where 100 processes have 100 VMAs mapping a single file. To unmap a single page in this case with object-based reverse mapping would require 10,000 VMAs to be searched, most of which are totally unnecessary. With pagebased reverse mapping, only 100 pte_chain slots need to be examined, one for each process. An optimization was introduced to order VMAs in the address_space by virtual address, but the search for a single page is still far too expensive for object-based reverse mapping to be merged.
PTEs in High Memory In 2.4, page table entries exist in ZONE_NORMAL because the kernel needs to be able to address them directly during a page table walk. This was acceptable until it was found that, with high memory machines, ZONE_NORMAL was being consumed by the third-level page table PTEs. The obvious answer is to move PTEs to high memory, which is exactly what 2.6 does.
As we will see in Chapter 9, addressing information in high memory is far from free, so moving PTEs to high memory is a compile-time configuration option. In short, the problem is that the kernel must map pages from high memory into the lower address space before it can be used but a very limited number of slots are available for these mappings, which introduces a troublesome bottleneck. However, for applications with a large number of PTEs, there is little other option. At the time of writing, a proposal has been made for having a User Kernel Virtual Area (UKVA), which would be a region in kernel space private to each process, but it is unclear if it will be merged for 2.6 or not.
To take the possibility of high memory mapping into account, the macro pte_offset() from 2.4 has been replaced with pte_offset_map() in 2.6. If PTEs are in low memory, this will behave the same as pte_offset() and return the address of the PTE. If the PTE is in high memory, it will first be mapped into low memory with kmap_atomic(), so it can be used by the kernel. This PTE must be unmapped as quickly as possible with pte_unmap().
In programming terms, this means that page table walk code looks slightly different. In particular, to find the PTE for a given address, the code now reads as (taken from mm/memory.c):
640 ptep = pte_offset_map(pmd, address); 641 if (!ptep) 642 goto out; 643 644 pte = *ptep; 645 pte_unmap(ptep);
Additionally, the PTE allocation API has changed. Instead of pte_alloc(), there is now a pte_alloc_kernel() for use with kernel PTE mappings and pte_alloc_map() for userspace mapping. The principal difference between them is that pte_alloc_kernel() will never use high memory for the PTE.
In memory management terms, the overhead of having to map the PTE from high memory should not be ignored. Only one PTE at a time may be mapped per CPU, although a second may be mapped with pte_offset_map_nested(). This introduces a penalty when all PTEs need to be examined, such as during zap_page_range() when all PTEs in a given range need to be unmapped.
At the time of writing, a patch has been submitted that places PMDs in high memory using essentially the same mechanism and API changes. It is likely that it will be merged.
Huge TLB Filesystem Most modern architectures support more than one page size. For example, on many x86 architectures, there is an option to use 4KiB pages or 4MiB pages. Traditionally, Linux only used large pages for mapping the actual kernel image and nowhere else. Because TLB slots are a scarce resource, it is desirable to be able to take advantage of the large pages, especially on machines with large amounts of physical memory.
In 2.6, Linux allows processes to use huge pages, the size of which is determined by HPAGE_SIZE. The number of available huge pages is determined by the system administrator by using the /proc/sys/vm/nr_hugepages proc interface, which ultimately uses the function set_hugetlb_mem_size(). Because the success of the allocation depends on the availability of physically contiguous memory, the allocation should be made during system startup.
The root of the implementation is a Huge TLB Filesystem (hugetlbfs), which is a pseudofilesystem implemented in fs/hugetlbfs/inode.c. Basically, each file in this filesystem is backed by a huge page. During initialization, init_hugetlbfs_fs() registers the file system and mounts it as an internal filesystem with kern_mount().
There are two ways that huge pages may be accessed by a process. The first is by using shmget() to set up a shared region backed by huge pages, and the second is the call mmap() on a file opened in the huge page filesystem.
When a shared memory region should be backed by huge pages, the process should call shmget() and pass SHM_HUGETLB as one of the flags. This results in hugetlb_zero_setup() being called, which creates a new file in the root of the internal hugetlbfs. A file is created in the root of the internal filesystem. The name of the file is determined by an atomic counter called hugetlbfs_counter, which is incremented every time a shared region is set up.
To create a file backed by huge pages, a filesystem of type hugetlbfs must first be mounted by the system administrator. Instructions on how to perform this task are detailed in Documentation/vm/hugetlbpage.txt. After the filesystem is mounted, files can be created as normal with the system call open(). When mmap() is called on the open file, the file_operations struct hugetlbfs_file_operations ensures that hugetlbfs_file_mmap() is called to set up the region properly.
Huge TLB pages have their own function for the management of page tables, address space operations and filesystem operations. The names of the functions for page table management can all be seen in <linux/hugetlb.h>, and they are named very similar to their normal page equivalents. The implementation of the hugetlbfs functions are located near their normal page equivalents, so are easy to find.
Cache Flush Management The changes here are minimal. The API function flush_page_to_ram() has been totally removed, and a new API flush_dcache_range() has been introduced.