Windows 2000 Memory Management
The Windows 2000 kernel makes heavy use of the protected-mode virtual memory management mechanisms of the Intel i386 CPU class. To get a better understanding of how Windows 2000 manages its main memory, it is important to be at least basically familiar with some architectural issues of the i386 CPU. The term i386 might look somewhat anachronistic because the 80386 CPU dates back to the Stone Age of Windows computing. Windows 2000 is designed for Pentium CPUs and better. However, even these newer processors rely on the memory management model originally designed for the 80386 CPU—with some important enhancements, of course. Therefore, Microsoft usually labels the Windows NT and 2000 versions built for Intel processors "i386" or even "x86". Don't be confused about that—whenever you read the numbers 86 or 386 in this book, keep in mind that the corresponding information refers to a specific CPU architecture, not a specific processor release.
i386 Memory Management Data Structures
Some portions of the sample code following are concerned with low-level memory management and peek inside the mechanisms outlined above. For convenience, I have defined several C data structures that make this task easier. Because many data items inside the i386 CPU are concatenations of single bits or bit groups, C bit-fields come in handy. Bit-fields are an efficient way to access individual bits of or extract contiguous bit groups from larger data words. Microsoft Visual C/C++ generates quite clever code for bit-field operations. Listing 1 is part one of a series of CPU data type definitions, containing the following items:
X86_REGISTER is a basic unsigned 32-bit integral type that can represent various CPU registers. This comprises all general-purpose, index, pointer, control, debug, and test registers.
-
X86_SELECTOR represents a 16-bit segment selector, as stored in the segment registers CS, DS, ES, FS, GS, and SS. In Figure 1, selectors are depicted as the upper-third of a logical 48-bit address, serving as an index into a descriptor table. For computational convenience, the 16-bit selector value is extended to 32-bits, with the upper half marked "reserved". Note that the X86_SELECTOR structure is a union of two structures. The first one specifies the selector value as a packed 16-bit WORD named wValue, while the second breaks it up into bit-fields. The RPL field specifies the Requested Privilege Level, which is either 0 (kernel-mode) or 3 (user-mode) on Windows 2000. The TI bit switches between the Global and Local Descriptor Tables (GDT/LDT).
X86_DESCRIPTOR defines the format of a table entry pointed to by a selector. It is a 64-bit quantity with a very convoluted structure due to its historic evolution. The linear base address defining the start location of the associated segment is scattered among three bit fields named Base1, Base2, and Base3; with Base1 being the least significant part. The segment limit specifying the segment size minus one is divided into the pair Limit1 and Limit2, with the former representing the least-significant half. The remaining bit-fields store various segment properties. For instance, the G bit defines the segment granularity. If zero, the segment limit is specified in bytes; otherwise, the limit value has to be multiplied by 4KB. Like X86_SELECTOR, the X86_DESCRIPTOR structure is made up of a union to allow different interpretations of its value. The dValueLow and dValueHigh members are helpful if you have to copy descriptors without regard to their internal structure.
X86_GATE looks somewhat similar to X86_DESCRIPTOR. In fact, both structures are related—while X86_DESCRIPTOR is a GDT entry and describes the memory properties of a segment, X86_GATE is an entry inside the Interrupt Descriptor Table (IDT) and describes the memory properties of an interrupt handler. The IDT can contain Task, Interrupt, and Trap Gates. (No, Bill Gates is not stored in the IDT!) The X86_GATE structure matches all three types, with the Type bit-field determining the identity. Type 5 identifies a Task Gate, Types 6 and 14 Interrupt Gates, and Types 7 and 15 Trap Gates. The most significant type bit specifies the size of the gate: 16-bit gates have this bit set to zero; otherwise, it is a 32-bit gate.
X86_TABLE is a tricky structure that is used to read the values of the GDTR or IDTR by means of the assembly language instructions SGDT (store GDT register) and SIDT (store IDT register), respectively. Both instructions require a 48-bit memory operand, where the limit and base address values will be stored. To maintain DWORD alignment for the 32-bit base address, X86_TABLE starts out with the 16-bit dummy member wReserved. Depending on whether the SGDT or SIDT instruction is applied, the base address must be interpreted as a descriptor or gate pointer, as suggested by the union of PX86_DESCRIPTOR and PX86_GATE types. The wLimit member is the same for both table types.
Listing 1 i386 Registers, Selectors, Descriptors, Gates, and Tables.
// ================================================================= // INTEL X86 STRUCTURES, PART 1 OF 3 // ================================================================= typedef DWORD X86_REGISTER, *PX86_REGISTER, **PPX86_REGISTER; // ----------------------------------------------------------------- typedef struct _X86_SELECTOR { union { struct { WORD wValue; // packed value WORD wReserved; }; struct { unsigned RPL : 2; // requested privilege level unsigned TI : 1; // table indicator: 0=gdt, 1=ldt unsigned Index : 13; // index into descriptor table unsigned Reserved : 16; }; }; } X86_SELECTOR, *PX86_SELECTOR, **PPX86_SELECTOR; #define X86_SELECTOR_ sizeof (X86_SELECTOR) // ----------------------------------------------------------------- typedef struct _X86_DESCRIPTOR { union { struct { DWORD dValueLow; // packed value DWORD dValueHigh; }; struct { unsigned Limit1 : 16; // bits 15..00 unsigned Base1 : 16; // bits 15..00 unsigned Base2 : 8; // bits 23..16 unsigned Type : 4; // segment type unsigned S : 1; // type (0=system, 1=code/data) unsigned DPL : 2; // descriptor privilege level unsigned P : 1; // segment present unsigned Limit2 : 4; // bits 19..16 unsigned AVL : 1; // available to programmer unsigned Reserved : 1; unsigned DB : 1; // 0=16-bit, 1=32-bit unsigned G : 1; // granularity (1=4KB) unsigned Base3 : 8; // bits 31..24 }; }; } X86_DESCRIPTOR, *PX86_DESCRIPTOR, **PPX86_DESCRIPTOR; #define X86_DESCRIPTOR_ sizeof (X86_DESCRIPTOR) // ----------------------------------------------------------------- typedef struct _X86_GATE { union { struct { DWORD dValueLow; // packed value DWORD dValueHigh; }; struct { unsigned Offset1 : 16; // bits 15..00 unsigned Selector : 16; // segment selector unsigned Parameters : 5; // parameters unsigned Reserved : 3; unsigned Type : 4; // gate type and size unsigned S : 1; // always 0 unsigned DPL : 2; // descriptor privilege level unsigned P : 1; // segment present unsigned Offset2 : 16; // bits 31..16 }; }; } X86_GATE, *PX86_GATE, **PPX86_GATE; #define X86_GATE_ sizeof (X86_GATE) // ----------------------------------------------------------------- typedef struct _X86_TABLE { WORD wReserved; // force 32-bit alignment WORD wLimit; // table limit union { PX86_DESCRIPTOR pDescriptors; // used by sgdt instruction PX86_GATE pGates; // used by sidt instruction }; } X86_TABLE, *PX86_TABLE, **PPX86_TABLE; #define X86_TABLE_ sizeof (X86_TABLE) // =================================================================
Figure 1 Flat 4GB memory segmentation.
The next set of i386 memory management structures collected in Listing 2 relates to demand paging and contains several items illustrated in Figures 2 and 3:
X86_PDBR is, of course, a structural representation of the CPU's CR3 register, also known as the page-directory base register (PDBR). The upper 20 bits contain the page-frame number (PFN), which is an index into the array of physical 4KB pages. PFN=0 corresponds to physical address, 0x00000000, PFN=1 to 0x00001000, and so forth. 20 bits are just enough to cover the entire 4GB address space. The PFN in the PDBR is the index of the physical page that holds the page-directory. Most of the remaining bits are reserved, except for bit #3, controlling page-level write-through (PWT), and bit #4, disabling page-level caching if set.
X86_PDE_4M and X86_PDE_4K are alternative incarnations of page-directory entries (PDEs) for 4MB and 4KB pages, respectively. A page-directory contains a maximum of 1,024 PDEs. Again, PFN is the page-frame number, pointing to the subordinate page. For a 4MB PDE, the PFN bit-field is only 10 bits wide, addressing a 4MB data page. The 20-bit PFN of 4KB PDE points to a page-table that ultimately selects the physical data pages. The remaining bits define various properties. The most interesting ones are the "Page Size" bit PS, controlling the page size (0 = 4KB, 1 = 4MB), and the "Present" bit P, indicating whether the subordinate data page (4MB mode) or page-table (4KB mode) is present in physical memory.
X86_PTE_4K defines the internal structure of a page-table entry (PTE) contained in a page-table. Like a page-directory, a page-table can contain up to 1,024 entries. The only difference between X86_PTE_4K and X86_PDE_4K is that the former lacks the PS bit, which is not required because the page size must be 4KB, as determined by the PDE's PS bit. Note that there is no such thing like a 4MB PTE, since the 4MB memory model doesn't require an intermediate page-table layer.
X86_PNPE represents a "page-not-present entry" (PNPE); that is, a PDE or PTE where the P bit is zero. According to the Intel manuals, the remaining 31 bits are "Available to Operating System or Executive." If a linear address maps to a PNPE, this means that this address is either unused, or it points to a page that is currently swapped out to one of the pagefiles. Windows 2000 uses the 31 unassigned bits of the PNPE to store status information of the page. The structure of this information is undocumented, but it seems that bit #10, named PageFile in Listing 2, is set if the page is swapped out. In this case, the Reserved1 and Reserved2 bit-fields contain values that enable the system to locate the page in the pagefiles, so it can be swapped in as soon as one of its linear addresses is touched by a memory read/write instruction.
X86_PE is included for convenience. It is merely a union of all possible forms a page entry can take, comprising the PDBR contents, 4MB and 4KB PDEs, PTEs, and PNPEs.
Listing 2 i386 PDBR, PDE, PTE, and Page-Not-Present Entry Values.
// ================================================================= // INTEL X86 STRUCTURES, PART 2 OF 3 // ================================================================= typedef struct _X86_PDBR // page-directory base register (cr3) { union { struct { DWORD dValue; // packed value }; struct { unsigned Reserved1 : 3; unsigned PWT : 1; // page-level write-through unsigned PCD : 1; // page-level cache disabled unsigned Reserved2 : 7; unsigned PFN : 20; // page-frame number }; }; } X86_PDBR, *PX86_PDBR, **PPX86_PDBR; #define X86_PDBR_ sizeof (X86_PDBR) // ----------------------------------------------------------------- typedef struct _X86_PDE_4M // page-directory entry (4-MB page) { union { struct { DWORD dValue; // packed value }; struct { unsigned P : 1; // present (1 = present) unsigned RW : 1; // read/write unsigned US : 1; // user/supervisor unsigned PWT : 1; // page-level write-through unsigned PCD : 1; // page-level cache disabled unsigned A : 1; // accessed unsigned D : 1; // dirty unsigned PS : 1; // page size (1 = 4-MB page) unsigned G : 1; // global page unsigned Available : 3; // available to programmer unsigned Reserved : 10; unsigned PFN : 10; // page-frame number }; }; } X86_PDE_4M, *PX86_PDE_4M, **PPX86_PDE_4M; #define X86_PDE_4M_ sizeof (X86_PDE_4M) // ----------------------------------------------------------------- typedef struct _X86_PDE_4K // page-directory entry (4-KB page) { union { struct { DWORD dValue; // packed value }; struct { unsigned P : 1; // present (1 = present) unsigned RW : 1; // read/write unsigned US : 1; // user/supervisor unsigned PWT : 1; // page-level write-through unsigned PCD : 1; // page-level cache disabled unsigned A : 1; // accessed unsigned Reserved : 1; // dirty unsigned PS : 1; // page size (0 = 4-KB page) unsigned G : 1; // global page unsigned Available : 3; // available to programmer unsigned PFN : 20; // page-frame number }; }; } X86_PDE_4K, *PX86_PDE_4K, **PPX86_PDE_4K; #define X86_PDE_4K_ sizeof (X86_PDE_4K) // ----------------------------------------------------------------- typedef struct _X86_PTE_4K // page-table entry (4-KB page) { union { struct { DWORD dValue; // packed value }; struct { unsigned P : 1; // present (1 = present) unsigned RW : 1; // read/write unsigned US : 1; // user/supervisor unsigned PWT : 1; // page-level write-through unsigned PCD : 1; // page-level cache disabled unsigned A : 1; // accessed unsigned D : 1; // dirty unsigned Reserved : 1; unsigned G : 1; // global page unsigned Available : 3; // available to programmer unsigned PFN : 20; // page-frame number }; }; } X86_PTE_4K, *PX86_PTE_4K, **PPX86_PTE_4K; #define X86_PTE_4K_ sizeof (X86_PTE_4K) // ----------------------------------------------------------------- typedef struct _X86_PNPE // page not present entry { union { struct { DWORD dValue; // packed value }; struct { unsigned P : 1; // present (0 = not present) unsigned Reserved1 : 9; unsigned PageFile : 1; // page swapped to pagefile unsigned Reserved2 : 21; }; }; } X86_PNPE, *PX86_PNPE, **PPX86_PNPE; #define X86_PNPE_ sizeof (X86_PNPE) // ----------------------------------------------------------------- typedef struct _X86_PE // general page entry { union { DWORD dValue; // packed value X86_PDBR pdbr; // page-directory Base Register X86_PDE_4M pde4M; // page-directory entry (4-MB page) X86_PDE_4K pde4K; // page-directory entry (4-KB page) X86_PTE_4K pte4K; // page-table entry (4-KB page) X86_PNPE pnpe; // page not present entry }; } X86_PE, *PX86_PE, **PPX86_PE; #define X86_PE_ sizeof (X86_PE) // =================================================================
Figure 2 Double-layered paging with 4KB pages.
Figure 3 Single-layered paging with 4MB pages.
In Listing 3, I have added structural representations of linear addresses. These structures are formal definitions of the "Linear Address" boxes in Figures 2 and 3:
-
X86_LINEAR_4M is the format of linear addresses that point into a 4MB data page, as shown in Figure 3. The page-directory index PDI is an index into the page-directory currently addressed by the PDBR, selecting one of its PDEs. The 22-bit Offset member points to the target address within the corresponding 4MB physical page.
-
X86_LINEAR_4K is the 4KB variant of a linear address. As outlined in Figure 2, it is composed of three bit-fields: As in a 4MB address, the upper 10 PDI bits select a PDE. The page-table index PTI has a similar duty, pointing to a PTE inside the page-table addressed by this PDE. The remaining 12 bits are the offset into the resulting 4KB physical page.
X86_LINEAR is another convenience structure that simply unites X86_LINEAR_4M and X86_LINEAR_4K in a single data type.
Listing 3 i386 Linear Addresses.
// ================================================================= // INTEL X86 STRUCTURES, PART 3 OF 3 // ================================================================= typedef struct _X86_LINEAR_4M // linear address (4-MB page) { union { struct { PVOID pAddress; // packed address }; struct { unsigned Offset : 22; // offset into page unsigned PDI : 10; // page-directory index }; }; } X86_LINEAR_4M, *PX86_LINEAR_4M, **PPX86_LINEAR_4M; #define X86_LINEAR_4M_ sizeof (X86_LINEAR_4M) // ----------------------------------------------------------------- typedef struct _X86_LINEAR_4K // linear address (4-KB page) { union { struct { PVOID pAddress; // packed address }; struct { unsigned Offset : 12; // offset into page unsigned PTI : 10; // page-table index unsigned PDI : 10; // page-directory index }; }; } X86_LINEAR_4K, *PX86_LINEAR_4K, **PPX86_LINEAR_4K; #define X86_LINEAR_4K_ sizeof (X86_LINEAR_4K) // ----------------------------------------------------------------- typedef struct _X86_LINEAR // general linear address { union { PVOID pAddress; // packed address X86_LINEAR_4M linear4M; // linear address (4-MB page) X86_LINEAR_4K linear4K; // linear address (4-KB page) }; } X86_LINEAR, *PX86_LINEAR, **PPX86_LINEAR; #define X86_LINEAR_ sizeof (X86_LINEAR) // =================================================================