Linux Kernel Architecture
Let’s begin this section by discussing the architecture of the Linux kernel, including responsibilities of the kernel, its organization and modules, services of the kernel, and process management.
Kernel Responsibilities
The kernel (also called the operating system) has two major responsibilities:
To interact with and control the system’s hardware components
To provide an environment in which applications can run
Some operating systems allow applications to directly access hardware components, although this capability is very uncommon nowadays. UNIX-like operating systems hide all the low-level hardware details from an application. If an application wants to make use of a hardware resource, it must make a request to the operating system. The operating system then evaluates the request and interacts with the hardware component on behalf of the application, but only if it’s valid. To enforce this kind of scheme, the operating system needs to depend on hardware capabilities that forbid applications to directly interact with them.
Organization and Modules
Like many other UNIX-like operating systems, the Linux kernel is monolithic. This means that even though Linux is divided into subsystems that control various components of the system (such as memory management and process management), all of these subsystems are tightly integrated to form the whole kernel. In contrast, microkernel operating systems provide bare, minimal functionality, and all other operating system layers are performed on top of microkernels as processes. Microkernel operating systems are generally slower due to message passing between the various layers. However, microkernel operating systems can be extended very easily.
Linux kernels can be extended by modules. A module is a kernel feature that provides the benefits of a microkernel without a penalty. A module is an object that can be linked to the kernel at runtime.
Using Kernel Services
The kernel provides a set of interfaces for applications running in user mode to interact with the system. These interfaces, also known as system calls, give applications access to hardware and other kernel resources. System calls not only provide applications with abstracted hardware, but also ensure security and stability.
Most applications do not use system calls directly. Instead, they are programmed to an application programming interface (API). It is important to note that there is no relation between the API and system calls. APIs are provided as part of libraries for applications to make use of. These APIs are generally implemented through the use of one or more system calls.
/proc File System—External Performance View
The /proc file system provides the user with a view of internal kernel data structures. It also lets you look at and change some of the kernel internal data structures, thereby changing the kernal’s behavior. The /proc file system provides an easy way to fine-tune system resources to improve the performance not only of applications but of the overall system.
/proc is a virtual file system that is created dynamically by the kernel to provide data. It is organized into various directories. Each of these directories corresponds to tunables for a given subsystem. Appendix A explains in detail how to use the /proc file system to fine-tune your system.
Another essential of the Linux system is memory management. In the next section, we’ll cover five aspects of how Linux handles this management.
Memory Management
The various aspects of memory management in Linux include address space, physical memory, memory mapping, paging, and swapping.
Address Space
One of the advantages of virtual memory is that each process thinks it has all the address space it needs. The virtual memory can be many times larger than the physical memory in the system. Each process in the system has its own virtual address space. These virtual address spaces are completely separate from each other. A process running one application cannot affect another, and the applications are protected from each other. The virtual address space is mapped to physical memory by the operating system. From an application point of view, this address space is a flat linear address space. The kernel, however, treats the user virtual address space very differently.
The linear address space is divided into two parts: user address space and kernel address space. The user address space cannot change every time a context switch occurs and the kernel address space remains constant. How much space is allocated for user space and kernel space depends mainly on whether the system is a 32-bit or 64-bit architecture. For example, x86 is a 32-bit architecture and supports only a 4GB address space. Out of this 4GB, 3GB is reserved for user space and 1GB is reserved for the kernel. The location of the split is determined by the PAGE_OFFSET kernel configuration variable.
Physical Memory
Linux uses an architecture-independent way of describing physical memory in order to support various architectures.
Physical memory can be arranged into banks, with each bank being a particular distance from the processor. This type of memory arrangement is becoming very common, with more machines employing NUMA (Nonuniform Memory Access) technology. Linux VM represents this arrangement as a node. Each node is divided into a number of blocks called zones that represent ranges within memory. There are three different zones: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. For example, x86 has the following zones:
ZONE_ DMA First 16MB of memory
ZONE_ NORMAL 16MB 896MB
ZONE_ HIGHMEM 896MB end
Each zone has its own use. Some of the legacy ISA devices have restrictions on where they can perform I/O from and to. ZONE_DMA addresses those requirements.
ZONE_NORMAL is used for all kernel operations and allocations. It is extremely crucial for system performance.
ZONE_ HIGHMEM is the rest of the memory in the system. It’s important to note that ZONE_HIGHMEM cannot be used for kernel allocations and data structures—it can only be used for user data.
Memory Mapping
While looking at how kernel memory is mapped, we will use x86 as an example for better understanding. As mentioned earlier, the kernel has only 1GB of virtual address space for its use. The other 3GB is reserved for the kernel. The kernel maps the physical memory in ZONE_DMA and ZONE_NORMAL directly to its address space. This means that the first 896MB of physical memory in the system is mapped to the kernel’s virtual address space, which leaves only 128MB of virtual address space. This 128MB of virtual space is used for operations such as vmalloc and kmap.
This mapping scheme works well as long as physical memory sizes are small (less than 1GB). However, these days, all servers support tens of gigabytes of memory. Intel has added PAE (Physical Address Extension) to its Pentium processors to support up to 64GB of physical memory. Because of the preceding memory mapping, handling physical memories in tens of gigabytes is a major source of problems for x86 Linux. The Linux kernel handles high memory (all memory about 896MB) as follows: When the Linux kernel needs to address a page in high memory, it maps that page into a small virtual address space (kmap) window, operates on that page, and unmaps the page. The 64-bit architectures do not have this problem because their address space is huge.
Paging
Virtual memory is implemented in many ways, but the most effective way is hardware-based. Virtual address space is divided into fixed-size chunks called pages. Virtual memory references are translated into addresses in physical memory using page tables. To support various architectures and page sizes, Linux uses a three-level paging mechanism. The three types of page tables are as follows:
Page Global Directory (PGD)
Page Middle Directory (PMD)
Page Table (PTE)
Address translation provides a way to separate the virtual address space of a process from the physical address space. Each page of virtual memory can be marked "present" or "not present" in the main memory. If a process references an address in virtual memory that is not present, hardware generates a page fault, which is handled by the kernel. The kernel handles the fault and brings the page into main memory. In this process, the system might have to replace an existing page to make room for the new one.
The replacement policy is one of the most critical aspects of the paging system. Linux 2.6 fixed various problems surrounding the page selection and replacement that were present in previous versions of Linux.
Swapping
Swapping is the moving of an entire process to and from secondary storage when the main memory is low. Many modern operating systems, including Linux, do not use this approach, mainly because context switches are very expensive. Instead, they use paging. In Linux, swapping is performed at the page level rather than at the process level. The main advantage of swapping is that it expands the process address space that is usable by a process. As the kernel needs to free up memory to make room for new pages, it may need to discard some of the less frequently used or unused pages. Some of the pages cannot be freed up easily because they are not backed by disks. Instead, they have to be copied to a backing store (swap area) and need to be read back from the backing store when needed. One major disadvantage of swapping is speed. Generally, disks are very slow, so swapping should be eliminated whenever possible.