seL4: A Security-Focused Microkernel
One of the nice things about being in Cambridge is the number of interesting talks. Over the summer, a lot of visitors from NICTA spoke about seL4, which is one of the more exciting operating system projects that I've heard about for a while.
The L4 Legacy
If you've paid any attention to operating system design in the last decade or two, you've probably heard about L4. Like its predecessor L3, L4 aimed at fixing the design shortfalls in Mach.
The one thing that everyone knows about Mach is that it was slow. Most of the overhead in Mach message-passing involved the kernel checking that the sender had the correct rights. In L4, this was determined to be the message receiver's job. This distinction is important because in most cases anyone who has a handle for the port is allowed to send messages, so the time spent doing the checks in Mach was largely wasted. Mach port-rights checks are similar to filesystem access checks, but they're implemented on every message send. Most things that would be system calls on a monolithic kernel are message sends (often pairs of message sends) in Mach, so this bottleneck was significant.
Message sending in Mach also had other disadvantages. It was asynchronous, so performing the equivalent of a system call involved sending a message containing the handle for a return port—and then waiting. This potentially required waiting for the process receiving the message (that is, the one implementing a filesystem or block device) to be scheduled, which caused terrible cache performance. In fact, the microkernel itself was so big that it often didn't fit into the cache on machines in the 1990s.
L4 is closer to the microkernel philosophy, implementing just address spaces, threads, a timer (required for scheduling), and a synchronous IPC mechanism. The synchronous IPC is quite important, because it means that the message typically stays in cache between the sender and receiver. It's also more efficient to implement asynchronous IPC on top of synchronous IPC than the other way around.
Adding Capabilities
The seL4 microkernel adds a capability model to an L4-inspired design. Having been exposed to the Cambridge security group's capability-groupthink, I'm somewhat biased in favor of this idea. A capability is an unforgeable token of authority, granting access to some resource. Capabilities are handled by a capability table maintained by the microkernel for each address space (the seL4 equivalent of a process).
In seL4, capabilities are used to refer to blocks of memory. You can delegate access to parts of your address space simply by passing the capability to another process. You can also implement nested processes, by providing them with only the capabilities to talk to you. Memory in seL4 is typed. For example, some memory regions hold capabilities; to preserve the unforgeable guarantee, they can only be modified by the microkernel. On boot, the kernel reserves some memory for itself and then delegates the remainder of the memory to a process as an untyped memory (UM) capability.
Untyped memory can be subdivided into "power of two"-sized slabs. It can also be typed explicitly. To create a new thread, refine a UM capability to give it the size of a thread-control block, and then set its type.
Memory can also be used for storing a page table abstraction, for IPC endpoints, and for interrupts. This design gives you everything you need to implement processes: You can receive timer interrupts for scheduling, set up virtual-to-physical address mappings, and create threads. On a system with memory-mapped I/O, granting access to a memory region containing I/O pages also grants access to that device, so you can use this mechanism to implement device drivers in userspace. Any communication with device drivers and other operating system services happens via the generic IPC mechanisms, so any process issues things that are equivalent to UNIX system calls via a mechanism that allows interposition.
If you create a subprocess, you can choose to grant direct access to some operating system services (for example, network or filesystem) just by passing your capabilities for these services. Alternatively, you can pass a capability to the subprocess for an IPC endpoint that you control, so that you can add some extra access control. This approach makes it very easy to implement sandboxes. For example, something like Chrome running on seL4 would create a subprocess for each tab that had only access to a password store via explicit access-control checks, access to a small part of the filesystem for caching, and very little else. That tab would then create sandboxes for the JavaScript VM, the PNG and JPEG decoders, and so on. Compromising the entire browser would require escaping from multiple levels of sandboxing.
Formal Verification
The most unusual thing about seL4 is that it's intended to be formally verified. The initial design is an executable specification written in Haskell. This is then the subject of two translations—one into C, and the other into a format understood by an automated theorem prover. The team then has to prove that these two forms are equivalent, which is possible only because they choose a restricted subset of C to use.
One interesting result of this project, even for those who never use seL4 in any way, is a price for formally proving a system. It's much cheaper to design a system with formal verification than to try to verify an existing system. The numbers produced by NICTA indicate that it costs around 25 times as much to develop a system this way as to develop one with no verification, testing, or QA. The proofs are done largely by a different team than the one that does implementation, so this is one case in which adding extra manpower can result in a speedup.
However, one question always arises when you begin proving a system: "What properties are you going to prove?" If you asked the seL4 developers whether they had a test suite, their answer to this question would quickly let you determine whether you were talking to someone on the specification and proof side, or on the implementation side. Theorists tend to reply, "Of course not. We proved it was correct,' whereas implementers tend to say, "No, but I wish we had….'
That's not to say that formal verification is useless. It certainly proved some properties of the system, and in some cases probably saved a lot of effort. It's important not to see formal verification as a panacea, however; it can only prove properties that you can conveniently express mathematically. For example, it can be used to prove certain real-time properties, such as a maximum interrupt latency. But it can't answer questions like, "Is this system secure?" or "Is it easy to write complex applications on top of this system?" Formulating those questions correctly is too difficult.
Interrupts Disabled
A lot of work has been done recently on mainstream operating systems to allow them to run with interrupts enabled for most of their runtime. This is considered very important because long delays before servicing interrupts can affect system responsiveness. If you have interrupts disabled while doing some long-running task, you won't handle any events (new network packets, keypresses, and so on) until it's finished.
However, the cost of running with interrupts disabled is increased complexity in the kernel. Any time an interrupt arrives, control passes immediately to the interrupt handler, which may handle the event directly, or simply add it to a queue and resume normal scheduling. Any code that runs with interrupts enabled must be able to handle suddenly being suspended for a short period while the interrupt is handled, making flow control very difficult to predict. Any line of code (or, indeed, any instruction) may immediately be followed by a jump into the interrupt handler. Typically, kernels protect certain regions with a lightweight spinlock that also disables interrupts.
The seL4 kernel makes the opposite decision. It runs with interrupts disabled most of the time. This approach has the obvious advantage of making flow control easier to reason about, but with the disadvantage of potentially increasing interrupt latency to an unusable level. To avoid this situation, every path through the kernel has to be verified as having a maximum length.
To implement this rule, every long-running or (more importantly), nondeterministic-length path through the kernel must be implemented in a continuation-passing style. The microkernel must do part of the work, parcel the rest into something that allows the kernel to resume, and return with an error code indicating that the caller should try again. The kernel can then poll for interrupts and deliver them between short system calls. Interrupts can happen in only a very small number of places, at the expense of potentially requiring users to restart some system calls that would be longer-running on another system.
The longest-running system call in seL4 is probably the capability-revoke call. This call must inspect the capability table of every process, so its runtime scales with the number of capability tables in the system. Other examples that may need restarting include closing an IPC endpoint, which must perform several steps, such as putting the sending end into an invalid state, cleaning up the buffer, and so on.
What's the Point?
You almost certainly won't be seeing seL4 as your desktop operating system any time soon, or even as a competitor to Android or iOS. But it's very interesting in one situation: Modern mobile phones come with an ARM feature called TrustZone, which allows a secure kernel to sit under the main kernel in a completely protected state. L4 is already quite common there, and it would be a very sensible deployment target for seL4. You'd still run your more traditional operating system in the main ARM environment, but seL4 in TrustZone would handle privileged operations, such as communicating with your bank or running the radio stack.
The aim of seL4 is to see how small the trusted part of a system can be, and whether it's possible to create a fully verified implementation of it. Of course, a microkernel is only part of the trusted computing base for a typical system; there are also drivers, network and storage stacks, and other complex components. But having a trustworthy microkernel and an IOMMU limits the damage that bugs in these components can do.