4.3 Context Switching
The kernel switches among threads in an effort to share the CPU effectively; this activity is called context switching. When a thread executes for the duration of its time slice or when it blocks because it requires a resource that is currently unavailable, the kernel finds another thread to run and context switches to it. The system can also interrupt the currently executing thread to run a thread triggered by an asynchronous event, such as a device interrupt. Although both scenarios involve switching the execution context of the CPU, switching between threads occurs synchronously with respect to the currently executing thread, whereas servicing interrupts occurs asynchronously with respect to the current thread. In addition, interprocess context switches are classified as voluntary or involuntary. A voluntary context switch occurs when a thread blocks because it requires a resource that is unavailable. An involuntary context switch takes place when a thread executes for the duration of its time slice or when the system identifies a higher-priority thread to run.
Each type of context switching is done through a different interface. Voluntary context switching is initiated with a call to the sleep() routine, whereas an involuntary context switch is forced by direct invocation of the low-level context-switching mechanism embodied in the mi_switch() and setrunnable() routines. Asynchronous event handling is triggered by the underlying hardware and is effectively transparent to the system. Our discussion will focus on how asynchronous event handling relates to synchronizing access to kernel data structures.
Thread State
Context switching between threads requires that both the kernel- and user-mode context be changed. To simplify this change, the system ensures that all a thread's user-mode state is located in one data structure: the thread structure (most kernel state is kept elsewhere). The following conventions apply to this localization:
-
Kernel-mode hardware-execution state: Context switching can take place in only kernel mode. The kernel's hardware-execution state is defined by the contents of the TCB that is located in the thread structure.
-
User-mode hardware-execution state: When execution is in kernel mode, the user-mode state of a thread (such as copies of the program counter, stack pointer, and general registers) always resides on the kernel's execution stack that is located in the thread structure. The kernel ensures this location of user-mode state by requiring that the system-call and trap handlers save the contents of the user-mode execution context each time that the kernel is entered (see Section 3.1).
-
The process structure: The process structure always remains resident in memory.
-
Memory resources: Memory resources of a process are effectively described by the contents of the memory-management registers located in the TCB and by the values present in the process and thread structures. As long as the process remains in memory, these values will remain valid, and context switches can be done without the associated page tables being saved and restored. However, these values need to be recalculated when the process returns to main memory after being swapped to secondary storage.
Low-Level Context Switching
The localization of the context of a process in the latter's thread structure permits the kernel to do context switching simply by changing the notion of the current thread structure and (if necessary) process structure, and restoring the context described by the TCB within the thread structure (including the mapping of the virtual address space). Whenever a context switch is required, a call to the mi_switch() routine causes the highest-priority thread to run. The mi_switch() routine first selects the appropriate thread from the scheduling queues, and then resumes the selected thread by loading its process's context from its TCB. Once mi_switch() has loaded the execution state of the new thread, it must also check the state of the new thread for a nonlocal return request (such as when a process first starts execution after a fork; see Section 4.5).
Voluntary Context Switching
A voluntary context switch occurs whenever a thread must await the availability of a resource or the arrival of an event. Voluntary context switches happen frequently in normal system operation. For example, a thread typically blocks each time that it requests data from an input device, such as a terminal or a disk. In FreeBSD, voluntary context switches are initiated through the sleep() routine. When a thread no longer needs the CPU, it invokes sleep() with a scheduling priority and a wait channel. The priority specified in a sleep() call is the priority that should be assigned to the thread when that thread is awakened. This priority does not affect the user-level scheduling priority.
The wait channel is typically the address of some data structure that identifies the resource or event for which the thread is waiting. For example, the address of a disk buffer is used while the thread is waiting for the buffer to be filled. When the buffer is filled, threads sleeping on that wait channel will be awakened. In addition to the resource addresses that are used as wait channels, there are some addresses that are used for special purposes:
-
The global variable lbolt is awakened by the scheduler once per second. Threads that want to wait for up to 1 second can sleep on this global variable. For example, the terminal-output routines sleep on lbolt while waiting for output-queue space to become available. Because queue space rarely runs out, it is easier simply to check for queue space once per second during the brief periods of shortages than it is to set up a notification mechanism such as that used for managing disk buffers. Programmers can also use the lbolt wait channel as a crude watchdog timer when doing debugging.
-
When a parent process does a wait system call to collect the termination status of its children, it must wait for one of those children to exit. Since it cannot know which of its children will exit first, and since it can sleep on only a single wait channel, there is a quandary on how to wait for the next of multiple events. The solution is to have the parent sleep on its own process structure. When a child exits, it awakens its parent's process-structure address rather than its own. Thus, the parent doing the wait will awaken independent of which child process is the first to exit. Once running, it must scan its list of children to determine which one exited.
-
When a thread does a sigpause system call, it does not want to run until it receives a signal. Thus, it needs to do an interruptible sleep on a wait channel that will never be awakened. By convention, the address of the user structure is given as the wait channel.
Sleeping threads are organized in an array of queues (see Figure 4.3). The sleep() and wakeup() routines hash wait channels to calculate an index into the sleep queues. The sleep() routine takes the following steps in its operation:
-
Prevent interrupts that might cause thread-state transitions by acquiring the sched_lock mutex (mutexes are explained in the next section).
-
Record the wait channel in the thread structure and hash the wait-channel value to locate a sleep queue for the thread.
-
Set the thread's priority to the priority that the thread will have when the thread is awakened and set the SLEEPING flag.
-
Place the thread at the end of the sleep queue selected in step 2.
-
Call mi_switch() to request that a new thread be scheduled; the sched_lock mutex is released as part of switching to the other thread.
Figure 4.3 Queueing structure for sleeping threads.
A sleeping thread is not selected to execute until it is removed from a sleep queue and is marked runnable. This operation is done by the wakeup() routine, which is called to signal that an event has occurred or that a resource is available. Wakeup() is invoked with a wait channel, and it awakens all threads sleeping on that wait channel. All threads waiting for the resource are awakened to ensure that none are inadvertently left sleeping. If only one thread were awakened, it might not request the resource on which it was sleeping, and any other threads waiting for that resource would be left sleeping forever. A thread that needs an empty disk buffer in which to write data is an example of a thread that may not request the resource on which it was sleeping. Such a thread can use any available buffer. If none is available, it will try to create one by requesting that a dirty buffer be written to disk and then waiting for the I/O to complete. When the I/O finishes, the thread will awaken and will check for an empty buffer. If several are available, it may not use the one that it cleaned, leaving any other threads waiting for the buffer that it cleaned sleeping forever.
In instances where a thread will always use a resource when it becomes available, wakeup_one() can be used instead of wakeup(). The wakeup_one() routine wakes up only the first thread that it finds waiting for a resource. The assumption is that when the awakened thread is done with the resource it will issue another wakeup_one() to notify the next waiting thread that the resource is available. The succession of wakeup_one() calls will continue until all threads waiting for the resource have been awakened and had a chance to use it.
To avoid having excessive numbers of threads awakened, kernel programmers try to use wait channels with fine enough granularity that unrelated uses will not collide on the same resource. Thus, they put locks on each buffer in the buffer cache rather than putting a single lock on the buffer cache as a whole. The problem of many threads awakening for a single resource is further mitigated on a uniprocessor by the latter's inherently single-threaded operation. Although many threads will be put into the run queue at once, only one at a time can execute. Since the uniprocessor kernel runs nonpreemptively, each thread will run its system call to completion before the next one will get a chance to execute. Unless the previous user of the resource blocked in the kernel while trying to use the resource, each thread waiting for the resource will be able to get and use the resource when it is next run.
A wakeup() operation processes entries on a sleep queue from front to back. For each thread that needs to be awakened, wakeup() does the following:
-
Removes the thread from the sleep queue
-
Recomputes the user-mode scheduling priority if the thread has been sleeping longer than one second
-
Makes the thread runnable if it is in a SLEEPING state and places the thread on the run queue if its process is not swapped out of main memory. If the process has been swapped out, the swapin process will be awakened to load it back into memory (see Section 5.12); if the thread is in a STOPPED state, it is not put on a run queue until it is explicitly restarted by a user-level process, either by a ptrace system call (see Section 4.9) or by a continue signal (see Section 4.7)
If wakeup() moved any threads to the run queue and one of them had a scheduling priority higher than that of the currently executing thread, it will also request that the CPU be rescheduled as soon as possible.
Synchronization
Historically, BSD systems ran only on uniprocessor architectures. Once a process began running in the top half of the kernel, it would run to completion or until it needed to sleep waiting for a resource to become available. The only contention for data structure access occurred at the points at which it slept or during an interrupt that needed to update a shared data structure. Synchronization with other processes was handled by ensuring that all shared data structures were consistent before going to sleep. Synchronization with interrupts was handled by raising the processor priority level to guard against interrupt activity while the shared data structure was manipulated.
Simple multiprocessor support was added to FreeBSD 3.0. It worked by creating a giant lock that protected the kernel. When a process entered the kernel it would have to acquire the giant lock before it could proceed. Thus, at most one processor could run in the kernel at a time. All the other processors could run only processes executing in user mode.
Symmetric multiprocessing (SMP) first appeared in FreeBSD 5.0 and required the addition of new synchronization schemes to eliminate the uniprocessor assumptions implicit in the FreeBSD 4.0 kernel [Schimmel, 1994]. Some subsystems in the FreeBSD 5.2 kernel have not yet been converted to symmetric multiprocessing and are still protected by the giant lock. However, most of the heavily used parts of the kernel have been moved out from under the giant lock, including much of the virtual memory system, the networking stack, and the filesystem.
Table 4.3 shows the hierarchy of locking that is necessary to support symmetric multiprocessing. The column labeled sleep in Table 4.3 shows whether a lock of that type may be held when a thread goes to sleep. At the lowest level, the hardware must provide a memory interlocked test-and-set instruction. The test-and-set instruction must allow two operations to be done on a main-memory locationthe reading of the existing value followed by the writing of a new valuewithout any other processor being able to read or write that memory location between the two memory operations. Some architectures support more complex versions of the test-and-set instruction. For example, the PC provides a memory interlocked compare-and-swap instruction.
Table 4.3. Locking hierarchy.
Level |
Type |
Sleep |
Description |
---|---|---|---|
Lowest |
hardware |
no |
memory-interlocked test-and-set |
spin mutex |
no |
spin lock |
|
sleep mutex |
no |
spin for a while, then sleep |
|
lock manager |
yes |
sleep lock |
|
Highest |
witness |
yes |
partially ordered sleep locks |
Spin locks are built from the hardware test-and-set instruction. A memory location is reserved for the lock with a value of zero showing that the lock is free and a value of one showing that the lock is held. The test-and-set instruction tries to acquire the lock. The lock value is tested and the lock unconditionally set to one. If the tested value is zero, then the lock was successfully acquired and the thread may proceed. If the value is one, then some other thread held the lock so the thread must loop doing the test-and-set until the thread holding the lock (and running on a different processor) stores a zero into the lock to show that it is done with it. Spin locks are used for locks that are held only brieflyfor example, to protect a list while adding or removing an entry from it.
It is wasteful of CPU cycles to use spin locks for resources that will be held for long periods of time. For example, a spin lock would be inappropriate for a disk buffer that would need to be locked throughout the time that a disk I/O was being done. Here a sleep lock should be used. When a thread trying to acquire a sleep-type lock finds that the lock is held, it is put to sleep so that other threads can run until the lock becomes available.
The time to acquire a lock can be variablefor example, a lock needed to search and remove an item from a list. The list usually will be short, for which a spin lock would be appropriate, but will occasionally grow long. Here a hybrid lock is used; the lock spins for a while, but if unsuccessful after a specified number of attempts, it reverts to a sleep-type lock. Most architectures require 100 to 200 instructions to put a thread to sleep and then awaken it again. A spin lock that can be acquired in less than this time is going to be more efficient than a sleep lock. The hybrid lock is usually set to try for about half this time before reverting to a sleep lock.
Spin locks are never appropriate on a uniprocessor, since the only way that a resource held by another thread will ever be released will be when that thread gets to run. Thus, spin locks are always converted to sleep locks when running on a uniprocessor.
The highest-level locking prevents threads from deadlocking when locking multiple resources. Suppose that two threads, A and B, require exclusive access to two resources, Rl and R2, to do some operation as shown in Figure 4.4. If thread A acquires Rl and thread B acquires R2, then a deadlock occurs when thread A tries to acquire R2 and thread B tries to acquire Rl. To avoid deadlock, FreeBSD 5.2 maintains a partial ordering on all the locks. The two partial-ordering rules are as follows:
-
A thread may acquire only one lock in each class.
-
A thread may acquire only a lock in a higher-numbered class than the highest-numbered class for which it already holds a lock.
Figure 4.4 Partial ordering of resources.
In Figure 4.4 thread A holds R1 and can request R2 as R1 and R2 are in different classes and R2 is in a higher-numbered class than R1. However, thread B must release R2 before requesting R1, since R2 is in a higher class than R1. Thus, thread A will be able to acquire R2 when it is released by thread B. After thread A completes and releases R1 and R2, thread B will be able to acquire both of those locks and run to completion without deadlock.
Historically, the class members and ordering were poorly documented and unenforced. Violations were discovered when threads would deadlock and a careful analysis done to figure out what ordering had been violated. With an increasing number of developers and a growing kernel, the ad hoc method of maintaining the partial ordering of locks became untenable. A witness module was added to the kernel to derive and enforce the partial ordering of the locks. The witness module keeps track of the locks acquired and released by each thread. It also keeps track of the order in which locks are acquired relative to each other. Each time a lock is acquired, the witness module uses these two lists to verify that a lock is not being acquired in the wrong order. If a lock order violation is detected, then a message is output to the console detailing the locks involved and the locations in question. The witness module also verifies that no spin locks are held when requesting a sleep lock or voluntarily going to sleep.
The witness module can be configured to either panic or drop into the kernel debugger when an order violation occurs or some other witness check fails. When running the debugger, the witness module can output the list of locks held by the current thread to the console along with the filename and line number at which each lock was last acquired. It can also dump the current order list to the console. The code first displays the lock order tree for all the sleep locks. Then it displays the lock order tree for all the spin locks. Finally, it displays a list of locks that have not yet been acquired.
Mutex Synchronization
Mutexes are the primary method of thread synchronization. The major design considerations for mutexes are
-
Acquiring and releasing uncontested mutexes should be as cheap as possible.
-
Mutexes must have the information and storage space to support priority propagation.
-
A thread must be able to recursively acquire a mutex if the mutex is initialized to support recursion.
There are currently two flavors of mutexes: those that sleep and those that do not. By default, mutexes will sleep when they are already held. As a machine-dependent optimization they may spin for some amount of time before sleeping. Most kernel code uses the default lock type; the default lock type allows the thread to be disconnected from the CPU if it cannot get the lock. The machine-dependent implementation may treat the lock as a short-term spin lock under some circumstances. However, it is always safe to use these forms of locks in an interrupt thread without fear of deadlock against an interrupted thread on the same CPU. If a thread interrupted the thread that held a mutex and then tried to acquire the mutex, it would be put to sleep until the thread holding the mutex ran to completion and released the mutex.
Mutexes that do not sleep are called spin mutexes. A spin mutex will not relinquish the CPU when it cannot immediately get the requested lock, but it will loop, waiting for the mutex to be released by another CPU. This spinning can result in deadlock if a thread interrupted the thread that held a mutex and then tried to acquire the mutex. To protect an interrupt service routine from blocking against itself all interrupts are blocked on the local processor while holding a spin lock. Thus, the interrupt can run only on another processor during the period that the mutex is held.
Spin locks are specialized locks that are intended to be held for short periods of time; their primary purpose is to protect portions of the code that implement default (i.e., sleep) locks. These mutexes should only be used to protect data shared with any devices that require nonpreemptive interrupts and low-level scheduling code. On most architectures both acquiring and releasing an uncontested spin mutex is more expensive than the same operation on a nonspin mutex. It is permissible to hold multiple spin mutexes. Here, it is a required that they be released in the opposite order to that in which they were acquired. It is not permissible to go to sleep while holding a spin mutex.
The mtx_init() function must be used to initialize a mutex before it can be passed to mtx_lock(). The mtx_init() function specifies a type that the witness code uses to classify a mutex when doing checks of lock ordering. It is not permissible to pass the same mutex to mtx_init() multiple times without intervening calls to mtx_destroy().
The mtx_lock() function acquires a mutual exclusion lock for the currently running kernel thread. If another kernel thread is holding the mutex, the caller will sleep until the mutex is available. The mtx_lock_spin() function is similar to the mtx_lock() function except that it will spin until the mutex becomes available. Interrupts are disabled on the CPU holding the mutex during the spin and remain disabled following the acquisition of the lock.
It is possible for the same thread to recursively acquire a mutex with no ill effects if the MTX_RECURSE bit was passed to mtx_init() during the initialization of the mutex. The witness module verifies that a thread does not recurse on a non-recursive lock. A recursive lock is useful if a resource may be locked at two or more levels in the kernel. By allowing a recursive lock, a lower layer need not check if the resource has already been locked by a higher layer; it can simply lock and release the resource as needed.
The mtx_trylock() function tries to acquire a mutual exclusion lock for the currently running kernel thread. If the mutex cannot be immediately acquired, mtx_trylock() will return 0; otherwise the mutex will be acquired and a nonzero value will be returned.
The mtx_unlock() function releases a mutual exclusion lock; if a higher-priority thread is waiting for the mutex, the releasing thread will be put to sleep to allow the higher-priority thread to acquire the mutex and run. A mutex that allows recursive locking maintains a reference count showing the number of times that it has been locked. Each successful lock request must have a corresponding unlock request. The mutex is not released until the final unlock has been done, causing the reference count to drop to zero.
The mtx_unlock_spin( function releases a spin-type mutual exclusion lock; interrupt state from before the lock was acquired is restored.
The mtx_destroy() function destroys a mutex so the data associated with it may be freed or otherwise overwritten. Any mutex that is destroyed must previously have been initialized with mtx_init(). It is permissible to have a single reference to a mutex when it is destroyed. It is not permissible to hold the mutex recursively or have another thread blocked on the mutex when it is destroyed. If these rules are violated, the kernel will panic.
The giant lock that protects subsystems in FreeBSD 5.2 that have not yet been converted to operate on a multiprocessor must be acquired before acquiring other mutexes. Put another way: It is impossible to acquire giant nonrecursively while holding another mutex. It is possible to acquire other mutexes while holding giant, and it is possible to acquire giant recursively while holding other mutexes.
Sleeping while holding a mutex (except for giant) is almost never safe and should be avoided. There are numerous assertions that will fail if this is attempted.
Lock-Manager Locks
Interprocess synchronization to a resource typically is implemented by associating it with a lock structure. The kernel has a lock manager that manipulates a lock. The operations provided by the lock manager are
-
Request shared: Get one of many possible shared locks. If a thread holding an exclusive lock requests a shared lock, the exclusive lock will be downgraded to a shared lock.
-
Request exclusive: Stop further shared locks when they are cleared, grant a pending upgrade (see following) if it exists, and then grant an exclusive lock. Only one exclusive lock may exist at a time, except that a thread holding an exclusive lock may get additional exclusive locks if it explicitly sets the canrecurse flag in the lock request or if the canrecurse flag was set when the lock was initialized.
-
Request upgrade: The thread must hold a shared lock that it wants to have upgraded to an exclusive lock. Other threads may get exclusive access to the resource between the time that the upgrade is requested and the time that it is granted.
-
Request exclusive upgrade: The thread must hold a shared lock that it wants to have upgraded to an exclusive lock. If the request succeeds, no other threads will have gotten exclusive access to the resource between the time that the upgrade is requested and the time that it is granted. However, if another thread has already requested an upgrade, the request will fail.
-
Request downgrade: The thread must hold an exclusive lock that it wants to have downgraded to a shared lock. If the thread holds multiple (recursive) exclusive locks, they will all be downgraded to shared locks.
-
Request release: Release one instance of a lock.
-
Request drain: Wait for all activity on the lock to end, and then mark it decommissioned. This feature is used before freeing a lock that is part of a piece of memory that is about to be released.
Locks must be initialized before their first use by calling the lockinit() function. Parameters to the lockinit() function include the following:
-
A top-half kernel priority at which the thread should run once it has acquired the lock
-
Flags such as canrecurse that allows the thread currently holding an exclusive lock to get another exclusive lock rather than panicking with a "locking against myself" failure
-
A string that describes the resource that the lock protects referred to as the wait channel message
-
An optional maximum time to wait for the lock to become available
Other Synchronization
In addition to the general lock manager locks and mutexes, there are three other types of locking available for use in the kernel. These other types of locks are less heavily used, but they are included here for completeness.
Shared/exclusive locks are used to protect data that are read far more often than they are written. Mutexes are inherently more efficient than shared/exclusive locks. However, shared/exclusive locks are faster than lock-manager locks because they lack many of the lock-manager lock features such as the ability to be upgraded or downgraded.
Condition variables are used with mutexes to wait for conditions to occur. Threads wait on condition variables by calling cv_wait(), cv_wait_sig() (wait unless interrupted by a signal), cv_timedwait() (wait for a maximum time), or cv_timedwait_sig() (wait unless interrupted by a signal or for a maximum time). Threads unblock waiters by calling cv_signal() to unblock one waiter, or cv_broadcast() to unblock all waiters. The cv_waitq_remove() function removes a waiting thread from a condition variable wait queue if it is on one.
A thread must hold a mutex before calling cv_wait(), cv_wait_sig(), cv_timedwait(), or cv_timedwait_sig(). When a thread waits on a condition, the mutex is atomically released before the thread is blocked, and then atomically reacquired before the function call returns. All waiters must use the same mutex with a condition variable. A thread must hold the mutex while calling cv_signal() or cv_broadcast().
Counting semaphores provide a mechanism for synchronizing access to a pool of resources. Unlike mutexes, semaphores do not have the concept of an owner, so they can also be useful in situations where one thread needs to acquire a resource and another thread needs to release it. Each semaphore has an integer value associated with it. Posting (incrementing) always succeeds, but waiting (decrementing) can only successfully complete if the resulting value of the semaphore is greater than or equal to zero. Semaphores are not used where mutexes and condition variables will suffice. Semaphores are a more complex synchronization mechanism than mutexes and condition variables, and are not as efficient.