7.2 Preemption
Preemption is the switching of one task to another. We mentioned how schedule() and scheduler_tick()decide which task to switch to next, but we haven’t described how the Linux kernel decides when to switch. The 2.6 kernel introduces kernel preemption, which means that both user space programs and kernel space programs can be switched at various times. Because kernel preemption is the standard in Linux 2.6, we describe how full kernel and user preemption operates in Linux.
7.2.1 Explicit Kernel Preemption
The easiest preemption to understand is explicit kernel preemption. This occurs in kernel space when kernel code calls schedule(). Kernel code can call schedule() in two ways, either by directly calling schedule() or by blocking.
When the kernel is explicitly preempted, as in a device driver waiting with a wait_queue, the control is simply passed to the scheduler and a new task is chosen to run.
7.2.2 Implicit User Preemption
When the kernel has finished processing a kernel space task and is ready to pass control to a user space task, it first checks to see which user space task it should pass control to. This might not be the user space task that passed its control to the kernel. For example, if Task A invokes a system call, after the system call completes, the kernel could pass control of the system to Task B.
Each task on the system has a "rescheduling necessary" flag that is set whenever a task should be rescheduled:
–---------------------------------------------------------------------- include/linux/sched.h 988 static inline void set_tsk_need_resched(struct task_struct *tsk) 989 { 990 set_tsk_thread_flag(tsk,TIF_NEED_RESCHED); 991 } 992 993 static inline void clear_tsk_need_resched(struct task_struct *tsk) 994 { 995 clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED); 996 } ... 1003 static inline int need_resched(void) 1004 { 1005 return unlikely(test_thread_flag(TIF_NEED_RESCHED)); 1006 } -----------------------------------------------------------------------
Lines 988–996
set_tsk_need_resched and clear_tsk_need_resched are the interfaces provided to set the architecture-specific flag TIF_NEED_RESCHED.
Lines 1003–1006
need_resched tests the current thread’s flag to see if TIF_NEED_RESCHED is set.
When the kernel is returning to user space, it chooses a process to pass control to, as described in schedule() and scheduler_tick(). Although scheduler_tick() can mark a task as needing rescheduling, only schedule() operates on that knowledge. schedule() repeatedly chooses a new task to execute until the newly chosen task does not need to be rescheduled. After schedule() completes, the new task has control of the processor.
Thus, while a process is running, the system timer causes an interrupt that triggers scheduler_tick(). scheduler_tick() can mark that task as needing rescheduling and move it to the expired array. Upon completion of kernel operations, scheduler_tick() could be followed by other interrupts and the kernel would continue to have control of the processor—schedule() is invoked to choose the next task to run. So, the scheduler_tick() marks processes and rearranges queues, but schedule() chooses the next task and passes CPU control.
7.2.3 Implicit Kernel Preemption
New in Linux 2.6 is the implementation of implicit kernel preemption. When a kernel task has control of the CPU, it can only be preempted by another kernel task if it does not currently hold any locks. Each task has a field, preempt_count, which marks whether the task is preemptible. The count is incremented every time the task obtains a lock and decremented whenever the task releases a lock. The schedule() function disables preemption while it determines which task to run next.
There are two possibilities for implicit kernel preemption: Either the kernel code is emerging from a code block that had preemption disabled or processing is returning to kernel code from an interrupt. If control is returning to kernel space from an interrupt, the interrupt calls schedule() and a new task is chosen in the same way as just described.
If the kernel code is emerging from a code block that disabled preemption, the act of enabling preemption can cause the current task to be preempted:
–---------------------------------------------------------------------- include/linux/preempt.h 46 #define preempt_enable() 47 do { 48 preempt_enable_no_resched(); 49 preempt_check_resched(); 50 } while (0) -----------------------------------------------------------------------
Lines 46–50
preempt_enable() calls preempt_enable_no_resched(), which decrements the preempt_count on the current task by one and then calls preempt_check_resched():
–---------------------------------------------------------------------- include/linux/preempt.h 40 #define preempt_check_resched() 41 do { 42 if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) 43 preempt_schedule(); 44 } while (0) -----------------------------------------------------------------------
Lines 40–44
preempt_check_resched() sees if the current task has been marked for rescheduling; if so, it calls preempt_schedule().
–---------------------------------------------------------------------- kernel/sched.c 2328 asmlinkage void __sched preempt_schedule(void) 2329 { 2330 struct thread_info *ti = current_thread_info(); 2331 2332 /* 2333 * If there is a non-zero preempt_count or interrupts are disabled, 2334 * we do not want to preempt the current task. Just return.. 2335 */ 2336 if (unlikely(ti->preempt_count || irqs_disabled())) 2337 return; 2338 2339 need_resched: 2340 ti->preempt_count = PREEMPT_ACTIVE; 2341 schedule(); 2342 ti->preempt_count = 0; 2343 2344 /* we could miss a preemption opportunity between schedule and now */ 2345 barrier(); 2346 if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) 2347 goto need_resched; 2348 } -----------------------------------------------------------------------
Line 2336–2337
If the current task still has a positive preempt_count, likely from nesting preempt_disable() commands, or the current task has interrupts disabled, we return control of the processor to the current task.
Line 2340–2347
The current task has no locks because preempt_count is 0 and IRQs are enabled. Thus, we set the current tasks preempt_count to note it’s undergoing preemption, and call schedule(), which chooses another task.
If the task emerging from the code block needs rescheduling, the kernel needs to ensure it’s safe to yield the processor from the current task. The kernel checks the task’s value of preempt_count. If preempt_count is 0, and thus the current task holds no locks, schedule() is called and a new task is chosen for execution. If preempt_count is non-zero, it is unsafe to pass control to another task, and control is returned to the current task until it releases all of its locks. When the current task releases locks, a test is made to see if the current task needs rescheduling. When the current task releases its final lock and preempt_count goes to 0, scheduling immediately occurs.