Process Termination
It is sad, but eventually processes must die. When a process terminates, the kernel releases the resources owned by the process and notifies the child's parent of its unfortunate demise.
Typically, process destruction occurs when the process calls the exit() system call, either explicitly when it is ready to terminate or implicitly on return from the main subroutine of any program (that is, the C compiler places a call to exit() after main() returns). A process can also terminate involuntarily. This occurs when the process receives a signal or exception it cannot handle or ignore. Regardless of how a process terminates, the bulk of the work is handled by do_exit(), which completes a number of chores:
First, it set the PF_EXITING flag in the flags member of the task_struct.
Second, it calls del_timer_sync() to remove any kernel timers. Upon return, it is guaranteed that no timer is queued and that no timer handler is running.
Next, if BSD process accounting is enabled, do_exit() calls acct_process() to write out accounting information.
Now it calls __exit_mm() to release the mm_struct held by this process. If no other process is using this address space (in other words, if it is not shared), then deallocate it.
Next, it calls exit_sem(). If the process is queued waiting for an IPC semaphore, it is dequeued here.
It then calls __exit_files(), __exit_fs(), exit_namespace(), and exit_sighand() to decrement the usage count of objects related to file descriptors, filesystem data, the process namespace, and signal handlers, respectively. If any usage counts reach zero, the object is no longer in use by any process and it is removed.
Subsequently, it sets the task's exit code, stored in the exit_code member of the task_struct, to the code provided by exit() or whatever kernel mechanism forced the termination. The exit code is stored here for optional retrieval by the parent.
It then calls exit_notify() to send signals to the task's parent, reparents any of the task's children to another thread in their thread group or the init process, and sets the task's state to TASK_ZOMBIE.
Finally, do_exit() calls schedule() to switch to a new process (see Chapter 4). Because TASK_ZOMBIE tasks are never scheduled, this is the last code the task will ever execute.
The code for do_exit() is defined in kernel/exit.c.
At this point, all objects associated with the task (assuming the task was the sole user) are freed. The task is not runnable (and in fact no longer has an address space in which to run) and is in the TASK_ZOMBIE state. The only memory it occupies is its kernel stack, the thread_info structure, and the task_struct structure. The task exists solely to provide information to its parent. After the parent retrieves the information, or notifies the kernel that it is uninterested, the remaining memory held by the process is freed and returned to the system for use.
Removal of the Process Descriptor
After do_exit() completes, the process descriptor for the terminated process still exists but the process is a zombie and is unable to run. As discussed, this allows the system to obtain information about a child process after it has terminated. Consequently, the acts of cleaning up after a process and removing its process descriptor are separate. After the parent has obtained information on its terminated child, or signified to the kernel that it does not care, the child's task_struct is deallocated.
The wait() family of functions are implemented via a single (and complicated) system call, wait4(). The standard behavior is to suspend execution of the calling task until one of its children exits, at which time the function returns with the PID of the exited child. Additionally, a pointer is provided to the function that on return holds the exit code of the terminated child.
When it is time to finally deallocate the process descriptor, release_task() is invoked. It does the following:
First, it calls free_uid() to decrement the usage count of the process's user. Linux keeps a per-user cache of information related to how many processes and files a user has opened. If the usage count reaches zero, the user has no more open processes or files and the cache is destroyed.
Second, release_task() calls unhash_process() to remove the process from the pidhash and remove the process from the task list.
Next, if the task was ptraced, release_task() reparents the task to its original parent and removes it from the ptrace list.
Ultimately, release_task(), calls put_task_struct() to free the pages containing the process's kernel stack and thread_info structure and deallocate the slab cache containing the task_struct.
At this point, the process descriptor and all resources belonging solely to the process have been freed.
The Dilemma of the Parentless Task
If a parent exits before its children, some mechanism must exist to reparent the child tasks to a new process, or else parentless terminated processes would forever remain zombies, wasting system memory. The solution, hinted upon previously, is to reparent a task's children on exit to either another process in the current thread group or, if that fails, the init process. In do_exit(), notify_parent() is invoked, which calls forget_original_parent() to perform the reparenting:
struct task_struct *p, *reaper = father; struct list_head *list; if (father->exit_signal != -1) reaper = prev_thread(reaper); else reaper = child_reaper; if (reaper == father) reaper = child_reaper;
This code sets reaper to another task in the process's thread group. If there is not another task in the thread group, it sets reaper to child_reaper, which is the init process. Now that a suitable new parent for the children is found, each child needs to be located and reparented to reaper:
list_for_each(list, &father->children) { p = list_entry(list, struct task_struct, sibling); reparent_thread(p, reaper, child_reaper); } list_for_each(list, &father->ptrace_children) { p = list_entry(list, struct task_struct, ptrace_list); reparent_thread(p, reaper, child_reaper); }
This code iterates over two lists: the child list and the ptraced child list, reparenting each child. The rationale behind having both lists is interesting; it is a new feature in the 2.6 kernel. When a task is ptraced, it is temporarily reparented to the debugging process. When the task's parent exits, however, it must be reparented along with its other siblings. In previous kernels, this resulted in a loop over every process in the system looking for children. The solution, as noted previously, is simply to keep a separate list of a process's children that are being ptraced—reducing the search for one's children from every process to just two relatively small lists.
With the process successfully reparented, there is no risk of stray zombie processes. The init process routinely calls wait() on its children, cleaning up any zombies assigned to it.