An Engineer's Options & Futures: A Sneak-Peek into Linux Kernel

Finally I found some time to get back to continuing this effort of writing about Linux kernel. This chapter is about how the process or task gets terminated.

In Linux, a task is terminated by an exit() system call, made either explicitly by the task or implicitly when the main() function of the task ends. After the task is terminated, its parent has to be informed about the termination through SIGCHLD signal. Major chunk of exit() operation is performed in do_exit() function in kernel/exit.c.

The first major process in do_exit() function is to validate the credential for exit(). Validation happens in cred.c with this piece of code:


if (tsk->cred == tsk->real_cred) {
if (unlikely(read_cred_subscribers(tsk->cred) <>cred)))
    goto invalid_creds;
} else { if (unlikely(read_cred_subscribers(tsk->real_cred) <>cred) <>real_cred) ||
                  creds_are_invalid(tsk->cred)))
    goto invalid_creds;
}

Oh.. yes; kernel developers never shun away from the usually dreaded goto statement. If they want the control to go to somewhere, they just goto there.

If the validation is successful, then exit_irq_thread() function, implemented in kernel/irq/manage.c is called to set the IRQTF_DIED flag that prevents any attempt to wake up the thread. Then PF_EXITING flag is set on the task_struct of the exiting task. Then the exiting task is protected from cleaning the area of pi futex. A futex is the same as mutex (fast user-space mutex = futex) that is used in the Linux kernel to implement locks. Out of that, pi futex or to be technically correct, PI enabled futex stand for Priority Inheritance enabled futex – a set of lightweight futex, about which we will discuss later.

Following this, the do_exit() invokes exit_mm() function that clears out the user memory space allocated to the process. exit_mm() first releases the user-space by calling mm_release() in kernel/fork.c. mm_release() first removes all the register state saved by the process and inform the parent sleeping on vfork(). The user-space thread id field is wiped off if the exit is normal. On the contrary, if the exit is due to some signal like segmentation fault or bus fault (checked using PF_SIGNALED flag of the task_struct, then it is not wiped so that this information can be written in core dump. This is followed in order by invocations of exit_sem, exit_files, and exit_fs that dequeue if the task has queued any semaphore, removes the locks on files and file space respectively. Then the task status is set to tsk->state = TASK_DEAD.

As it can be observed here, the task_struct instance of the exiting task is set to a state: TASK_DEAD and it is not entirely wiped off. That means that the parent can still gather information about the finished task. Removal of the task_struct altogether is managed by invoking release_task(). The release_task() has this following code:


rcu_read_lock();
atomic_dec(&__task_cred(p)->user->processes);
rcu_read_unlock();

RCU stands for Read-Copy-Update, a synchronization mechanism used in Linux. So whatever falls between rcu_read_lock() and rcu_read_unlock() is a read-side critical section – just wanted to show a real piece of the OS kernel code that establishes a critical section. release_task() also calls do_notify_parent() function which notifies the parent with SIGCHLD. This is followed by a final disposal of the task by a call to architecture specific release_thread() function. In x86, it involves freeing up vm86 irq for this task; whereas in ARM or PowerPC, this function does nothing (serious). And thus, a task reaches its demise.

So in the past three episodes, we discussed the rise and fall of a task or process. But what happens in between is more interesting. The next chapter would be on task scheduling.