An Engineer's Options & Futures: September 2010

Wednesday, September 29, 2010

A Sneak-Peek into Linux Kernel - Chapter 3: Process Termination

Finally I found some time to get back to continuing this effort of writing about Linux kernel. This chapter is about how the process or task gets terminated.

In Linux, a task is terminated by an exit() system call, made either explicitly by the task or implicitly when the main() function of the task ends. After the task is terminated, its parent has to be informed about the termination through SIGCHLD signal. Major chunk of exit() operation is performed in do_exit() function in kernel/exit.c.

The first major process in do_exit() function is to validate the credential for exit(). Validation happens in cred.c with this piece of code:


if (tsk->cred == tsk->real_cred) {
if (unlikely(read_cred_subscribers(tsk->cred) <>cred)))
    goto invalid_creds;
} else { if (unlikely(read_cred_subscribers(tsk->real_cred) <>cred) <>real_cred) ||
                  creds_are_invalid(tsk->cred)))
    goto invalid_creds;
}

Oh.. yes; kernel developers never shun away from the usually dreaded goto statement. If they want the control to go to somewhere, they just goto there.

If the validation is successful, then exit_irq_thread() function, implemented in kernel/irq/manage.c is called to set the IRQTF_DIED flag that prevents any attempt to wake up the thread. Then PF_EXITING flag is set on the task_struct of the exiting task. Then the exiting task is protected from cleaning the area of pi futex. A futex is the same as mutex (fast user-space mutex = futex) that is used in the Linux kernel to implement locks. Out of that, pi futex or to be technically correct, PI enabled futex stand for Priority Inheritance enabled futex – a set of lightweight futex, about which we will discuss later.

Following this, the do_exit() invokes exit_mm() function that clears out the user memory space allocated to the process. exit_mm() first releases the user-space by calling mm_release() in kernel/fork.c. mm_release() first removes all the register state saved by the process and inform the parent sleeping on vfork(). The user-space thread id field is wiped off if the exit is normal. On the contrary, if the exit is due to some signal like segmentation fault or bus fault (checked using PF_SIGNALED flag of the task_struct, then it is not wiped so that this information can be written in core dump. This is followed in order by invocations of exit_sem, exit_files, and exit_fs that dequeue if the task has queued any semaphore, removes the locks on files and file space respectively. Then the task status is set to tsk->state = TASK_DEAD.

As it can be observed here, the task_struct instance of the exiting task is set to a state: TASK_DEAD and it is not entirely wiped off. That means that the parent can still gather information about the finished task. Removal of the task_struct altogether is managed by invoking release_task(). The release_task() has this following code:


rcu_read_lock();
atomic_dec(&__task_cred(p)->user->processes);
rcu_read_unlock();

RCU stands for Read-Copy-Update, a synchronization mechanism used in Linux. So whatever falls between rcu_read_lock() and rcu_read_unlock() is a read-side critical section – just wanted to show a real piece of the OS kernel code that establishes a critical section. release_task() also calls do_notify_parent() function which notifies the parent with SIGCHLD. This is followed by a final disposal of the task by a call to architecture specific release_thread() function. In x86, it involves freeing up vm86 irq for this task; whereas in ARM or PowerPC, this function does nothing (serious). And thus, a task reaches its demise.

So in the past three episodes, we discussed the rise and fall of a task or process. But what happens in between is more interesting. The next chapter would be on task scheduling.

Wednesday, September 01, 2010

A Sneak-Peek into Linux Kernel - Chapter 2: Process Creation

In the last chapter, we looked at the basics of process or task in Linux kernel and with a brief overview of struct task_struct. Now we are going to discuss how the process or task gets created.
In Linux, a new task can be created using fork() system call. fork() call creates an almost exact copy of the parent process. The differences are in pid (unique), and parent pointer. It creates an exact copy of task descriptor, and resource utilization limit (set to 0, initially). But there are a lot of parameters in the task descriptor that must not be copied to child. So the forking operation is implemented in the kernel using clone() system call, which in turn calls do_fork() system call in kernel/fork.c. do_fork() predominantly uses an unsigned long variable called clone_flag to determine what parameters need to be shared or copied. The first call made by do_fork() is copy_process(), which has the following code:


retval = security_task_create(clone_flags);

The first step of copy_process() is to call security_task_create() implemented in security/security.c. In this function, a structure called struct security_operations is used. This function takes clone_flags as input and determines if the current process has sufficient permission to create a child process. This is done by calling selinux_task_create() function in security/selinux/hooks.c, which also has a function current_has_perm() that takes clone_flag and check for several permissions using access vector cache - a component that provides caching of access decision computation in order to avoid doing it over and over again. If security_task_create() returns 0, copy_process() cannot create the new task.


p = dup_task_struct(current);
ftrace_graph_init_task(p);
rt_mutex_init_task(p);
retval = copy_creds(p, clone_flags);

The next step is to duplicate the struct task_struct by calling dup_task_struct(). Once the struct task_struct is duplicated, the values of the pointers to parent tasks that do not make sense in the child are set to appropriate values as explained in the subsequent paragraph. This is followed by a call to copy_creds(), implemented in kernel/creds.c. The copy_creds() copies the credential of parent to the child and at this point a new clone_flag parameter called CLONE_THREAD has to be introduced.

In Linux, there is no differentiation between threads and tasks (or processes). They are both created and destroyed in the same way, but handled a little differently. CLONE_THREAD flag says that the child has to be placed in the same group as the parent. The parent of the child process in this case is the same as the parent of the task that called clone() and not the task that called clone() itself. The process and session keyrings (handled later) are shared between all the threads in a process. This explains why struct task_struct has two pointers, namely real_parent and parent. In case of CLONE_THREAD flag set, real_parent points to the task that invoked the clone(), but beyond that this "real parent" does not have any control over the newly created task. The parent of the newly created task is marked as the task that created all these threads. getppid() system call brings that parent process and only the parent process gets SIGCHLD on termination of child, if it makes wait() system call. So the copy_creds() has to copy the credentials and keyrings of the common parent (not the real parent) for all these threads. So threads in Linux are processes that share parent and user space. Note that the thread we are talking about here are user threads that any program creates using pthread or zthread libraries. They are completely different from kernel threads created by Linux kernel for its internal operations.

After this, the function takes care of setting the CPU time utilized to zero and splitting the total CPU time from parent to give it to the child. Then sched_fork() function defined in kernel/sched.c is invoked. Process scheduling is a separate topic of discussion.

In this chapter, we looked at how a task is created in Linux kernel or rather what important operations are performed during task creation. In the next chapter, we will see task termination. In the meantime, you can put printk() statements in copy_process() function of kernel/fork.c and check out the kernel log and observe its working.