The Not Especially System Call Lab
The next lab is, in some sections, known as the "System Call Lab". You'll notice that I've renamed it "The Mini-UNIX Shell". I also toyed with the name "The Process Lab". So, what's in a name?Well, the lab does make use of system calls. The library calls that are new to this lab such as, fork(), wait(), and waitpid() have system calls underneath the hood, to be sure. But, this is not new. We've used system calls before. Any time we've made use of, for example, I/O, the C library has made use of system calls to get the job done.
Who's In Charge Here, Anyway
So, what is a system call, anyway? Well, to answer that question, we need to first take a better look at the operating system. Most people operate under the assumption that the operating system is something akin to air traffic control or a paramilitary police force. The idea is that the operating system controls everything that is going on, that it controls who gets what and when, and that it can kill any process that is offensive.And, this model is useful in that it does offer an explanation for a lot of what we see when we look casually at our computers and their operating systems. But, it does fail a more careful look. You guys have written a machine simulator. You know the basic life of a processor -- they just charge through memory and execute instructions. So, when the processor is in the context of a particular process chewing along following the instructions of that program, where is the operating system, this great controlling ruler, running? Answer: It is not. If the processor is running a user program, it is not simultaneously running the operating system. Yep, you got that right. Barring anything else, a program can run forever, starving the operating system, itself, from getting to the processor.
The Timer and Interrupts
Well, obviously that system won't do. It doesn't explain reality -- that the OS does seem to be in charge. Here's how the game is played. There is a piece of hardware known as the timer. At periodic intervals, this timer counts doen to zero and interrupts the processor. It pokes the processor directly, or via some arbitration hardware, and let's it know that it time to let the operating system run again.The processor makes the operating system run again via what is, in effect, an array of function pointers known as the Interrupt Vector Table (IVT). Every device on the system has a handler, a function that services it. These handlers are known as Interrupt Service Routines (ISRs). Each interrupt has a number, for example 0 for the timer. When the hardware device wants attention, it uses its wire to signal the processor. The processor then executes the appropriate function fron the IVT.
So, if we ask the disk to get some data for us, when that data is ready, it can signal the processor, which will then run the disk's ISR to copy the data. When data shows up on the network, the network adapter can signal the processor via its own ISR. This mechanism is also used for the processor to signal the operating system that there is a problem with the process that is currently running, for example a divide-by-zero or an attempt to make an invalid memory access. This type of situation often results in the OS ending the current process as you have seen with SIGSEGV.
Regardless, periodically, the timer will signal the processor that "time is up" for the current processes. When this timer interrupt goes off, the processor runs the timer handler, which is in-effect a part of the operating system's scheduler. The OS can then look around and decide if that process should continue running or if it should take a break to allow another process to run for a while.
This interrupt system enables the operating system to schedule multiple processes to take turns so quickly that they all seem to be running at the same time. So, for example, your web page can update at the same time that you type an email and move the mouse.
How does all of this get set up, you ask? When a computer starts up, it goes throug ha boots strap process. The boot sector is read off of the disk. This boot sector contains the instructions about how to initialize, a.k.a. boot strap, the operating system. Part of this process is initializing the IVT and storing its address in a register so that the processor can find it. Another part of it is telling the timer how often to interrupt the processor.
Protection, Supervisor Mode and Interrupts
Well, as fun a story as this is, what does it have to do with so-called system calls? Well, a little more background first. A quick story about protection, to be specific. If a process, in effect, owns the processor while it is running, what prevents it from reading another process's memory? The OS's memory? Or from accessing a file on disk that contains a user's private information?The hardware limits the access to hardware devices as well as to certain parts of memory that are specially tagged so that user processes can't, under normal circumstances, access them. So, for example, a process is normally not able to access the disk or to read protected memory.
Instead accessing these protected resources can only be done when in a special mode often known as supervisor mode. And, there is exactly one way to get into supervisor mode -- by executing one of the functions via the IVT. In this way, supervisor mode can only be entered in a very controlled way -- only through the functions that the operating system, itself, set up in the IVT.
So, for example, when the disk needs attention, and the disk handler runs, it can access the disk, the operating system's data structures, and the destination process's memory, because it is running via an ISR. When the function ends, privilege is lost.
And, again, I want to observe that the protected resources include not only hardware things like the disk, the network, and memory, but also more abstract resources, such as the data structures that represent processes, communication channels, &c.
See how nice and pretty? We now see how the OS can periodically get back in the saddle via the timer, as well as how it can gain access to resources not normally accesible via ISRs.
So, What's a System Call, Anyway?
Enough already! What is a so-called system call? Well, we know how the hardware can rquest's the OS's attention and invokes the OS's special privileged functions via the ISRs. But, what happens when a program wants to invoke the OS to do something that requires privilege, such as request data from the network or the disk? Or to do something that otherwise requries the OS, such as create a new process?Well, for it to access the privileged resources, we know that it has to go through the IVT -- if it doesn't it can't get into the special mode needed to access the protected resources. And, this is exactly what it does. There is a special instruction often knwon as TRAP. This instruction causes the trap-handler to run from the IVT. This handler looks on the stack for a single argument -- a number which indicates what the user wants to do. This number, whcih is #defined to a useful name, suhc as OPEN, CLOSE, READ, WRITE, &c is then fed to a big switch statement which calls the appropriate function. But, since this function is invoked via the ISR, it has privilege.
So, what is a so-called system call? It is a function that is invoked indirectly via a TRAP instruction such that it passes through the IVT and has privilege.
We've seen thing that are, at least traditionally, system calls: read() and write(), for example. And, today we'll see some more, such as fork(), wait(), waitpid(), and getpid(). No magic here. They are just C library calls that, deep down inside, require privs.
Notice that I said, "at least traditionally". This is because the so-called system call interface has evolved over the years and is no longer the same as it once was -- and isn't the same across all flavors of UNIX. But, the details there are really the domain of 15-410 -- not 15-123!
The Life Cycle of a Process
Okay. So. Now that we've got a better understanding of the role of the OS and the nature of a system call, let's move in the direction of this week's lab. It involves the management of processes. So, let's begin that discussion by considering the lifecycle of a process:
A newly created process is said to be ready or runnable. It has everything it needs to run, but until the operating system's schedule dispatches it onto a processor, it is just waiting. So, it is put onto a list of runnable processes. Eventually, the OS selects it, places it onto the processor, and it is actually running.
If the timer interrupts its execution and the OS decides that it is time for another process to run, the other process is said to preempt it. The preempted process returns to the ready/runnable list until it gets the opportunity to run again.
Sometimes, a running process asks the operating system to do something that can take a long time, such as read from the disk or the network. When that happens, the operating system doesn't want to force the processor to idle while the process is waiting for the slow action. Instead, it blocks the process. It moves the proces to a wait list associated with the slow resource. It then chooses another process from the ready/runnabel list to run.
Eventually, the resource, via an interrupt, will let the OS know that the process can again be made ready to run. The OS will do what it needs to do, and ready, a.k.a., amke runnable, the previosly blocked process by moving it to the ready/runnable list.
Eventually a program may die. It might call exit under the programmer's control, in which case it is said to exit or it might end via some exception, in which case the more general term, terminated might be more descriptive. When this happens, the process doesn't immediately go away. Instead, it is said to be a zombie. The process remains a zombie until its parent uses wait() or waitpid() to collect its status -- and set it free.
If the parent died before the child, or if it died before waiting for the child, the child becomes an orphan. Shoudl this happen, the OS will reparent the orphan process to a special process called init. Init, by convention, has pid 1, and is used at boot time to start up other processes. But, it also has the special role of waiting() for all of the orphans that are reparented to it. In this way, all processes can eventually be cleaned up. When a process is set free by a wait()/waitpid(), it is said to be reaped.
fork()
In order to create a process, we make use of the fork() call. Tradionally, for is, itself, a system call. These days, the actual system call may, or may not be a fork(). But, this isn't important. What is important is that when a process calls fork(), it creates a nearly exact clone of itself. One process goes into the phone booth -- and two step out.The original process is known as the parent process. The new one is known as the child. The parent and the child are virtually alike. For our purposes, there are only two differences. The first difference is that they have differnt process ids (pids). A pid is just a number that the OS uses to identify a process. The second difference is that fork() returns a 0 to the child, but returns the pid of the child to the parent. This is convenient, because it allows the program, now running in two differnt processes, to do different things in each of the two processes.
The Exec Family
For example, it is not uncommon to want the child process to assume some other identity -- to, in effect, become a different program. this is done with one of the exec calls (man execve, execvl, execvp, &c). What an exec call does is to load, from disk, a new process image into the current process. So, the current process's memory is dumped and the new process's image is loaded into memory from its executible file.Take careful note, when an exec function is called, in the normal case, it doesn't return. This is beause the process has assumed a new identity -- the exec is gone. Most ofen, an exec faisl becuase the path to the executible is wrong and the executible can't be found.
The "l" versions of the functions, execl() and execlp() take the arguments in a long list, each of the new programs' arguments are passed separately as arguments to exec. It is very important to note that the last argument to exec() and execl() must be a NULL pointer -- otherwise, since it doesn't know how many arguments there are, when walking down the stack, it doesn't know when to stop.
The "v" version of the functions, execv() and execvp(), take the arguments in an array, mcuh like main gets its arguments via the argv[] array. Again, since the array has no length, it is important that the last entry be a NULL.
The "p" version of the functions, execvp() and execlp() will search the path for a matching executible, as compared to execv() and execl(), where the full path, e.g., /bin/ls, must be specified.
Lastly, the 0th argument should be the name of the program, not a "real" command-line argument. It doesn't have to match the executible name -- but normally does. It is passed in as argv[0] to the new program's main().
So, basically, the argv[] array that is passed into execl(), execlp(), &c, is the same as the argv[] array as you are accustomed to receiving within main(). execl() and execlp() take the same list of arguments, including the 0th and terminating NULL -- but take them flattened out, with each being passed as a separate argument to exec.
To really understand this, you'll need to read the man page, look at our example, and search the Web for another example or two.
wait() and waitpid()
The two function calls wait() and waitpid() are normally used to wait for a process to end. So, consider a UNIX shell. When you start up "vi" or "emacs", your shell waits until the editor eds before producing a new prompt. The shell waits by calling wait() or waitpid().Fork w/copy-on-writeIn either case, the wait call will block until the child process is done. Once the child is done, the wait call will return. By waiting for a child in this way, the parent also reaps the child. As we discussed earlier, the child will remain a zombie until reaped by a wait.
The wait calls give the caller back an integer that contains some information about the child's state. We're not going to worry about it here. Either pass in a pointer to a real interger, or a NULL. But, if you are curious, do a "man 2 wait". Notice the macros, such as WIFEXITED() that can be used to decode this status.
The wait() call will wait for any child. If it is desirable to wait for a particular child, the waitpid() call can be used. It will only wait for the child who's pid is specified. waitpid() can be made to wait for any child by passing in a pid of 0 or -1. Upon success, both forms of wait return the pid of the child they reaped.
waitpid() also has another argument that will become important for this lab -- the flags. If a flag of WNOHANG is specified, wait will not actually block. If there is an available zombie, it will collect its information and return its PID. If no child is currently available, the WNOHANG glag prefents wait from blocking. Instead, it will return -1 if there are no more children, or 0 if there are children -- but none are zombies.
Copying all of the pages of memory associated with a process is a very expensive thing to do. It is even more expensive considering that very often the first act of the child is to deallocate this recently created space.One alternative to a traditional fork implementation is called copy-on-write. the details of this mechanism won't be completely clear until we study memory management, but we can get the flavor now.
The basic idea is that we mark all of the parent's memory pages as read-only, instead of duplicating them. If either the parent or any child try to write to one of these read-only pages, a page-fault occurs. At this point, a new copy of the page is created for the writing process. This adds some overhead to page accesses, but saves us the cost of unnecessarly copying pages.
vfork()
Another alternative is also available -- vfork(). vfork is even faster, but can also be dangerous in the worng hands. With vfork(), we do not duplicate or mark the parent's pages, we simply loan them, and the stack frame to the child process. During this time, the parent remains blocked (it can't use the pages). The dangerous part is this: any changes the child makes will be seen by the aprent process.vfork() is most useful when it is immediately followed by an exec_(). This is because an exec() will create a completely new process-space, anyway. There is no reason to create a new task space for the child, just to have it throw it away as part of an exec(). Instead, we can loan it the parent's space long enough for it to get started (exec'd).
Although there are several (4) different functions in the exec-family, the only difference is the way they are parameterizes; under-the-hood, they all work identically (and are often one).
After a new task is created, the parent will often want to wait for it (and any siblings) to finish. We discussed the defunct and zombie states last class. The wait-family of calls is used for this purpose.
A Quick Overview of the File System from the OS Point of View
The operating system maintains two data structures representing the state of open files: the per-process file descriptor table and the system-wide open file table.
When a process calls open(), a new entry is created in the open file table. A pointer to this entry is stored in the process's file descriptor table. The file descriptor table is a simple array of pointers into the open file table. We call the index into the file descriptor table a file descriptor. It is this file descriptor that is returned by open(). When a process accesses a file, it uses the file descriptor to index into the file descriptor table and locate the corresponding entry in the open file table.
The open file table contains several pieces of information about each file:
- the current offset (the next position to be accessed in the file)
- a reference count (we'll explain below in the section about fork())
- the file mode (permissions),
- the flags passed into the open() (read-only, write-only, create, &c),
- a pointer to an in-RAM version of the inode (a slightly light-weight version of the inode for each open file is kept in RAM -- others are on disk), and a structure that contains pointers to all of the .
- A pointer to the structure containing pointers to the functions that implement the behaviors like read(), write(), close(), lseek(), &c on the file system that contains this file. This is the same structure we looked at last week when we discussed the file system interface to I/O devices.
Each entry in the open file table maintains its own read/write pointer for three important reasons:
- Reads by one process don't affect the file position in another process
- Write are visible to all processes, if the file pointer subsequently reaches the location of the write
- The program doesn't have to supply this information each call.
One important note: In modern operating systems, the "open file table" is usually a doubly linked list, not a static table. This ensures that it is typically a reasonable size while capable of accomodating workloads that use massive numbers of files.
Session Semantics
Consider the cost of many reads or writes may to one file.
- Each operation could require pathname resolution, protection checking, &c.
- Implicit information, such as the current location (offset) into the file must be maintained,
- Long term state must also be maintained, especially in light of the fact that several processes using the file might require different view.
Caches or buffers may need to be initialized The solution is to amortize the cost of this overhead over many operations by viewing operations on a file as within a session. open() creates a session and returns a handle and close() ends the session and destroys the state. The overhead can be paid once and shared by all operations.
Consequences of Fork()ing
In the absence of fork(), there is a one-to-one mapping from the file descriptor table to the open file table. But fork introduces several complications, since the parent task's file descriptor table is cloned. In other words, the child process inherits all of the parent's file descriptors -- but new entries are not created in the system-wide open file table.
One interesting consequence of this is that reads and writes in one process can affect another process. If the parent reads or writes, it will move the offset pointer in the open file table entry -- this will affect the parent and all children. The same is of course true of operations performed by the children.
What happens when the parent or child closes a shared file descriptor?
- remember that open file table entries contain a reference count.
- this reference count is decremented by a close
- the file's storage is not reclaimed as long as the reference count is non-zero indicating that an open file entry to it exists
- once the reference count reaches zero, the storage can be reclaimed
- i.e., "rm" may reduce the link count to 0, but the file hangs around until all "opens" are matched by "closes" on that file.
Why clone the file descriptors on fork()?
- it is consistent with the notion of fork creating an exact copy of the parent
- it allows the use of anonymous files by children. The never need to know the names of the files they are using -- in fact, the files may no longer have names.
- The most common use of this involves the shell's implementation of I/O redirection (< and >). Remember doing this?