Lecture 5: Unix File System Review

Files

Before discussing file systems, it makes sense to discuss files. What is a file? Think back to the beginning of the semester when we discussed the first of many abstractions -- the process. We said that a process is an abstracton for a unit of work to be managed by the operating system system on behalf of some user (a person, an agent, or some aspect of the system). We said that the PCB was an operating system data structure that represented a process within the operating system.

Similarly, a file is an abstraction. It is a collection of data that is organized by its users. The data within the file isn't necessarily meaningful to the OS, the OS many not know how it is organized -- or even why some collection of bytes have been organized together as a file. None-the-less, it is the job of the operating system to provide a convenient way of locating each such collection of data and manipulating it, while protecting it from unintended or malicious damage by those who should not have access to it, and ensuring its privacy, as appropriate.

Introduction: File Systems

A file system is nothing more than the component of the operating system charged with managing files. It is responsible for interacting with the lower-level IO subsystem used to access the file data, as well as managing the files, themselves, and providing the API by which application programmers can manipulate the files.

Factors In Filesystem Design

  1. naming
  2. operations
  3. storage layout
  4. failure resiliance
  5. efficiency (lost space is not recovered when a process ends as it is with RAM, the penalty is also higher for frequent access...by a factor of 106)
  6. sharing and concurrency
  7. protection

Naming

The simplest type of naming scheme is a flat space of objects. In this model, there are only two real issues: naming and aliasing.

Naming involves:

Aliasing

Aliasing is the ability to have more than one name for the same file. If aliasing is to be permitted, we must detemrine what types. It is useful for several reasons:

There are two basic types:

In order to implement hard links, we must have low level names.

UNIX has low-level names, they are called inodes. The pair (device number, inode # is unique). The inode also serves as the data structure that represents the file within the OS, keeping track of all of its metadata. In contrast, MS-DOS uniquely names files by their location on disk -- this scheme does not allow for hard links.

Hierarchical Naming

Real systems use hierarchical names, not flat names. The reason for this relates to scale. The human mind copes with large scale in a hierarchical fashion.It is essentially a human cognative limitation, we deal with large numbers of things by categorizing the. Every large human organization is hierarchical: army, companies, church, etc.

Furthermore, too many names are hard to remember and it can be hard to generate unique names.

With a hierarchical name space only a small fraction of the full namespace is visible at any level. Internal nodes are directories and leaf nodes are files. The pathname is a representation of the path from the leafnode to the root of the tree.

The process of translating a pathname is known as name resolution. We must translate the pathname one step at a time to allow for symbolic links.

Every process is associated with a current directory.  The low level name is evaluated by chdir().If we follow a symbolic link to a location and try to "cd ..", we won't follow the symbolic link back to our original location -- the system doesn't remember how we got there, it takes us to the parent directory.

The ".." relationship superimpsoes a Directed Acyclic Graph(DAG) onto the directory structure, which may contain cycles via links.

Have you ever seen duplicate listings for the same page in Web searche ngines? This is because it is impossible to impose a DAG onto Web space -- not only is it not a DAG on any level, it is very highly connected.

Each directory is created with two implicit components

Directory Entries

What exactly is inside of each directory entry aside form the file or directory name?

UNIX directory entries are simple: name and inode #. The inode contains all of the metadata about the file -- everything you see when you type "ls -l". It also contains the information about where (which sectors) on disk the fiel is stored.

MS-DOS directory entries are much more complex. They actually contain the meta-data about the file:

Unix keeps similar information in the inode. We'll discuss the inode in detail very soon.

File System Operations

File system operations generally fall into one of three categories:

From open() to the inode

The operating system maintains two data structures representing the state of open files: the per-process file descriptor table and the system-wide open file table.

When a process calls open(), a new entry is created in the open file table. A pointer to this entry is stored in the process's file descriptor table. The file descriptor table is a simple array of pointers into the open file table. We call the index into the file descriptor table a file descriptor. It is this file descriptor that is returned by open(). When a process accesses a file, it uses the file descriptor to index into the file descriptor table and locate the corresponding entry in the open file table.

The open file table contains several pieces of information about each file:

Each entry in the open file table maintains its own read/write pointer for three important reasons:

One important note: In modern operating systems, the "open file table" is usually a doubly linked list, not a static table. This ensures that it is typically a reasonable size while capable of accomodating workloads that use massive numbers of files.

Session Semantics

Consider the cost of many reads or writes may to one file.

The solution is to amortize the cost of this overhead over many operations by viewing operations on a file as within a session. open() creates a session and returns a handle and close() ends the session and destroys the state. The overhead can be paid once and shared by all operations.

Consequences of Fork()ing

In the absence of fork(), there is a one-to-one mapping from the file descriptor table to the open file table. But fork introduces several complications, since the parent task's file descriptor table is cloned. In other words, the child process inherits all of the parent's file descriptors -- but new entries are not created in the system-wide open file table.

One interesting consequence of this is that reads and writes in one process can affect another process. If the parent reads or writes, it will move the offset pointer in the open file table entry -- this will affect the parent and all children. The same is of course true of operations performed by the children.

What happens when the parent or child closes a shared file descriptor?

Why clone the file descriptors on fork()?

Memory-Mapped Files

Earlier this semester, we got off on a bit of a tangent and discussed memory-mapped I/O. I promied we'd touch on it again -- and now seems like a good time since we just talked about how the file system maintains and access files. Remember that it is actually possible to hand a file over to the VMM and ask it to manage it, as if it were backing store for virtual memory. If we do this, we only use the file system to set things up -- and then, only to name the file.

If we do this, when a page is accessed, a page fault will occur, and the page will be read into a physical frame. The access to the data in file is conducted as if it were an access to data in the backing-store. The contents of the file are then accessed via an address in virtual memory. The file can be viewed as an array of chars, ints, or any other primitive variable or struct.

Only those pages that are actually used are read into memory. The pages are cached in physical memory, so frequently accessed pages will not need to be read from external storage each access. It is important to realize that the placement and replacement of the pages of the file in physical memory competes with the pages form other memory mapped files and those from other virtual memory sources like program code, data, &c and is subject to the same placement/replacement scheme.

As is the case with virtual memory, changes are written upon page-out and unmodified pages do not require a page-out.

The system call to memory map a file is mmap(). It returns a pointer to the file. The pages of the file are faulted in as is the case with any other pages of memory. This call takes several parameters. See "man mmap" for the full details. But a simplified version is this:

void *mmap (int fd, int flags, int protection)

The file descriptor is associated with an already open file. In this way the filesystem does the work of locating the file. Flags specifies the usual type of stuff: executable, readable, writable, &c. Protection is something new.

Consider what happens if multiple processes are using a memory-mapped file. Can they both share the same page? What if one of them changes a page? Will each see it?

MAP_PRIVATE ensures that pages are duplicated on write, ensuring that the calling process cannot affect another process's view of the file.

MAP_SHARED does not force the duplication of dirty pages -- this implies that changes are visible to all processes.

A memory mapped file is unmapped upon a call to munmap(). This call destroys the memory mapping of a file, but it should still be closed using close() (Remember -- it was opened with open()). A simplified interface follows. See "man munmap" for the full details.

int munmap (void *address) // address was returned by mmap.

If we want to ensure that changes to a memory-mapped file have been committed to disk, instead of waiting for a page-out, we can call msync(). Again, this is a bit simplified -- there are a few options. You can see "man msync" for the details.

int msync (void *address)

Cost of Memory Mapped Access To Files

Memory mapping files reduces the cost of accessing files imposed by the need for traditional access to copy the data first from the device into system space and then from system space into user space.

BUt it does come at another, somewhat interesting cost. Since the file is being memory mapped into the VM space, it is competing with regular memory pages for frames. That is to say that, under sufficient memory pressure, access to a meory-mapped file can force the VMM to push a page of program text, data, or stack off to disk.

Now, let's consider the cost of a copy. Consider for example, this "quick and dirty" copy program:

int main (int argc, char *argv)
{
  int fd_source; 
  int fd_dest;
  struct stat info;
  unsigned char *data;

  fd_source = open (argv[1], O_RDONLY);
  fd_dest = open (argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0666);

  fstat (fd_source, &info);

  data = mmap (0, info.st_size, PROT_READ, MAP_SHARED, fd_source, 0);
  write (fd_dest, data, info.st_size);

  munmap (data, info.st_size);
  close (fd_source);
  close (fd_dest);
}

Notice that in copying the file, the file is viewed as a collection of pages and each page is mapped into the address space. As the write() writes the file, each page, individually, will be faulted into physical memory. Each page of the source file will only be accessed once. After that, the page won't be used again.

The unfortunate thing is that these pages can force pages that are likely to be used out of memory -- even, for example, the text area of the copy program. The observation is that memory mapping files is best for small files, or those (or parts) that will be frequently accessed.

Storage Management

The key problems of storage management include:

These problems are different in several ways from the problems we encountered in memory management:

Blocks and Fragmentation

During our discussion of memory management, we said that a byte was the smallest addressable unit of memory. But our memory management systems created larger and more convenient memory abstractions -- pages and/or segments. The file system will employ similar medicine.

Although the sector is the smallest addressable unit in hardware, the file system manages storage in units of multiple sectors. Different operating systems give this unit a different name. CPM called it an extent. MS-DOS called it a cluster UNIX systems generally call it a block. We'll follow the UNIX nomenclature and call it a block. But regardless of what we call it, in some sense it becomes as logical sector. Except when interacting with the hardware, the operating system will perform all operations on whole blocks.

Internal fragmentation results as a result of allocating storage in whole block units -- even when less storage is requested. But, much as was the case with RAM< this approach avoids external fragmentation.

Key Differences from RAM Storage Management

Storage Allocation

Now that we've considered the role of the file system and the characteristics of the media that it manages, let's consider storage allocation During this discussion we will consider several different policies and data structures used to decide which disk blocks are allocated to a particular file.

Contiguous Allocation

Please think back to our discussion of memory management techniques. We began with a simple proposal. We suggested that each unit of data could be stored contiguously in physical memory. We suggested that this approach could be managed using a free list, a placement policy such as first-fit, and storage compaction.

This simple approach is applicable to a file system. But, unfortunately, it suffers from the same fatal shortcomings:

Linked Lists

In order to eliminate the external fragmentation problem, we need to break the association between physical contiguity and logical contiguity -- we have to gain the ability to satisfy a request with non-adjacent blocks, while preserving the illusion of contiguity. To accompilish this we need a data structure that stores the information about the logical relationship among the disk blocks. This data structure must answer the question: Which phyical blocks are logically adjacent to each other.

IN many ways, this is the same problem that we had in virtual memory -- we're trying to establish a virtual file address space for each file, much like we did a virtual address space for each process.

One approach might be to call upon our time-honored friend the linked list. The linked lists solves so many problems -- why not this one?

We could consider the entire disk to be a collection of linked lists, where each block is a node. Specifically, each block could contain a pointer to the next block in the file. But this approach has problems also:

Well, unfortunately, the linked list isn't the solution, this time:

File Allocation Table

Another approach might be to think back to our final solution for RAM -- the page table. A page table-proper won't work for disk, because each process does not have its own mapping from logical addresses to physical addresses. Instead this mapping is universal across the entire file system.

Remember the inverted page table? This was a system-wide mapping. We could apply a similar system-wide mapping in the file system. Actually, it gets easier in the file system. We don't need a complicated hashing system or a forward mapping on disk. Let's consider MS-DOS. We said that the directory entry associated the high-level "8 + 3" file name assigned by the user with a low-level name, the number of the first cluster populated by the file. Now we can explain the reason for this.

MS-DOS uses an approach similar to an inverted page table. It maintains a table with one entry for each cluster on disk. Each entry contains a pointer to the cluster that logically follows it. When a directory entry is opened, it provides the address (cluster number) of the first cluster in the corresponding file. This number is used as an index into the mapping table called the File Allocation Table, a.k.a FAT. This entry provides the number of the next cluster in the file. This process can be repeated until the entry in the table corresponding to the last cluster in the file is inspected -- this entry contains a sentinel value, not a cluster address.

A compilicated hash is not needed, because the directory tree structure provides the mapping. We don't need the forward mapping, because all clusters must be present on disk -- (for the most part) there is no backing store for secondary storage. To make use of this system, the only "magic" required is a priori knowledge as to the whereabouts of the FAT on disk (actually MS-DOS uses redundant FAT tables, with a write-all, read one policy).

But this approach also has limitations:

inode Based Allocation

UNIX uses a more sophiticated and elogant system than MS-DOS. It is based on a data structure known as the inode.

There are two important characteristics of the i-node approach:

Each level-0 or outermost inode is divided into several different fields:

Files up to a certain size are mapped using only the direct mappings. If the file grows past a certain threshold, then Indirect_1 mappings are also used. As it keeps growing, Indirect_2 and Indirect_3 mappings are used. This system allows for a balance between storage compactness in secondary storage and overhead in the allocation system. In some sense, it amounts to a special optimization for small files.

Estimating Maximum File Size

Given K direct entries and I indirect entries per block, the biggest file we can store is (K + I + I2 + I3 ) blocks.

If we would need to allocate files larger than we currently can, we could reduce the number of Direct Block entries and add an Indirect_4 entry. This process could be repeated until the entire table consisted of indirect entries.

A Quick Look Back At Traditional File Systems

We've looked at "General Purpose inode-based file systems" such as UFS and ext2. They are the workhorses of the world. They are reasonable fast, but have some limitations, including:

Hybrid file systems

Today we are going to talk about a newer generation of file systems that keep the best characteristics of traditional file systems, plus some improvements, and also logging to increase availability in the event of failure. These file systems, in particular, support much large file systems than could reasonably be managed using the older file systems and do so more robustly -- and often faster.

Much like the traditional file systems that we talk about have common characteristics, such as similar inode structures, buffer cache organizations, &c, these file systems will often share some of the same characteristics:

ReiserFS

The ReiserFS isn't the most sophisticated among this class of filesystems, but it is a reasonably new filesystem. Furthermore, despite the availability of journaling file systems for other platforms, Reiser was among the first availble for Linux and is the first, and only, hybrid file system currently part of the official Linux kernel distribution.

As with the other filesystems that we dicsussed, ReiserFS only journals metadata. And, it is based on a variation of the B+ tree, the B* tree. Unlike the B+ tree, which does 1-2 splits, the B* tree does 2-3 splits. This increases the overall packing density of the tree at the expense of only a small amount of code complexity.

It also offers a unique tail optimization. This feature helps to mitigate internal fragmentation. It allows the tails of files, the end portions of files that occupy less than a whole block, to be stored together to more completely fill a block.

Unlike the other file systems, its space management is still pretty "old-school". It uses a simple block-based allocator and manages free space using a simple bit-map, instead of a more efficient extent-based allocator and/or B-tree based free space management. Currently the block size is 4KB and the maximum file size 4GB, and the maximum file system size is 16TB, Furthermore, ReiserFS doesn't support sparse files -- all blocks of a file are mapped. Reiser4, scheduled for release this fall, will address some of these limitations by including extents and a variable block size of up to 64KB.

For the moment, free block are found using linear search of bitmap. The search is in the order of increasing block number to match disk spin. It tries to keep things together by searching bitmap beginning with position representing the left neighbor. This was empirically determined to be the better of the following:

ReiserFS allows for the dynamic allocation of inodes and keeps inodes and the directory structure organized within a single B* tree. This tree organizes four different types of nodes:

Items are stored in the tree using a key, which is a tuple:

<parent directory ID, offset within object, item type/uniqueness>, where

Each key structure also contains a unique item number, basically the inode number. But, this isn't used to determine ordering. Instead, the tree sorts keys using each tuple, in order of position. This orders the files in the tree in a way that keep files within the same directory together, and then these sorted by file or directory name

The leaf nodes are data nodes. Unformatted nodes contain whole blocks of data. "Formatted" nodes hold the tails of files. They are formatted to allow more than one tail to be stored within the same block. Since the tree is balanced, the path to to any of these data nodes is the same length.

A file is composed of set of indirect items and, at most 2 direct items for the tail Why not always one? If a tail is smaller than unformatted node, but larger than formatted node, it needs to be broken apart and placed into two direct nodes).

SGI's XFS

In many ways SGI's XFS is similar to ReiserFS. But, it is in many ways more sophisticated. It may be the most sophisticated among the systems we'll consider. This being said, unlike ReiserFS, XFS uses B+ trees instead of B* trees.

The extent-based allocator is rather sophisticated. In particular, it has three pretty cool features. First, it allows for delayed allocation. Basically, this allows the system to build a virtual extent in RAM and then allocate it in one piece at the end. This mitgates the "and one more thing" syndrom that can lead to a bunch of small extents instead of one bit one. It also allows for the preallocation of an extent. This allows the system to reserve an extent that is big enough in advance so that the right sized extent can be used -- without consuming memory for delayed allocation or running the risk of running out of space later on. The system also allows for the ?coalecing of extents as they are freed to reduce fragmentation.

The file system organized into different partions called allocation groups (AGs). Each allocation group has own data structures -- for practical purposes, they are seaparate instances of the same file system class. This helps to keeps data structures to a normal scale. It also allows for parallel activity on multiple AGs, without concurrency control mechanisms creating hot spots.

Inodes are created dynamically in chunks of 64 inodes. Each inode is numbered using a tuple that includes both the chunk number and the inode's index within its chunk. The location of an inode can be discovered by lookup in B+ tree by chunk number. The B+ tree also contains bitmap showing which inodes within each chunk are used.

Free space is managed using two different B+ tree of extents. One B+ tree is organized by size, whereas the other is organized by location. This allows for efficient allocation -- btoh by size and locality.

Directories are also stored in a B+ tree. Instead of storing the name, itself in the tree, a hash of the name is stored. This is done, because it is more complicated to organize a B tree to work with names of different sizes. But, regardless of the size of the name, it will hash to the same sized key.

Each file within this tree contains its own storage map (inode). Initially, each node stores block offset and extent size measured in blocks. When the file grows and overflows the inode, the storage allocation is stored in a tree rooted at the inode. This tree is indexed by the offset of the extent and stores the size of the extent. In this way, the directory structure is really a tree of inodes, which in turn are trees of the file's actual storage. Much like ReiserFS, XFS logs only metadata changes, not changes to the file's actual metadata. In the event of a crach, it replays these logs the obtain consistent metadata. XFS also includes a repair program, similar to fsck, that is capable of fixing other types of corruption. This repair tool was not in the first release of XFS, but was demanded by customers and added later. Logging can be done to a separate device to prevent the log from becoming a hot-spot in high-throughput applications. Normally asynchronous logging is used, but synchronous is possible (be it expensive).

XFS offers variable block size ranging from 512 bytes - 64K and an extent-based alloctor. The maximum file size is 9 thousand petabytes. The maximum file system size is 18 thousand petabytes.

IBM's JFS

IBM's JFS isn't one of the best performers among this class of file system. But, that is probably becuase it was one of the first. What to say? Things get better over time -- and I think everyone benefitted from IBM's experience here.

File system partitions correspond to what are known in DFS as aggregates. Wthin each partition lives an allocation group, similar to that of XFS. Within each allocation group is one or more fileset. A fileset is nothing more than a mountable tree. JFS supports extents within each allocation group.

Much like XFS, JFS uses a B+ tree to store directories. And, again, it also uses a B+ tree to track allocations within a file. Unlike JFS, the B+ tree is used to track even small allocations. The only exception is an optimization that allows symlinks to live directly in the inode.

Free space is represented as array w/1 bit per block. This bit array can be viewed as an array of 32-bit words. These words then form a binary tree sorted by size. This makes it easy to find a contiguous chunk of space of the right size, without a linear search of the available blocks. The same array is also indexed by another tree as a "Binary Buddy". This allows for easy coalescing and easy tracking of the allocated size.

These trees actually have a somewhat complicated structure. We won't spend the time here to cover it in detail. This really was one of the "original attempts" and not very efficient. I can provide you with some references, if you'd like more detail.

As for sidelines statistics, the block size can be 512B, 1KB, 2KB, or 4KB. The maximum file size ranges from 512TB with a 512 byte block size to 4 petabytes with a 4KB blocks size. Similarly, the maximum file system size ranges from 4PB with a 512 byte blocks to 32 petabytes with a 4KB byte block size.

Ext3

Ext3 isn't really a new file system. It is basically a journaling layer on top of Ext2, the "standard" Linux file system. It is both forward and backward compatible with Ext2. One can actually mount any ext2 file system as ext3, or mount any ext3 filesystem as ext2. This filesystem is particularly noteworthy because it is backed by Red Hat and is their "official" file system of choice.

Basically RedHat wanted to have a path into journaling file systems for their customers, but also wanted as little transitional headache and risk as possible. Ext3 offers all of this. There is no need, in any real sense, to convert an existing ext2 file system to it -- really ext3 just needs to be enabled. Furthermore, the unhappy customer cna always go back to ext2. And, in a pinch, the file system can always be mounted as ext2 and the old fsk remains perfectly effective.

The journaling layer of ext3 is really separate from the filesystem layer. There are only two differences between ext2 and ext3. The first, which really isn't a change to ext2-proper, is that ext3 has a "logging layer" to log the file system changes. The second change is the addition in ext3 of a communication interface from the file system to the logging layer. Additionally, one ext2 inode is used for log file, but this really doesn't matter from a compatibility point of view -- unless the ext2 file system is (or otherwise would be) completely full.

Three types of things are logged by ext3. These must be logged atomically (all or none).

Periodically, the in-memory log is check-pointed by writing outstanding entries to an in memory journal. This journal is committed periodically to disk. The level of journaling is a mount option. Basically, writes to the log file are cached, like any other writes. The classic performance versus recency trade-off involves how often we sync the log to disk.

As for the sideline stats, the block size is variable between 1KB and 4KB. The maximum file size is 2GB and the maximum filesystem size is 4TB.

As you can see, this is nothing more than a version of ext2, which supports a journaling/logging layer that provides for a faster, and optionally more thorough, recovery mode. I think Red Hat made the wrong choice. My bet is that people want more than compatibility - more than Ext3 offers. Instead, I think that the ultimate winner will be the new version of ReiserFS or XFS. Or, perhaps, something new -- but not this.

Handling Multiple File Systems

So far we have discussed the role of file systems and the implementation of UNIX-like file systems. But our model of the world was a little simplified -- it aimed to capture essential properties without the added complexity of optimization or real-word idiosyncracies. Today, we are going to take a closer look at the mechanisms used within Linux.

In real world systems, many different file systems may be in use on the same system at the same time. Many different file systems exist -- some are specialized for particular applications, others are just vender-specific or vestigual general-purpose file systems. The commercial success of a new entry in the OS market often depends on its ability to support a plethora of file systems -- no one wants to convert of of their old data (applications present enough trauma).

The Virtual File System (VFS), originally proposed by Sun and now a part of SYSVR4, is a file system architecture designed to facilitate support for multiple file systems. It uses an object-oriented paradigm to represent file systems. The VFS model can be viewed as consisting of an abstract class that represents a file system with derived classes for each specific type of file system.

The abstract base class defines the minimal interface to the file system. The derived class implements these behaviors in a way that is appropriate to the file system and defines additional behaviors as necessary.


Source: Rusling, David A, The Linux Kernel, V0.8-3, LDP, 1999, S.9.2.

Sun also defined a similar abstraction to represent a file, called the vnode. The vnode is basically an abstract base class that, when implemented by a derived class, serves the role of a traditional inode. The vnode defines the universal interface and the derived classes implement these behaviors and others for the specific file system.

Linux is fairly loyal to the VFS architecture and has adopted many of the ideas of the vnode into its inode strcuture. Its inode structure is not however, an exact implentation of a vnode. Linux maints the general achitecture of the vnode, without employing as strong an OO model. One note: whereas a vnode # is unique across file systems, an inode # is only unique within the file system. For this reason, it is necessary to use the device # and the inode # as a unique identifier for a file in Linux.

Major Data Structures

The following are the major data structures in the Linux file system infastructure. We'll walk our way through them today.

Per Process File Information

So far, in lecture, we've suggested that the only file system state that is associated with a process is the file descriptor table in the PCB.

This is almost true in the real world, but not quite. There are a few other pieces of information that prove useful and a few optimizations. In Linux, the file system information associated with a process is kept a struct files_struct within the task_struct. The task_struct is Linux's version of the PCB.

include/linux/sched.h:

struct files_struct { /* kept within task_struct (PCB) */
        atomic_t count;
        rwlock_t file_lock;
        int max_fds;
        int max_fdset;
        int next_fd;
        struct file ** fd;      /* current fd array */
        fd_set *close_on_exec;
        fd_set *open_fds;
        fd_set close_on_exec_init;
        fd_set open_fds_init;
        struct file * fd_array[NR_OPEN_DEFAULT];
};

We find the struct file **fd, the array of file descriptors, just as expected. But, it is dynamic not static. Initially it references a small, defualt array, struct file *fs_array[NR_OPEN_DEFAULT], but if necessary, it can grow. If this happens a new array is allocated for fd and the contents are copied. This can happen repeatedly, if necessary.

The count variable tracks the number of files the process has open and the lock variable is a spin lock that is used to protect list operations.

There are a few bit-masks of type fd_set. These sets contain one bit per file descriptor. In the case of open_fds this bit indicates whether or not the corresponding file descriptor is in use. In the case of close_on_exec, each bit indiciated whether or not the corresponding file should be closed in the event of an exec(). If the new process knows nothing about the open files of its predecessor, it makes sense to close them and free the assocaited resources. But, in other cases, the open files can provide an anonymous way for the predecessor and successor to cooperate.

The open_fds_init and close_on_exec_init fd_sets are used to initialize the fd_sets of a clone()'d process. The linux clone() call is much like a super-set of Fork() and SharedFork() in Yalnix. It can create traditional processes or thread-like relationships.

next_fd is an index into the array that is used when searching for an available file descriptor. It prevents an increasingly long linear search starting at the beginning of the array, in the event that many files are in use.

System-wide File Information

List of files in use: (struct list_head) sb->s_files

List of free files: struct list_head free_list

List of newly created files: struct list_head anon_list

struct file

The elements of the open file list, s_files, are of type struct file. Each node represents the use of a file by a process. THe only exception occurs in the case of a clone()'ing. Several clone()'d or fork()'d processes may share the file node.

struct file {
        struct list_head        f_list;  /* Head of the list */
        struct dentry          *f_dentry; /* The name--> inode mapping */
        struct file_operations *f_op; /* Remember this from I/O? The op pointers*/
        atomic_t                f_count; /* Reference count -- needed because of fork(), &c */
        unsigned int            f_flags; /* O_RDONLY, O_WRONLY, &c */
        mode_t                  f_mode; /* just as in chmod() */
        loff_t                  f_pos; /* The current position in the file -- allows for sequential reads, writes, &c */
        unsigned long           f_reada, f_ramax, f_raend, f_ralen,    			                        f_rawin; /* Used for read-ahead magic */
        struct fown_struct      f_owner; /* Module, not use */
        unsigned int            f_uid, f_gid; /* userid and group id */
        int                     f_error; /* needed for NFS return codes */
        unsigned long           f_version; /* Needed for cache validation */

        /* needed for tty driver, and maybe others */
        void                    *private_data;
};

struct file_operations

This structure should seem familiar to everyone -- we discussed it in the context of device drivers. It contains pointers to the functions that implement the standard interface. Most of the operations defined in this structure should probably be familiar to you.

Please remember that although each file ahs a pointer to this structure, many of these pointers will reference the same structure. Typically there is only one file_operations structure for each type of file supported by the file system.

struct file_operations {
        loff_t (*llseek) (struct file *, loff_t, int);
        ssize_t (*read) (struct file *, char *, size_t, loff_t *);
        ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
        int (*readdir) (struct file *, void *, filldir_t);
        unsigned int (*poll) (struct file *, struct poll_table_struct *);
        int (*ioctl) (struct inode *, struct file *, unsigned int, 
                           unsigned long );
        int (*mmap) (struct file *, struct vm_area_struct *);
        int (*open) (struct inode *, struct file *);
        int (*flush) (struct file *);
        int (*release) (struct inode *, struct file *);
        int (*fsync) (struct file *, struct dentry *);
        int (*fasync) (int, struct file *, int);
        int (*check_media_change) (kdev_t dev);
        int (*revalidate) (kdev_t dev);
        int (*lock) (struct file *, int, struct file_lock *);
};

struct inode

Most of the fields in the inode should probably be self explanatory. Please remember that the inode number is only unique within the file system. It take the tuple to uniquely identify a file in a global context..

We'll talk more baout the struct super_block shortly. The same is true for the struct vm_area_struct.

Please also notice the union within the inode. This allows the inode structure to be used with the several different types of file systems.

struct inode {
        struct list_head        i_hash;
        struct list_head        i_list;
        struct list_head        i_dentry;

        unsigned long           i_ino;
	  kdev_t                        i_dev;
        
	  /* Usual metadata, such as might be seen with "ls -l" */
          /* blah, blah, blah*/

        struct inode_operations *i_op;
        struct super_block      *i_sb;
        wait_queue_head_t       i_wait;
        struct vm_area_struct   *i_mmap;
        struct pipe_inode_info  *i_pipe;

        union {
                struct minix_inode_info         minix_i;
                struct ext2_inode_info          ext2_i;
		 ...
        } u;
}

Memory Mapping

We won't cover this in too much detail. But this is the structure that defines virtual meory areas. When a file is memory mapped, this defines the relationship between virtual memory and the file's blocks. The struct vm_operations_struct implements the operations on the memory mapped area. Obviously the implementation of these operations is different for different media types, file system types, &c.

/*
 * This struct defines a memory VMM memory area. There is one of these
 * per VM-area/task.  A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
        struct mm_struct * vm_mm;       /* VM area parameters */
        unsigned long vm_start;
        unsigned long vm_end;

        /* linked list of VM areas per task, sorted by address */
        struct vm_area_struct *vm_next;

        pgprot_t vm_page_prot;
        unsigned short vm_flags;

        /* AVL tree of VM areas per task, sorted by address */
        short vm_avl_height;
        struct vm_area_struct * vm_avl_left;
        struct vm_area_struct * vm_avl_right;

        /* For areas with inode, the list inode->i_mmap, for shm areas,
         * the list of attaches, otherwise unused.
         */
        struct vm_area_struct *vm_next_share;
        struct vm_area_struct **vm_pprev_share;

        struct vm_operations_struct * vm_ops;
        unsigned long vm_offset;
        struct file * vm_file;
        void * vm_private_data;         /* was vm_pte (shared mem) */
};

Memory Mapping Operations

These operations should seem reasonably meaningful to you. The advise() operation is a BSD-ism that is not implemented in Linux. wppage() is also unimplemented; I believe that it is also a BSD-ism.

struct vm_operations_struct {
        void (*open)(struct vm_area_struct * area);
        void (*close)(struct vm_area_struct * area);
        void (*unmap)(struct vm_area_struct *area, unsigned long, size_t);
        void (*protect)(struct vm_area_struct *area, unsigned long, size_t, 
                        unsigned int newprot);
        int (*sync)(struct vm_area_struct *area, unsigned long, size_t, 
                    unsigned  int flags);
        void (*advise)(struct vm_area_struct *area, unsigned long, size_t, 
                       unsigned int advise);
        unsigned long (*nopage)(struct vm_area_struct * area, 
                                unsigned long address, int write_access);
        unsigned long (*wppage)(struct vm_area_struct * area, 
                                unsigned long address, unsigned long page);
        int (*swapout)(struct vm_area_struct *, struct page *);
};

Inode Cache

The Linux inode cache is organized as an open chain hash table. The hashing function hashes the inode # and the block #.

All blocks in the hash table are also linked into one of three LRU lists:

If cache pressure forces a block out of the cache, a clean buffer is prefered, since it does not need to be written to disk. The unused list is of course the prefered source of buffers. The buffers from deleted files, &c are placed in the unused list instead of freeing them to reduce the overhead of allocating and freeing buffers within the OS -- as we discussed earlier, this is a common strategy.

struct dentry

In our earlier discussion of UNIX-like file systems, we very much over simplified the directory entry -- we alleged that it was simply a mapping.

Here we see that it does exactly that -- but it also has another purpose. It maintains the structure of the directory tree by keeping references to siblings (d_child), the parent (d_parent), and subdirectories/children (d_subdirs).

The structure also contains some meta-data used to cache the entries (d_lru, d_hash, d_time). As well as mounting information (d_mounts = a directory mounted on top of this one, d_covers = a directory that this directory is mounted on top of).

The d_operations structure defines operations on directory entries -- mostly cache related. More soon.

struct dentry {
        int d_count;
        unsigned int d_flags;
        struct inode  * d_inode;        /* Where the name belongs to */
        struct dentry * d_parent;       /* parent directory */
        struct dentry * d_mounts;       /* mount information */
        struct dentry * d_covers;
        struct list_head d_hash;        /* lookup hash list */
        struct list_head d_lru;         /* d_count = 0 LRU list */
        struct list_head d_child;       /* child of parent list */
        struct list_head d_subdirs;     /* our children */
        struct list_head d_alias;       /* inode alias list */
        struct qstr d_name;
        unsigned long d_time;           /* used by d_revalidate */
        struct dentry_operations  *d_op;
        struct super_block * d_sb;      /* The root of the dentry tree */
        unsigned long d_reftime;        /* last time referenced */
        void * d_fsdata;                /* fs-specific data */

	/* small names */
        unsigned char d_iname[DNAME_INLINE_LEN]; 
};

dentry_operations

These operations should be mostly self-explanatory. revalidate() is needed to revalidate a cahced entry, if it is possible that something other than the VFS changed it -- this is typically only the case in shared files systems.

struct dentry_operations {
        int (*d_revalidate)(struct dentry *, int);
        int (*d_hash) (struct dentry *, struct qstr *);
        int (*d_compare) (struct dentry *, struct qstr *,                   
					struct qstr *);
        void (*d_delete)(struct dentry *);
        void (*d_release)(struct dentry *);
 };

Dcache

The dcache, a.k.a, the name cache, and the diretory cache provide a fast way of mapping a name to an inode. Without the dcache every file access by name would require a traversal of the directory structure -- this could get painful.

By now, the structure of the dcache should be of no surprise to you: an open-chained hash table hashed by , with entries also linked into an LRU list for replacement. Free dentry structures are kept in a separate list for reuse. The onyl surpise is that only names of 15 or fewer characters can be cached -- fortunately, this is most names.

Replacement:

level1_cache/level1_head

level_2_cache/level2_head

Level 2 is safer - entries can only be displaced by repeatedly accessed entry, not random new entries.

struct file_system_type

SInce Linux can support multiple file systems, there is a structure that mainatins the basic information about each one. These structures are kept in a singly linked list. When you mount a file system, it walks this list until it finds a name that matches the type provided to the mount operation. If it can't find a matching type, the mount will fail. The next pointer is the link to the next node in the list, or NULL.

The most critical field in the list is the pointer to the super_block structure. The super block contains the meta-data that describes and organizes the file system.

struct file_system_type {
        const char *name;
        int fs_flags;
        struct super_block * (*read_super) 
				(struct super_block *, void *, int);
        struct file_system_type * next;
};

struct super_block

Most of the fields in the super_block should be self-explanatory. I have no idea what the purpose of the "basket" fileds might be. As far as I know they are a recent addition to this structure and aren't used anywhere within the kernel -- perhaps they are a hint of coming attractions? I can only assume that they describe an unorder linked list of inodes.

Please notice the use of the union to permit the super_block structure to support multiple different file systems.

struct super_block {
        struct list_head        s_list;         /* Keep this first */
        kdev_t                  s_dev;
        unsigned long           s_blocksize;
	unsigned char           s_lock;
        unsigned char           s_rd_only;
        unsigned char           s_dirt;
        struct inode           *s_ibasket;
        short int               s_ibasket_count;
        short int               s_ibasket_max;
        struct list_head        s_dirty;        /* dirty inodes */
        struct list_head        s_files;
        ...
        union { 
                struct minix_sb_info    minix_sb;
                struct ext2_sb_info     ext2_sb;
                struct hpfs_sb_info     hpfs_sb;
                ....
        } u;
} 

struct super_operations

The super block operations manipulate the meta-data associated with the file system. Their purpose is more-or-less self-evident.

struct super_operations {
        void (*read_inode) (struct inode *);
        void (*write_inode) (struct inode *);
        void (*put_inode) (struct inode *);
        void (*delete_inode) (struct inode *);
        int (*notify_change) (struct dentry *, struct iattr *);
        void (*put_super) (struct super_block *);
        void (*write_super) (struct super_block *);
        int (*statfs) (struct super_block *, struct statfs *, int);
        int (*remount_fs) (struct super_block *, int *, char *);
        void (*clear_inode) (struct inode *);
        void (*umount_begin) (struct super_block *);
};

The Buffer Cache

The buffer cache provides a way of caching file system blocks (data and metadata) to avoid repeated accessed to disk. Please remember that there is only one buffer cache per system -- not one cache per file system. The same buffers can be shared by multiple file systems. Heavy use of one file system will reduce the number of buffers used by another, &c.

There are different sized buffers. So the cache is really a collection of caches -- one of each block size.

Some people like to think of each block buffer as the representation of a request. That is to say that the contents of a buffer represent the results of a recent request. I prefer to think of buffers as simple containers -- but the request analogy isn't bad and might be useful to you.

Two main parts:

Properties:

Victim Selection

Each block buffer is maintained on one of the following LRU lists:

The victim is the best clean buffer. If a victim can't be found, the system will try to create more buffers. If that fails, it will try to free block buffers of ofther sizes and try again.

The bdflush Kernel Daemon

The bdflush daemon flushes dirty blocks creating clean blocks. It normally sleeps, but wakes up:

struct buffer_head

The buffer_head structure is the structure that represents an individual bloc buffer (or, if you prefer, buffered request). At this point, most of the fields should be reasonably familiar.

struct buffer_head {
        /* First cache line: */
        struct buffer_head *b_next;     /* Hash queue list */
        unsigned long b_blocknr;        /* block number */
        unsigned short b_size;          /* block size */
        unsigned short b_list;          /* List that this buffer appears */
        kdev_t b_dev;                   /* device (B_FREE = free) */
        atomic_t b_count;               /* users using this block */
        kdev_t b_rdev;                  /* Real device */

        unsigned long b_state;          /* buffer state bitmap (see above) */
        unsigned long b_flushtime;      /* Time to write (dirty) buffer */

        struct buffer_head *b_next_free;/* lru/free list linkage */
        struct buffer_head *b_prev_free;/* doubly linked list of buffers */
        struct buffer_head *b_reqnext;  /* request queue */
        struct buffer_head **b_pprev;   /* 2x linked list of hash-queue */

        char *b_data;                   /* pointer to data block (1024 bytes) */

        void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion */
        void *b_dev_id;
        unsigned long b_rsector;        /* Real buffer location on disk */
        wait_queue_head_t b_wait;
        struct kiobuf * b_kiobuf;       /* kiobuf which owns this IO */
};