Lecture 27 (November 6, 2002)

Lecture 27 (November 6, 2002)

A Quick Look Back At Traditional File Systems

We've looked at "General Purpose inode-based file systems" such as UFS and ext2. They are the workhorses of the world. They are reasonable fast, but have some limitations, including:

Internal fragmentation: waste w/in allocated blocks
Slow disk performance due to seeking (although some like FFS try to reduce this using extents. They keep track of multisector blocks of various sizes and can allocate that way)
Large logical blocks decrease transfer cost and metadata overhead (number of indirect blocks needed, &c), but increase fragmentation.
Small logical blocks do the opposite, reduce fragmentation, but increase inode overhead and transfer time.
Opportunity for meta-data inconsistency due to crashes. Long recover time due to intensive process used by tools such as the venerable fsck to discover lost information or otherwise force consistency.
Block sizes vs. meta data, and the need for fsck to scan whole partions makes these inappropriate for large partitions - to much wastage and too much time wasted to fsck for high availability.
Free space tracking, typically a bit for each block, is also a big RAM waste and time-consumer for large file systems.
Directory lookups are painfully slow - sequentially open and search directory files - caching helps a good bit, but there are plenty of cold misses and capacity misses, &c.
Static inode allocation limits the number of files and wastes space if not needed.

A Bit of a Historical Perspective

In the early days of file systems, we were concerned about the basics -- naming and ownership, and organizing things. Then we became concerned about efficiency at scale -- more and bigger files, and files of varying sizes.
More recently, verifying metadata consistency has become a huge concern. fscking (File System CheKING) a file system's metadata at boot can take a long time. Maintaining metadata (or even data) logs is a big-step to being able to do that in a time-efficient way when needed, because they enable us to efficiently check only the things that need checking -- not everything.
We don't talk about LFS because it is a good solution to real-world probelems. I think there are probably relatively few situations where it is the best solution. The assumptions are just not aligned with practice.
But, we do talk about it because it is an interesting thought experiment about how we can make certain things more efficient and was one fo the very early solutiions involving logs -- which are critically important today.

Berkeley's Log Structure File Systems

Introduction

In our discussion of file systems we talked much about how the data structures and techniques of a typical general purpose file system. But there are many specialized file systems that don't fit this bill. Once such file system is BSD's log structured file system (LFS).
Traditional file systems have two limitations that LFS was designed to mitigate. The first is the expense involved with moving the head around the disk to support the request pattern of software. The other is recovery from failure. Traditional file systems lack an intrinsic checkpointing and recover mechanism.

Aside:
Ask 100 system administrators how often the perform "cold" backups -- and then ask them how they do it. I suspect many will, after consideration, confess "close enough" -- doubt it! Checkpointing is a real issue -- many so called "bare metal" or "cold" backups are actually warm backups that were made while the system was running and contain inconsistent information resulting from dirty metadata.
If the data is valuable, checkpointing is best when it is a built-in part of the file system and doesn't rely on time-pressured human processes.

Much of the latency that results from seek and rotational delay can be overcome using caching and "smart allocation" techniques. The Berkley Fast File System (FFS) is the classic example of a file system designed with this in mind. It is very careful to allocated related items, data and metadata, nearby on disk. This combined with standard caching resulted in a very small physical latency. Log structured file systems attempt to accomplish the same thing, while introducing checkpointing and recovery.
LFS isn't a file system that is currently in use (as far as I know), or, to be honest, that I think was ever a good solution to common challenges. But, we're discussing it because, real-world performance concerns aside, there is a good bit to be learned from it, I think.

The Basic Idea

The basic idea is to structure the entire file system as a log. Every time we make a change to the file system, we simply append it to the log. Using this approach, we don't have to move the head very much from write to write -- each write is right after the previous one. For this to work, related data and metadata should be updated together and appended at the end of the log. Although LFS is an example, other file systems also use this approach, collectively, they are known as journaling file systems.
This system is further improved throguh buffering. Not every write is written to disk. Instead, updates are collected into segments and each segment is written to disk at once. This prevents a collection of small writes from pounding on the same place on disk.

Reality #1: Required Consistency

Alas, the real world isn't always a pretty place. Some operations require that the disk be left in a consistent state -- whole segment or otherwise. synch is one such operation, close() is another such operation, as are any write()s by NFS.
Since the segment size is usually very large -- on the order of one or more physical tracks, this would potentially require wasting a large amount of storage for very small operations. Top solve this problem LFS defines a structure, called a partial segment within a segment. A segment consists of one or more partial segments. A partial segment is a collection of data and metadata that is written at a time. To support this, the structure of the log must allow for the distinction among segments -- and also partial segments. We'll talk more about the orgainization of the log soon. But for now, let's just realize that the log isn't one long stream of changes -- it has a structure and meta data of its own.

Reality #2: How Do We Find the inodes?

LFS is very efficient for writes -- they are buffered most of the time, and when they are written, it is contiguuously without unnecessary physical delays. But what about reads? They are the more common case in general purpose file systems, right?
Well, reads require finding the data. If the data is spread about multiple places in the log file, how do we find those places? Well, we could look in the inode, as we always have before, but how do we find the inode? We might have multiple versions of it spread across the same log file.
To solve this problem LFS introduces a new data structure, the inode map. The inode map is a RAM-based data structure that contains the most up-to-date mapping from inode numbers to disk locations. This mapping is periodically written to disk as a file. In the event of a crash, the most up-to-date version of this file is found and all changes after this checkpoint are replayed to rebuild the in-RAM inode map. Clearly, the more often this file is checkpointed to disk, the faster the recovery process can proceed. Of course, the trade-off is that the process of writing the inode map to disk takes time. Making the common case fast means making the recovery case slower -- this is an administrative trade-off.

Reality #3: Disks Aren't Infinite

It would be nice if we could just keep extending our log forever, but unfortunately, disks aren't infinite. We can't keep a copy fo every change ever made to any piece of data or metadata -- this would become an expensive prospect.
To address this problem, LFS includes a cleaner process. This process periodically starts at the begining of the log and throws away replaced or deleted entries; active entries are rewritten at the beginning of the log. This process is carried out on a segment by segment basis. To determine if an inode is current, it looks in the inode map and checks for this inode at this location. To check to see if a data block is current, it looks up the associated inode in the inode map and then checks to see if the block in question is in use by the inode.
To allow for the reuse of segments, they are organized as a linked list. That is to say that segments can be logically contiguous without being physically contiguous. Each segment contains the physical number of the next logical segment.
The cleaner might run when the filesystem runs our of space, between certain high and low water marks, or as a low priority almost "idle" process. Obviously, the cleaner results in a massive performance hit and the policy decisions here can really affect overall performance.

Reality #4: The Not-So Universal Buffer Cache

Recall our discussion of the buffer cache. We said that it is file system independent and that only one buffer cache exists across all file systems (acknoledging that this cache may consist of several caches of different sized blocks).
Well, when it comes to LFS, this cache can't be so file system independent, afterall. Why not? Two reasons:

Under cache pressure, the buffer cache will flush one block at a time. If we write one block at a time from the cache, we might as well not bother with the segment buffering. The gain in efficiency that is realized with a single, large, contiguous write is lost. To solve this problem all LFS blocks in the buffer cache are maintained in the unflushable LOCKED queue. This makes the caching less efficient, but preserves the efficiency of writes.
If we change a data block, this requires updating the metadata. In a traditional file system, this isn't problematic, because the new data metadata is written on top of the old metadata. Unfortunately, since we are writting all chenages into new blocks, this means that we need to allocate new buffers for the metadata, before writing it out. If we are out of memory, we can't do this. The solution is to maintain a pool of otherwise unused buffers to use for potential metadata changes. Again, this decreases the efficiency of the buffer cache, but "What to do?"

Reality #5: Directory Operations Require Proper Ordering

Operations like create(), link(), mkdir(), remove(), &c must be synchronous. This is a problem for recovery, if certain inodes are written out, but not others. This could result in prior operations being logged, but not later ones. This is because these operations might affect multiple inodes, so the changes in the log might not be ordered correctly -- dirty bits are used to write out the most up-to-date version of a block in memory.
To solve this problem, a hack involving some special flags is used. These flags delay the write out of certain inodes, to ensure that the dependencies imposed by the directory ordering are enforced.

The LFS disk layout

The LFS file system is divided into segments. The first and last logical segments contain copies of the super block. The disk label appears once at the beginning of the disk.

Disk label

segment 1 = super block

segment 2

...

segment n = super block

Each segment is composed of multiple partial segments. Each segment contains a segment summary that describes the partial segments as well as the associated inode blocks and data blocks.

Segment summary

data block

inode block

data block

inode block

inode block

...

&c

The segment summary contains the information that describes a partial segment. It contains the file information structure that describe files within the segment as well as other metadata.

Checksum over partial segment

Next segment (for linked list)

inode count

file information

file information

...

disk address for inode

disk address for inode

...

The file information contains the information about a file within the partial segment. This information includes a version (generation) number, inode number, and logical block numbers.

number of blocks in structure

file version number

last block size (might not be full)

logical block #1

logical block #2

...

logical block #n

The Inode Map

Remember that the inode map, known as the index file or ifile is a regular file on disk. It has the following structure:

#dirty segments

#clean segments

Segment info 1

Segment info 2

...

Segment info n

Inode info 1

Inode info 2

...

Inode info n

Each segment info structure contains the following fields:

live byte count

timestamp

dirty flag

active flag

super block flag

Each inode info structure contains the head node of a linked list. Each node in the list has the same structure:

version number

disk address

pointer to next free inode, only if inode is unused and on the free list

Cost-Benefit Analysis

Does LFS reduce latency? Well, this depends. There are a collection of published papers that claim it does. But I yell, "Foul!" These papers are based on very large caches and buffers. The performance boost is based on enough memory for very large track buffers and inefficient cache use.
Most systems don't dedicate this amount of RAM to the filesystem, and perhaps shouldn't. Perhaps the real questions shouldn't be, "Given infinite RAM, which file system performs best?", but should be, "In building a general purpose system, balancing the costs and ebefits, which would we choose?" I suspect that the answer to this -- as we see in almost all, if not all, general purpose systems, is a more traditional file system.
So why do we talk about LFS? Well, it is a very good idea when checkpointing and recovery is a primary concern. It can also be used as a supplement for a traditional file system: most accesses are through a traditional means, but a log is also maintained for recovery. It might also be a good idea for certain write-often application, especially those requiring robust recovery -- for example, logging the transactions on a major stock exchange.

/proc: More Abstraction With File Systems

In class today, I posed a challenge to the class. I described the /proc file system and asked the class how it could be implemented under Linux.
The /proc file system is a virtual file system. It is a convenient way for user processes to get access to information about the state of the system, such as a list of devices, a list of file systems, various performance statistics, &c. It is also a convenient way for users to get access to information about specific processes, such as their current working directory, their virtual memory map, their process state, and their command line and environment. In addition to reading this information, some information such as the kernel tunables and the VM maps are writable to authorized users.
The general system-wide information is represented as a collection of files under /proc. Information about subsystems like the network is maintained in subdirectories, such as /net.
There is a subdirectory under /proc for each process named by the process's pid. This subdirectory contains all of the files relating to the process.
These files and subdirectories aren't real -- or, more precisely they don't live on disk. It wouldn't make much sense to write them to disk, since, for the most part they do not persist across reboots and are their content is highly variable.

The directory structure

The class decided that the directory structure should be very similar to the directory structure of a traditional file system, except it should only live in RAM.
This is because the directory is organized much like a traditional directory: parent, children, siblings, &c, but doesn't need to be written to disk since it is small and not persistent. Since the root of /proc always contains the same files, these directory entries could be static. The others must be dynamically.

The inode_operations

The inode traditionally maps a file to its storage. But in the case of /proc, the files don't actually exist and are often constantly in flux. So, what should the inode do?
It should generate the file data dynamically upon demand. If the user requests the file that contains information about X, the inode should collect the data and build the temporary file that represents it in memory. Now that we have a the data, we can represent it and manipulate it as we did before with the file_struct and file_operations. The inode can also contain the metadata necessary to maintain ownership protection, and access modes (actually, in /proc, Linux puts this metadata into the directory entry -- it is more persistent than the data itself).

file_struct and file_operations

As before, files are represented with using the file_struct and operations are represented using the file_operations struct. The file operations structure define how the standard operations like read(), write(), open(), close(), &c will behave. In the cases where these operations modify the data, they can invoke appropriate inode operations to write it back out to the kernel data structures. It is possible for some buffering to take place at this level.

The Super Block

The /proc system should have a super block like any other file system. By register itself in a normal way, it will be possible to mount /proc and play all of the normal file system games. The super block should, as usual, contain a reference to the root inode, so that the root directory can be found. Since the root directory of /proc is static, it can be initialized when the super block is initialized.

Hybrid file systems

Today we are going to talk about a new generation of file systems that keep the best characteristics of traditional file systems, plus some improvements, and also logging to increase availability in the event of failure. These file systems, in particular, support much large file systems than could reasonably be managed using the older file systems and do so more robustly -- and often faster.
Much like the traditional file systems that we talk about have common characteristics, such as similar inode structures, buffer cache organizations, &c, these file systems will often share some of the same characteristics:

They will log only the metadata, not the data. This allows for a fast recovery at boot time, but doesn't provide any consistency or correctness guarantees about the data blocks of the files, themselves.

Unlike LFS, these file systems will manage the data blocks (or extends) in a random-access way -- writes to data blocks will replace the original data, not land at the end of the log file. (The log, itself, however, is an append log).

Make extensive use of the B+ tree data structure to organize information.

The free block (or extent) list is one such case -- it is often maintained by B+ trees. A common organization keeps two trees -- one ordered by size and another ordered by location. This makes it fast to find an extend that is large enough, and also to find one that is nearby other data in the file -- this minimizes seek delay.

Some filesystems further improve this by organizing extents into two different B+ trees, one by physical address and another by size.

Give my an extent at least size X", is fast
As is, "Give me a block or extent near by X".

Directory files are replaced with B+ trees.

Typically one file-system-wide B+ tree containing all directories and
One per-directory B+ tree of entries.
But, some use one B+ tree for whole file system

Organize inodes by file name using a B+ tree

Keep small-sized files' data directly in the inodes.
Keep medium-sized files' data in extents or blocks directly named from the inodes
Keep large-sized files' blocks or extents organized in B+ tree indexed by offset and named by inode.
Many allow dynamic inode allocation to avoid wasted space - there's no more need to index into an array by number. The inode is directly named by the B+ tree

In general, performance comparable to standard file systems, with more space efficiency and higher reliability.

ReiserFS

The ReiserFS isn't the most sophisticated among this class of filesystems, but it is a reasonably new filesystem. Furthermore, despite the availability of journaling file systems for other platforms, Reiser was among the first availble for Linux and is the first, and only, hybrid file system currently part of the official Linux kernel distribution.
As with the other filesystems that we dicsussed, ReiserFS only journals metadata. And, it is based on a variation of the B+ tree, the B* tree. Unlike the B+ tree, which does 1-2 splits, the B* tree does 2-3 splits. This increases the overall packing density of the tree at the expense of only a small amount of code complexity.
It also offers a unique tail optimization. This feature helps to mitigate internal fragmentation. It allows the tails of files, the end portions of files that occupy less than a whole block, to be stored together to more completely fill a block.
Unlike the other file systems, its space management is still pretty "old-school". It uses a simple block-based allocator and manages free space using a simple bit-map, instead of a more efficient extent-based allocator and/or B-tree based free space management. Currently the block size is 4KB and the maximum file size 4GB, and the maximum file system size is 16TB, Furthermore, ReiserFS doesn't support sparse files -- all blocks of a file are mapped. Reiser4, scheduled for release this fall, will address some of these limitations by including extents and a variable block size of up to 64KB.
For the moment, free block are found using linear search of bitmap. The search is in the order of increasing block number to match disk spin. It tries to keep things together by searching bitmap beginning with position representing the left neighbor. This was empirically determined to be the better of the following:

Starting at the beginning (no locality, really
Starting at the right neighbor (begins past us, given disk spin)
Starting at the left neighbor (if space, right in-between; but costly to find left neighbor)

ReiserFS allows for the dynamic allocation of inodes and keeps inodes and the directory structure organized within a single B* tree. This tree organizes four different types of nodes:

Direct items - tails of files packed together or one small file
Indirect items - unformatted [data] nodes; hold whole blocks of file data
Directory items - key for first directory entry, plus number of directory entires
Stat items - metadata (configuration option to combine these with directory item)

Items are stored in the tree using a key, which is a tuple:
<parent directory ID, offset within object, item type/uniqueness>, where

parent ID is ID of parent object
For files, the offset indicates the offset of the first byte stored in this item. For directories, it contains the first 4 bytes of the filename of the first file stored within the node
The item type/uniqueness field indicates the type of the node:

0 - stat
-1 direct
-2 - indirect
500 - directory + unique number for files matching in first 4/bytes

Each key structure also contains a unique item number, basically the inode number. But, this isn't used to determine ordering. Instead, the tree sorts keys using each tuple, in order of position. This orders the files in the tree in a way that keep files within the same directory together, and then these sorted by file or directory name
The leaf nodes are data nodes. Unformatted nodes contain whole blocks of data. "Formatted" nodes hold the tails of files. They are formatted to allow more than one tail to be stored within the same block. Since the tree is balanced, the path to to any of these data nodes is the same length.
A file is composed of set of indirect items and, at most 2 direct items for the tail Why not always one? If a tail is smaller than unformatted node, but larger than formatted node, it needs to be broken apart and placed into two direct nodes).

SGI's XFS

In many ways SGI's XFS is similar to ReiserFS. But, it is in many ways more sophisticated. It may be the most sophisticated among the systems we'll consider. This being said, unlike ReiserFS, XFS uses B+ trees instead of B* trees.
The extent-based allocator is rather sophisticated. In particular, it has three pretty cool features. First, it allows for delayed allocation. Basically, this allows the system to build a virtual extent in RAM and then allocate it in one piece at the end. This mitgates the "and one more thing" syndrom that can lead to a bunch of small extents instead of one bit one. It also allows for the preallocation of an extent. This allows the system to reserve an extent that is big enough in advance so that the right sized extent can be used -- without consuming memory for delayed allocation or running the risk of running out of space later on. The system also allows for the ?coalecing of extents as they are freed to reduce fragmentation.
The file system organized into different partions called allocation groups (AGs). Each allocation group has own data structures -- for practical purposes, they are seaparate instances of the same file system class. This helps to keeps data structures to a normal scale. It also allows for parallel activity on multiple AGs, without concurrency control mechanisms creating hot spots.
Inodes are created dynamically in chunks of 64 inodes. Each inode is numbered using a tuple that includes both the chunk number and the inode's index within its chunk. The location of an inode can be discovered by lookup in B+ tree by chunk number. The B+ tree also contains bitmap showing which inodes within each chunk are used.
Free space is managed using two different B+ tree of extents. One B+ tree is organized by size, whereas the other is organized by location. This allows for efficient allocation -- btoh by size and locality.
Directories are also stored in a B+ tree. Instead of storing the name, itself in the tree, a hash of the name is stored. This is done, because it is more complicated to organize a B tree to work with names of different sizes. But, regardless of the size of the name, it will hash to the same sized key.
Each file within this tree contains its own storage map (inode). Initially, each node stores block offset and extent size measured in blocks. When the file grows and overflows the inode, the storage allocation is stored in a tree rooted at the inode. This tree is indexed by the offset of the extent and stores the size of the extent. In this way, the directory structure is really a tree of inodes, which in turn are trees of the file's actual storage. Much like ReiserFS, XFS logs only metadata changes, not changes to the file's actual metadata. In the event of a crach, it replays these logs the obtain consistent metadata. XFS also includes a repair program, similar to fsck, that is capable of fixing other types of corruption. This repair tool was not in the first release of XFS, but was demanded by customers and added later. Logging can be done to a separate device to prevent the log from becoming a hot-spot in high-throughput applications. Normally asynchronous logging is used, but synchronous is possible (be it expensive).
XFS offers variable block size ranging from 512 bytes - 64K and an extent-based alloctor. The maximum file size is 9 thousand petabytes. The maximum file system size is 18 thousand petabytes.

IBM's JFS

IBM's JFS isn't one of the best performers among this class of file system. But, that is probably becuase it was one of the first. What to say? Things get better over time -- and I think everyone benefitted from IBM's experience here.
File system partitions correspond to what are known in DFS as aggregates. Wthin each partition lives an allocation group, similar to that of XFS. Within each allocation group is one or more fileset. A fileset is nothing more than a mountable tree. JFS supports extents within each allocation group.
Much like XFS, JFS uses a B+ tree to store directories. And, again, it also uses a B+ tree to track allocations within a file. Unlike JFS, the B+ tree is used to track even small allocations. The only exception is an optimization that allows symlinks to live directly in the inode.
Free space is represented as array w/1 bit per block. This bit array can be viewed as an array of 32-bit words. These words then form a binary tree sorted by size. This makes it easy to find a contiguous chunk of space of the right size, without a linear search of the available blocks. The same array is also indexed by another tree as a "Binary Buddy". This allows for easy coalescing and easy tracking of the allocated size.
These trees actually have a somewhat complicated structure. We won't spend the time here to cover it in detail. This really was one of the "original attempts" and not very efficient. I can provide you with some references, if you'd like more detail.
As for sidelines statistics, the block size can be 512B, 1KB, 2KB, or 4KB. The maximum file size ranges from 512TB with a 512 byte block size to 4 petabytes with a 4KB blocks size. Similarly, the maximum file system size ranges from 4PB with a 512 byte blocks to 32 petabytes with a 4KB byte block size.

Ext3

Ext3 isn't really a new file system. It is basically a journaling layer on top of Ext2, the "standard" Linux file system. It is both forward and backward compatible with Ext2. One can actually mount any ext2 file system as ext3, or mount any ext3 filesystem as ext2. This filesystem is particularly noteworthy because it is backed by Red Hat and is their "official" file system of choice.
Basically RedHat wanted to have a path into journaling file systems for their customers, but also wanted as little transitional headache and risk as possible. Ext3 offers all of this. There is no need, in any real sense, to convert an existing ext2 file system to it -- really ext3 just needs to be enabled. Furthermore, the unhappy customer cna always go back to ext2. And, in a pinch, the file system can always be mounted as ext2 and the old fsk remains perfectly effective.
The journaling layer of ext3 is really separate from the filesystem layer. There are only two differences between ext2 and ext3. The first, which really isn't a change to ext2-proper, is that ext3 has a "logging layer" to log the file system changes. The second change is the addition in ext3 of a communication interface from the file system to the logging layer. Additionally, one ext2 inode is used for log file, but this really doesn't matter from a compatibility point of view -- unless the ext2 file system is (or otherwise would be) completely full.
Three types of things are logged by ext3. These must be logged atomically (all or none).

Metadata - The whole block of updated metadata (even if only small part of block changed). This is basically a shadow copy of the updated block.
Descriptor Blocks - These tell where each metadata block should be copied on recovery. These are written before metadata block. Rememebr, the metadata blocks are just unformatted blocks of data -- the descriptor blocks are necessary to tell us which are which.
Header Blocks - These describe the log file, itself. In particular, we need to know the head and tail of journal file, as well as the current sequence number. These tell us where current head and tail of log are for updates

Periodically, the in-memory log is check-pointed by writing outstanding entries to an in memory journal. This journal is committed periodically to disk. The level of journaling is a mount option. Basically, writes to the log file are cached, like any other writes. The classic performance versus recency trade-off involves how often we sync the log to disk.
As for the sideline stats, the block size is variable between 1KB and 4KB. The maximum file size is 2GB and the maximum filesystem size is 4TB.
As you can see, this is nothing more than a version of ext2, which supports a journaling/logging layer that provides for a faster, and optionally more thorough, recovery mode. I think Red Hat made the wrong choice. My bet is that people want more than compatibility - more than Ext3 offers. Instead, I think that the ultimate winner will be the new version of ReiserFS or XFS. Or, perhaps, something new -- but not this.

Ext4

Ext4 is the next in the lineage. It is a big step forward from ext3 and, unlike ext3 vs ext2, the on-disk data structures are not backwardly compatible. The discussion in class was based upon the following:

Ext4: The Next Generation of Ext2/3 Filesystem (Cao, et al)
Anatomy of Ext4 (M. Jones, IBM)