Lecture 2: Disk Drives

Lecture 1: Disk Drives

Hard Drives

Hard disk drives are the venerable solution when it comes to bulk data storage. I've been hearing for my entire adult life that they'll soon be replaced by this-or-that. And, Flash-Based SSD devices are getting there. We'll talk about them next class. But, for right now, a lot of drives are spinning out there. And, that'll be the case for a very long time.

How Hard Drives Work, An Overview

Hard drives are stacks of two-sided disks called platters. The disks rotate at a constant rate of anywhere between 3600 RPM and 10,000 RPM. Unlike CLV CDs, the rate of rotation for hard drives is constant. They are said to be Constant Angular Velocity (CAV). This leads to an organization of tracks that are concentric circles, rather than a single spiral.
The bits are encoded using magnetic polarity. For the moment, you can imagine a north pole facing upward as a 1-bit and a S-pole facing upward as a 0-bit. But, in reality it gets more complex than that. A head, essentially a small coil of wire, is positioned over a track and senses the flux resulting from the transition of a north pole to a south pole or vice versa. This small electrical signal is amplified, cleaned up, and interpreted to result in a the stream of 0s and 1s that represent the data. Because coils of wire can sense changes in magnetic fields, not constant magnetic fields, the bits are encoded to prevent long strings of 0s or 1s. Modern hard drives use variants of a scheme called Run-Length Limited (RLL) encoding.
The heads are actually stacked, so that there is one head for each surface, or side, of each platter. For reasons of simplicity, cost, and efficiency, the heads are not independent, they move together. Specifically, they seek from track to track by moving in and out on the drive. Technically speaking, they pivot in and out on an arc, rather than moving straight in and out. But, you'll see that for yourself soon enough.
As was the case with CDs, and is the case for most any other storage device or communiction system, disks aren't organized as endless streams of bits. They need to have small, manageable parts so that they can be easily addressed, and so that the data can easily be found, edited, and checked for errors. Sectors, themselves, vary in size, with the smallest at the inside, where the circumference is small, to the largest at the outside, where the circumference is large. And, depending on one's perspective, an outside track might be viewed as too large for convenient management. So, these tracks are broken down into sectors.
In addition to the data, itself, sectors contain metadata. This includes the sector number, synchronization fields, ECCs, and status bits. A status bit might indicate if the sector is in use or defective.
There are small gaps between the sectors. These, of course, allow for tolerances, but they also allow for one sector to be processed by the electronics, before the next one shows up. There is a similar gap between tracks, also to allow for tolerances.

A Historical Model of Sectors, Tracks, and Cylinders

Back in "The Day", when things were simple, it was easy to sketch pictures of the organization of the data on a hard drive. The picture below shows the old school organization, where each surface contains concentric circle tracks. And, because the disk spins at a constant rotational velocity, there are an equal number of them on the inside and the outside. The bit-density is higher on the inside and lower on the outside. But, since the bits pass under the heads at the same rate, the electronics didn't know that, as they timed it out.
Beyond this, since the heads moved together, they moved across corresponding tracks on each surface. There was a negligible delay to switch from one head to another. As a result, data was often written from head-to-head, then sector-to-sector. And, only then would the head move for a seek.
It is important to note that moving the heads was not, in the early days, nor is it now, a fast thing to do. The head needs to be pushed past the moment of inertia, speed up, coast, stop, and become stable. As it stops, it oscillates. So, waiting for it to become stable, basically means waiting for this to dampen out, and tuning it, if necessary. Reading takes less stability than writing, because some error can be managed without degrading future access. But, writes need tighter tolerances, to ensure future reads will be doable.
The basic picture, in the old days, looked like this:

Modern Disks

In the early days of disk storage, the electronics were the limiting factor. We could store bits as quickly as we could clock them flying by. This meant that there was no loss in having a lower bit density on the outside than on the inside, since the speed, not the density, was the enemy.
Over time, disk heads, and especially the electronics for signal processing, got dramatically better. We were now able to clock data flying by faster than we could squeeze the bits together, at least on the inside of the disk. As a result, it became inefficient to lose storage by keeping the same number and size of bits on the inside and outside.
One can't make a flat, circular disk where all of the tracks are the same size and store bits linearly. Varying the size of the sector is impractical, because less-than-maximum size sectors would waste space in hardware and system buffers, never mind force a lot of physical knowledge to organize simple reads and writes.
So, modern disks vary what they can -- the number of sectors per track. Outer tracks have more sectors than inner tracks. The adjacent group of sectors that has the same number of tracks is called a zone. So, we say that the surface is divided into zones, each of which have the same number of sectors per track.
So, we now have a picture that looks like this:

Source: http://www.pcguide.com"

Because, in the push for increased capacity, tolerances have gotten smaller, there is a delay associated with switching from one head to the next. It is smaller than a seek delay, because head movement is usually not needed. But, it does take time to tune the electronics to the signal from a different head.
Rather than forming a cylinder for corresponding sectors, it is actually formed as more of a spiral, with a skew from one platter to the next. This allows the head time to get tuned to the track on the next surface, before the logically sequential sector comes under the head. For example, surfaces might be stacked as follows:

Source: http://www.pcguide.com"

Disks and Latency

Historically, disks are said to encounter three types of latency, in order of significance:

Seek: The time it takes to start the head moving, get it going, slow it down, and let it settle into a single track.

Rotational: The time it takes to wait for the disk, spinning at a constant speed, to spin around to position the desired data under the head for reading

Transfer: The time it takes to read all desired adjacent sectors on a track, once the first sector is under the head

In modern drives, things get somewhat more complex:

Seek: The time it takes to start the head moving, get it going, slow it down, and let it settle into a single track.
A very short seek might take 0.7mS to 2mS, whereas a "full stroke" seek from inside to outside might take 10mS. An "average" seek is said to typically be about 1/3 of a full seek, or about 3mS. (The numbers vary, --a lot-- with the drive).
The reason for this is that sort moves encounter a lot of overhead to break the moment of intertia and to stabalize the head. This is amortized over larger moves. But, beyond that, since a head is never perfectly still, sometimes it is possible to "settle" into a nearby sector, without encountering the full overhead of even a short seek. This is done through tuning the electronics to filter the signal, and by a "bump-like movement" of the heads.

Head switch delay: The time it takes to switch heads, to go from paying attention to one surface to another.
Although on older drives, this wasn't an issue. With newer drives, the tolerances are tight. Reading a track involves not only positioning the heads "well enough", but also tuning the signal processing to the signal that is actually being received, which will vary, even within the tolerances that allow for the track to be read.
In general, a head switch delay is no greater than the track-to-track seek time, and can be shorter. Depending on the drive times of 0.7mS to 1mS might be typical. As we discussed earlier, the penalty of a head switch is partially hidden by skewing the tracks from one surface to the next, to avoid paying both the head-switch delay and a rotational delay for missing the next logical sector due to it.

Rotational: The time it takes to wait for the disk, spinning at a constant speed, to spin around to position the desired data under the head for reading (Nothing strange here).
This is a function of the drives rotation. A full rotation, to the sector before the one just read takes 1/DRIVE_RPM. For example, a full rotational delay on a 3600RPM drive is 1/(3600/60 sec/min) = 16.7mS. On a 7200RPM drive, it is half this, 8.33mS. And, on a high-performance 10,000RPM drive, it is 1/(10,000RM/60 sec/min) = 6mS. An average rotational delay is, as you might expect, half a spin, or half of this maximum rotational delay.

Transfer: The time it takes to read all desired adjacent sectors on a track, once the first sector is under the head

Effect of Buffering: Hard drives are heavily buffered, e.g. internally cached. They can hold whole tracks in cache. And, some of their buffers are write-back, rather than write-through. This means that the only copy of the data is kept in volatile storage. But, to mitigate this, these drives actually have the capacity to do the write back as the drive spins down, despite the decreasing speed. Obviously, only a relatively small buffer space can be managed this way. But, by buffering whole tracks, much rotational delay can be relieved in light of the way many applications "Keep coming back for more". Buffering policies can be quite sophisticated.

Logical vs. Physical Geometry

Back in "The Day", drive manufacturers used to tell use the actual number of surfaces, tracks per surface, and sectors per track. This information could then be used by operating systems to tune disk access.
These days, the actual configuration of the drive is too complicated, e.g. zones, and proprietary. Instead, drives hide their actual geometry and we can basically pretend that they have any logical geometry that adds up to the capacity of the drive.
This, of course, means that the system software's assumptions about the adjacency of sectors might be incorrect. But, in general, the logical sectors are ordered in the same way as the physical sectors. So, even if the system software can't predict exactly when a seek might occur, nearby sectors remain, in the average case, much faster to "seek" to than more distant logical sectors.

SMART Hard Drives

Manufacturers have included a bunch of monitoring within hard drives. Drives with this capablility are said to have Self-Monitoring, Analysis, and Reporting (SMART)technology. Although this monitoring might be able to predict some failures, this should not be viewed as its primary purpose. Instead it collects information that, in the event the drive fails and is returned to the manufacturer, can be used by the manufacturer to better understand field conditions and drive failure.
The reality is that most hard drive failures are not predicted by SMART drives, and some SMART warnings may be innocuous (but don't bet on it).
Wikipedia has a a good article on SMART drives. The only thing to which I want to call your attention is that SMART stats always count down from 255. So, for example, for reallocations 254 is a better number than 6.

Bad Sector Relocation

Hard drives are not perfect. They have bad sectors leaving the factory. And, they can accumulate (or discover) bad sectors over time. In order to make things easier for the operating system, disks hide these errors. They do this by having a chunk of spare sectors that aren't normally addressable. Bad sectors are reallocated to use these "spares". The user asks for the same "physical sector", which we really know is actually a logical sector, and instead gets its replacement.
This is achieved through the use of relocation lists, known as the P-List and the G-List or the defect tables. The P-ermanant list contains the defects mapped at the factory. The G-rowth list contains bad sectors mapped after the fact, either by the controller's firmware, or by disk repair utilities.
It is interesting to note that the controller only remaps bad sectors upon write -- not upon read. Upon read it only notes that they are bad. This is because nothing it to be gained by remapping a sector that can't be read as there is no data to move -- and it could destroy data that recovery software might be able to tease out later, and, at the least, is a waste of time. Instead, these are known as pending sectors or pending relocations. SMART attributes can tell you the number of items on the G-List and the number pending, specialized software can let you see and edit the G-List. And, can let you access the mapped-bad sectors for forensic purposes.

Firmware

What many people don't realize is that a disk controllers firmware is not entirely on the controller board. Instead, much, if not most, of it is actually on a chunk of the disk that is not accessible by normal means. The firmware on the controller gets things going -- and then loads the rest from the disk. This reduces cost and increases agility for the manufacturer. But, it also increases complexity for the user in the event of a failure.
Problems that seem like electronics problems might actually not be on the board, but instead on the on-disk firmware. There are usually two copies of it on disk. But, none-the-less, it can be damaged by all of the usual stuff, e.g. power failure or head crash while running, drive wear, alpha-beta-gamma-omega-purple-rays.

Venerable Disk Scheduling Approaches

We often fall into the trap of considering disk access in a one-task environment. More often than not, we are actually working on a system with multiple tasks sharing the same disk (or disks). In these multiprogramming environments, we need to be concerned about the ordering of the requests to disk.

Under these circumstances a disk scheduler is employed to determine the proper ordering of requests. Requests can be organized to ensure fairness to the requestors, to advance some system policy, or to minimize the average wait-time. The rules, algorithm, or process of ordering requests is known as a scheduling policy.

The proper scheduling policy is often unclear and depends both on the workload and on the purpose of the system. Often times trade-offs exist between better average-case performance and fairness.

FCFS/FIFO
PRI (Priority)
SSTF (Shortest Service Time First)
SCAN
LOOK
C-SCAN
F-SCAN

FCFS/FIFO

First-come, first-serve
First in, first-out
Services requests in the order in which they were received
Perfectly fair
No possibility for starvation
Requires no knowlege of physical factors
Completely unoptomized seeking (track-to-track movement)

PRI

Priority based scheduling
Priorities set by system administrator to advance system goals
Intentionally unfair
Depending on priority system, starvation possible
Completely unoptomized seeking (track-to-track movement)
Requires no knowlege of physical factors

LIFO

Last-in, first-out
Most recent requestor is serviced first
Intentionally unfair
Takes advantage of locality in specific programs
Starvation possible
Improves seek distances without knowlege of physical factors

SSTF

Shortest Service Time First
Always select request closest to current head position
Reduces seek time (but not necessarily optimal in avg case)
Starvation is possible
Knowlege of physical factors is required
Reduces seek time
"arm stickiness" -- pounding on same track or nearby tracks can starve access to others.

SCAN

Head moves in only one direction, and then reverses
Performance similar to SSTF, with less starvation
Starvation only possibly by "sticky arm"
Biased in favor of middle tracks
Biased against older requests (when direction is reversed, new requests are encountered first).
Ignores program locality

Let's consider the biases in SCAN more carefully. It is clear that SCAN is biased against older requests. When it turns around, it services the most recent requests first -- the tracks are being covered in the opposite order.
In reasoning about the performance of the algorithm, it is important to remember is that it is not the track that we should consider it is the request. So we must ask ourselves the question, if a request arrives for a particular track, how long do we expect that it will have to wait for service? Let's consider the following table:

Head Movement by Request Location Using SCAN
(Movement is 0-->4, 4-->0 is symmetrical)

Current Head Position
(Track #) Request Location
(Track #)

0 1 2 3 4

0 0 1 2 3 4

1 7 0 1 2 3

2 6 5 0 1 2

3 5 4 4 0 1

4 4 3 2 1 0

Avg. Movement
(Num. of Tracks) 4.4 2.6 1.6 1.4 2

Average Head Movement Using SCAN -- Both Directions

Request Location
(Track #) 0 1 2 3 4

Average Movement
(Number of Tracks) (4.4 + 2.0)/2
= 3.2 (2.6 + 1.4)/2
= 2.0 (1.6 + 1.6)/2
= 1.6 (2.4 + 2.6)/2
= 2.0 (2.0 + 4.4)/2
= 3.2

From this we can see that the bias is against the outside tracks. But why? The answer is that the best case for all tracks is exactly the same -- the head is already there. But in the worst case it can take the head almost two full passes to get to a request on an outside edge. In the average case, the inside tracks win!

C-SCAN
The biases that we discussed in SCAN above can be reduced by eliminating the outside tracks. We can do this by considering the disk to be circular and seeking back to track 0 before SCANning again. This costs a full seek. But on modern disk drives a full seek is much faster than several consecqutive seeks. But this approach still isn't FIFO-fair, it still suffers from "head stickiness", and it wastes the time required for the full seek.

Circular SCAN
Head moves in only one direction, then returns and starts over.
More fair than SCAN - removes bias against old requests
More uniform service time than SCAN
Worst average performance than SCAN

LOOK

Used with SCAN based approaches
Look at queue; if no more requests in current direction then reverse
Improves performance
Minimally improves fairness - new requests ahead of head that would otherwise be serviced on return trip now wait for second pass.
Not always better -- could turn around right before a request arrives.

Head Movement by Request Location Using SCAN (Movement is 0-->4, 4-->0 is symmetrical)
Current Head Position (Track #)	Request Location (Track #)
	0	1	2	3	4
0	0	1	2	3	4
1	7	0	1	2	3
2	6	5	0	1	2
3	5	4	4	0	1
4	4	3	2	1	0
Avg. Movement (Num. of Tracks)	4.4	2.6	1.6	1.4	2

Average Head Movement Using SCAN -- Both Directions
Request Location (Track #)	0	1	2	3	4
Average Movement (Number of Tracks)	(4.4 + 2.0)/2 = 3.2	(2.6 + 1.4)/2 = 2.0	(1.6 + 1.6)/2 = 1.6	(2.4 + 2.6)/2 = 2.0	(2.0 + 4.4)/2 = 3.2

Readings and Discussion

Now that we've gotten through the basics, we turned our attention to some highlights from some interesting research. Please read the full papers.

Eduardo, P, et al (Google), "Failure Trends in a Large Disk Drive Population", FAST07. (pdf)
Brewer E., et al (Google), "Disks for Data Centers", FAST16. (pdf)
Worthington, B. and Ganger G. and Patt, Y., "Scheduling Algorithms for Modern Disk Drives", SIGMETRICS94, pp.241-251. (pdf)
Ruemmler C. and Wilkes, J., "An introduction to disk drive modeling", IEEE Computer 27(3):17-29, March 1994. (pdf)
Talaga N and Arpaci-Dusseau R. and Patterson, D., Microbenchmark-based Extraction of Local and Global Disk Characteristics, 1994. (pdf)