File Systems for Data Intensive Scalable Computing (DISC)
What to say? We're talking about Data-Intensive Scalable Computing, we probably should pay some attention to the care and feeding of the data. Let's see if we can think about the use case, and what it tells us about the file system design.GFS is the Google File System that is designed to solve this problem. Some details of its design were publsihed a few years back. But, I doubt if they were entirely current then -- and they certainly aren't now. HDFS, the Hadoop File System, is the Open Source alternative thqat is closely affiliated with Hadoop. It is in some ways a "Knock-Off" of that few years old paper. Over time, it'll probably evolve in parallel. But, for right now, it is a pretty mostly straight-forward knock-off based ont he old white paper.
Where Does It Live?
We've already said that, ideally, the data and processing are scattered across the same racks and that the data is replicated on more than one node. In the context of our discussion last class, this enabled the data to be processed in parallel by more nodes that would be possible if it lived in only one place.It is pretty straight-forward to understand why. If the processing is located nearby the data, the more places we put the data, the more processing nodes have rapid access to it. When we talk about "nearby" we are measuring it in terms of the number of switches involved. We aren't going to put processing on the same host as the data. But, we might put it on the same rack and/or within a switch of that is pretty easily accomplished.
By locating our data on distant nodes, we enable processing it in parallel by different sets of sufficiently nearby nodes. And, keeping in mind that portions of more than one data set might be sprinkled onto the same node, more replicas also enables better load balancing. It gives us more options about who can do the work.
So, How Many Replicas?
Back in the day, the Google paper suggested that the data was generally replicated three times. And, the Hadoop tutorial on HDFS describes their implementation of this policy: Two nearby replicas and one distant one. Okay, first, everyone, with me, "There ain't nothing special about three. There ain't nothing special about three."Ultimately, there are three goals for the replicas: (a) parallel access, (b) load balancing, and (c) data integrity. If we only have one copy, and that node fails -- bad day. If we only have two copies, and both nodes fail, bad day. You get the idea. The more the merrier. But, the probability of overlapping failures drops off pretty quickly.
Similarly, if all of our replicas are on one rack, and the switch fails, at last a temporary bad time. And, to be honest, we probably don't want to race around every time hardware fails -- but we'll explore that shortly.