April 27, 2010 (Lecture 26)

April 27, 2010 (Lecture 26)

How Man Map-Reduce Phases Is Optimal? When Do Multiple Phases Make Sense? How Many Map Instances Should We Have?

Now that you guys are well into the Hadoop project with real data, we revisited our earlier discussion of these topics. For a refresher, please visit those notes.

How Many Reduce Instances Should We Have? Can I Make Things Better By Having More Reducers?

Generally speaking, the number of reducers you need is determined by the number of output files you need. If you can get away with more output files, that is probably a win. For example, if you can take the "Top M" items from each of n output files, that might be better than reducing these files to a single file and selecting the top n*m items (Yes, I know that these two sets are not exactly the same). The reason for this is that, if you can take this shortcut, you are saving some work.
It doesn't usually make sense to reduce your maps to some large number of files and then repeat identity-maps and reduces to perform a multi-phase merge to one large file. The reason for this is that Hadoop can do this, without wasting the time on the Maps. It can do large external merge sorts.
The only time you'd want to do this is if you could throw away results with the inital reduces or sumsequent merges, making the problem substantially smaller, reducing the amount of actual work to be done.

Processing Multiple Data Files

One question that has come up with a couple of teams during office hours has been about using the Map-Reduce paradigm with multiple files. Map can take its input from multiple files at the same time. It can even munge on records of different types, if need be. But, a relational database, a Map-Reduce engine is surely not. Things which are very easily expressed in SQL can take a lot of work to express in Map-Reduce code. The basic problem we run into is that a very common place database operation, the join, is not well expressed in Map-Reduce. It takes a lot of work to express the join -- and, this is appropriate, as joins can be expensive.
A table in a database is a lot like a file that contains structured records of the same type. Each table has only one type of record -- but each table can have a different type of record. The basic idea is that each record is a row and each field is a column. A join operation matches records in one table with those in another table that share a common key. If you ever take databases, you'll learn that there are a lot of different ways of expressing a join -- each with its own implementation efficiencies. But, for Map-Reduce, there are no efficiencies.
In class, we discussed, essentially, what database folks call an inner join. It is one, of many, types of relational operations that are challenging for Map-Reduce. The idea beahind an inner-join is that we have two tables that share at least one field. In effect, we match the records in the two table based on this one field to produce a new table in which each record contains the union of the matching records from both tables. We then filter these results based on whatever criteria we'd like This criteria can include fields that were originally part of different tables, as we are now, at least logically, looking at uber-records that contain both sets of fields.
In order to implement this in Map-reduce, we end up going through more than one phase. Here I'll describe one idiom -- but not the only one. The first phase uses a Map operations to process each file independently. The Map produces each record in a new format that contains the common element as the key and a value composed of the rest of the fields. In addition, it performs any per-record filtering that does not depend on records in the other file. The new record format might be very specific, e.g. <field1, field2, field3>, or it might be more flexible, e.g. <TYPE, field1, field2, field3>.
Although there were two different types of Map operations, one for each field type, they produced a common output format. These can be hashed to the same set of Reduce operations. This is, in effect, where the join happens. As you know, the output from the Map is sorted en route to the Reduce so that the records with the same key can be brought together. And the reduce does exactly this, in effect producing the join table. As this is happening, the filter criteria that depends on the relation between the two tables can be applied and only those records that satisfy both criteria need be produced.
At this point, what we have is essentially what the join table. It is important to note that we might do the filtering upon the indivudal records in the same pass as we render the records into a new format, or in a prior or subsequent pass. The same is true of the filtering based on the new record. But, we can only rarely use a combiner to join the records -- as the two records to be joined are necessarily the output of different Map operations and will only come together at a join.
It might also be important to note that since the records from two differnet files are converging at a single reducer, there are likely to be a huge number of records. In practice, this means a huge sort, likely external, will need to occur at the reducer. In reality, this might need to be handled as a distributed sort, with multiple reduce phases.
This is a lot of work for an operation that is, essentially, the backbone of modern databases. In class, we illustrated this by with the example of finding the "Top Customer This Week" from a Web access log <sourceIP, destinationURL, date, $$> and also determining the average rank of the pages s/he visited from a datafile . This problem was solvable, but only after several phases of processing. This problem was adapted from a database column article.

Big Picture View

It is fairly straight-forward to represent any problem that involves processing individual records independently as a Map Reduce problem. It is striaght-forward to represent any problem that is the aggregation of the results of such individual processing as a Map Reduce problem -- as long as the aggregation can be performed in a "running" way and without any data structures that grow as one goes.
It is challenging to solve problems that involve relating different types of data to each other in the Map-Reduce paradigm, because these involve "matching" rather than individual records. The more relational our problem, the more we have to match to find our answer, the more phases we are likely to need. And, we could end up processing a lot of intermediate records. A whole lot of intermediate records -- and that can make storage of this intermediate output a concern. Also, multiple reduce phases means that we are going deep, and wide is really where the advantage lies.
But, again, you've got to look at the problem. Some problems are just big and deep -- and that's not only okay. It is the only way.

Future Direction

Because we spent the first part of class talking about Map-Reduce and Hadoop, we punted the DNS lecture until tomorrow and instead did the "Wrap Up, Future Direction" lecture today.
Here are some bullet points form this discussion:

Distributed systems for scientific computing, continues to be a research area. It is probably the oldest of the distributed systems research areas that is still alive. But, it is a somewhat small community. It is bolstered by the fact that we've hit the end of Moore's Law when it comes to silicon and are finding limits in supporting system design.

The "Great Distributed Operating System In the Sky" model is dead. DSM killed it. But, it does leave behind some nice middleware vestiges such as Condor, OpenPBS, etc.

Ubiquitous computing, wearable computing, context-aware computing, and related areas from the early late 90's and early 2000's seem to have quieted down. There was no show stopper. But, they have (possibly temporarily) turned out to be a solution looking for a problem. And, in some ways, have turned into HCI questions as much as raw technology questions. But, they aren't gone. And, in fact, there are commercial applications. GPS systems that cooperate and share traffic information via a subscription-based service.

Data-Intensive Scalable Computing (DISC) is hot right now. Representing problems under the Map-Reduce idiom, solving the data base problem, improving performance, understanding the interaction, etc. It is clear that DISC is very important in a lot of non-systems research domains -- and that making it work well is a living, breathing, not completely answered systems question. This area has a few more years of life, at least.

Distributed processing is generally going to increase in importance within embedded systems and appliances. We really can't build monolithic processors cost effectively -- even comerical processors are multi-core. In environments where monolithic processing was never a requirement, but a convenience, it'll go away. If ten things need to happen -- we might well use ten processors. There is research that needs to continue to happen in how this needs to be done -- and how to demonstrate the correctness of the solutions (Don't want, oh, Brand-X cars that can't stop!)

Distributed processing for energy and cost conservation is about to bloom, I think. We already discussed the fact that it is easier to get a higher utilization given one processor than many processors. But, more powerful processors consume more energy per unit of work -- it is a function of the capacitance of the junctions as the clock rates get faster. Over the last few years, we've seen tremendous iprovements in low-cost, low-power systems. Within the next couple of years, the technology will likely somewhat plateau. But, it will do it with a great energy economy. And, hopefully with a good economy in terms of the initial purchase cost. The result is that a large number of problems that could be solved by a monolithic processor with a higher utilization -- might be cheaper to solve with a distributed system of energy-conserving systems. It might take less energy to do the computation and require less cooling. It might be cheaper, even counting the initial investment. And, it might not take any longer. Figuring out which classes of problems can be solved this way and how to do it is, I think, going to be a research area for the next few years, at the least.