How Man Map-Reduce Phases Is Optimal?
One question that we've gotten a bunch over the last few days is, "How many Map-Reduce phases should we have?", which is sometimes phrased, "In designing a Map-Reduce approach, should we use many phases or just a few?" The answer to this question is, "Ideally, you should have one phase."Much of the power in a distributed Map-Reduce comes from the work that is distributed in the Map phase. In an ideal world, the Mappers will keep a lot of workers busy for a long time. Keep in mind that, whereas the nature of the data and processing determines the number of Mappers that can efficiently run concurrently, the number of Reducers is limited by the number of output files that the end-user application is willing to accept. So, althoguh we can go really wide on a Map, and as a consequence get a lot done at a time, the Reduce can be a bottleneck.
In an ideal world, a metric boat load of Mappers each process a relatively small chunk of the data in parallel and the results are locally combined into a much smaller set. These are then sorted, perhaps externally, and fed into relatively few Reducers, each of which performs only a very small amount of work to take the new information and add it to the current bucket.
Although multiple Map-Reduce phases are possible on the same data, it is almost always better to structure these into a fewer number of phases, if possible. Remember, Mappers read their data from the global file system and write it into a local cache. The Reducers get the data from this local cache into their own local cache via RPC calls and then write the results back into the global file system, which distributes and replicates it.
If we can do more processing on a single unit of data in the first pass, we cut out a huge amount of overhead. We save the work of shipping cached temporary results into the global file system, where they get replicated, etc. We also save the overhaead of sucking them back into cache file systems, which might or might not be ont he same nodes. There's also the overhead of managing another phase of computation.
When Do Multiple Phases Make Sense?
There are times when multiple phases do make sense. We've seen one example of this already Consider, for example, the first lab. In the first phase, we counted the word occurances. In the second phase, we flipped them to sort by count rather than key. If we could have done this in one phase, it surely would have been more efficient -- but we couldn't. So, we either had to do it in two phases -- or with post-processing after the fact. In the case of the lab, we did it with a second Map-Reduce phase.Another general situation that might involved multiple Map-Reduce phases is when we need to draw inferences across the output of the first phase, rather than about the individual elements. For example, "Find some list of records that match X, and then, determine the Y of those".
Processing Multiple Data Files
Map can take its input from multiple files at the same time. It can even munge on records of different types, if need be. But, a relational database, a Map-Reduce engine is surely not. Things which are very easily expressed in SQL can take a lot of work to express in Map-Reduce code. The basic problem we run into is that a very common place database operation, the join, is not well expressed in Map-Reduce. It takes a lot of work to express the join -- and, this is appropriate, as joins can be expensive.A table in a database is a lot like a file that contains structured records of the same type. Each table has only one type of record -- but each table can have a different type of record. The basic idea is that each record is a row and each field is a column. A join operation matches records in one table with those in another table that share a common key. If you ever take databases, you'll learn that there are a lot of different ways of expressing a join -- each with its own implementation efficiencies. But, for Map-Reduce, there are no efficiencies.
In class, we discussed, essentially, what database folks call an inner join. It is one, of many, types of relational operations that are challenging for Map-Reduce. The idea beahind an inner-join is that we have two tables that share at least one field. In effect, we match the records in the two table based on this one field to produce a new table in which each record contains the union of the matching records from both tables. We then filter these results based on whatever criteria we'd like This criteria can include fields that were originally part of different tables, as we are now, at least logically, looking at uber-records that contain both sets of fields.
In order to implement this in Map-reduce, we end up going through more than one phase. Here I'll describe one idiom -- but not the only one. The first phase uses a Map operations to process each file independently. The Map produces each record in a new format that contains the common element as the key and a value composed of the rest of the fields. In addition, it performs any per-record filtering that does not depend on records in the other file. The new record format might be very specific, e.g. <field1, field2, field3>, or it might be more flexible, e.g. <TYPE, field1, field2, field3>.
Although there were two different types of Map operations, one for each field type, they produced a common output format. These can be hashed to the same set of Reduce operations. This is, in effect, where the join happens. As you know, the output from the Map is sorted en route to the Reduce so that the records with the same key can be brought together. And the reduce does exactly this, in effect producing the join table. As this is happening, the filter criteria that depends on the relation between the two tables can be applied and only those records that satisfy both criteria need be produced.
At this point, what we have is essentially what the join table. It is important to note that we might do the filtering upon the indivudal records in the same pass as we render the records into a new format, or in a prior or subsequent pass. The same is true of the filtering based on the new record. But, we can only rarely use a combiner to join the records -- as the two records to be joined are necessarily the output of different Map operations and will only come together at a join.
It might also be important to note that since the records from two differnet files are converging at a single reducer, there are likely to be a huge number of records. In practice, this means a huge sort, likely external, will need to occur at the reducer. In reality, this might need to be handled as a distributed sort, with multiple reduce phases.
This is a lot of work for an operation that is, essentially, the backbone of modern databases. In class, we illustrated this by with the example of finding the "Top Customer This Week" from a Web access log <sourceIP, destinationURL, date, $$> and also determining the average rank of the pages s/he visited from a datafile
. This problem was solvable, but only after several phases of processing. This problem was adapted from a database column article.
Big Picture Vierw
It is fairly straight-forward to represent any problem that involves processing individual records independently as a Map Reduce problem. It is striaght-forward to represent any problem that is the aggregation of the results of such individual processing as a Map Reduce problem -- as long as the aggregation can be performed in a "running" way and without any data structures that grow as one goes.It is challenging to solve problems that involve relating different types of data to each other in the Map-Reduce paradigm, because these involve "matching" rather than individual records. The more relational our problem, the more we have to match to find our answer, the more phases we are likely to need. And, we could end up processing a lot of intermediate records. A whole lot of intermediate records -- and that can make storage of this intermediate output a concern. Also, multiple reduce phases means that we are going deep, and wide is really where the advantage lies.
But, again, you've got to look at the problem. Some problems are just big and deep -- and that's not only okay. It is the only way.