Question

I have only a high-level understanding of MapReduce but a specific question about what is allowed in implementations.

I want to know whether it's easy (or possible) for a Mapper to uniformly distribute the given key-value pairs among the reducers. It might be something like

(k,v) -> (proc_id, (k,v))

where proc_id is the unique identifier for a processor (assume that every key k is unique).

The central question is that if the number of reducers is not fixed (is determined dynamically depending on the size of the input; is this even how it's done in practice?), then how can a mapper produce sensible ids? One way could be for the mapper to know the total number of key-value pairs. Does MapReduce allow mappers to have this information? Another way would be to perform some small number of extra rounds of computation.

What is the appropriate way to do this?

Was it helpful?

Solution

The distribution of keys to reducers is done by a Partitioner. If you don't specify otherwise, the default partitioner uses a simple hashCode-based partitioning algorithm, which tends to distribute the keys very uniformly when every key is unique.

I'm assuming that what you actually want is to process random groups of records in parallel, and that the keys k have nothing to do with how the records should be grouped. That suggests that you should focus on doing the work on the map side instead. Hadoop is pretty good at cleanly splitting up the input into parallel chunks for processing by the mappers, so unless you are doing some kind of arbitrary aggregation I see no reason to reduce at all.

Often the procId technique you mention is used to take otherwise heavily-skewed groups and un-skew them (for example, when performing a join operation). In your case the key is all but meaningless.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top