Is Reduce always required to remove duplicates in MapReduce?

https://stackoverflow.com/questions/9520692

15-11-2019
|

Question

Why is Reduce required in MapReduce? If a job such as counting the number of words in a book will result in the same outcome if performed by a single process or MapReduced over a farm of servers, what possibility exists that duplicates will have to be removed? I'm assuming the Reduce step, at least in this example, would simply SUM the outcome from each worker process and deliver the total count of words in the book. I don't understand where duplicates of anything come into the picture.

Solution

The Reduce step is not meant for removing duplicates (although that is a possible use case in some situations). Reduce is meant for aggregation of outputs from various mappers with the same key.

For example in the word count example, Node 1 might get 10 instances of a word, say "school" and Node 2 might have 15 instances, Node 3 12 instances. Now how will the sum be calculated? The results 10, 15 and 12 are on different nodes. There is a shuffle phase which brings all these values to one node (which is the reducer allotted to the key "school" by the partitioner). Then the reducer will have all the values for this key, and it can sum them up.

EDIT: As Tudor mentioned, by aggregation I mean aggregation in the more general sense of "bringing together"

EDIT2: To clarify RaffiM's doubt: Continuing the above example, let's say Node 1 had pages 1-10, Node 2 had pages 11-20 and Node 3 got pages 21-30. So, after the mao phase, we know that Pages 1-10 have the word "school" 10 times, pages 11-20 have that word appear 15 times and likewise, 15 times for pages 21-30. Now what we need is the total number of times the word appears in the whole book, so we need to still add these up. We need 10+15+12+the numbers for the other page ranges...

If you don't use the combiner, the mapper just sends "1" for each time the word appears. So for pages 1-10, it will send <"school",1> as the output key-value 10 times. To make it more efficient, we use the combiner which sums it up at the mapper level. So if you use the combiner, it will consume this in Node 1 itself and generate a consolidated output <"school", 10> for Node 1.

OTHER TIPS

Reduce is a much more general operation. It does not necessarily mean "aggregate a bunch of numerical values by repeatedly applying an operation (e.g. summation)". The formal definition of Map-Reduce is that it is a transformation composed of the following stages:

Map: (K k, V v) -> (K' k, V' v1 [, v2,...]) - an operation that takes in input a key-value pair of types K and V, respectively, and produces a key-value or key-list of values result of potentially different types.
A shuffle phase where partitioning is performed.
Reduce: (K' k, V' v1 [, v2,...]) -> (K' k, V'' v1 [, v2,...]) - an operation that takes in input a key-list of values pair where "list of values" is the list of all values corresponding to key k that were produced by the Map phase and produces a key-value or key-list of values pair where the output key must have the same type as the input key and the value is of arbitrary type.

This, you see that the Reduce operation is much more flexible and general.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow