Question

I'm still trying to get an intuition as to when to use the Hadoop combiner class (I saw a few articles but they did not specifically help in my situation).

My question is, is it appropriate to use a combiner class when the value of the pair is of the Text class? For instance, let's say we have the following output from the mapper:

fruit apple
fruit orange
fruit banana
...
veggie carrot
veggie celery
...

Can we apply a combiner class here to be:

fruit apple orange banana
...
veggie carrot celery
...

before it even reaches the reducer?

Was it helpful?

Solution

Combiners are typically suited to a problem where you are performing some form of aggregation, min, max etc operation on the data - these values can be calculated in the combiner for the map output, and then calculated again in the reducer for all the combined outputs. This is useful as it means you are not transferring all the data across the network between the mappers and the reducer.

Now there is not reason that you can't introduce a combiner to accumulate a list of the values observed for each key (i assume this is what your example shows), but there are some things which would make it tricker.

If you have to output <Text, Text> pairs from the mapper, and consume <Text, Text> in the reducer then your combiner can easily concatenate the list of values together and output this as a Text value. Now in your reducer, you can do the same, concatenate all the values together and form one big output.

You may run into a problem if you wanted to sort and dedup the output list - as the combiner / reducer logic would need to tokenize the Text object back into words, sort and dedup the list and then rebuild the list of words.

To directly answer your question - when would it be appropriate, well i can think of some examples:

  • If you wanted to find the lexicographical smallest or largest value associated with each key
  • You have millions of values for each key and you want to 'randomly' sample a small set the values

OTHER TIPS

Combiner class is used when there is situation to use commutative or associative approach. Commutative example:

abc=cba during combine task perform (a*b=d),c and then send value of d,c to reducer. Now the reducer has to perform only one task instead of two task i.e. a*b = d d*c to get final answer. If you use combiner need to do only d*c.

Similarly for associative (a+b)+c = a+(b+c) Associative(Grouping) and commutative(moving around) result will not differ on how you multiply or add. Mainly combiner is used for structured data which obeys Associative & commutative.

Advantage of combiner:

  • It reduces network I/O between Map and reducer
  • It reduces Disk I/O in reducer as part of executionn happens in Comabiner.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top