Question

I'm currently working on a Java EMR project where my key is composed out of 2 Texts. I set the NaturalKeyGroupingComparator in one of my steps to compare only the left part of key.

Now this is the Java code for the Reducer:

     public void reduce(Pair key, Iterable<Data> values, Context context) throws IOException,  InterruptedException{

         int totalOccurrences=0;
         for (Data value : values){

             if (key.getRight().toString().equals("*")){
                 totalOccurrences+=value.getOccurrences();
             }
             else{
                value.setCount(new IntWritable(totalOccurrences));
             }
         }

     }

Now everything is working perfectly fine as planned, but I don't understand what's exactly happening. How can the key change in the middle of the reduce run?

Was it helpful?

Solution

Your question is a good beginners question :)

I have written about it here.

I guess the biggest thing to keep in mind is that the Iterable is not backed by a collection, it is computed on the fly as and when the next() method is invoked. Just keep this in mind.

Once your done with above post if your the i want to see the code kind of person.

// Line number157

 if (hasMore) {
      nextKey = input.getKey();
      nextKeyIsSame = comparator.compare(currentRawKey.getBytes(), 0, 
                                     currentRawKey.getLength(),
                                     nextKey.getData(),
                                     nextKey.getPosition(),
                                     nextKey.getLength() - nextKey.getPosition()
                                         ) == 0;
    } else {
      nextKeyIsSame = false;
    }

This is a snippet from ReduceContextImpl

The method gets called each time, you invoke next(), it basically checks if the key is changing in the underlying stream if not it just passes you the next value (remember keys are ordered), else it makes arrangements to call the reducer method again with a new key and iterable.

The underlying stream is always a key, value pair, the ReducerContextImpl gives you an illusion/abstraction of it being a key,Collection pair.

Like i said at the start ....

The biggest thing to keep in mind is that the Iterable is not backed by a collection, it is computed on the fly as and when the next() method is invoked. Just keep this in mind.

This theme is common across the MapReduce framework, all computations are done on streams nothing is ever loaded entirely in the memory, it took me a while to get this :) hence the eagerness to share it.

OTHER TIPS

reduce() method is executed for every key group in the input to the reducer. In your case when multiple texts were used as part of the key, keys were grouped using both the texts as key so your output would be

KeyGroup1, count1

KeyGroup2, count2

Now when the grouping is changed based only on the left part of the key, grouping for the reducer also changes providing an output of

 NewKeyGroup1, count1
 NewKeyGroup2, count2

For deeper understanding go through the Definitive Guide Chapter 8, Section on Secondary Sort

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top