Hadoop jobs using same reducer output to same file

https://stackoverflow.com/questions/13029796

13-07-2021
|

Question

I ran into an interesting situation, and now am looking for how to do it intentionally. On my local single node setup, I ran 2 jobs simultaneously from the terminal screen. My both jobs use same reducer, they only have difference in map function (aggregation key - the group by), the output of both jobs was written to the output of first job (though second job did created its own folder, but it was empty). What I am working on is providing rollup aggregations across various levels, and this behavior is fascinating for me, that the aggregation output from two different levels are available to me in one single file (also perfectly sorted).

My question is how to achieve the same in real Hadoop cluster, where we have multiple data nodes i.e. I programmatically initiate multiple jobs, all accessing same input file, mapping the data differently, but using the same reducer, and the output is available in one single file, and not in 5 different output files.

Please advise.

I was taking a look at merge output files after reduce phase before I decided to ask my question.

Thanks and Kind regards,

Moiz Ahmed.

Solution

When different Mappers consume the same input file, with other words the same data structure, then source code for all these different mappers can be placed into separate methods of a single Mapper implementation and use a parameter from the context to decide which map functions to invoke. On the pluss side you need to start only one Map Reduce Job. Example is pseudo code:

class ComplexMapper extends Mapper {

protected BitSet mappingBitmap = new BitSet();

protected void setup(Context context) ... {
{
    String params = context.getConfiguration().get("params");
    ---analyze params and set bits into the mappingBitmap
}

protected void mapA(Object key, Object value, Context context){
.....
context.write(keyA, value);
}


protected void mapB(Object key, Object value, Context context){
.....
context.write(keyA, value);
}


protected void mapB(Object key, Object value, Context context){
.....
context.write(keyB, value);
}

public void map(Object key, Object value, Context context) ..... {
   if (mappingBitmap.get(1)) {
       mapA(key, value, context);
   }
   if (mappingBitmap.get(2)) {
       mapB(key, value, context);
   }
   if (mappingBitmap.get(3)) {
       mapC(key, value, context);
   }
}

Of cause it can be implemented more elegantly with interfaces etc.

In the job setup just add:

Configuration conf = new Configuration();
conf.set("params", "AB");

Job job = new Job(conf);

As Praveen Sripati mentioned, having a single output file will force you into having just one Reducer which might be bad for performance. You can always concatenate the part** files when you download them from the hdfs. Example:

hadoop fs -text /output_dir/part* > wholefile.txt

OTHER TIPS

Usually each reducer task produces a separate file in HDFS, so that the reduce tasks can operate in parallel. If the requirement is to have one o/p file from the reduce task then configure the job to have one reducer task. The number of reducers can be configure using the mapred.reduce.tasks property which is defaulted to 1. The con of this approach is there is only one reducer which might be a bottle neck for the job to complete.

Another option is to use some other output format which allows multiple reducers to write to the same sink simultaneously like DBOuputFormat. Once the Job processing is complete, the results from the DB can be exported into a flat file. This approach will enable multiple reduce tasks to run in parallel.

Another options is to merge the o/p files as mentioned in the OP. So, based on the pros and cons of each of the approach and the volume of the data to be processed the one of the approach can be chosen.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow