Question

As part of a workaround, I wanted to use two mapreduce jobs(instead of one) that ought to run in sequence for giving the desired affect.

The map function in each job simply emit each key,value pair without processing. The reduce functions in each job are different as they do different kind of processing.

I stumbled upon oozie and it seem to directly writes to the input stream of the consequent job (or doesn't it?) - this would be great since the intermediate data is large (I/O operation would become a bottleneck).

How can I achieve this with oozie (2 mr jobs in the workflow)?

I did go through the below resources, but they simply run a single job as a workflow: https://cwiki.apache.org/confluence/display/OOZIE/Map+Reduce+Cookbook

Help appreciated.

Cheers

Was it helpful?

Solution 2

Oozie is a system for describing the workflow of a job, where that job may contain a set of map reduce jobs, pig scripts, fs operations etc and supports fork and joining of the data flow.

It doesn't however allow you to stream the input of one MR job as the input to another - the map-reduce action in oozie still requires an output format of some type, typically a File based on, so your output from job 1 will still be serialized via HDFS, before being processed by job 2.

The oozie documentation has an example with multiple MR jobs, including a fork:

http://oozie.apache.org/docs/3.2.0-incubating/WorkflowFunctionalSpec.html#Appendix_B_Workflow_Examples

OTHER TIPS

There is, look at the ChainMapper class in Hadoop. It allows you to forward the map output of one mapper directly into the input of the next mapper without hitting the disk.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top