What is the correct way to use oozie to write to multiple output streams for a mapreduce job?

https://stackoverflow.com/questions/9809751

25-05-2021
|

Question

I'm using the new Hadoop API to write a sequence of map-reduce jobs. I plan to use Oozie to pipeline all of these together, but I can't seem to find a way to do multiple output streams from a map-reduce node in the workflow.

Normally to write multiple outputs I would use code similar to the code given in the MultipleOutputs javadoc, but oozie gets all its configuration from workflow.xml file so the named outputs cannot be configured like they are in the example.

I've come across a thread discussing the use of multiple outputs in Oozie, but there was no solution presented beyond creating a Java task and adding it to the Oozie pipline directly.

Is there a way to this via a map-reduce node in the workflow.xml?

Edit:

Chris's solution did work, though I wish there was a better way. Here are the exact changes I made.

I added the following to the workflow.xml file:

<property>
    <name>mapreduce.multipleoutputs</name>
   <value>${output1} ${output2}</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output1}.key</name>
   <value>org.apache.hadoop.io.Text</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output1}.value</name>
   <value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output1}.format</name>
   <value>org.apache.hadoop.mapreduce.lib.output.TextOutputFormat</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output2}.key</name>
   <value>org.apache.hadoop.io.Text</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output2}.value</name>
   <value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
    <name>mapreduce.multipleoutputs.namedOutput.${output2}.format</name>
   <value>org.apache.hadoop.mapreduce.lib.output.TextOutputFormat</value>
</property>

I added the following to the job.properties file that is fed to oozie at startup:

output1=totals
output2=uniques

Then in the reducer I wrote to the named outputs totals and uniques.

Solution

the addNamedOutput utility methods for MultipleOutputs is just configuring configuration properties - so go look at an instance of your job that has run and extract the properties for MultipleOutputs (look in the job.xml, lined from the JobTracker page).

Alternatively, look through the source for MultipleOutputs and see what configuration properties are being set when you call this method.

Once you know the properties being set, add them to the configuration section of map-reduce element in your Oozie workflow.

OTHER TIPS

As of Hadoop 2.x the property names have changed from mapreduce.multipleoutputs.* to mo.*, thus to new configuration properties would now look like this:

<property>
    <name>mo.namedOutputs</name>
   <value>${output1} ${output2}</value>
</property>
<property>
    <name>mo.namedOutput.${output1}.key</name>
   <value>org.apache.hadoop.io.Text</value>
</property>
<property>
    <name>mo.namedOutput.${output1}.value</name>
   <value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
    <name>mo.namedOutput.${output1}.format</name>
   <value>org.apache.hadoop.mapreduce.lib.output.TextOutputFormat</value>
</property>
<property>
    <name>mo.namedOutput.${output2}.key</name>
   <value>org.apache.hadoop.io.Text</value>
</property>
<property>
    <name>mo.namedOutput.${output2}.value</name>
   <value>org.apache.hadoop.io.LongWritable</value>
</property>
<property>
    <name>mo.namedOutput.${output2}.format</name>
   <value>org.apache.hadoop.mapreduce.lib.output.TextOutputFormat</value>
</property>

Tested and verified on Hadoop 2.4.x, Ooize 4.0.0

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow