Question

Is it possible to use Oozie to concatenate the output of a MapReduce job into a single file? Lets say I have the output ...

part-r-00000
part-r-00001
part-r-00002

and I just want...

output.csv

I know I can pull them down as a single file with hadoop fs -getmerge, but I'm curious if it's possible with a workflow application and HDFS.

Was it helpful?

Solution

Two simple options i can think of:

  1. Amend the job that produced this output to use a single reducer
  2. Run a map-reduce action with identity mapper, identity reducer and single reducer

OTHER TIPS

You can probably use pig or Java to call

http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#concat-org.apache.hadoop.fs.Path-org.apache.hadoop.fs.Path:A-

or maybe add it to your own fork of Oozie's fs-action.

Alternatively, using webhdfs: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Concat_Files .

You could wrap that curl call in a shell or ssh action.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top