Question

I want to output gzip and lzo formats at the same time in one job.

I used MultipleOutputs, and add two named outputs like this:

MultipleOutputs.addNamedOutput(job, "LzoOutput", GBKTextOutputFormat.class, Text.class, Text.class);

GBKTextOutputFormat.setOutputCompressorClass(job, LzoCodec.class);

MultipleOutputs.addNamedOutput(job, "GzOutput", TextOutputFormat.class, Text.class, Text.class);

TextOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

(GBKTextOutputFormat here is written by myself which extends FileOutputFormat)

They are used in reducer like:

multipleOutputs.write("LzoOutput", NullWritable.get(), value, "/user/hadoop/lzo/"+key.toString());

multipleOutputs.write("GzOutput", NullWritable.get(), value, "/user/hadoop/gzip/"+key.toString());

The result is:

I can get outputs in the two path, but they are both in gzip format.

Someone can help me? Thanks!

==========================================================================

More:

I just looked at the source code of setOutputCompressorClass in FileOutputFormat, in which conf.setClass("mapred.output.compression.codec", codecClass, CompressionCodec.class);

It seems that mapred.output.compression.codec in configuration will be reset when setOutputCompressorClass is called.

So the actual compression format is the one we set at last, and we cannot set two different compression formats in the same job ? Or there is something else ignored ?

Was it helpful?

Solution

So maybe as a work-around, try setting the correct outputCompressorClass directly in the configuration

context.getConfiguration().setOutputCompressorClass(GzipCodec.class);

just before your write call to each of the outputs. It does look like any output format configuration parameters other than key class, value class and output path are not handled well by MultipleOutputs and we may have to write a bit of code to offset that oversight.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top