Write some data (lines) from my mappers to separate directories depending on some logic in my mapper code

https://stackoverflow.com/questions/11091723

15-06-2021
|

Question

I am using mrjob for my EMR needs.

How do I write some data (lines) from my mappers to "separate directories" depending on some logic in my mapper code that I can:

tar gzip and
upload to separate S3 buckets (depending on the directory name) after the job finishes/terminates abruptly?

I guess the '--output-dir' options only allows you to upload the final job output to that directory, but I would like to write to other directories as well from time to time from my mappers.

Solution

No you can't in the traditional sense.

Reason: MrJob internally uses Hadoop streaming to run map/reduce jobs when running with Hadoop cluster I am assuming that this is same for Amazon Elastic M/R as it is for Hadoop cluster.

The --output-dir is actually an input to Hadoop streaming which specifies where the the output of reducers will be collected. You can not use this mechanism for segregating data into different folders.

[Edit: In response to comment]

My understanding is that boto is only a library to connect to Amazon services and access ec2 and s3 etc.

In a non-traditional sense you can still write to different directories, I guess.

I have not tested this idea and don't recommend this approach. This would be like opening a file and writing to it directly within the reducers. Theoretically you could do that. Instead of just writing the reducer output to std.out. You could possibly open and write to S3 objects directly. You have to ensure that you will open different files as it spawns multiple reducers.

This is what I learned while using MrJob with Hadoop cluster: http://pyfunc.blogspot.com/2012/05/hadoop-map-reduce-with-mrjob.html

OTHER TIPS

I think Hadoop's MultipleOutput feature can help you, in your custom outputFormat you can specify the path and filename.

You can follow the approach of creating custom Jar and customize your OutputFormat in order to mutiplex outputs in different folders/files. You shall create a subclass of MultipleTextOutputFormat and override few of the methods in it (generateFileNameForKeyValue(Text key, Text value, String leaf) and generateActualKey(Text key, Text value) mainly).

For more details you can refer this : http://www.infoq.com/articles/HadoopOutputFormat

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow