How to prepare and aftercare data for AWS MapReduce

https://stackoverflow.com/questions/4669947

10-10-2019
|

Question

I am working with Amazons MapReduce Web Service for an university project. In order to use the data for MapReduce, I need to dump them from a relational database (AWS RDS) into S3. After MapReduce finishes I need to split the output file and load chunks of it into their own S3 buckets.

What is a good way to do this within the Amazon Web Services Enviroment?

Best case: Could this be a accomplished without using extra EC2 instances besides the ones used for RDS and MapReduce?

I use python for the mapper and reducer functions and json specifiers for the MapReduce job-flow. Otherwise I am not language or technology bound.

Solution

If you take a look at the Amazon Elastic MapReduce Developer Guide you need to specify the location of input data, output data, mapper script and reducer script in S3 in order to create a MapReduce job flow.

If you need to do some pre-processing (such as dumping the MapReduce input file from a database) or post-processing (such as splitting the MapReduce output file to other locations in S3), you will have to automate those tasks separately from the MapReduce job flow.

You may use the boto library to write those pre-processing and post-processing scripts. They can be run on an EC2 instance or any other computer with access to the S3 bucket. Data transfer from EC2 may be cheaper and faster, but if you don't have an EC2 instance available for this, you could run the scripts in your own computer... unless there is too much data to transfer!

You can go as far as you want with automation: You may even orchestrate the whole process of generating input, launching a new MapReduce job flow, waiting for the job to finish and processing output accordingly, so that given the proper configuration, the whole thing is reduced to pushing a button :)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow