Question

I have to make a quite complex data processing system using Amazon EC2 + S3 + RDS + EMR and I have some general questions in which I hope you can help me out:

  • I need to use R, then I have to use Streaming Job Flow. Does that mean I lose the power of Hive and I can't execute a Hive query on top of the EMR Job to work with that data?
  • Can I have multiple Job Flows running and interacting with them?
  • How can I use Dependent Jobs?
  • Can you re-run a job once done? I don't want to do the calculation once, I want to evolve according to the data.
  • Can I pass variables to Jobs?
  • What is the correct way to automate this?
Was it helpful?

Solution

I need to use R, then I have to use Streaming Job Flow. Does that mean I lose the power of Hive and I can't execute a Hive query on top of the EMR Job to work with that data?

You can mix jobs in whatever way you want. For example an R streaming job that reads from S3 and writes to HDFS followed by a Hive job that reads that data from HDFS and writes back to S3 . They are all just MapReduce jobs.

Can I have multiple Job Flows running and interacting with them?

There is no limitation in EMR about the number of jobflows you can have running at once; the only limit enforced is the quota on EC2 instances. There is no support to move data between the HDFS of two clusters yet but you can go via S3 easily enough.

How can I use Dependent Jobs?

Depends on you mean by dependent jobs? You can use the step mechanism to queue jobs up to run after each other so as long as you workflow can be described by a single sequence you're ok. see [1]

Can you re-run a job once done? I don't want to do the calculation once, I want to evolve according to the data.

In terms of debugging / exploratory work it can often be easiest to start a cluster with --alive, ssh to the master node and submit jobs directly. Once you're happy you can use the step mechanism to orchestrate your workflow.

Can I pass variables to Jobs?

Yes; your steps give you full access to the job you're submitting

What is the correct way to automate this?

As long as your workflow is linear the step mechanism should be enough; start the cluster and just queue up the things to do, make sure the last step outputs to S3 and just let the cluster terminate itself.

Mat

[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/index.html?ProcessingCycle.html

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top