How to use output of RedShift query as input of an EMR job?

Question

So my limited understand of Redshift this is my plan for going about my problem...

I want to take the results of a query, and use them as an input for an EMR job. What is the best way to go about this programmaticly.

Currently my EMR job takes a flat file from S3 as the input, and I use the Amazon Java SDK, to set this job up and everything.

Should I write the output of my RedShift query to S3, and point my EMR job there, and then remove the file after the EMR job has completed?

Or does the RedShift and AWS SKD offer a more resourceful way to directly pipe the query from RedShift to EMR, cutting out the the S3 step?

Thanks

Recently spoke with memebers of Amazon Redshift Team, they said a solution for this is in the works.

Solution

This is pretty easy - no need for Sqoop. Add a Cascading Lingual step at the front of your job which executes a Redshift UNLOAD command to S3:

UNLOAD ('select_statement')
TO 's3://object_path_prefix'
[ WITH ] CREDENTIALS [AS] 'aws_access_credentials' 
[ option [ ... ] ]

Then you can either process the export directly on S3, or add an S3DistCp step to bring the data onto HDFS first.

This will be a lot more performant than adding Sqoop, and a lot simpler to maintain.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow