Question

We're currently using Redshift as our data warehouse, which we're very happy with. However, we now have a requirement to do machine learning against the data in our warehouse. Given the volume of data involved, ideally I'd want to run the computation in the same location as the data rather than shifting the data around, but this doesn't seem possible with Redshift. I've looked at MADlib, but this is not an option as Redshift does not support UDFs (which MADlib requires). I'm currently looking at shifting the data over to EMR and processing it with the Apache Spark machine learning library (or maybe H20, or Mahout, or whatever). So my questions are:

  1. is there a better way?
  2. if not, how should I make the data accessible to Spark? The options I've identified so far include: use Sqoop to load it into HDFS, use DBInputFormat, do a Redshift export to S3 and have Spark grab it from there. What are the pros/cons for these different approaches (and any others) when using Spark?

Note that this is off-line batch learning, but we'd like to be able to do this as quickly as possible so that we can iterate experiments quickly.

Was it helpful?

Solution

The new Amazon Machine Learning Service may work for you. It works directly with Redshift and might be a good way to start. http://aws.amazon.com/machine-learning/

If you're looking to process using EMR, then you can use Redshift's UNLOAD command to land data on S3. Spark on EMR can then access it directly without you having to pull it into HDFS.

Spark on EMR: https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923

OTHER TIPS

You can try this. https://github.com/databricks/spark-redshift

Otherwise use Rahul's answer, unload data to s3 then load it into SPARK. EMR Spark is on top of YARN and Spark context's textFile method supports "s3//" natively.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top