AWS EMR Auto Scaling

Question 1

The scaling up and down of the number of queries is more relevant for the number of task nodes (compute part of Hadoop) and less to the number of core nodes (data storage part of Hadoop), as the amount of data is not changing.

The rebalancing and redistribution of data when you want to scale up and down your queries is not a good idea. It is too slow and too complex to give any real benefit.

The "pay for what you use" and the quick launch with no configuration of EMR should encourage you to kill your cluster when you don't need it, and launch a new one when you need it. You can optimize Hive on EMR to store your table metadata in external MySQL DB between cluster launch to avoid missing or repeating table definitions.

Question 2

You could take a look at Themis, an EMR autoscaling framework developed at Atlassian. Current features include proactive as well as reactive autoscaling, it comes with a Web UI, and the tool is very easy to configure.

(Apologies for posting in an old thread, but the answer may still be interesting for readers discovering this thread.)

Question 3

There is some value in having the data nodes also scale up. Scaling too far with just task nodes for long running clusters can result in an HDFS bottle neck (if there is a lot of intermediate data.)

Have you considered looking at Qubole? Qubole provides automatic scaling up and down based on load. The user configures a cluster with min and max slave nodes. These would be both task nodes and data nodes.

Question 4

I know I am a little late to the party here, but I have had a similar problem many times, and I wanted to share one possible alternative. I have written a Java tool to dynamically resize an EMR cluster during the processing. It might help someone. Check it out at:

http://www.lopakalogic.com/articles/hadoop-articles/dynamically-resize-emr/

The source code is available on Github