Question

I was looking into Hive on AWS (EMR to be particular). They have provide two options

  1. Spawning an Ad-hoc cluster, wherein the EMR cluster is teared-down after execution of a pre-specified Hive query (in bootstrap) is evaluated.
  2. Spawning a Hive cluster in interactive mode where one can SSH to the master and provide Hive queries using the hive command line client.

Obviously in 2nd option the cluster will remain alive until explicitly asked to be terminated.

I want to modify number of slave nodes in a keep alive hive cluster. I read in emr faq that it just supports addition and removal of task-nodes but mere addition (but not removal) of core-nodes. Core-nodes contribute to HDFS storage but task-nodes do not.

I want to add more core nodes to a running cluster and scale them down when number of queries being run is less. Is there a way to achieve this (may be using cloudwatch)?

Was it helpful?

Solution

The scaling up and down of the number of queries is more relevant for the number of task nodes (compute part of Hadoop) and less to the number of core nodes (data storage part of Hadoop), as the amount of data is not changing.

The rebalancing and redistribution of data when you want to scale up and down your queries is not a good idea. It is too slow and too complex to give any real benefit.

The "pay for what you use" and the quick launch with no configuration of EMR should encourage you to kill your cluster when you don't need it, and launch a new one when you need it. You can optimize Hive on EMR to store your table metadata in external MySQL DB between cluster launch to avoid missing or repeating table definitions.

OTHER TIPS

You could take a look at Themis, an EMR autoscaling framework developed at Atlassian. Current features include proactive as well as reactive autoscaling, it comes with a Web UI, and the tool is very easy to configure.

(Apologies for posting in an old thread, but the answer may still be interesting for readers discovering this thread.)

There is some value in having the data nodes also scale up. Scaling too far with just task nodes for long running clusters can result in an HDFS bottle neck (if there is a lot of intermediate data.)

Have you considered looking at Qubole? Qubole provides automatic scaling up and down based on load. The user configures a cluster with min and max slave nodes. These would be both task nodes and data nodes.

I know I am a little late to the party here, but I have had a similar problem many times, and I wanted to share one possible alternative. I have written a Java tool to dynamically resize an EMR cluster during the processing. It might help someone. Check it out at:

http://www.lopakalogic.com/articles/hadoop-articles/dynamically-resize-emr/

The source code is available on Github

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top