문제

I'm evaluating EC2/EMR for running a ~20 node Hadoop cluster. (custom JAR cluster). I've run the simple WordCount example on a single-node 3.3 GHz 2GB RAM local VMWare instance which takes less than 10 seconds to complete. The WordCount example takes 3 minutes to complete on EMR with 2 c1.mediumm instances (excluding the startup time of 3-5 minutes). Takes the same time for 2 m1.small instances. There will be some overhead for running a job on EMR, and maybe this problem size is too small, so this seems understandable.

At about what size problems do you begin to see the performance advantage of the cloud? Or at about how many nodes or compute units?

도움이 되었습니까?

해결책

If you're spinning up an EMR job, that essentially means you're asking Amazon to provide you an on-demand cluster of N machines, and the simple fact of provisioning and giving you these machines can easily take several minutes, not to mention that these machines need to be setup, can have bootstrap actions, and so on. I've rarely seen EMR jobs (even big ones) take more than 10 minutes to have the cluster ready, but I've also rarely seen a cluster be up in less than a couple minutes.

If you have a job that you're running frequently (for example every hour), then the cost of setting up and shutting down your EMR cluster might be too big, in this case it would be a good idea to create your cluster with some reserved instances on EC2. With reserved instances, you will have your own cluster always up and administered by you, so there is no time lost setting up/shutting down your cluster, this behaves like a regular Hadoop cluster.

What I've been doing in the past couple years is use an EC2 cluster on reserved instances that is always up and all the jobs are running on it, but for some jobs that are very large and that couldn't fit on my cluster, I run them on EMR where I can choose how many nodes I want and since these are large jobs the time to setup/shutdown the cluster is small in comparison to the total runtime. I would not recommend using EMR for small/frequent jobs.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top