Setting up multi node Hadoop cluster automatically

https://stackoverflow.com/questions/12896852

07-07-2021
|

質問

I have an EC2 image that I made with Hadoop installed. However, I set it up to be roleless upon instantiation (it isn't a slave or a master). In order to start a Hadoop cluster I launch as many instances (nodes) as I need on EC2, then I have to do the 3 following things to each node:

Update /etc/hosts to contain the necessary IP addresses.
If master node, change $HADOOP_HOME/conf/masters and $HADOOP_HOME/conf/slaves
Enable SSH access between the nodes.

I'd like to be able to find a way to do this automatically so that for an arbitrary amount of nodes, I don't have to go in and set all these settings on each one.

How do other people deal with setting up Hadoop clusters automatically? Is there a way to automate the networking part?

I'm not sure it would be possible since the IP addresses will be different every time, but I want to know what other people have tried or what is commonly used. Is there a good way to automate these processes so every time I set up a cluster for testing I don't have to do these for every node? I don't know much about Linux scripting, is this possible with a script? Or will I just have to deal with configuring every node manually?

解決

I have no experience with Hadoop, but in general the task you have is called "configuration management". In general you write some "receipes" and define "roles" (master, slave) for your servers. Such a role may contain config files for services, to-be-installed packages, hostname changes, SSH keys etc. After the servers have initially started up, you can tell them which role they should be and they will install automatically.

There are different tools available for these tasks, examples are Puppet or Salt. There is a comparison available at Wikipedia.

他のヒント

I was going thru to see if utilities like these exist but could not find any.

So I built an automation utility for "Hadoop provisioning automation" using python, salt and fabric.

There are quite a number of steps involved for getting a hadoop cluster ready.

spin up by EC2 instances.
creating security groups.
setup ssh keys so that the instances master can ssh to slaves.
Install JDK.
Install hadoop.
Designate nodes as namenode,secondary namenode, slaves -make hadoop config file changes..
Start services

Doing all these things say for 4 nodes is going to take 1 hour. For the work I want to do, I need to do these repeatedly and often so with a large number of nodes hence the need for automation.

For steps that need to be done in every node (eg: jdk install, hadoop package install, etc) I used salt for configuration management. Salt provides similar capabilities like puppet and chef.

please feel free to check out https://github.com/varmarakesh/aws-hadoop

If you already have an aws account,it is designed for easy setup and run.

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow