Question

I am planning to implement a hadoop cluster with about 5 machines. With some background study, I understood that I need to install hadoop on each of those machines in order to implement the cluster.

Earlier I was planning to install a Linux distribution on each of these machines, and then install hadoop separately, and configure each machine to work in parallel.

Recently I came through some Hadoop distributions, such as Cloudera and Hortonworks. My question is, should I install a distribution such as Cloudera or Hortonworks in each of those machines, or should I install hadoop separately as I described earlier?

Will using a distribution make my task easier or would it need more knowledge to handle them than pure hadoop installation?

Était-ce utile?

La solution

I'm a beginner in Hadoop too (~1.5 month), using a distribution can be very helpful if you use the automated way to install (Cloudera Manager for Cloudera or Ambari for Hortonworks). It install and deploy Hadoop and services you choose (hive, impala, spark, hue ...) on all the cluster very quickly. The main disadvantages in my opinion is that you can't really optimize and personalize your installation but for a first time it's much easier to run some simple cases.

Autres conseils

I would highly recommend using a distro rather than doing it manually. Even using a distro will be complicated the first time as there are a lot of separate services that need to be running depending on what you want into addition to a base Hadoop install.

Also, do you intend to have a cluster size of just 5 machines? If so Hadoop may not be the right solution for you. You could potentially run all the masters on a single server and have a 4 node cluster, but that is probably not going to perform all that well. Note that the typical redundancy for HDFS is 3, so 4 nodes is just barely enough. If one or two machines goes down you could easily lose data in a production cluster. Personally I would recommend at least 8 nodes and one or two servers for the masters, so a total cluster size of 9 or 10, preferably 10.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top