Question

Hadoop will run a lot of jobs by reading data from Hbase and writing data to Hbase. Suppose I have 100 nodes, then there are two ways that I can build my Hadoop/Hbase cluster:

  1. 100 nodes hadoop & hbase cluster (1 big Hadoop&Hbase)

  2. Separate the Database(Hbase), then we have two clusters: 60 nodes Hadoop cluster and 40 nodes Hbase cluster (1 Hadoop + 1 Hbase)

which option is better? Why?

Thanks.

Was it helpful?

Solution

I would say option 2 is better.
My reasoning - even though your requirement is mostly of running lots of mapreduce jobs to read and write data out of hbase, there are a lot of things going behind scene for hbase to optimise those reads and write for your submitted jobs. Hmaster will have to do load balancing often , unless your region keys are perfectly balanced. Table hotspotting can be there. For Regionserver, there will be major-compactions and if your jvm skills are not that good, then occasionally Stop the World garbage collection can happen. All the regions may start splitting at the same time. Your regionserver can go down and so on.
Moot point is - tuning hbase needs time. If you have just one node dedicated for hbase then probability of aforementioned problems are higher. It's always better to have more than one node, so all the performance pressure doesn't apply to just one node. And by the way , scoring point of hbase is it's inherently distributed nature, you wouldn't want to kill it.
All said, you can experiment on the ratio of nodes between hadoop and hbase- May be 70:30 or 80:20. Mileage may vary according to your application requirements.

OTHER TIPS

The main reason to separate HBase and Hadoop is when they have different usage scenarios - i.e. HBAse does random read-write in low latency and Hadoop does sequential batches. In this case the different access patterns can interfere with each other and it can be better to separate the clusters.

If you're just using HBase in batch mode you can use the same cluster (and probably rethink using HBase since it is slower than raw hadoop in batch).

Note that you would need to tune HBase along the lines mentioned by Chandra Kant regardless of the path you take

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top