Question

I am new to Apache-Hadoop. I have Apache-Hadoop cluster of 3 nodes. I am trying to load a file having 4.5 billion records,but its not getting distributed to all nodes. The behavior is kind of region hotspotting.

I have removed "hbase.hregion.max.filesize" parameter from hbase-site.xml config file.

I observed that if I use 4 node's cluster then it distributes data to 3 nodes and if I use 3 node's cluster then it distributes to 2 nodes.

I think, I am missing some configuration.

Was it helpful?

Solution

Generaly with HBase the main issue is to prepare rowkeys that are not monotonically. If they are, only oneregion server is used at the time: http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/

This is HBase Reference Guide about RowKey Design:
http://hbase.apache.org/book.html#rowkey.design

And one more really good article:
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

In our case predefinition of Region servers also improved the loading time:

create 'Some_table', { NAME => 'fam'}, {SPLITS=> ['a','d','f','j','m','o','r','t','z']}

Regards
Pawel

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top