Question

I am new to Apache-Hadoop. I have Apache-Hadoop cluster of 3 nodes. I am trying to load a file having 4.5 billion records,but its not getting distributed to all nodes. The behavior is kind of region hotspotting.

I have removed "hbase.hregion.max.filesize" parameter from hbase-site.xml config file.

I observed that if I use 4 node's cluster then it distributes data to 3 nodes and if I use 3 node's cluster then it distributes to 2 nodes.

I think, I am missing some configuration.

Était-ce utile?

La solution

Generaly with HBase the main issue is to prepare rowkeys that are not monotonically. If they are, only oneregion server is used at the time: http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/

This is HBase Reference Guide about RowKey Design:
http://hbase.apache.org/book.html#rowkey.design

And one more really good article:
http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

In our case predefinition of Region servers also improved the loading time:

create 'Some_table', { NAME => 'fam'}, {SPLITS=> ['a','d','f','j','m','o','r','t','z']}

Regards
Pawel

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top