Question

I have a huge HBase table of about half a billion rows, with about 100 columns (varies per row) of data.

I would like to query this data, based on any column qualifier value, as fast as possible.

I know that HBase is optimized for fast reads when we know the ROW-KEY but I want to query based on different column values. But applying Column Filters (using JAVA API) leads to full table scans which slows the system down

What are my options?

  • INDEXING: The columns present in every row changes. Can I still do indexing?
  • Do I continue to use HBase to store data? Or use it along with Solr or ElasticSearch?
  • What sort of performance can I expect for random queries based on any column values with maybe a billion rows?

Any other suggestions are welcome.

No correct solution

OTHER TIPS

Getting data from the row key is fast in Hbase, but since values are not indexed, querying with a value filter is sloooooooooow. If the number of columns to be indexed is small you can consider reversed table index.

But if you want more things, like multi-criteria queries, you should have a look to elasticsearch and use it to store only the index on your columns and keep your data in hbase. Don't forget to disable the source store with "_source" : {"enabled" : false} when creating your index, all your data is already in hbase, don't waste your HDD :)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top