Question

Have a background in GAE's Big Table. From what I have read, HBase is the open source version of Big Table and should be very comparable in its features.

Using Big Table, this object could be indexed and queried in Olog(n) time:

Object

widget{
     names:     ['spike', 'cheeta', 'badger']
     counts:    [4, 6, 7]
     size:      387331209
}

Query

SELECT * FROM widget_table WHERE names == 'spike' AND counts = 6 ORDER BY size

Have been pouring over HBase documentation for a few hours and still can't seem to find a definitive answer to this question:

Question:

Can I use HBase to perform to search with one non-equality operator and two or more equality operators in roughly Olog(n) time?

This is possible in GAE's Big Table as refernced here: https://developers.google.com/appengine/docs/python/datastore/queries#Restrictions_on_Queries

Thanks so much!

Était-ce utile?

La solution

Chris, maybe this at least somehow will help you. In HBase everything depends on your row key design (specially look for openTSDB case). For example in your case key may look like the following:

[name-code] [counts-code] [...]

In this case you easily select range for all records having certain name / counts with Olog(n) complexity. If key doesn't include component calculated from size, you will have O(n) complexity searching for certain size. If key includes size (or at least some calculation based on size) this will speed up process as it allows you to limit range up to Olog(n).

HBase is very straightforward tool allowing you to perform magic things but only if you really know how it works and yes, it is something like 'raw engine' with minimal abstraction.

Please also note if you have lot of records per names / counts field value you probably need to balance such request loading among cluster nodes. So this affects your table / row key design. For example I have now design where linear full scan of table with perfect loading balance is better than limited scan without balancing.

Autres conseils

Agreeing with Roman;

HBase

  • is a distributed key/value store

  • has no built in index structure (apart from third party tools as described here)

  • has no built in query language support (using Hive may ease but she will disable you to use the data stored in HBase from a programming language without a third party library support. Or you can use HCatalog instead of Hive, Pig gang. But this will make it an ordinary RDBMS with seek latencies for every row as RDBMS platforms do using BTree like structures)

  • Very good on batch reading according to the rowkey (the only builtin index available); if you design your rowkey well, you will only be very fast first to seek to the startrowkey and read from there in batch with the disk transform rate to the stoprowkey.

If you can design your data this way it will be very well suited.

Apart from that, of course you can filter your data, whether this filter is on rowkey or on the payload, but if there are no startrowkey or stoprowkey, the query (or map/reduce job if it is used) will have to read the entire data even if you put filters on the payload or on the rowkey.

So you must consider these when you make your evaluations.

PS: Because of the rowkey design, startrowkey and stoprowkey is crucial. You may create a compound rowkey but in that the order of the fields will be very important.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top