How to store data in HBase for efficient fetching with Partial Key scanning?

https://stackoverflow.com/questions/7040487

23-12-2020
|

Question

My key has three components: num, type, name

The 'type' is only of two kinds A and B while num can have more values e.g. 0,1,2..,30

I have to fetch data with respect to num and type i.e. fetch all rows which have keys with the specified num and type.

I can either store data in the form: 1. num|type|name or 2. type|num|name

Considering how HBase scans through data if I use partial key scanning, which is the best strategy to store data?

This is how I will set my partial key scanning: For 1.

scan.setStartRow(Bytes.toBytes(num);
scan.setStopRow(Bytes.toBytes(num+1);

For 2.

scan.setStartRow(Bytes.toBytes(type + "|" + num);
scan.setStopRow(Bytes.toBytes(type + "|" + (num+1));

La solution

First I would recommend against using pipe as a delimiter - that is ASCII 124 and falls after all letters and numbers and sorting will not be what you expect (unless you left pad everything - but that makes for overly large keys). For HBase rowkey delimiters you want to use something that is lexicographically before all of your valid key characters to preserve correct sorting. Tab works well at ASCII 9.

Considering that type only has two valid values and assuming a random distribution I would go with num type. This allows you to select just on num if you need to in the future. Selecting on just num with the reverse order, type num, is two fetchs, once for type 'A' and again for type 'B'. Not the most efficient.

If you will rarely select on just number then it does make sense to go with type num as that is the most selective on the row level, if inflexible.

Really you should try them both out and see what works best with your data.

Autres conseils

There are a couple of approaches you can take.

1) You should choose whichever layout you will be scanning more frequently. Then for the less frequent scan type, you make a full scan(or delimit it to range if yo can) and using filters, you can construct a row filter that filters out anything but items you want. Regarding filters: http://hbase.apache.org/apidocs/index.html

2) You can duplicate your data by storing it twice(once with each rowname). This is going to slow writes, but help a lot with reads if you do scanning on both. Of course disk usage is also doubled.

3) You can construct an index with the alternative row names to point to the relevant rows.

What approach you take will depend heavily on the access patterns of your data and read/write ratio.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow