Columnar Databases - which one can query (aggregation) compressed data

https://stackoverflow.com/questions/22116472

18-10-2022
|

Question

I want to run aggregate queries (say, within a column/colfam how many times value '1' repeats for certain rowkeys. I would like to run these queries on compressed data as they would provide better performance and we can completely skip decompression.

I am currently using HBASE's aggregation client and it looks like (haven't checked source code yet...can be wrong) Hbase finds rowkeys using its b-tree index, decompresses that block and reads the data sequentially. Is there a way to skip the decompressing process?

Looks like Redshift (ParAccel) is also doing aggregation the same way.

How to make HBase calc the count by just working on compressed data. If that is not possible, are there any other columnar db's that provides this functionality.

Solution

Redshift employs "zone maps" to avoid unnecessary decompression. The min/max/count (plus a few others) of each column are stored for each compressed 1MB "block".

The content of each block is defined by the sort key. If your query aligns with the sort key (and can be answered from the zone map) then Redshift will not decompress the data unnecessarily.

All of the above is AFAIK from reading the docs and extensive use. YMMV, of course.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow