Frage

I need to find out only the row count of the result in a scan. The problem with the code below is that it returns the row key and the column key, which makes it very slow when shipping the data to the client. I only need to send the row count from the scan result to the client. Is there a specific way to do this directly?

scan 'consumer-data', {FILTER => "
PrefixFilter('test_row') 
AND KeyOnlyFilter() 
AND FirstKeyOnlyFilter() 
AND ((ColumnPrefixFilter('test_col:test1') 
AND ValueFilter(=, 'binary:test 1')) 
"}

Any help would be appreciated.

War es hilfreich?

Lösung

The code you wrote is very very slow. First off scan works sequentially (no map/reduce) so that is slow to begin with. Then you use two slow filters one that looks at column names that and the worse filter that actually looks at values. - what you get is one by one sequential read that examines each each column and value for matching columns)

If you expect to run queries like these on a regular basis you should rethink your key. Also re do this as a map/reduce job, so at least it will divide the work

Andere Tipps

Easiest option that i have used is to create a HIVE table on Hbase and then query the HIve table instead using HQL (you can add where clause and all sort of conditions)... this would internally automatically create a MapReduce Job for you and would run in cluster, so you don't have to worry about running multi threading and writing MR code.

Example below:

CREATE EXTERNAL TABLE emp(id int, city string, name string, occupation string, salary int) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,
                   personal_data:city,
                   personal_data:name,
                   professional_data:occupation,
                   professional_data:salary")
TBLPROPERTIES ("hbase.table.name" = "emp", "hbase.mapred.output.outputtable" = "emp");

Select count(*) from emp where city = 'LA';

For this case, I would like to give you 2 options.

  1. Use MapReduce jobs. You should write your own MR jobs to actually run the counting job throughout the cluster.

  2. Use multi-threading. You can write multi-threading scan tasks to count with filters on hbase regions. For example: 1 thread per region to do the counting.

Just for your reference: I have tried both strategies before and my test result showed they have similar performance. Maybe not correct, but definitely faster than your current implementation.

Yoy can do it by hbase shell by executing following command

hbase(main):002:0> count 'consumer-data'
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top