How to count rows with filter in Hbase?

Question 1

The code you wrote is very very slow. First off scan works sequentially (no map/reduce) so that is slow to begin with. Then you use two slow filters one that looks at column names that and the worse filter that actually looks at values. - what you get is one by one sequential read that examines each each column and value for matching columns)

If you expect to run queries like these on a regular basis you should rethink your key. Also re do this as a map/reduce job, so at least it will divide the work

Question 2

Easiest option that i have used is to create a HIVE table on Hbase and then query the HIve table instead using HQL (you can add where clause and all sort of conditions)... this would internally automatically create a MapReduce Job for you and would run in cluster, so you don't have to worry about running multi threading and writing MR code.

Example below:

CREATE EXTERNAL TABLE emp(id int, city string, name string, occupation string, salary int) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,
                   personal_data:city,
                   personal_data:name,
                   professional_data:occupation,
                   professional_data:salary")
TBLPROPERTIES ("hbase.table.name" = "emp", "hbase.mapred.output.outputtable" = "emp");

Select count(*) from emp where city = 'LA';

Question 3

For this case, I would like to give you 2 options.

Use MapReduce jobs. You should write your own MR jobs to actually run the counting job throughout the cluster.
Use multi-threading. You can write multi-threading scan tasks to count with filters on hbase regions. For example: 1 thread per region to do the counting.

Just for your reference: I have tried both strategies before and my test result showed they have similar performance. Maybe not correct, but definitely faster than your current implementation.

Question 4

Yoy can do it by hbase shell by executing following command

hbase(main):002:0> count 'consumer-data'