Cassandra efficient table walk

Question

Here are two ideas that may help you.

1) You can efficiently scan rows in any table using the following approach. Consider a table with PRIMARY KEY (pk, sk, tk). Let's use a fetch size of 1000, but you can try other values.

First query (Q1):

select whatever_columns from allocated limit 1000;

Process these and then record the value of the three columns that form the primary key. Let's say these values are pk_val, sk_val, and tk_val. Here is your next query (Q2):

select whatever_columns from allocated where token(pk) = token(pk_val) and sk = sk_val and tk > tk_val limit 1000;

The above query will look for records for the same pk and sk, but for the next values of tk. Keep repeating as long as you keep getting 1000 records. When get anything less, you ignore the tk, and do greater on sk. Here is the query (Q3):

select whatever_columns from allocated where token(pk) = token(pk_val) and sk > sk_val limit 1000;

Again, keep doing this as long as you get 1000 rows. Once you are done, you run the following query (Q4):

select whatever_columns from allocated where token(pk) > token(pk_val) limit 1000;

Now, you again use the pk_val, sk_val, tk_val from the last record, and run Q2 with these values, then Q3, then Q4.....

You are done when Q4 returns less than 1000.

2) I am assuming that 'report_name, view_name, col_name and row_name' are not unique and that's why you maintain a hashmap to keep track of the total amount whenever you see the same combination again. Here is something that may work better. Create a table in cassandra where key is a combination of these four values (maybe delimited). If there were three, you could have simply used a composite key for those three. Now, you also need a column called amounts which is a list. As you are scanning the allocate table (using the approach above), for each row, you do the following:

update amounts_table set amounts = amounts + whatever_amount where my_primary_key = four_col_values_delimited;

Once you are done, you can scan this table and compute the sum of the list for each row you see and dump it wherever you want. Note that since there is only one key, you can scan using only token(primary_key) > token(last_value_of_primary_key).

Sorry if my description is confusing. Please let me know if this helps.