Why do we need secondary indexes in cassandra and how do they really work?

https://stackoverflow.com/questions/22650126

21-06-2023
|

Question

I was trying to understand why secondary indexes were even necessary on Cassandra.

I know that secondary indexes are used because:

"Secondary indexes allow for efficient querying by specific values using equality predicates (where column x = value y). Also, queries on indexed values can apply additional filters to perform operations such as range queries."

from: http://www.datastax.com/docs/0.7/data_model/secondary_indexes

But what I did not understand is why a query like:

get users where birth_date = 1973;

required that the birth_date had a secondary index. Why is it necessary for secondary indexes to even exist? Can't cassandra just go through the table and then return the values when the constrained is matched? Why do we need to treat things that we might want to query in that way in any special way?

I am assuming that the fact that cassandra is distributed and going through the whole table might not be easy due to each row key being allocated to a different node making it a little complicated. But I didn't really understand how making it distributed complicated the problem and how secondary indices resolved it (i.e. how does cassandra resolve this issue?).

Related to this question, is it true that secondary indexes and primary keys are the only things that can be queried in the for of SELECT * FROM column_family_table WHERE col_x = constraint? Why is the primary key special?

Solution

With amount of data these nosql databases meant to deal with, going for table scan or region scan is not an option. That's what Cassandra has restricted and allowed queries over non row key columns only if secondary indxes are enabled. That way such indices and data would be co located on same data node.

Hope it helps.

-Vivek

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow