What is the difference between a secondary index and an inverted index in Cassandra?

Question

The main difference is that secondary indexes in Cassandra are not distributed in the same way a manual inverted index would be. With the inbuilt secondary indexes, each node indexes the data it stores locally (using the LocalPartitioner). With manual indexing, the indexes are distributed independently of the nodes that store the values.

This means that, for the inbuilt indexes, each query must go to each node, whereas if you did inverted indexing manually you would just go to one node (plus replicas) to query the value you were looking up. One advantage of having the index stored locally is that indexes can be updated atomically with the data. (Although, since Cassandra 1.2, the atomic batches could be used for this instead although they are a bit slower.)

This is why Cassandra indexes are not recommended for really high cardinality data. If you are doing a lookup on each node but there are only one or two results, it is inefficient and a manual inverted index will be better. If your lookup returns many results, then you will need to lookup on each node anyway so the inbuilt indexes work well.

A further advantage of using Cassandra's inbuilt indexing is that the indexes are updated lazily, so you don't need to do a read on every update. (See CASSANDRA-2897.) This can be a significant speed improvement for indexed tables with high write throughput.