Cassandra: Which is the best choice on manual indexes

https://stackoverflow.com/questions/14342382

15-01-2022
|

Question

First, excuse me for my English. It is not my native language. I'm working on moving a SQL database to Cassandra but I have a question I'm not able to solve. Let's say I have a SQL table where I store songs. Each song has an ID as primary key which allows to access all its related data, which are stored in the fields of the row given by the key. I also have some indexes to search using some different criterias as the author, gender, title...

When I think on moving this to a Cassandra schema, I work around the idea that I can create an equivalent column family, where the song ID is the row key and the song attributes are the columns. Then, I can create 5 or 6 manual indexes to search by author, title, gender and more. The author, title... will be the column key (adding some extra data to keep them unique, using a composite column name) and the value will be the song ID for searching in the static column family where each row is identified by the song ID.

But I here appears my doubt. What is better: each index CF storing only the ID or storing all the attributes? The first option allows me to reduce the amount of necessary memory, but I need (at least) 2 reads to get each song attributes. With the second option I need more memory because repeat the same information once per index, but by in one read I can get all the attributes I need. I think I can assume the extra memory needed if this will be a faster schema, but, it will be really faster? Having a bigger database will not make it work slower? Or the slower operation is to search each row given by the index CF due to the way Cassandra stores the rows and due to the 2 reads?

Another detail: I have calculated that using the second option (storing all the attributes in the CF which works as "indexes") I need about 80% more memory than using the first option (CFs really work as indexes to find the right data in the "main" CF of songs).

Any help will be very appreciated.

Thanks in advance!

Solution

You will also want to check out the wide row pattern. Some libraries like PlayOrm do the pattern for you so you can then do something like Scalable SQL(ie. with partitions). You can have as many partitions as you like. I am sure more and more NoSql object mapping libraries will exist in the future as well...there is a patterns page on PlayOrm's wiki as well that has noSql patterns and PlayOrm patterns both....you may want to checkout the nosql ones.

OTHER TIPS

Of course there are all sorts of tradeoffs with different data models, but it sounds like your primary concern is the data set size and the access speed. Cassandra can handle extremely large quantities of data in linearly scalable fashion, as long as you can give it the necessary resources to do the job. On the other hand, doing two lookups is very cheap when you're doing a get-by-key. My intuition would be to store just the ID, if for no other reason than it makes it easier to update your attributes. Then you can optimize if you find the queries aren't fast enough. Coming from an RDBMS, though, I'm guessing it will be plenty fast.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow