Cassandra/BigTable data model - what's the best approach for building indexes?

https://stackoverflow.com/questions/3318773

28-09-2020
|

Question

I'm in the process of spiking a conversion from MySQL to Cassandra for PenWag.com. In Cassandra, I'm storing Users keyed off of a GUID, but users sign in with their email, not the GUID (obviously). GUID as a key for Users makes sense to me more than email for two reasons. From a practical perspective it seems that it's too cumbersome to change or delete/add a row with all of its SuperColumns. From a theoretical standpoint, it's still the same user, why should their key change?

Nevertheless, here's my question: I'm building an index in a separate ColumnFamily, mapping email->GUID to support login. It's a Standard type CF, where the column name is email, and the value is GUID. It's Standard, not Super, to avoid loading an entire SC for every mapping. Supporting "change email" is easy, it's just a column delete/add. But it seems that an alternative to this is to store the index as rows instead of columns, where the row key is email, and a column holds the GUID. Delete/add on those rows would not be cumbersome, since there's only column (the GUID) to manage.

It seems that either approach works. What are the pros and cons of each? Is there a best practice?

Solution

Since I have no hands-on experience with Cassandra or similar databases, you'll need to take my answer with a grain of salt :)

If you'd store each mapping as a column, using the email address as the column name, this would imply a single row containing an enormous amount of columns. According to Wikipedia^[1]:

Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.

This could result in significant locking overhead if all mappings are stored in a single row.

The Cassandra Wiki states^[2]:

The row key is what determines what machine data is stored on.

This makes me believe that it's more efficient to do lookups based on row key than on column name. Based on this information, I would suggest to use the email address as the row key and store the GUID in the column.

OTHER TIPS

Niels is correct; one row per user would be the right way to do this manually.

I qualify that because in 0.7 you could just have a an email column in the row with the rest of your keyed-by-UUID user data and ask Cassandra to index it: http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow