Question

Would it be bad practice to use an email as a primary key in Cassandra? Would this cause problems with the replication (since first primary key is used for replication - partition key)?

According to the documentation it is not a good idea to use high-cardinality 'keys' as indices. It says that one should create a dynamic column family (table) for queries against high cardinality columns.

It doesn't seem to make sense to me if the main thing I'm keeping track of in the database is the USER, who logs in with their EMAIL (to the app), to use anything else, but the EMAIL as the primary key..

Is it efficient to use EMAIL as the row key? Would there be a reason to use UUID over this?

The problem I am (perhaps ignorantly) foreseeing is that using UUID as the row key and then adding the email as another primary key is loss of uniqueness (that being the uniqueness of the email address). Multiple accounts could then be created with the same email (without extra checks to ensure that that email has not already been used -- which either necessitates an index or a dynamic table?)

This leads to the second question. What exactly is a dynamic table? I don't see where this high-cardinality key is used in the dynamic table.. Is it now the row key (why not make it the row key to begin with..)?

Does the search for the row key have higher performance than created indexes?

Does anyone have any insight into this? I would really appreciate it!

If dynamic column family just means that the columns are 'dynamically' added then I don't see how this helps for high-cardinality columns in terms of indexing.

Was it helpful?

Solution

You are mixing up primary keys with secondary indexes. The cardinality vs. efficiency trade-off applies to secondary indexes but not the primary key. The primary key values are unique by definition and are also the most efficient means of finding and accessing a single row. Have a look at this summary about indexes in Cassandra.

There is absolutely no problem with using the user's email address as the primary key of a user table if that is what uniquely identifies your users and associates them with their detail information.

A dynamic column family is a "table" for which the number if columns is not fixed. You add information not only by adding rows but also by adding columns on the fly. E.g. to build a time series of events. A column family is always dynamic, though I think the CQL layer obscures the fact. Whether you treat it as such or as a fixed set of columns is up to you. To find some theoretical background look for the BigTable concept and how Cassandra implements it.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top