Question

I'm trying to store some time series data on the following column family:

create column family t_data with comparator=TimeUUIDType and default_validation_class=UTF8Type and key_validation_class=UTF8Type;

I'm successfully inserting data this way:

data={datetime.datetime(2013, 3, 4, 17, 8, 57, 919671):'VALUE'}
key='row_id'
col_fam.insert(key,data)

As you can see, using a datetime object as the column name pycassa converts to a timeUUID object correctly.

[default@keyspace] get t_data[row_id];

=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)

Sometimes, the application needs to update some data. The problem is that when I try to update that column, passing the same datetime object, pycassa creates a different UUID object (the time part is the same) so instead of updating the column, it creates another one.

[default@keyspace] get t_data[row_id];

=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)

=> (column=**f36ad7be**-84ed-11e2-b2fa-a6d3e28fea13, value=VALUE, timestamp=1362424025433209)

The question is, how can I update TimeUUID based columns with pycassa passing the datetime object? or, if this is not the correct way to doing it, what is the recommended way?

Was it helpful?

Solution

Unless you do a read-modify-write you can't. UUIDs are by their nature unique. They exist to solve the problem of how to get unique IDs that sort in chronological order but at the same time avoid collisions for things that happen at exactly the same time.

So to update that column you need to first read it, so you can find its column key, change its value and write it back again.

It's not a particularly elegant solution. You should really avoid read-modify-write in Cassandra. Perhaps TimeUUID isn't the right type for your column keys? Or perhaps there's another way you can design your application to avoid having to go back and change things.

Without knowing what your query patterns look like I can't say exactly what you should do instead, but here are some suggestions that hopefully are relevant:

Don't update values, just write new values. If something was true at time T will always have been true for time T, even if it changes at time T + 1. When things change you write a new value with the time of the change and let the old values be. When you read the time line you resolve these conflics by picking the most recent value -- and since the values will be sorted in chronological order the most recent value will always be the last one. This is very similar to how Cassandra does things internally, and it's a very powerful pattern.

Don't worry that this will use up more disk space, or require some extra CPU when reading the time series, it will most likely be tiny in comparison with the read-modify-write complexity that you would otherwise have to implement.

There might be other ways to solve your problem, and if you give us some more details maybe we can come up with someting that fits better.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top