Question

I want to design a Cassandra schema for my timeline:

I have users and every user has a timeline, each timeline value is an string.

Since every user has a timeline of strings I need to populate each users timeline from the end as quickly as possible.

I estimate that every user will have 10000 timeline objects, and that I will have 10000 users, so I will have 100,000,000 object easily. This means that speed is very important.

This is the code I have used, is it right?

$pool = new ConnectionPool('Keyspace', array('127.0.0.1'));
$cf = new ColumnFamily($pool, 'timeline');

// Insert a few records
$columns = array(microtime() => "event1", microtime() => "event2", microtime() => "event3", microtime() => "event4" );
$cf->insert('usera', $columns);
Was it helpful?

Solution

You can use TimeUUID as column keys, they can make sure you get unique keys even if you have multiple application servers writing data at the same time (although very unlikely two application servers could insert something at the exact same microtime value for the same user), and they will sort in chronological order just like a regular timestamp.

You might also want to use a reverse comparator if you expect that you will display the most recent items more often (for example if you want to show the ten most recent timeline items for a user). Using a reverse comparator means that Cassandra will store rows in reverse order, with the most recent items first. This means that the most recent items will be the easiest for Cassandra to find and you will get very good performance.

Another thing to think about is just how wide your rows will get. If you don't expect that a timeline will be longer than a million or so items (exactly how many depend on how much data there will be in each item) then having a single row per user will probably work (but again, try using a reverse comparator, otherwise reading the most recent items will be slow). If you expect your users to generate millions and millions of timeline items you need to think of a way to split up a user's timeline into many rows. Perhaps one row per user per month, or per day. It needs to be something that is deterministic so that you don't have to do a query to find wich row you should read -- and since your columns are sorted on time, using time to partition into multiple rows is natural.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top