Question

I am trying to create a nested data model in a cassandra database similar to:

Forums = {
    forum001: {
        name: "General News",
        topics: {
            topic000001: {
                subject: "This is what I think",
                date: "2012-08-24 10:12:13",
                posts: {
                    post20120824.101213: { username: "tom", content: "Blah blah", datetime: "2012-08-24 10:12:13" }
                    post20120824.101513: { username: "dick", content: "Blah blah blah", datetime: "2012-08-24 10:15:13" }
                    post20120824.103213: { username: "harry", content: "Blah blah", datetime: "2012-08-24 10:32:13" }
                }
            },
            topic000002: {
                subject: "OMG Look at this",
                date: "2012-08-24 10:42:13",
                posts: {
                    post20120824.104213: { username: "tom", content: "Blah blah", datetime: "2012-08-24 10:42:13" }
                    post20120824.104523: { username: "dick", content: "Blah blah blah", datetime: "2012-08-24 10:45:23" }
                    post20120824.104821: { username: "harry", content: "Blah blah", datetime: "2012-08-24 10:48:21" }
                }
            }
        }
    },
    forum002: {
        name: "Specific News",
        topics: {
            topic000003: {
                subject: "Whinge whine",
                date: "2012-08-24 10:12:13",
                posts: {
                    post20120824.101213: { username: "tom", content: "Blah blah", datetime: "2012-08-24 10:12:13" }
                    post20120824.101513: { username: "dick", content: "Blah blah blah", datetime: "2012-08-24 10:15:13" }
                }
            }
        }
    }
}

The basic design of the data is a bunch of nested maps within one another. I have read that this is not reasonable due to the difficulty of querying this data structure. What would be a better solution to this problem in order to structure the data in such a fashion?

Was it helpful?

Solution

If you want to query using a range of anything that can be sorted (like a date in your example) then it needs to be in the column_name.

First I would make forums ids the row key and the column_family would look something like this:

*Row*: "forum001"<br>
=> *column*: "name" - *value*: "General News"<br>
=> *column*: "post::20120824101213::[some_uuid]" - *value*: "[serialized blob of data representing everything in the post]"<br>

From this you have to ask to return the columns in the range post::201203* ~ post::201204* for all the posts in the month of march for example.

Something to remember is that rows are stored randomly across your cassandra cluster (if you keep the default settings of Cassandra which is advised). Columns of the same row are on the same node and are sorted, so you can use those for ranges of values.

For the column name I like to use the type of the object serialized in the column as the prefix (this way I can have many types in the same row). Then you have a few choice in how to represent the date in the column name :

  • ISO format date + a random UUID: The iso format gives you readability for debugging and sorts in as a String, the UUID appended is there to guarantee the uniqueness of the column name (or you might have accidental overwriting in period of high traffic)
  • TimeUUID : will give your time sorted and uniqueness in one go, but you won't be able to tell the date yourself from the cassandra console tools.

You will have to use a different row name for any kind of query criteria (author, date, size, ...) so use denormalization

A good read (I think I have pasted this a thousand times) is this two part article from eBay:
Cassandra Data Modeling Best Practices, Part 1
Cassandra Data Modeling Best Practices, Part 2

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top