Question

I'm looking to create a table to store information containing details on the URL parameters of visitors to a bunch of websites. I'm getting a dictionary of parameters and values per user_id-session_id pair for each site, each with a timestamp.

In summary: Each site has Users (who have sessions) that have parameters that have values at times.

I'm going to want to query this data like so...

For each website, given a daterange, get the customer_id of the customers with a certain parameter value.

p.s. I don't readily have a list of user id/session id so I can't use those as row keys without doing a 'get_range()' query. Which may be expensive.

Attempted solutions/thoughts...

I figured that dynamic columns would be a good idea here as we don't really know all the different sets of parameters and I'd rather not make an attempt at hard coding them in. That gives me the name of the parameter as the column header and the value of the cells as the parameter value. That's all well and good but it leaves me with composite row keys of (site, date, customer_id, session_id).

I'm all for big composite row keys but I don't think I (would love to be corrected) could query the data to get ranges of dates in this way. If so, how can I do that please?

Also thought about having site-datetime composite column headers as that allows me to filter for those properties nicely. Then I could have the row keys as (param - param value) composite keys and build a set of user_ids as values. Actually, that's sounding like my best bet at the moment I think, but it still seems fairly messy.

If anybody has got any ideas, I'd love to hear them!

Thanks,

Matt

p.p.s I'm using Pycassa, CQL3 and Cassandra 1.2 so have all the tools available to me that those platforms offer

Was it helpful?

Solution

@Kali

All problem has proper solution

For query, we have:

  • site as value

We would like to search by date range

CREATE TABLE my_index
(
   site text, // row key, also known as partition key
   date timestamp, // search by date range
   user_id bigint, // user id
   session_id uuid, // session id
   parameters map<text,text> // pairs of param key/param value
   PRIMARY KEY (site,date,user_id,session_id)
)

For each site, we have one row because site = row key

The column name is a composite of (date,*user_id*,*session_id*). Date is the first component so that you get query by date.

*user_id* and *session_id* are there to guarantee UNICITY, meaning that for each timestamp, you can store parameters for more than 1 couple of user_id/session_id

Now, to store parameters, use the MAP feature introduced by CQL3. Parameter names are keys of the map and can be stored as text. For parameter values, since they can be of any type (long, int, string, date ...), it is better to JSON-serialize them as text so you can store them together in the map.

This solution require a little serialize/deserialize before persisting/readign from Cassandra

Example of query:

Get all users,session and parameters for website www.stackoverflow.com between '2013-09-03' and '2013-09-04'

SELECT user_id,session_id,parameters FROM my_index 
WHERE site = 'www.stackoverflow.com' AND date>=1378166400 
AND date<=1378252800

Get all parameters for website www.stackoverflow.com at date '2013-09-03' and user_id =10

SELECT parameters FROM my_index 
WHERE site = 'www.stackoverflow.com' 
AND date=1378166400 AND user_id=10

OTHER TIPS

cassandra data modeling is based on query based approach.

For cassandra or noSQL data modelling beginners, there is a process involved in data modelling your application, like

1- Understand your data, design a concept diagram
2- List all your quires in detail
3- Map your queries using defined rules and patterns, best suitable for cassandra
4- Create a logical design, table with fields derived from queries
5- Now create a schema and test its acceptance.

if we model it well, then it is easy to handle issues such as new complex queries, data over loading, data consistency setc.

After taking this free online data modelling training, you will get more clarity

https://academy.datastax.com/courses/ds220-data-modeling

Good Luck!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top