Cassandra time series data model with counter column

Question 1

If you want a composite partition key, you need extra brackets:

CREATE TABLE minute_usr (
  min varchar,
  usr varchar,
  bytes counter,
  PRIMARY KEY ((min, usr))
);

However, then you can't do range queries on min or usr. You can only do range queries on non-partition keys.

You can use your second model, with usr as the partition key, and shard on some time bucket to stop the rows growing too large. For example, you could have a partition per user per day:

CREATE TABLE usr_day_minute (
  day varchar,
  min varchar,
  usr varchar,
  bytes counter,
  PRIMARY KEY ((usr, day), min)
);

but now you will need to do separate queries per day if you want results for multiple days. You could choose a larger time bucket, but at the cost of larger rows.

Note you probably want to use a numeric data type or timestamp for min so you do numeric comparisons rather than string comparison.

Question 2

I don't think you really need to worry about row size. You can grow a single partition to 2 billion cells, so you are probably okay.

If you are really expecting data distribution to be bad, just insert an enum that is populated at random.

create table network_usage (
    usr varchar,
    bucket_enum int,
    when timestamp,
    bytes counter,
    PRIMARY KEY ((usr, bucket), when)
);

Now, you could have bucket_enum be derived from some point in time in timestamp, but that means that for certain windows of time, all the data will be in one node, which I'd imagine would be a concern. Let the bucket_enum be sized based on roughly how much you want the data sharded up. You can potentially just use the number of partitions in the database (and query at runtime for it).

Then, to update consumption information you might do something like: #language of your choice ts = now() bucket = random_integer() % sharding_factor;

#now in CQL
consistency any;
update network_usage set bytes = bytes + 200
    where usr = 'testuser' and bucket_enum = :bucket and when = now();

The bucket is really just an abuse of CQL to have the primary key sharded across the nodes in the database. Now we exploit that at query time. Let's say the sharding factor is 6:

#you may want a different consistency level, but since this is mostly historical data,
#one should really be enough.
consistency one;
select count from network_usage
where usr = 'testuser' AND
    bucket_enum in (0, 1, 2, 3, 4, 5) AND
    when >= :start_time and
    when < :end_time;

The different from the other approach is that you can control exactly how much data distribution there is, you can easily rebalance the data whenever you want, you don't have a hot spot partition, and you are spreading the data and the query processing load over as many nodes in the cluster as you want. The disadvantage is you are spreading the data and query processing load over as many nodes in the cluster as you want. ;-)

This approach is particularly helpful if you do your queries as Hadoop/Spark jobs, and it allows for completely flexible time resolution.