Ways to organize row-keys for range scans in Cassandra

https://stackoverflow.com/questions/9086711

21-04-2021
|

Question

I am trying to find a good way to organize my row-keys to perform range scans on them without creating my own index lists.

I am having a MySQL Database with currently about 15.000 Databases, each ~50 Tables = 75.000 Tables. Because 99% of the data is always read with an unique identifier that data is planned to move into a Cassandra cluster.

For some maintenance (listing the contents of a complete table, removing a complete table or dropping a database) cases I need to get the contents of a complete table or even a database. Range-Scans seem to be the perfect fit for that.

Currently I am planning to generate UUIDs for each part of the old structure and put them together separated by a | (DB + Table + Id = UUID1|UUID2|UUID2).

Example:

07424eaa-4761-11e1-ac67-12313c033ac4|0619a6ec-4525-11e1-906e-12313c033ac4|0619a6ec-4795-12e9-906e-78313c033ac4

The CF with the data should be sorted with org.apache.cassandra.db.marshal.AsciiType.

As client I am using phpcassa.

For the range scans I want to use an UUID| as start key and as an end for the range, the same key but with chr(255) or z appended to it. The ascii-value for both characters are bigger any other of the UUID characters that are following in that keys.

Is this a solid approach that allows me to achieve the explained goals for the range scans?

Solution

Cassandra best practices are to use the RandomPartitioner - this gives you 'free' load balancing, as long as your tokens are evenly distributed. Unfortunately, with the random partitioner, row range queries (ie get_range_slices) returns keys in a random order.

This is fine for paging through the entire column family (and if that is what you want to, then you approach will work). But if you just want to page through a smaller, contiguous range of row keys, it will not work.

One option to solve this is to use wide rows and composite columns. For example, a column family which looks like this:

{ 
  row1 -> {column1: value1, column2: value2},
  row2 -> {column3: value3, column4: value4},
  ... 
}

Would be transposed to look like this:

{
  row1-10 -> {
              [row1, column1]: value1, [row1, column2]: value2,
              [row2, column3]: value3, [row2, column4]: value4,
              ...
             }
  ...
}

And you can do a range query by doing a column slice (get_slice) on the right row, between the right columns. ie

get_range_slice(start=row1, end=row2)

becomes:

get_slice(row=row1-10, start=[row1, null], end=[row2, null])

Note the null second dimension on the column keys.

The trick is to pick your row ('bucket') keys such that your columns won't grow too large (this will perform badly for normal Cassandra), but that you queries won't need to get too many rows. This will depend on your average query size, and the distribution of your uuids, but a good candidate might be to use UUID1 as the row keys and [UUID2, UUID3] as the first dimensions of the column keys.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow