Row ordering in Cassandra

https://stackoverflow.com/questions/22706658

23-06-2023
|

Question

I have the following columnfamily in Cassandra 2.0.5, using a Murmur3Partitioner. Inside this columnfamily I store the number of apparitions of unique hashes in a timeframe (hashes extracted from events occuring over time - not really relevant).

My use case is to select all the hashes and their counts for a given timeframe (the hour field).

Since the amount of data can be very large, I tried to do pagination like using LIMIT and continuing from the last returned hash, like in the example below. It seems to work, as the hashes seem to be returned in a sorted ascending order.

Can someone explain if this really works and why? Especially since I found this link which states that the rows are...not ordered, so now that I think about it, the hashes should be returned randomly.

I did validate the procedure by counting the number of rows using the pagination approach and by using COUNT in cqlsh, but I can't really check if all the right hashes are returned due to the large amount of data.

cqlsh:db> DESCRIBE COLUMNFAMILY hashes ;
CREATE TABLE hashes (
  hour text,
  hash text,
  count counter,
  PRIMARY KEY (hour, hash)
) WITH COMPACT STORAGE AND
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND 
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND 
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};


cqlsh:db> SELECT * FROM hashes WHERE hour = '2014032710' LIMIT 10;

 hour       | hash                                                             | count
------------+------------------------------------------------------------------+----------
 2014032710 | 000034d4b821c9af90bbf39cd803d45b25d7c14777697b8d9fc71c3a102c360f |        1
 2014032710 | 000063b39f526788dc026a07abe1bc1365652772e9c66be9a7408b16c61962fa |        2
 2014032710 | 00009c38834cedfb37bfd95355bba1a225aea6ee74f5ddc4ace820bfc33eb7a6 |        1
 2014032710 | 0000a68de59092e0326b3ceff8d9a1167c7f5ea0aac804389c259f336956e520 |        1
 2014032710 | 0000b0fed9e2f8f70e5e46f084be1872f0d1944c0e89a8850e6b7c3be17b8935 |        9
 2014032710 | 0001204a0fb29d3a8ac7164e451662069d19307ea56e014215a64cc606cf4df9 |        1
 2014032710 | 00015c165622a3c8b88d33e471d740088d9b6203dd81235d50ec129c40282229 |        1
 2014032710 | 00019ed1b3287ed808c24146d1f2e145238478b49ad3740fb58cb46bc509965a |       10
 2014032710 | 00019fa833cee60e7a1b8ed5d5c6fbef8c401a144e1537e15c9a5f65672d44fb |        1
 2014032710 | 0001df8d8319524a93ed523382a6cce8de9234211d5f3dc46bb4c530d9385150 |        1

(10 rows)

cqlsh:db> SELECT * FROM hashes WHERE hour = '2014032710' AND hash > '0001df8d8319524a93ed523382a6cce8de9234211d5f3dc46bb4c530d9385150' LIMIT 10;

 hour       | hash                                                             | count
------------+------------------------------------------------------------------+----------
 2014032710 | 000200428d93eb478c6a9ae0d9daa21fac88ca8dd4e536f60ae992dbea6155d4 |        2
 2014032710 | 00024447d8983fc0f022df4301eb69eca4ccc7cf0fc2e9361046dbaedbe830bc |        1
 2014032710 | 00025c6b3ef861fa3ef047d618f078927c9f8cf875e9b935c8e556189969bc17 |        1
 2014032710 | 00026f67e525bd11b67062e3122eb625799c6878f7812da8f23f0c8e9bd9f9d5 |        2
 2014032710 | 00028ded6dfe5d8616cc0eef559cfdf15fd51d5a36c17f2b9852785e8ca55c27 |        4
 2014032710 | 00028f8fab859c702fe0cc51db390ce7ae85ca97807a751ddf12fed57639239f |        1
 2014032710 | 0002f4046ef35e169fa79e2abf0b92212c1438487819dd8318301991ff99acac |       32
 2014032710 | 000381054a59d46c87164fcfb69952afa1e77acd71f88b25e09eab3eacc1b21a |        1
 2014032710 | 0003aca7fd2cab16a03d79fa7ac1505f144f9ba04fea87a050bef919aa628e74 |        1
 2014032710 | 0003e6a549b01cf1634c1b2844618d4e96ac00d74be30b9401b3fbbbc5bdb7e2 |        1

(10 rows)

Solution

Please read about Sorted wide rows and Clustering ORDER KEY. Some excerpts from CQL specification page "Partition key and clustering columns

In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key. It has the property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns forms the partition key.

The remaining columns of the PRIMARY KEY definition, if any, are called __clustering columns. On a given physical node, rows for a given partition key are stored in the order induced by the clustering columns, making the retrieval of rows in that clustering order particularly efficient (see SELECT).

OTHER TIPS

Try using token function along with limit to scroll over multiple rows. Since you have defined a composite key, which would ensure sorted order. You may also have a look at CLUSTERING KEY ORDER while creating a column family.

hope it helps. -Vivek

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow