What is the ordering for Cassandra UTF8Type?
All the documentation led me to expect a lexographical sort order (essentially, alphabetical order). That doesn't appear to be the order Cassandra uses. What it is using is hard for me to guess.
I built a table to count interactions affecting named "applications", organized in time-buckets of one day. (This is a simple example to demonstrate the cause of my confusion). I want to be able to look for a particular application
The CQL description of the table is as follows:
CREATE TABLE "appMetrics" (app text,time timestamp,counter_val counter,
PRIMARY KEY (app, time)) WITH COMPACT STORAGE;
I load it with data:
update "appMetrics" set counter_val = counter_val+1 WHERE app='ab' AND time='2014-02-14 00:00:00';
update "appMetrics" set counter_val = counter_val+1 WHERE app='a' AND time='2014-02-14 00:00:00';
update "appMetrics" set counter_val = counter_val+1 WHERE app='c' AND time='2014-02-14 00:00:00';
update "appMetrics" set counter_val = counter_val+1 WHERE app='b' AND time='2014-02-14 00:00:00';
update "appMetrics" set counter_val = counter_val+1 WHERE app='bc' AND time='2014-02-14 00:00:00';
update "appMetrics" set counter_val = counter_val+1 WHERE app='ca' AND time='2014-02-14 00:00:00';
I select from the table and see this result:
select * from "appMetrics";
app | time | counter_val
-----+--------------------------+-------------
a | 2014-02-14 00:00:00-0500 | 1
c | 2014-02-14 00:00:00-0500 | 1
ab | 2014-02-14 00:00:00-0500 | 1
ca | 2014-02-14 00:00:00-0500 | 1
bc | 2014-02-14 00:00:00-0500 | 1
b | 2014-02-14 00:00:00-0500 | 1
(6 rows)
So, this order is not alphabetic, not order of entry, not any order I can see. The ordering isn't random, or at least it's repeatable:
cqlsh:simplex> select * from "appMetrics" where token(app) >= token('ab');
app | time | counter_val
-----+--------------------------+-------------
ab | 2014-02-14 00:00:00-0500 | 1
ca | 2014-02-14 00:00:00-0500 | 1
bc | 2014-02-14 00:00:00-0500 | 1
b | 2014-02-14 00:00:00-0500 | 1
(4 rows)
cqlsh:simplex> select * from "appMetrics" where token(app) <= token('ab');
app | time | counter_val
-----+--------------------------+-------------
a | 2014-02-14 00:00:00-0500 | 1
c | 2014-02-14 00:00:00-0500 | 1
ab | 2014-02-14 00:00:00-0500 | 1
(3 rows)
For what it's worth, the column family is described as:
ColumnFamily: appMetrics
Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Default column value validator: org.apache.cassandra.db.marshal.CounterColumnType
Cells sorted by: org.apache.cassandra.db.marshal.TimestampType
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 0.1
DC Local Read repair chance: 0.0
Populate IO Cache on flush: false
Replicate on write: true
Caching: KEYS_ONLY
Default time to live: 0
Bloom Filter FP chance: 0.01
Index interval: 128
Speculative Retry: 99.0PERCENTILE
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
Compression Options:
sstable_compression: org.apache.cassandra.io.compress.LZ4Compressor
Can someone explain how these are ordered?