Question

I have a Cassandra schema similar to this:

CREATE TABLE table1 (a1 text,
  a2 text,
  b1 text,
  b2 text,
  c1 text,
  c2 text,
  PRIMARY KEY ( (a1, a2), b1, b2) )
);

This is for time series data. a1 is my id and a2 is my time bucket.
QUESTION: How do I properly select an entire time bucket?

I can select by a1 or a2. But if I select by a1 AND a2 the request times out.

SELECT * FROM table1 WHERE a1 = 'id1' AND a2 = 'time_bucket37' LIMIT 100 ALLOW FILTERING;

Result: Request did not complete within rpc_timeout.

Here is my::

cqlsh:testmulti1> DESCRIBE TABLE datasimple;

CREATE TABLE datasimple (
  dc_name text,
  time_bucket text,
  partition int,
  perf_name text,
  vm_name text,
  measurement_time timestamp,
  perf_data int,
  platform text,
  PRIMARY KEY ((dc_name, time_bucket, partition), perf_name, vm_name, measurement_time)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

CREATE INDEX datasimple_dc_name_idx_10 ON datasimple (dc_name);

CREATE INDEX datasimple_time_bucket_idx_10 ON datasimple (time_bucket);

CREATE INDEX datasimple_perf_name_idx_10 ON datasimple (perf_name);

CREATE INDEX datasimple_vm_name_idx_10 ON datasimple (vm_name);

CREATE INDEX datasimple_platform_idx_10 ON datasimple (platform);

I am trying to:

SELECT * 
FROM DataSimple 
WHERE dc_name = 'dc1' 
AND time_bucket = '2014031922' 
AND vm_name = 'vm1' LIMIT 1000;
Was it helpful?

Solution

SELECT * 
FROM DataSimple 
WHERE dc_name = 'dc1' 
AND time_bucket = '2014031922'
AND vm_name = 'vm1'
LIMIT 1000;

That's not going to work. Your current PRIMARY KEY is:

PRIMARY KEY ((dc_name, time_bucket, partition),
  perf_name, vm_name, measurement_time)

Try to think of using a compound PRIMARY KEY in Cassandra as a hierarchical path, starting from left to right. dc_name, time_bucket, and partition are a part of your partitioning key, so you'll need to include each of those in your WHERE to ensure that Cassandra can find where your data is stored.

All of the columns after that are clustering columns (perf_name, vm_name, and measurement_time), which determine the sort order of your data on disk. Again, you can't pick and choose which of these to include. You need to search on them in the order that they are defined in your schema. So if you want to query by vm_name, then you also need to make sure that the preceding column keys (perf_name in this case) are also present in your WHERE clause.

Basically to query this schema, add ANDs to your WHERE clause for:

  • partition
  • perf_name
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top