Question

We're running Datastax Enterprise 4.0.1 and experimenting with running different M/R jobs against a CF in Cassandra. We've setup the column family thusly:

CREATE TABLE pageviews (
  website text,
  date text,
  created timestamp,
  browser_id text,
  ip text,
  referer text,
  user_agent text,
  PRIMARY KEY ((website, date), created, browser_id)
) WITH bloom_filter_fp_chance=0.001000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=1.000000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='NONE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
  compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};

The benefit of Hive is that it handles the CQL3 "flattening", to abstract Cassandra's underlying column/row storage mechanism. The downside appears to be that it doesn't use Cassandra's partition key or primary key to perform fast lookups, for e.g.

SELECT COUNT(1) WHERE website = "blah" AND date = "blah";

Running that MR job appears to perform a full table scan instead of pre-narrowing the keys it has to parse through. Is it possible to tell Hive not to perform a full table scan if there are obvious benefits to filtering based on partition/primary key?

Side note: When using Pig, it appears that it can and does use Cassandra's partition/primary key to perform fast lookups. The downside of Pig being that we have to do all of our filtering and flattening ourselves - greatly impeding the time to create jobs.

Was it helpful?

Solution

The best bet is to use Pig, and use cql:// with CqlStorage(), which does the heavy lifting of flattening the Cassandra data for you, e.g.

grunt> pageviews = LOAD 'cql://ks/pageviews' USING CqlStorage();
grunt> describe pageviews;
grunt> pageviews: {website: chararray,date: chararray,created: long,browser_id: chararray,ip: chararray,referer: chararray,user_agent: chararray}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top