DSE 4.0.1: hive count different than cassandra count

https://stackoverflow.com/questions/22701505

22-06-2023
|

Question

We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).

The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:

cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;

 count
-------
 10000

(1 rows)

cqlsh:pageviews>

But from Hive:

hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job  -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%,  reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4  Reduce: 1   Cumulative CPU: 11.31 sec   HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)

Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:

CREATE TABLE pageviews_v1 (
  website text,
  date text,
  created timestamp,
  browser_id text,
  ip text,
  referer text,
  user_agent text,
  PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
  bloom_filter_fp_chance=0.001000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=1.000000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='NONE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
  compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};

And in Hive we have:

CREATE EXTERNAL TABLE pageviews_v1(
  website string COMMENT 'from deserializer',
  date string COMMENT 'from deserializer',
  created timestamp COMMENT 'from deserializer',
  browser_id string COMMENT 'from deserializer',
  ip string COMMENT 'from deserializer',
  referer string COMMENT 'from deserializer',
  user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
  'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1',
  'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
  'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
  'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
  'cassandra.ks.name'='pageviews',
  'cassandra.cf.name'='pageviews_v1',
  'auto_created'='true')

Has anyone else experienced similar?

Solution 3

The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

OTHER TIPS

It's probably the consistency setting on the HIVE table as per this document.

Change the hive query to "select count(*) from pageviews_v1 ;"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow