DSE 4.0.1: hive count different than cassandra count

https://stackoverflow.com/questions/22701505

22-06-2023
|

문제

We're running Datastax Enterprise 4.0.1 and running into a very strange issue when inserting rows into Cassandra and then querying hive for the COUNT(1).

The setup: DSE 4.0.01, Cassandra 2.0, Hive, brand new cluster. Insert 10,000 rows into Cassandra and then:

cqlsh:pageviews> select count(1) from pageviews_v1 limit 100000;

 count
-------
 10000

(1 rows)

cqlsh:pageviews>

But from Hive:

hive> select count(1) from pageviews_v1 limit 100000;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201403272330_0002, Tracking URL = http://ip:50030/jobdetails.jsp?jobid=job_201403272330_0002
Kill Command = /usr/bin/dse hadoop job  -kill job_201403272330_0002
Hadoop job information for Stage-1: number of mappers: 4; number of reducers: 1
2014-03-27 23:38:22,129 Stage-1 map = 0%,  reduce = 0%
<snip>
2014-03-27 23:38:49,324 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.31 sec
MapReduce Total cumulative CPU time: 11 seconds 310 msec
Ended Job = job_201403272330_0002
MapReduce Jobs Launched:
Job 0: Map: 4  Reduce: 1   Cumulative CPU: 11.31 sec   HDFS Read: 0 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 310 msec
OK
1723
Time taken: 38.634 seconds, Fetched: 1 row(s)

Only 1,723 rows. I'm so confused. The CQL3 ColumnFamily definition is:

CREATE TABLE pageviews_v1 (
  website text,
  date text,
  created timestamp,
  browser_id text,
  ip text,
  referer text,
  user_agent text,
  PRIMARY KEY ((website, date), created, browser_id)
) WITH CLUSTERING ORDER BY (created DESC, browser_id ASC) AND
  bloom_filter_fp_chance=0.001000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=1.000000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='NONE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'min_sstable_size': '52428800', 'class': 'SizeTieredCompactionStrategy'} AND
  compression={'chunk_length_kb': '64', 'sstable_compression': 'LZ4Compressor'};

And in Hive we have:

CREATE EXTERNAL TABLE pageviews_v1(
  website string COMMENT 'from deserializer',
  date string COMMENT 'from deserializer',
  created timestamp COMMENT 'from deserializer',
  browser_id string COMMENT 'from deserializer',
  ip string COMMENT 'from deserializer',
  referer string COMMENT 'from deserializer',
  user_agent string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.cassandra.cql3.serde.CqlColumnSerDe'
STORED BY
  'org.apache.hadoop.hive.cassandra.cql3.CqlStorageHandler'
WITH SERDEPROPERTIES (
  'serialization.format'='1',
  'cassandra.columns.mapping'='website,date,created,browser_id,ip,referer,ua')
LOCATION
  'cfs://ip/user/hive/warehouse/pageviews.db/pageviews_v1'
TBLPROPERTIES (
  'cassandra.partitioner'='org.apache.cassandra.dht.Murmur3Partitioner',
  'cassandra.ks.name'='pageviews',
  'cassandra.cf.name'='pageviews_v1',
  'auto_created'='true')

Has anyone else experienced similar?

해결책 3

The issue appears to be with CLUSTERING ORDERY BY. Removing that resolves the COUNT misreporting from Hive.

다른 팁

It's probably the consistency setting on the HIVE table as per this document.

Change the hive query to "select count(*) from pageviews_v1 ;"

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow