Question

I'm trying to use Shark on EMR and I can't seem to be able to recover my partitions from a table with location set to an S3 bucket. I get nothing when i try to show my partitions.

shark> MSCK REPAIR TABLE logs ;
OK
Time taken: 1.79 seconds
shark> SHOW PARTITIONS logs ;
OK
Time taken: 0.073 seconds

I create my table like

SET hive.exec.dynamic.partition = true ;
SET hive.exec.dynamic.partition.mode = nonstrict ;

CREATE EXTERNAL TABLE IF NOT EXISTS logs (
  time STRING,
  thread STRING,
  logger STRING,
  identity STRING,
  message STRING,
  logtype STRING,
  logsubtype STRING,
  node STRING,
  storageallocationstatus STRING,
  nodelist STRING,
  userid STRING,
  nodeid STRING,
  path STRING,
  datablockid STRING,
  hash STRING,
  size STRING,
  value STRING,
  exception STRING,
  server STRING,
  app STRING,
  version STRING
)
PARTITIONED BY (
  dt STRING,
  level STRING
)
ROW FORMAT
  DELIMITED
  FIELDS TERMINATED BY '\t'
  LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://my-log/parsed-logs/' ;

My log bucket contains one log file located in s3://my-log/parsed-logs/dt=2014-01-03/level=ERROR/.

The MSCK REPAIR TABLE logs command should be equivalent to Amazons Hive extension ALTER TABLE logs RECOVER PARTITIONS according to the Hive language manual but when I run the command I get no visible partitions. I tried the exact same thing in Hive with ALTER TABLE logs RECOVER PARTITIONS and it worked like a charm.

hive> ALTER TABLE logs RECOVER PARTITIONS ;
OK
Time taken: 0.975 seconds
hive> SHOW PARTITIONS logs ;
OK
dt=2014-01-03/level=ERROR
Time taken: 0.078 seconds, Fetched: 1 row(s)

Am I missing something here when I'm using Shark?

Was it helpful?

Solution

I spoke to AWS and they said that my only option at the moment is to stick with Hive as MSCK REPAIR TABLE has some issues when addressing a table located in S3 (which is the reason why they added the ALTER TABLE RECOVER PARTITION command).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top