AWS EMR Hive partitioning unable to recognize any type of partitions

https://stackoverflow.com/questions/21564513

07-10-2022
|

Pergunta

I am trying to process some log files on a bucket in amazon s3.

I create the table :

CREATE EXTERNAL TABLE apiReleaseData2 (
messageId string, hostName string, timestamp string, macAddress string DISTINCT, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';

Then I run the following HiveQL statement and get my desired output in the file without any issues. My directories are setup in the following manner :

s3://apireleasecandidate1/regression/transferstatistics/2013/12/31/ < All the log files for this day >

What I want to do is that I specify the LOCATION up to the 's3://apireleasecandidate1/regression/transferstatistics/' and then call the

ALTER TABLE <Table Name> ADD PARTITION (<path>)

statement or the

ALTER TABLE <Table Name> RECOVER PARTITIONS ;

statement to access the files in the subdirectories. But when I do this there is no data in my table.

I tried the following :

CREATE EXTERNAL TABLE apiReleaseDataUsingPartitions (
messageId string, hostName string, timestamp string, macAddress string, apiKey string,
userAccountId string, userAccountEmail string, numFiles string)
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
serde 'com.amazon.elasticmapreduce.JsonSerde'
with serdeproperties ( 'paths'='messageId, hostName, timestamp, macAddress, apiKey, userAccountId, userAccountEmail, numFiles')
LOCATION 's3://apireleasecandidate1/regression/transferstatistics/';

and then I run the following ALTER command :

ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31');

But running the Select statement on the table gives out no results.

Can someone please guide me what I am doing wrong ? Am I missing something Important ?

Cheers Tanzeel

Solução

In HDFS anyway, the partitions manifest in a key/value format like this:

hdfs://apireleasecandidate1/regression/transferstatistics/year=2013/month=12/day=31

I can't vouch for S3 but an easy way to check would be to write some data into a dummy partition and see where it creates the file.

ADD PARTITION supports an optional LOCATION parameter, so you might be able to deal with this by saying

ALTER TABLE apiReleaseDataUsingPartitions ADD PARTITION (year='2013', month='12', day='31') LOCATION 's3://apireleasecandidate1/regression/transferstatistics/2013/12/31/';

Again I've not dealt with S3 but would be interested to hear if this works for you.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow