Question

I have a file test_file_1.txt containing:

20140101,value1
20140102,value2

and file test_file_2.txt containing:

20140103,value3
20140104,value4

In HCatalog there is a table:

create table stage.partition_pk (value string)
Partitioned by(date string)
stored as orc;

These two scripts work nicely:

Script 1:

LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer();

Script 2:

LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') 
AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer();

Table partition_pk contains four partitions - everything is as expected.

But lets say, there is another file containing data that should be inserterd in one of existing partitions. Pig is unable to write into partition that contain data (or I missed something?) How do you manage loading into existing partitions (on not empty nonpartitioned tables)? Do you read partition, union it with new data, delete partition (how?) and insert it as new partition?

Was it helpful?

Solution

Coming from HCatalog's site, https://cwiki.apache.org/confluence/display/Hive/HCatalog+UsingHCat, it says: " Once a partition is created records cannot be added to it, removed from it, or updated in it.". So, by the nature of HCatalog, you can't add data to an existing partition that already has data in it.

There are bugs around this that they are working on. Some of the bugs were fixed in Hive 0.13:

https://issues.apache.org/jira/browse/HIVE-6405 (Still unresolved) - The bug used to track the other bugs https://issues.apache.org/jira/browse/HIVE-6406 (Resolved in 0.13) - separate table property for mutable https://issues.apache.org/jira/browse/HIVE-6476 (Still unresolved) - Specific to dynamic partitioning https://issues.apache.org/jira/browse/HIVE-6475 (Resolved in 0.13) - Specific to static partitioning https://issues.apache.org/jira/browse/HIVE-6465 (Still unresolved) - Adds DDL support to HCatalog Basically, it looks like if you don't want to use dynamic partitioning, then 0.13 might work for you . You just need to remember to set the appropriate property

What I've found that works for me is to create another partition key that I call build_num. I then pass the value of this parameter via the command line and set it in the store statement. Like so:

create table stage.partition_pk (value string) Partitioned by(date string,build_num string) stored as orc;

STORE LoadFile into 'partition_pk' using org.apache.hcatalog.pig.HCatStorer('build_num=${build_num}';

Just don't include the build_num partition in your queries. I generally set the build_num to a timestamp when I ran the job;

OTHER TIPS

Try using multiple partitions:

create table stage.partition_pk (value string) Partitioned by(date string, counter string) stored as orc;

Storing look like this:

LoadFile = LOAD 'test_file_2.txt' using PigStorage(',') AS (date : chararray, wartosc : chararray);
store LoadFile into 'stage.partition_pk' using org.apache.hcatalog.pig.HCatStorer('date=20161120, counter=0');

So now you can store data into the same date partition again by increasing the counter.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top