Pergunta

I need to load the data for a certain partition (date) in Pig. This data was created in Hive, and partitioned on date. So i want to load the data in Pig via HCatalog.

The HCatalog documentation says that to load a certain partition in Pig, you first load the whole dataset and then filter on it, i.e. :

a = load 'web_logs' using org.apache.hcatalog.pig.HCatLoader();
b = filter a by datestamp > '20110924';

https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore But I am afraid this first loads the whole data in bag a, then only filters it in b. Am i correct or no ?

In Hive this works (without HCat), you can prune the data to just get the partition you want, i.e. :

LOAD DATA  INPATH 'filepath' INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

What is the equivalent of this construct in Pig with HCatalog ?

Thanks!

Foi útil?

Solução

I see two parts to your question.

Part 1, https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore But I am afraid this first loads the whole data in bag a, then only filters it in b. Am i correct or no ?

Ans 1) NO, when you apply filters just after the load statement, hcatalog is smart enough to load specified partitions, which you specified in your filter statement.

Part 2) LOAD DATA INPATH 'filepath' INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

What is the equivalent of this construct in Pig with HCatalog ?

Ans 2) YES, you can use store a into 'tablename' using org.apache.hcatalog.pig.HCatStorer('particol1=val1,partcol2=val2');

eg: store a into 'tablename' using org.apache.hcatalog.pig.HCatStorer('datestamp=20110924');

Please drop a comment if you have any doubts.

Thanks

Outras dicas

The documentations states that if the loader (using HCatLoader()) is immediately followed by a filter the loader will only load the specified partitions, as opposed to loading the entire dataset and then filtering out records.

From the book "Programming Pig":

"HCatalog includes the load function HCatLoader. The location string for HCatLoader is the name of the table. It implements LoadMetadata, so you do not need to specify the schema as part of your load statement; Pig will get it from HCatLoader. Also, because it implements this interface, Pig can work with HCatalog’s partitioning. If you place the filter statement that describes which partitions you want to read immediately after the load, Pig will push that into the load so that HCatalog returns only the relevant partitions. "

The book is very good, and currently offered as open source material here: http://chimera.labs.oreilly.com/books/1234000001811/ch12.html#cassandra

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top