Question

While running the below pig script I am getting an error in line4: If it is GROUP then I am getting error. If I change from 'GROUP' TO 'group' in line4, then the script is running.

What is the difference between group and GROUP?

LINES = LOAD '/user/cloudera/datapeople.csv' USING PigStorage(',') AS ( firstname:chararray, lastname:chararray, address:chararray, city:chararray, state:chararray, zip:chararray );

WORDS = FOREACH LINES GENERATE FLATTEN(TOKENIZE(zip)) AS ZIPS;

WORDSGROUPED = GROUP WORDS BY ZIPS;

WORDBYCOUNT = FOREACH WORDSGROUPED GENERATE GROUP AS ZIPS, COUNT(WORDS);

WORDSSORT = ORDER WORDBYCOUNT BY $1 DESC;

DUMP WORDSSORT;
Was it helpful?

Solution

'group' in strictly lower case in the FOREACH is the thing you are looping/grouping over.

http://squarecog.wordpress.com/2010/05/11/group-operator-in-apache-pig/ says:

When you group a relation, the result is a new relation with two columns: “group” and the name of the original relation.

Column names are case sensitive, so you have to use lower-case 'group' in your FOREACH.

'GROUP' in upper case is the grouping operator. You can't mix them. So don't do that.

OTHER TIPS

Normally the GROUP/COGROUP is used to group the relation by some key.after you group the relation describe the grouped relation.you can find EX: describe grp; grp: {group: chararray,A: {(name: chararray,session: chararray,gpa: float)}}.

in the above result you can find "group".

if you want to perform some operation on grouped relation(grp) ,you should use the "group" not GROUP.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top