Question

I'm new to PIG and I try to figure out how to get the minimum rank within a group. What I want is getting from following dataset:

ID clickcounter
A   A1
A   A2
A   A3
B   B1
B   B2
C   C1
D   D1
E   E1
E   E2
E   E3
E   E4

... to following dataset:

ID  clickcounter   Rank     minRank_of_ID
A    A1             1       1
A    A2             2       1   
A    A3             3       1
B    B1             4       4
B    B2             5       4
C    C1             6       6
D    D1             7       7
E    E1             8       8
E    E2             9       8
E    E3             10      8
E    E4             11      8

I tried the following code and it is working, however I'm wondering if there is a better solution ?

A = LOAD 'datapath' using PigStorage() as (ID:chararray, clickcount:chararray);
B = rank A;
C = group B by ID;
D = foreach C generate  group, flatten($1.clickcount), MIN($1.rank_A);
E = rank D;
Dump D;
Was it helpful?

Solution

You are on the right track. In your code everything up to D is correct. You'll get your expected output with just a couple of changes:

D = FOREACH C GENERATE group, FLATTEN(B.($0, clickcount)), MIN(B.$0) ;
-- D should not be your expected output!

Since the output of C is like:

(A, {(1, A, A1), (2, A, A2), (3, A, A3)})
(B, {(4, B, B1), (5, B, B2)})
etc.

Your FLATTEN is going to need both the rank given in B and the clickcount field. The RANK in E is not going to do what you expect because the data is no longer guaranteed to be in the same order it was in the file.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top