You are on the right track. In your code everything up to D
is correct. You'll get your expected output with just a couple of changes:
D = FOREACH C GENERATE group, FLATTEN(B.($0, clickcount)), MIN(B.$0) ;
-- D should not be your expected output!
Since the output of C
is like:
(A, {(1, A, A1), (2, A, A2), (3, A, A3)})
(B, {(4, B, B1), (5, B, B2)})
etc.
Your FLATTEN
is going to need both the rank given in B
and the clickcount
field. The RANK
in E
is not going to do what you expect because the data is no longer guaranteed to be in the same order it was in the file.