Question

I am working with Apache Pig and Mahout. Right now I am working on frequent pattern growth of apache mahout. I have data in the following format

    user items
    1     i1
    1     i2
    1     i3
    2     i2
    2     i5
    2     i6
    3     i1
    3     i4

--load the data

data = LOAD '$input' AS (user,item);

And then I grouped my data by user

grpdata = GROUP data BY user;

and I get

1 {(1,i1),(1,i2),(1,i3)}
2 {(2,i2),(2,i5),(2,i6)}
3 {(3,i1),(3,i4)}

Here is my question, how I can change the bag created as a result of grouping into the following format

1 i1,i2,i3
2 i2,i5,i6
3 i1,i4
Was it helpful?

Solution

You can obtain just the field you are interested in by using bag projection:

proj = FOREACH grpdata GENERATE group, data.item;

This will give you

1 {(i1),(i2),(i3)}
2 {(i2),(i5),(i6)}
3 {(i1),(i4)}

Unfortunately, there is no built-in capability to fiddle with the way a bag is serialized into a string. You will need to write a UDF that does that piece for you.

OTHER TIPS

In order to obtain :

(i1,i2,i3)
(i2,i5,i6)
(i1,i4)

You can do this :

res = foreach grpdata generate FLATTEN(BagToTuple($1.item));
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top