Вопрос

I have two sets

personCounts 
(personName:chararray, count:int)

whitelist
(empID:int, empName:chararray)

What I want is the people who are in personCounts, but not in whitelist. I know that JOIN returns the elements that appear in both. Is there a way to return those that would be dropped instead? I was thinking I could do it with CROSS, but then I would have extras I think..?

crossed = CROSS personCounts BY personName, whitelist BY empName;
filcrs = FILTER crossed BY NOT personCounts::personName MATCHES whitelist::empName;
Это было полезно?

Решение 2

You can do this with a JOIN FULL.

joined = JOIN personCounts BY personName FULL, whitetlist BY empName;
joined = FILTER joined BY NOT $0 MATCHES '';
joined = FILTER joined BY $3 IS null;

Then joined is (personName, count, , '')

Другие советы

I think what you want to achieve is the set difference between personCounts and whitelist correct?

If so, try the following (not tested!!!):

CGRP = COGROUP personCounts BY personName, whitelist BY empName;
PC_MINUS_WL = FILTER CGRP BY IsEmpty(whitelist);
PC_MINUS_WL = FOREACH PC_MINUS_WL GENERATE group AS name;

I found the two following resources helpful:

http://agiletesting.blogspot.de/2012/02/set-operations-in-apache-pig.html

http://www.cs.tufts.edu/comp/150CPA/notes/Advanced_Pig.pdf

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top