You can do this with a JOIN FULL.
joined = JOIN personCounts BY personName FULL, whitetlist BY empName;
joined = FILTER joined BY NOT $0 MATCHES '';
joined = FILTER joined BY $3 IS null;
Then joined is (personName, count, , '')
Вопрос
I have two sets
personCounts
(personName:chararray, count:int)
whitelist
(empID:int, empName:chararray)
What I want is the people who are in personCounts, but not in whitelist. I know that JOIN returns the elements that appear in both. Is there a way to return those that would be dropped instead? I was thinking I could do it with CROSS, but then I would have extras I think..?
crossed = CROSS personCounts BY personName, whitelist BY empName;
filcrs = FILTER crossed BY NOT personCounts::personName MATCHES whitelist::empName;
Решение 2
You can do this with a JOIN FULL.
joined = JOIN personCounts BY personName FULL, whitetlist BY empName;
joined = FILTER joined BY NOT $0 MATCHES '';
joined = FILTER joined BY $3 IS null;
Then joined is (personName, count, , '')
Другие советы
I think what you want to achieve is the set difference between personCounts and whitelist correct?
If so, try the following (not tested!!!):
CGRP = COGROUP personCounts BY personName, whitelist BY empName;
PC_MINUS_WL = FILTER CGRP BY IsEmpty(whitelist);
PC_MINUS_WL = FOREACH PC_MINUS_WL GENERATE group AS name;
I found the two following resources helpful:
http://agiletesting.blogspot.de/2012/02/set-operations-in-apache-pig.html