Pergunta

Let's say I JOIN two relations like:

-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9

-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3

jnd = JOIN part BY $0, original BY $0;

The output will be:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3

Notice that $0 is shown twice in each tuple. EG:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^     ^
|-----|

I can remove the duplicate key manually by doing:

jnd = foreach jnd generate $0,$1,$3,$4 ..;

Is there a way to remove this dynamically? Like remove(the duplicate key joiner).

Foi útil?

Solução

Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.

So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.

Pre-Requisite:

Name of all the fields should be present

You need to download this UDF file and make it jar so as to use it.

UDF file location from GitHub : GitHub UDF Java File Location

We will take the above question as example.

--Data Set A contains this data
-- 1,5.3
-- 2,4.9
-- 3,4.9

--Data Set B contains this data
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,Akhila,3.3,IT,C,1.3,0.3

PIG Script:

REGISTER /home/user/
DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);
DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);
JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;

We will get column name after joining as DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

For making it to DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

DSB::ROLLNO:int is removed.

We need to use the UDF as

JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));

Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.

This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top