I think you have to load the data twice to achieve what you want.
i.e.
A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;
Domanda
If one have data like those:
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
And then a cross-join is done on A, A:
B = CROSS A, A;
DUMP B;
(1,2,3)
(4,2,1)
Why is second A optimized out from the query?
info: pig version 0.11
== UPDATE ==
If I sort A like:
C = ORDER A BY a1;
D = CROSS A, C;
It will give a correct cross-join.
Soluzione 2
I think you have to load the data twice to achieve what you want.
i.e.
A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;
Altri suggerimenti
davek is correct -- you cannot CROSS
(or JOIN
) a relation with itself. If you wish to do this, you must create a copy of the data. In this case, you can use another LOAD
statement. If you want to do this with a relation further down a pipeline, you'll need to duplicate it using FOREACH
.
I have several macros that I use frequently and IMPORT
by default in all of my Pig scripts in case I need them. One is used for just this purpose:
DEFINE DUPLICATE(in) RETURNS out
{
$out = FOREACH $in GENERATE *;
};
This will work for you wherever in your pipeline you need a duplicate:
A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = DUPLICATE(A1);
B = CROSS A1, A2;
Note that even though A1
and A2
are identical, you cannot assume that the records are in the same order. But if you are doing a CROSS
or JOIN
, this probably doesn't matter.