Self cross-join in pig is disregarded

https://stackoverflow.com/questions/15256621

18-03-2022
|

Domanda

If one have data like those:

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)

And then a cross-join is done on A, A:

B = CROSS A, A;

DUMP B;
(1,2,3)
(4,2,1)

Why is second A optimized out from the query?

info: pig version 0.11

== UPDATE ==

If I sort A like:

C = ORDER A BY a1;
D = CROSS A, C;

It will give a correct cross-join.

Soluzione 2

I think you have to load the data twice to achieve what you want.

i.e.

A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;

Altri suggerimenti

davek is correct -- you cannot CROSS (or JOIN) a relation with itself. If you wish to do this, you must create a copy of the data. In this case, you can use another LOAD statement. If you want to do this with a relation further down a pipeline, you'll need to duplicate it using FOREACH.

I have several macros that I use frequently and IMPORT by default in all of my Pig scripts in case I need them. One is used for just this purpose:

DEFINE DUPLICATE(in) RETURNS out
{
        $out = FOREACH $in GENERATE *;
};

This will work for you wherever in your pipeline you need a duplicate:

A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = DUPLICATE(A1);
B = CROSS A1, A2;

Note that even though A1 and A2 are identical, you cannot assume that the records are in the same order. But if you are doing a CROSS or JOIN, this probably doesn't matter.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow