Hadoop cartesian product of list with itself

https://stackoverflow.com/questions/17385295

02-06-2022
|

Domanda

Using Hadoop MapReduce

I have a list as input:

And I want to get the Cartesian product of the list with itself:

A => A,f(A,A)
A => B,f(A,B)
A => C,f(A,C)
B => A,f(B,A)
B => B,f(B,B)
B => C,f(B,C)
C => A,f(C,A)
C => B,f(C,B)
C => C,f(C,C)

f() is a function that gives a value for a pair of keys.

How do I do that a in a simple manner using Hadoop MapReduce in Java?

Of course I can't hold the entire input list in memory.

Thanks!!

Soluzione

You can implement it in Java map reduce. Let us assume, you want to do cross product between two files A and B with splits 3 and 4 respectively. Then you have to write custom input format that splits up the two datasets and then ensured there was a SPLIT for each subset of data.

So your splits would look like:

 A1 X B1
 A1 X B2
 A1 X B3
 A1 X B4
 A2 X B1
 A2 X B2
 A2 X B3
 A2 X B4
 A3 X B1
 A3 X B2
 A3 X B3
 A3 X B4

Use link https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch5/CartesianProduct.java for your reference.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow