Can Hadoop's Shuffle/Sort (or partition) phase be customized to perform a graph traversal?

https://stackoverflow.com/questions/16702223

30-05-2022
|

문제

I am still learning about the MapReduce framework, specifically implemented by Hadoop, and I wonder if it can be modified to perform the following task:

The Map() function will emit (key,value) pairs, whose keys are arrays of size 2, say int[2]. I would like every pair containing any of the two integers in common to be mapped to the same reducer.

So for example, if Map() emits: ([2,3],4),([2,4],5),([6,5],2),([5,7],1), then Reduce1 should receive the first two pairs, and Reduce2 the second two (first two sharing 2, second two sharing 5). This can be viewed as a connected component problem, where the vertices are the integers in the int[], and edges are shared between any two integers in the same int[].

해결책

A change in algorithm and you can probably achieve this - but you're going to need emit each edge twice

For each edge you currently output, you should output them for both vertex IDs, amend the outputted value to include the other edge, the weight and optionally a direction (if edge direction is important to your algorithm).

So instead of this:

([2,3],4)
([2,4],5)
([6,5],2)
([5,7],1)

Output this (S denotes the key was the source, D denotes the key was the destination):

(2, [3, 4, S])
(3, [2, 4, D])
(2, [4, 5, S])
(4, [2, 5, D])
(6, [5, 2, S])
(5, [6, 2, D])
(5, [7, 1, S])
(7, [5, 1, D])

Now in your reducer, you'll be grouping by the vertex ID, and will be able to iterate other the tuples containing the other vertex ID, a weight, and the direction:

(2, [3, 4, S])
(2, [4, 5, S])

(3, [2, 4, D])

(4, [2, 5, D])

(5, [6, 2, D])
(5, [7, 1, S])

(6, [5, 2, S])

(7, [5, 1, D])

You still need to be aware that each edge could potentially be processed twice, especially if the edges exist in both directions between two vertices.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow