質問

I am confused how pagerank algorithm work with mapreduce model.

The main confusion is that after phaseII, the val is inlinks to the key URL(not the outlinks), so how can it work in the next iteration?

See my example below:

txt:
A->B
A->C
B->A
C->B

     WORKER1              WORKER2
LOAD
      A->B                B->A  
      A->C                C->B
MAP
     (A,B)                (B,A)
     (A,C)                (C,B)
SHUFFLE AND DISTRIBUTE
     (A,[B,C])            (B,[A])
                          (C,[B])
REDUCE
     (A,(PR(A),[B,C],2))  (B,(PR(B),[A],1))
                          (C,(PR(C),[B],1))
MAP(PHASE2)
     (B,(PR(A)/2,2))      (A,(PR(B)/1,1))
     (C,(PR(A)/2,2))      (B,(PR(C)/1,1))
SHUFFLED AND DISTRIBUTE
     (A,[PR(B)/1])        (B,[PR(A)/2,PR(C)/1])
                          (C,[PR(A)/2])
RERUCE
     (A,(NEWPR(A),[B],2)) (B,(NEWPR(B),[A,C],1))
                          (C,(NEWPR(C),[A],1))

Till now, I lose the outlinks info, where is my mistake?

役に立ちましたか?

解決

You need a structure (node_id, page_rank, adjacency_list) to store the page link, PR and the adjacency list.

A good book to train your MapReduce thought is Data-Intensive Text Processingwith MapReduce. In 5.3 PAGERANK there is a detail about how to implement PageRank in MapReduce.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top