문제

I am confused how pagerank algorithm work with mapreduce model.

The main confusion is that after phaseII, the val is inlinks to the key URL(not the outlinks), so how can it work in the next iteration?

See my example below:

txt:
A->B
A->C
B->A
C->B

     WORKER1              WORKER2
LOAD
      A->B                B->A  
      A->C                C->B
MAP
     (A,B)                (B,A)
     (A,C)                (C,B)
SHUFFLE AND DISTRIBUTE
     (A,[B,C])            (B,[A])
                          (C,[B])
REDUCE
     (A,(PR(A),[B,C],2))  (B,(PR(B),[A],1))
                          (C,(PR(C),[B],1))
MAP(PHASE2)
     (B,(PR(A)/2,2))      (A,(PR(B)/1,1))
     (C,(PR(A)/2,2))      (B,(PR(C)/1,1))
SHUFFLED AND DISTRIBUTE
     (A,[PR(B)/1])        (B,[PR(A)/2,PR(C)/1])
                          (C,[PR(A)/2])
RERUCE
     (A,(NEWPR(A),[B],2)) (B,(NEWPR(B),[A,C],1))
                          (C,(NEWPR(C),[A],1))

Till now, I lose the outlinks info, where is my mistake?

도움이 되었습니까?

해결책

You need a structure (node_id, page_rank, adjacency_list) to store the page link, PR and the adjacency list.

A good book to train your MapReduce thought is Data-Intensive Text Processingwith MapReduce. In 5.3 PAGERANK there is a detail about how to implement PageRank in MapReduce.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top