Pergunta

I am confused how pagerank algorithm work with mapreduce model.

The main confusion is that after phaseII, the val is inlinks to the key URL(not the outlinks), so how can it work in the next iteration?

See my example below:

txt:
A->B
A->C
B->A
C->B

     WORKER1              WORKER2
LOAD
      A->B                B->A  
      A->C                C->B
MAP
     (A,B)                (B,A)
     (A,C)                (C,B)
SHUFFLE AND DISTRIBUTE
     (A,[B,C])            (B,[A])
                          (C,[B])
REDUCE
     (A,(PR(A),[B,C],2))  (B,(PR(B),[A],1))
                          (C,(PR(C),[B],1))
MAP(PHASE2)
     (B,(PR(A)/2,2))      (A,(PR(B)/1,1))
     (C,(PR(A)/2,2))      (B,(PR(C)/1,1))
SHUFFLED AND DISTRIBUTE
     (A,[PR(B)/1])        (B,[PR(A)/2,PR(C)/1])
                          (C,[PR(A)/2])
RERUCE
     (A,(NEWPR(A),[B],2)) (B,(NEWPR(B),[A,C],1))
                          (C,(NEWPR(C),[A],1))

Till now, I lose the outlinks info, where is my mistake?

Foi útil?

Solução

You need a structure (node_id, page_rank, adjacency_list) to store the page link, PR and the adjacency list.

A good book to train your MapReduce thought is Data-Intensive Text Processingwith MapReduce. In 5.3 PAGERANK there is a detail about how to implement PageRank in MapReduce.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top