Question

I have read the explanation in http://en.wikipedia.org/wiki/PageRank and i understand that the page rank is calculated by incoming links and out going links.

I have a crawler while crawls a webpage and store in db i need an page-rank algorithm. I have a db with following values

Title 
url
content_html
outgoing_links(external domain)
internal_links(the links with  same domain of the url)

can u please explain do i need any other value to compute the page rank and. please explain how to compute it using java

Était-ce utile?

La solution

You have a few options. If you want to do it all yourself then duffymo's solution is perfect but if you want to use existing libraries I would suggest something similar to Jung for graphs.

I'm not sure if your familiar with graphs but they can be used to store the structure of the links and pagerank is often included in most libraries. Depending on how you want to do it, a good in memory solution is Jung but if you need persistent database storage than loading your data into Neo4J would work(you would have to install gremlin to do the pagerank).

The above are Java solutions but if you want to do it yourself(and like me don't like dry research papers) then I would highly suggest the book programming collective intelligence. They go through(chapter 4? I think) creating a search engine from scratch that includes pagerank and neural networks to monitor clicks. The only problem, based on your requirements above, is the book is written in python but you can easily apply the logic to java. If you know a bit of python already then you can even download the books source code for free and check out the software(but there is no explanation on the math behind the code in the source code).

Hope that helps

Autres conseils

PageRank is, at its heart, a linear algebra eigenvalue problem:

http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pdf

If you don't know linear algebra or eigenvalue problems, or aren't willing to read this paper, it's unlikely that you'll be able to tackle this problem. As Einstein said, "Make the problem as simple as possible, but no simpler..."

The paper's title is old; it refers to Google's market cap circa 2004. It's up to $211B this morning.

The technology hasn't stood still in all that time. Google continues to tweak the algorithm in proprietary ways. But this paper explains the heart of it.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top