PageRank implementation for Research

Question 1

That is a great article to learn about pagerank. I implemented a Perl version from it here to use with Textrank. However, if you want to just learn about pagerank and how the various aspects discussed in the article affect the results (dampening factor, direct or undirected graph, etc.), I would recommend running experiments in R or Octave. If you want to learn how to implement it efficiently, then programming it up from scratch, as you are doing, is best.

Most web graphs (or networks) are very sparse, which means most of the entries in the matrix representation of the graph are zero. A common data structure used to represent a sparse matrix is a hash-map, where the zero values are not stored. For example, if the matrix was

1, 0, 0
0, 0, 2,
0, 3, 0

a two dimension hash-map would store only the values for hm(0,0)=1, hm(1,2)=2, and hm(2,1)=3. So in a 1,000,000 by 1,000,000 matrix of a web graph, I would expect only a few million values to be non-zero. If each row averages only 5 non-zero values, a hash-map will use about 5*(8+8+8)*10^6 bytes ~ 115mb to store it (8 for the left int index, 8 for the right int index, and 8 for the double value). The square matrix will use 8*10^6*10^6 ~ 7 terabytes.

Implementing an efficient sparse matrix-vector multiply in Java is not trivial, and there are some already implemented if you don't want to devote time to that aspect of the algorithm. The sparse-matrix multiply is the most difficult aspect to implement of the pagerank algorithm, so after that it gets easier (and more interesting).

Question 2

Python networkx module has a nice implementation of pagerank. It uses scipy/numpy for the matrix implementation. The below two questions on stackoverflow should be enough to get you started.

Question 3

A few suggestions:

Use python, not Java: python is an excellent prototyping language, and has available sparse matrices (in scipy) as well as many other goodies. As others have noted, it also has a pagerank implementation.
Store your data not all in memory: any type of lightweight database would be fine, for instance sqlite, hibernate, ...
Work on tiles of the data: if there is a big matrix NxN, break it up into small tiles MxM where M is a fraction of N, that fit in memory. Combined with sparse matrices this allows you to work with really big N (hundreds of millions to billions, depending on how sparse the data is).

Question 4

As Dan W suggested, try to increase the heap size. If you run your Java application from the command line, just add the switch -Xmx with the desired heap size. Let's assume you compiled your Java code into a runnable JAR file called pagerank.jar, and you want to set your heap size to 512 MB, you would issue the following command:

java -jar -Xmx512m pagerank.jar

EDIT: But that only works if you don't have that many "pages" ... A 1 Million x 1 Million array is too big to fit into your RAM (1 trillion times * 64 bit double value = 7.27595761 terabytes). You should change your algorithm to load chunks of data from the disk, manipulate it, and store it back to disk.

You could use a graph database like Neo4j for that purpose.

Question 5

You don't have to store the whole 1000000x1000000 matrix, because most matrix entries will be zero. Instead, you can (for example) store a list of nonzero entries for each row, and write your matrix functions to use it directly, without expanding it into a full matrix.

This kind of compressed representation is called a sparse matrix format, and most matrix libraries have an option to build and work with sparse matrices.

One disadvantage with sparse matrices is that multiplying two of them will result in a matrix which is much less sparse. However, the PageRank algorithm is designed so that you don't need to do that: the hyperlink matrix is constant, and only the score vector is updated.

Question 6

PageRank is performed by Google using the 'Pregel' BSP (really just keywords) framework.

I remembered Apache Giraph (another Pregel), which includes a version of PageRank in its benchmark package.

Here's a video about Giraph: it's an introduction, and it specifically talks about handling PageRank.

If that doesn't work:

In Java there is an implementation of Pregel called GoldenOrb.

Pseudo code for the PageRank algorithm is here (on a different implementation of Pregel).

You'll have to read around BSP, and PageRank to handle the size of data you have.

Question 7

Because the matrix is sparse you can implement dimensionality reduction like svd,pca,mds or Lsi that includes svd. There is a library to implement this kind of processes which is called Jama. You can find it here