I'm trying to implement PageRank algorithm on a set of web pages, for that I need a sample dataset of web pages, and the web graph corresponding to them, this web graph represents the links between the pages that the data set contains.

I need the web graph so I can get the transition matrix and do the calculation needed. Example:

URL1 -> URL2
URL3390 -> URL5

URLxxxx is an id, somehow mapped to the corresponding web page

My question is: how/where can I get this resource (I've tried many links on the internet but nothing really helps), I would also like it to be not of a very large size, (internet connection limitation), if I can't have this as it is, could sou give me some advice on what I should do?

Update: for people who may consider this off topic, and they may be right, networks like Software Recommendation or on Computer Science, don't even have corresponding tags, and doesn't really fit the kind of this question, I appreciate your help.

有帮助吗?

解决方案

May be Site Visualizer is the tool you're looking for. The app has the feature to generate visual sitemap.

Download and install the app (Standard or Pro version), click Create new project toolbutton, type the URL of the website you need to crawl, and then click Start button.

After the crawling is finished, click Draw button on the Visual Sitemap tab. Graph of the website will be drawn as a set of pages (rectangles) and links (lines with arrows). Click on a box to select the certain page and highlight its outbound links: generate visual sitemap

Dataset of all links of the website you can get by using All Links report (on the Reports tab). 'From URL' and 'To URL' columns are what you need.

Besides of that, you can represent a dataset of pages or links of the crawled website by using your particular SQL query. For instance, go to the Database tab, type the following query and click Execute toolbutton:

SELECT * FROM links WHERE link_type='A'

The resultset will contain only A-tag links, excluding images, CSS files, JS, etc.

The program has full-featured 30-days trial period, so you can carry out your tasks for free.

其他提示

you might try searching for datasets used in supplementary information for PageRank papers. Here's an example: this paper: http://langvillea.people.cofc.edu/ReorderingPageRank.pdf

uses this dataset: http://www.cs.cornell.edu/Courses/cs685/2002fa/data/gr0.California which supposedly contains 9,664 nodes and 16,773 links. The links are at the end of the file and appear to be in a connection format similar to what you're looking for.

from this page (which also has other datasets): http://www.cs.cornell.edu/Courses/cs685/2002fa/

here's a few other pages that aggregate network datasets:

  1. http://snap.stanford.edu/data/, see particularly http://snap.stanford.edu/data/web-Stanford.html
  2. http://www.datawrangling.com/some-datasets-available-on-the-web
  3. http://networkdata.ics.uci.edu/resources.php

good luck!

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top