You can use the API to do this, specifically the links
module.
The initial query would look like this:
http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Page
There are libraries for accessing the API from almost any language.
Pergunta
Which is the most efficient way of crawling Wikipedia starting from a seed?
What I would like to do is to start from a seed (i.e., a specific page) and then crawl the pages that are at a maximum distance of N
from the seed. The crawling should be done by navigating the links that are contained in the page.
For instance, in the case N=2
, I would expand to each page that is linked in the seed (distance=1
) and then, for each one of these pages, expand again to the pages it links (distance=2
).
A Java solution would be preferred, but a script (e.g., Python) is fine too.
Solução
You can use the API to do this, specifically the links
module.
The initial query would look like this:
http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Page
There are libraries for accessing the API from almost any language.