Crawl Wikipedia pages starting from a seed

https://stackoverflow.com/questions/23108177

wikipedia
web-crawler

04-07-2023
|

Pergunta

Which is the most efficient way of crawling Wikipedia starting from a seed?

What I would like to do is to start from a seed (i.e., a specific page) and then crawl the pages that are at a maximum distance of N from the seed. The crawling should be done by navigating the links that are contained in the page.

For instance, in the case N=2, I would expand to each page that is linked in the seed (distance=1) and then, for each one of these pages, expand again to the pages it links (distance=2).

A Java solution would be preferred, but a script (e.g., Python) is fine too.

Solução

You can use the API to do this, specifically the links module.

The initial query would look like this:

http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Page

There are libraries for accessing the API from almost any language.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow