Pergunta

Which is the most efficient way of crawling Wikipedia starting from a seed?

What I would like to do is to start from a seed (i.e., a specific page) and then crawl the pages that are at a maximum distance of N from the seed. The crawling should be done by navigating the links that are contained in the page.

For instance, in the case N=2, I would expand to each page that is linked in the seed (distance=1) and then, for each one of these pages, expand again to the pages it links (distance=2).

A Java solution would be preferred, but a script (e.g., Python) is fine too.

Foi útil?
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top