Question

I've tried WebSphinx application.

I realize if I put wikipedia.org as the starting URL, it will not crawl further.

Hence, how to actually crawl the entire Wikipedia? Can anyone gimme some guidelines? Do I need to specifically go and find those URLs and put multiple starting URLs?

Anyone has suggestions of good website with the tutorial on usng WebSphinx's API?

Was it helpful?

Solution

If your goal is to crawl all of Wikipedia, you might want to look at the available database dumps. See http://download.wikimedia.org/.

OTHER TIPS

I'm not sure but maybe WEbSphinx's UserAgent is blocked by wikipedia's robots.txt

http://en.wikipedia.org/robots.txt

I think you couldn't choose the required configuration for that. Switch to advanced, crawl the subdomain, unlimit the page size and time.

However, WebSphinx probably can't crawl the whole Wikipedia, it slows down with bigger data and eventually stops near 200mb of memory is used. I recommend you Nutch, Heritrix and Crawler4j.

You probably need to start with a random article, and then crawl all articles you can get to from that starting one. When that search tree has been exhausted, start with a new random article. You could seed your searches with terms you think will lead to the most articles, or start with the featured article on the front page.

Another question: Why didn't WebSphinx crawl further? Does wikipedia block bots that identify as 'WebSphinx'?

In addition to using the Wikipedia database dump mentioned above, you can use Wikipedia's API for executing queries, such as retrieving 100 random articles.

http://www.mediawiki.org/wiki/API:Query_-Lists#random.2F_rn

Have a look at dbpedia, a structured version of Wikipedia.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top