Question

I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section.

How should i go about it ? Should I use a crawler and get the pages and search through those using BeautifulSoup ?
Or is there any other alternative to get the same from Wikipedia ?

Was it helpful?

Solution

I would go with Pywikipediabot python project.

Have a look to category.py. You could use:

* tree        - show a tree of subcategories of a given category
* listify     - make a list of all of the articles that are in a category

OTHER TIPS

If you want, you can just download the entire dump of the wikipedia and work it from there. The one your would probably want is only the articles dump dated 3 feb 2010. But beware: It's 5.6 GB in size.

You can use the CatScan tool to search categories.

Instructions here
http://meta.wikimedia.org/wiki/CatScan

Example search - note, html format maxes out at 1000 results. Choose CSV export to retrieve all the results. Also, be sure to modify the category depth and other options, as needed.

The pywikipediabot already mentioned is another option.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top