Question

I am trying to use Rails to extract data from Wikipedia, based on a search term.

For example,

1) if I have the String "American Idol", I want to pass that to Wikipedia and get a list of the articles that relate to that. My goal will be to take the first 3 hyperlinks and display them on the website.

2) one step further would involve me extracting small pieces of data from Wikipedia - say the infobox, or the first few words of the wikipedia article.

Any tips?

Thanks!

Was it helpful?

Solution

You don't need to resort to screen-scraping, MediaWiki has a very comprehensive API for precisely this kind of thing. See https://github.com/jpatokal/mediawiki-gateway for a handy Ruby wrapper around it.

Alternatively, if you're only interested in data like infoboxes, see DBpedia for the database version of Wikipedia.

OTHER TIPS

There is another gem that you can use: https://github.com/kenpratt/wikipedia-client

This gem seems to get just the first result of your search, but you can consult the documentation to be sure.

Regarding the content, once you get the page, the gem allows you to access the different content of the article, links, images and so on.

Use mechanize and nokogiri to do that. This is a great cheat sheet for that:

http://www.e-tobi.net/blog/files/ruby-mechanize-cheat-sheet.pdf

Mechanize is a toolbox to simulate website calls and nokogiri is an html/xml parser. It should be simple to realize that.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top