Question

I'm still stuck on my problem of trying to parse articles from wikipedia. Actually I wish to parse the infobox section of articles from wikipedia i.e. my application has references to countries and on each country page I would like to be able to show the infobox which is on corresponding wikipedia article of that country. I'm using php here - I would greatly appreciate it if anyone has any code snippets or advice on what should I be doing here.

Thanks again.


EDIT

Well I have a db table with names of countries. And I have a script that takes a country and shows its details. I would like to grab the infobox - the blue box with all country details images etc as it is from wikipedia and show it on my page. I would like to know a really simple and easy way to do that - or have a script that just downloads the information of the infobox to a local remote system which I could access myself later on. I mean I'm open to ideas here - except that the end result I want is to see the infobox on my page - of course with a little Content by Wikipedia link at the bottom :)


EDIT

I think I found what I was looking for on http://infochimps.org - they got loads of datasets in I think the YAML language. I can use this information straight up as it is but I would need a way to constantly update this information from wikipedia now and then although I believe infoboxes rarely change especially o countries unless some nation decides to change their capital city or so.

Was it helpful?

Solution

I suggest performing a WebRequest against wikipedia. From there you will have the page and you can simply parse or query out the data that you need using a regex, character crawl, or some other form that you are familiar with. Essentially a screen scrape!

EDIT - I would add to this answer that you can use HtmlAgilityPack for those in C# land. For PHP it looks like SimpleHtmlDom. Having said that it looks like Wikipedia has a more than adequate API. This question probably answers your needs best:

Is there a Wikipedia API?

OTHER TIPS

I'd use the wikipedia (wikimedia) API. You can get data back in JSON, XML, php native format, and others. You'll then still need to parse the returned information to extract and format the info you want, but the info box start, stop, and information types are clear.

Run your query for just rvsection=0, as this first section gets you the material before the first section break, including the infobox. Then you'll need to parse the infobox content, which shouldn't be too hard. See en.wikipedia.org/w/api.php for the formal wikipedia api documentation, and www.mediawiki.org/wiki/API for the manual.

Run, for example, the query: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0

I suggest you use DBPedia instead which has already done the work of turning the data in wikipedia into usable, linkable, open forms.

It depends what route you want to go. Here are some possibilities:

  1. Install MediaWiki with appropriate modifications. It is a after all a PHP app designed precisely to parse wikitext...
  2. Download the static HTML version, and parse out the parts you want.
  3. Use the Wikipedia API with appropriate caching.

DO NOT just hit the latest version of the live page and redo the parsing every time your app wants the box. This is a huge waste of resources for both you and Wikimedia.

There is a number of semantic data providers from which you can extract structured data instead of trying to parse it manually:

  • DbPedia - as already mentioned provides SPARQL endpoint which could be use for data queries. There is a number of libraries available for multiple platforms, including PHP.

  • Freebase - another creative commons data provider. Initial dataset is based on parsed Wikipedia data, but there is some information taken from other sources. Data set could be edited by anyone and, in contrast to Wikipedia, you can add your own data into your own namespace using custom defined schema. Uses its own query language called MQL, which is based on JSON. Data has WebID links back to correspoding Wikipedia articles. Free base also provides number of downloadable data dumps. Freebase has number of client libraries including PHP.

  • Geonames - database of geographical locations. Has API which provides Country and Region information for given coordinates, nearby locations (e.g. city, railway station, etc.)

  • Opensteetmap - community built map of the world. Has API allowing to query for objects by location and type.

  • Wikimapia API - another location service

To load the parsed first section, Simply add this parameter to the end of the api url

rvparse

Like this: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0&rvparse

Then parse the html to get the infobox table (using Regex)

    $url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Niger&rvsection=0&rvparse";
    $data = json_decode(file_get_contents($url), true);
    $data = current($data['query']['pages']);
    $regex = '#<\s*?table\b[^>]*>(.*)</table\b[^>]*>#s';
    $code = preg_match($regex, $data["revisions"][0]['*'], $matches);
    echo($matches[0]);

if you want to parse one time all the articles, wikipedia has all the articles in xml format available,

http://en.wikipedia.org/wiki/Wikipedia_database

otherwise you can screen scrape individual articles e.g.

To update this a bit: a lot of data in Wikipedia infoboxes are now taken from Wikidata, which is a free database of structured information. See data page for Germany for example, and https://www.wikidata.org/wiki/Wikidata:Data_access for information on how to access the data programatically.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top