Getting info from Wikipedia - how do I get HTML form?

https://stackoverflow.com/questions/853450

21-08-2019
|

Question

I'm using curl to retrieve information from wikipedia. So far I've been successful in retrieving basic text information but I really would want to retrieve it in HTML.

Here is my code:

$s = curl_init();       

$url = 'http://boss.yahooapis.com/ysearch/web/v1/site:en.wikipedia.org+'.$article_name.'?appid=myID';
curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);

$rs = Zend_Json::decode($rs);

$rs = ($rs['ysearchresponse']['resultset_web']);

$rs = array_shift($rs);
$article= str_replace('http://en.wikipedia.org/wiki/', '', $rs['url']);

$url = 'http://en.wikipedia.org/w/api.php?';
$url.='format=json';
$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);

curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);
//curl_close( $s );
$rs = Zend_Json::decode($rs);

$rs = array_pop(array_pop(array_pop($rs)));
$rs = array_shift($rs['revisions']);
$articleText = $rs['*'];

However the text retrieved this way isnt well enough to be displayed :( its all in this kind of format

'''Aix-les-Bains''' is a [[Communes of France|commune]] in the [[Savoie]] [[Departments of France|department]] in the [[Rhône-Alpes]] [[regions of France|region]] in southeastern [[France]].

It lies near the [[Lac du Bourget]], {{convert|9|km|mi|abbr=on}} by rail north of [[Chambéry]].

==History== ''Aix'' derives from [[Latin]] ''Aquae'' (literally, "waters"; ''cf'' [[Aix-la-Chapelle]] (Aachen) or [[Aix-en-Provence]]), and Aix was a bath during the [[Roman Empire]], even before it was renamed ''Aquae Gratianae'' to commemorate the [[Emperor Gratian]], who was assassinated not far away, in [[Lyon]], in [[383]]. Numerous Roman remains survive. [[Image:IMG 0109 Lake Promenade.jpg|thumb|left|Lac du Bourget Promenade]]

How do I get the HTML of the wikipedia article?

UPDATE: Thanks but I'm kinda new to this here and right now I'm trying to run an xpath query [albeit for the first time] and can't seem to get any results. I actually need to know a couple of things here.

How do I request just a part of an article?
How do I get the HTML of the article requested.

I went through this url on data mining from wikipedia - it put an idea to make a second request to wikipedia api with the retrieved wikipedia text as parameters and that would retrieve the html - although it hasn't seemed to work so far :( - I don't want to just grab the whole article as a mess of html and dump it. Basically my application what it does is that you have some locations and cities pin pointed on the map - you click on the city marker and it would request via ajax details of the city to be shown in an adjacent div. This information I wish to get from wikipedia dynamically. I'll worry about about dealing with articles that don't exist for a particular city later on just need to make sure its working at this point.

Does anyone know of a nice working example that does what I'm looking for i.e. read and parse through selected portions of a wikipedia article.

According to the url provided - it says I should post the wikitext to the wikipedia api location for it to return parsed html. The issue is that if I post the information I get no response and instead an error that I'm denied access - however if I try to include the wikitext as GET it parses with no issue. But it fails of course when I have waaaaay too much text to parse.

Is this a problem with the wikipedia api? Because I've been hacking at it for two days now with no luck at all :(

Solution

The simplest solution would probably be to grab the page itself (e.g. http://en.wikipedia.org/wiki/Combination ) and then extract the content of <div id="content">, potentially with an xpath query.

OTHER TIPS

There is a PEAR Wiki Filter that I have used and it does a very decent job.

Text Wiki

Phil

Try looking at the printable version of the desired Wikipedia article in question.

In other words, change this line of your source code:

$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);

to something like:

$url.=sprintf('&action=query&titles=%s&printable=yes&redirects=1', $article);

Disclaimer: Have not tested, and this is just a guess at how your API might work.

As far as I understand it, the Wikipedia software converts the Wiki markup into HTML when the page is requested. So using your current method, you'll need to deal with the results.

A good place to start is the Mediawiki API. You can also use http://pear.php.net/package/Text_Wiki to format the results retrieved via cURL.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow