How to get the first image of any wiki page
题
I need to get the first image/ main image in any given wiki page. I could use a scraping tool to do this. But I am using curl to scrap a page. But may be due to slow internet connection, it is taking a long time to scrap just one wiki page. Apart from that I need to display at least 7-8 different wiki images at the same time depending on user's query.
So no point in using curl for this. I tried wiki api
https://en.wikipedia.org/w/api.php?action=query&titles=India&prop=images&imlimit=1
But there are no other parameters that I can give to sort this list. Usually the first image this api is returning is not the main image which you see at the top of the page. Sometimes the image is too far from the context of the page.
I need to display just one image for each wiki title. Thanks in advance.
解决方案
Seems like the images are getting returned in alphabetical order.... weird.
Anyway, this might work better:
https://en.wikipedia.org/w/api.php?action=parse&text={{Barack_Obama}}&prop=images
Unfortunately, only the first image is usable, but at least it's the right one.
其他提示
To get often-times a very good guess for the "main image", use prop=pageimages
, provided by the MediaWiki extension "PageImages":
The PageImages extension collects information about images used on a page.
Its aim is to return the single most appropriate thumbnail associated with an article, attempting to return only meaningful images, e.g. not those from maintenance templates, stubs or flag icons. Currently it uses the first non-meaningless image used in the page.
(Text is cc-by-sa 3.0; list of authors)
Usage
To quote from the MediaWiki API documentation:
Returns information about images on the page, such as thumbnail and presence of photos. Parameters: piprop Which information to return: thumbnail URL and dimensions of image associated with page, if any. name Image title. Values (separate with "|"): thumbnail, name Default: thumbnail|name pithumbsize Maximum thumbnail dimension. Default: 50 pilimit Properties of how many pages to return. No more than 50 (100 for bots) allowed. Default: 1 picontinue When more results are available, use this to continue.
Example
https://en.wikipedia.org/w/api.php?action=query&titles=India&prop=pageimages&pithumbsize=300
Return value:
{
"query": {
"pages": {
"14533": {
"pageid": 14533,
"ns": 0,
"title": "India",
"thumbnail": {
"source": "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/Political_map_of_India_EN.svg/256px-Political_map_of_India_EN.svg.png",
"width": 256,
"height": 300
},
"pageimage": "Political_map_of_India_EN.svg"
}
}
}
}
Further examples:
api.php?action=query&titles=India&prop=images
Gives you the full list of all images sorted alphabetically. You can retrieve the first image from the document order on the non-api page. Probably if you combine both, you'll get most out of it:
$topic = 'India';
$url = sprintf('http://en.wikipedia.org/wiki/%s', urlencode($topic));
$options = array(
'http' => array(
'user_agent' => 'Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:50',
)
);
$context = stream_context_create($options);
libxml_set_streams_context($context);
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xp = new DOMXPath($doc);
$result = $xp->query('(//img[@class = "thumbimage"])[1]');
$image = ($result && $result->length) ? $result->item(0) : NULL;
echo $doc->saveXML($image), "\n";
$wikipage = file_get_contents('http://en.wikipedia.org/wiki/Cats');
preg_match_all('/<img[^<]+?>/', $wikipage, $matches);
typically the main image will be the second match, after the lock (http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/20px-Padlock-silver.svg.png)