How to get the first image of any wiki page

https://stackoverflow.com/questions/10248475

02-06-2021
|

题

I need to get the first image/ main image in any given wiki page. I could use a scraping tool to do this. But I am using curl to scrap a page. But may be due to slow internet connection, it is taking a long time to scrap just one wiki page. Apart from that I need to display at least 7-8 different wiki images at the same time depending on user's query.

So no point in using curl for this. I tried wiki api

https://en.wikipedia.org/w/api.php?action=query&titles=India&prop=images&imlimit=1

But there are no other parameters that I can give to sort this list. Usually the first image this api is returning is not the main image which you see at the top of the page. Sometimes the image is too far from the context of the page.

I need to display just one image for each wiki title. Thanks in advance.

解决方案

Seems like the images are getting returned in alphabetical order.... weird.

Anyway, this might work better:

https://en.wikipedia.org/w/api.php?action=parse&text={{Barack_Obama}}&prop=images

Unfortunately, only the first image is usable, but at least it's the right one.

其他提示

To get often-times a very good guess for the "main image", use prop=pageimages, provided by the MediaWiki extension "PageImages":

The PageImages extension collects information about images used on a page.

Its aim is to return the single most appropriate thumbnail associated with an article, attempting to return only meaningful images, e.g. not those from maintenance templates, stubs or flag icons. Currently it uses the first non-meaningless image used in the page.

^{_{(Text is cc-by-sa 3.0; list of authors)}}

Usage

To quote from the MediaWiki API documentation:

Returns information about images on the page, such as thumbnail and
presence of photos.
Parameters:

piprop
    Which information to return:

    thumbnail
        URL and dimensions of image associated with page, if any.
    name
        Image title.

    Values (separate with "|"): thumbnail, name
    Default: thumbnail|name

pithumbsize
    Maximum thumbnail dimension. 
    Default: 50

pilimit
    Properties of how many pages to return. 
    No more than 50 (100 for bots) allowed.
    Default: 1

picontinue
    When more results are available, use this to continue.

Example

https://en.wikipedia.org/w/api.php?action=query&titles=India&prop=pageimages&pithumbsize=300

Return value:

{
    "query": {
        "pages": {
            "14533": {
                "pageid": 14533,
                "ns": 0,
                "title": "India",
                "thumbnail": {
                    "source": "https://upload.wikimedia.org/wikipedia/commons/thumb/b/b8/Political_map_of_India_EN.svg/256px-Political_map_of_India_EN.svg.png",
                    "width": 256,
                    "height": 300
                },
                "pageimage": "Political_map_of_India_EN.svg"
            }
        }
    }
}

Further examples:

api.php?action=query&titles=India&prop=images

Gives you the full list of all images sorted alphabetically. You can retrieve the first image from the document order on the non-api page. Probably if you combine both, you'll get most out of it:

$topic = 'India';
$url = sprintf('http://en.wikipedia.org/wiki/%s', urlencode($topic));
$options = array(
    'http' => array(
        'user_agent' => 'Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.102011-10-16 20:23:50',
    )
);
$context = stream_context_create($options);
libxml_set_streams_context($context);
$doc = new DOMDocument();
$doc->loadHTMLFile($url);
$xp = new DOMXPath($doc);
$result = $xp->query('(//img[@class = "thumbimage"])[1]');
$image = ($result && $result->length) ? $result->item(0) : NULL;
echo $doc->saveXML($image), "\n";

$wikipage = file_get_contents('http://en.wikipedia.org/wiki/Cats');
preg_match_all('/<img[^<]+?>/', $wikipage, $matches);

typically the main image will be the second match, after the lock (http://upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/20px-Padlock-silver.svg.png)

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow