Extracting relevant image from a web-page

https://stackoverflow.com/questions/3129857

01-10-2019
|

Question

I have a couple of twitter-powered news aggregation website. I have been planning to add images from articles that I find on twitter.

If I download the page and extract image using <img> tag, I get a bunch of images; not all of them relevant to the article. For example, images of button, icons, ads etc are captured. How do I extract the image accompanying the article? I know there is a solution -- Facebook link sharer does this pretty well.

Mithun

Duplicate of : How to find and extract "main" image in website

Solution

It's been a long time. But this may help next time.

You can use this API https://urlmeta.org/

It's very simple to use and result is the best we need.

example for using API:

<?php
$url = "http://timesofindia.indiatimes.com/business/india-business/Raghuram-Rajan-not-fit-to-be-RBI-Governor-Subramanian-Swamy/articleshow/52236298.cms";

$result = file_get_contents('https://api.urlmeta.org/?url='.$url);
$array = json_decode($result,1);
print_r($array['meta']['image']);

?>

And that's the result you needed.

OTHER TIPS

Download all images from the page, blacklist all images coming from an ad server. then find some heuristic which will get you the correct image...

I think something like:

Biggest resolution += 5pts
Biggest filesize += 10 pts
Jpeg += 2 pts

then take the image with the most points and throw the rest away

Probably works for majority of sites.

(Would require some fiddling with the heuristics though)

I kind of came-up with a solution that is a bit hacky but works for me. Here is what I do to get thumbnails.

Say the headline of the page I find is "this is a headline"
I use this as a query to the Google Image API and then extract the first thumbnail I find.

It actually works quite well for a majority of the cases. Check it out for yourself http://cricketfresh.in

Mithun

ps: I think this is a good answer. Will give credit to someone who comes with a more elegant answer.

I would guess that Facebook has a link extractor for the various sites it supports. Something like id="content" -> img (1st).

Guess I am wrong. Seems that Facebook uses the Open Graph Protocol to define which image (og:image) and which metadata to use.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow

Extracting *relevant* image from a web-page

Extracting relevant image from a web-page