What to detect in url's HTML as thumbnail? [closed]

Question 1

Not sure exactly how facebook do it, maybe try looking at the facebook docs or googling, but here's something you could do to get you started...

First of all, have a fallback check for the old style:

<link rel="image_src" href="/myimage.jpg"/>

If that fails, then you need to select an appropriate image. You could get really fancy and do google-esc scraping which trys to put things into context such as looking for images inside the main content frame only (dictated by checking other website urls and identifying the common layout template). But to start with you could try,

Get all image tags and parse out the src attribute
Purge any sources which aren't unique (might indicate icons like social icons)
Fetch all images to a temp directory
Purge any images whose size is not indicative of a featured image (i.e anything smaller than 300px maybe? You'd have to play with it i guess).
Purge any images whose aspect size is wildly outside that of an expected featured image

Optionally before step 3, you could try removing any images which are within close proximity of another image in the source code, which could identify things like image navigation menus.

Anything more than that would probably require a contextual understanding of the webpage being scraped (which is probably what facebook do). An image followed by several paragraphs for example could indicate a featured article image.

On top of all of that, if you made it a factory class where you can plugin additional parsers for specific sites. You could try to build and plugin more specific parsers for common website layouts, such as wordpress and other CMS's, where 90% of the time, you could probably reasonably expect to be able to identify the main content area of the website at very least to narrow your search (if not the exact image of an article if the template isn't too customised)

Question 2

You can use simple_html_dom. You can do your work like below by searching different type of tags(img, og tags, etc...);

<?php
include_once('simple_html_dom.php');
$url =''; // To be crawled
$images = array();
$html = file_get_html($url);
foreach ($html->find('img') as $img){ // img is an option. 
    if (!empty($img->getAttribute('src')))
    array_push($images, $img->getAttribute('src'));
}

EDIT: I have gave how to implement to crawl html page and find img like tags. However, main problem here is how to find images. I have given an option img only. And I said that you can use another tags also