Domanda

I would like to use HTML Agility Pack to determine the main article body and then extract the main article image from it.

I have noticed that most of the site webmasters have their main content container containing an H1 tag, but that is not the rule every time, so I cannot base my assumption on that.

The 2 printscreens below are from these 2 sites.

http://www.24matins.fr/the-walking-dead-saison-4-le-deces-de-ce-personnage-ne-sera-pas-anodin-40685

http://www.lasemainedansleboulonnais.fr/actualite/la_une/2013/04/04/article__20_ans_prison_meurtre_de_sa_mere_boulogne.shtml

These are just some examples of the websites that I want to scrape.

content to rip

content to rip

Thank you!

Nessuna soluzione corretta

Altri suggerimenti

In fact, there is no defined assumption to achieve what you want in a generic way.

At first you have to bear in mind that the websites are different and can change at any moment so any try to get an indefectible algorithm is a waste of time in most of situations.

In this case, if you have just a few number of websites to parse, then you can just figure out the current content disposition pattern of each one and parse it with HTML Agility Pack, for example:

24matins: There's a div with a class named "post-header", which first <img> is the main article image, then with HAP you could write:

var web = new HtmlWeb();
var doc = web.Load("http://www.24matins.fr/the-walking-dead-saison-4-le-deces-de-ce-personnage-ne-sera-pas-anodin-40685");
var img = doc.DocumentNode.SelectSingleNode("//div[@class='post-header']/img");
Console.WriteLine(img.Attributes["src"].Value);

lasemaine..: There is a unique div with its class named "illustrations", so:

web = new HtmlWeb();
doc = web.Load("http://www.lasemainedansleboulonnais.fr/actualite/la_une/2013/04/04/article__20_ans_prison_meurtre_de_sa_mere_boulogne.shtml");
img = doc.DocumentNode.SelectSingleNode("//div[@class='illustrations']/img");
Console.WriteLine(img.Attributes["src"].Value);

Also, I would suggest you to use the RSS Feed of the sites to get relevant information. Generally, they include the picture of the articles and are more likely to have recognizable pattern as you can check out in www.24matins.fr/feed/rss-toutes-actualites.

Hope it helps.

You may scan HTML content of given URL for social websites' meta tags. For example for Facebook it would be:

<meta property="og:image" content="_here_is_URL_of_main_article_image_" />

But like natenho said there is no one and sure way that will always work.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top