What algorithms could I use to identify content on a web page

https://stackoverflow.com/questions/1999228

22-09-2019
|

문제

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.

해결책

This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm

다른 팁

First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.

After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.

There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.

Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.

I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow