Clean HTML using C#

https://stackoverflow.com/questions/1754258

20-09-2019
|

Question

How do I repair malformed HTML using C#? A great answer would be an HTML Agility Pack sample!

I'm scraping a site (for legitimate use). The site's HTML is OK but there are some annoying problems.

One way I could go would be through regular expressions. I used Expression Web to analyse the problems and the regular expressions needed to correct them. So one way would be to use a tool such as RegexBuddy to generate C# code for these regular expressions.

However, the recommended tool for processing malformed HTML in C# is the HTML Agility Pack (HAP). Moreover, I've analysed only a handful of pages and I'm afraid that future pages will contain patterns I've not yet solved, and I would hate to enter the "find the errors in the next few pages and correct them" maintenance business. So, if HAP already has a solid, always-working solution, this would be great. The problem is that except for a few mentions here at SO I could not find any how-to-use documentation for this tool, except for the object-by-object API help file.

So - before I spend $ and learning time on RegexBuddy (no free evaluation version), or break my teeth on HAP's API documentation - is there an easy way to do this? An HAP sample would help... :-)

Solution 2

What I took from the answers here: 1) If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes. 2) If you are limited to this known site, why not write your scraper to adjust the problems

So, if I have to go into maintenance mode, it should be as easy as possible. Therefore, my process is as follows:

I use Webius's SWExplorerAutomation to detect scenes in Web pages. The idea is that a Scene is a collection of conditions you define for IE. When a web page is loaded, IE tries to see which set of conditions is met (e.g. - page title is "Account Login", the page contains a "Login" text box a "Password" text box). If a set of conditions corresponding to a scene is detected, IE reports that the scene has been detected. This model provides an abstraction layer - Some changes in the web page can translate to changes in the scene file, saving the code from having to change. Additionally, this shields me from IE's event driven model: I call "scene. I'm evaluating this product but I'm not yet sure I'll use it, mainly because the documentation is terrible. Another alternative is Watin, and one more reason I haven't yet bought SWEA is this article accusing its author of spamming against Watin.
Once the web page has been acquired, I use Expression Web to run compatibility checks and identify errors.
I use RegexMagic to remove and correct errors. I really love this tool. Sure, sometimes it make you murderously angry because it doesn't let you do things that should be really easy, but it's a sweet, sweet tool, and the documentation is amazing.
Finally, after all the errors I know have been corrected, I use HTML Agility Pack to convert to XHTML - cross the ts and dot the is, so to speak: all lower case, quotes across attributes, and so on.

Hope this helps!

Avi

OTHER TIPS

can you tell me what kind of annoying problems are you having?
but you dont need to use regex to clean the html, HAP will let you access the elemtents of a malformed html using Xpath Queries.
and basically you need to learn Xpath to know how to get the html elements you want.
it really depends on the kind of html you are parsing using HAP.
but there is several ways to get the elements.
like by id or class or even you can get the element that follows another element that contain a given text like "name:" for example.
you can goto W3 schools Xpath Tutorial for a nice xpath tutorial

Regex can't be used for HTML Cleaning. Does http://tidy.sourceforge.net/ helps?

If you're scraping a website you don't control, you'll always enter a maintenance mode where you have to fix your scraper every time the layout of the page you're scraping changes. It doesn't matter if you're using the regex <td color="red">\d+</td> to get the big red number from a page or if you're using a DOM parser to get the 3rd cell in the 2nd row in the table with id numbers to get the same. The regex breaks if the webmaster replaces the color attribute with a class attribute. The DOM parser breaks if the webmaster adds another row to the top of the table.

If you're scraping larger parts of a web page and want to embed them in your own web page, it may be easier to get over your desire for web standards compliance and just let the browser figure out how to display things.

Since you're using Html Agility Pack and know of the problems that occur, if you are limited to this known site, why not write your scraper to adjust the problems when you've loaded the HtmlDocument.

i.e.: If you know the element always appears after the , insert the element into the first child position of the tag.....

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow