Parsing HTML Fragments

https://stackoverflow.com/questions/250292

05-07-2019
|

Question

What's the best way to parse fragments of HTML in C#?

For context, I've inherited an application that uses a great deal of composite controls, which is fine, but a good deal of the controls are rendered using a long sequence of literal controls, which is fairly terrifying. I'm trying to get the application into unit tests, and I want to get these controls under tests that will find out if they're generating well formed HTML, and in a dream solution, validate that HTML.

Solution

If the HTML is XHTML compliant, you can use the built in System.Xml namespace.

OTHER TIPS

Have a look at the HTMLAgility pack. It's very compatible with the .NET XmlDocument class, but it much more forgiving about HTML that's not clean/valid XHTML.

I've used an SGMLReader to produce a valid Xml document from HTML and then parse what is required using XPath or to another format using XSLT. .

You can also look into HTML Tidy for HTML parsing/cleanup. I don't think they have specific .NET libraries, but you might be able to run the binary via command-line, or IKVM the java libraries.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow