Parsing HTML Fragments
-
05-07-2019 - |
Question
What's the best way to parse fragments of HTML in C#?
For context, I've inherited an application that uses a great deal of composite controls, which is fine, but a good deal of the controls are rendered using a long sequence of literal controls, which is fairly terrifying. I'm trying to get the application into unit tests, and I want to get these controls under tests that will find out if they're generating well formed HTML, and in a dream solution, validate that HTML.
Solution
If the HTML is XHTML compliant, you can use the built in System.Xml namespace.
OTHER TIPS
Have a look at the HTMLAgility pack. It's very compatible with the .NET XmlDocument class, but it much more forgiving about HTML that's not clean/valid XHTML.
I've used an SGMLReader to produce a valid Xml document from HTML and then parse what is required using XPath or to another format using XSLT. .
You can also look into HTML Tidy for HTML parsing/cleanup. I don't think they have specific .NET libraries, but you might be able to run the binary via command-line, or IKVM the java libraries.