Question

I'm looking for a .NET library that can generate a clean Xml tree, ideally System.Xml.XmlDocument, from invalid HTML code. I.E. it should make the kind of best effort guesses, repairs, and substitutions browsers do when confronted with this situation, and generate a pretend XmlDocument. The library should also be well-maintained. :)

I realize this is a lot (too much?) to ask, and I would appreciate any useful leads. There seem to be a fair number of implementations of this for Java, but I would rather not generate my own bindings. So far for .NET I have found http://www.majestic12.co.uk/projects/html_parser.php and http://users.rcn.com/creitzel/tidy.html#dotnet, and http://sourceforge.net/projects/tidyfornet .

I have not yet built or tested any of these, but from the (sparse) docs and rare updates they do not seem like they have what I'm looking for. So what recommendations do you have, either among these choices, or from your past experience.

Was it helpful?

Solution

The HTML Agility Pack is highly rated. It will certainly do the parsing / best guess etc.

The model is intentially similar to XmlDocument, including SelectNodes etc for querying.

If you need xhtml output, there is a OptionOutputAsXml flag; I assume that setting this to true and calling Save results in xhtml.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top