Question

I have a project that will accept inputs such as (html, sgml, xml and txt).

I have no problem parsing the XML files and txt files, Can you please suggest some tools that I can use on parsing html or sgml files.

Was it helpful?

Solution

For HTMl Parser, use the HTML Agilty Pack - it is an open source HTML parser for .NET.

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

You can use this to query HTML and extract whatever data you wish.

For SGML Parser

Check out this link, SGMLReader - Convert any HTML to valid XML:

http://developer.mindtouch.com/Community/SgmlReader

Reference: SGML parser .NET recommendations

OTHER TIPS

For parsing HTML I can't recommend anything other then http://htmlagilitypack.codeplex.com/ and since SGML is basicly the same but with other elements you could probaly use it for that as well.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top