HTML parsing using pugixml or an actual HTML parser

https://stackoverflow.com/questions/10079451

30-05-2021
|

Question

I'm interested in using pugixml to parse HTML documents, but HTML has some optional closing tags. Here is an example: <meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">

Pugixml stops reading the HTML as soon as it encounters a tag that's not closed, but in HTML missing a closing tag does not necessarily mean that there is a start-end tag mismatch.

A simple test of parsing the HTML documentation of pugixml fails because the meta tag is the second line of the HTML document: http://pugixml.googlecode.com/svn/tags/latest/docs/quickstart.html

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=US-ASCII">
<title>pugixml 1.0</title>
<link rel="stylesheet" href="pugixml.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.75.2">
<link rel="home" href="quickstart.html" title="pugixml 1.0">
</head>
<!--- etc... -->

A lot of HTML documents in the wild would fail if I try to parse them with pugixml. Is there a way to avoid that? If there is no way to "fix" that, then is there another HTML parsing tool that's as about as fast as pugixml?

Update

It would also be great if the HTML parser also supports XPATH.

Solution

I ended up taking pugixml, converting it into an HTML parser and I created a github project for it: https://github.com/rofldev/pugihtml

For now it's not fully compliant with the HTML specifications, but it does a decent enough job at parsing HTML that I can use it. I'm working on making it compliant with the HTML specifications.

OTHER TIPS

One way to address this is to do some pre-processing that converts the HTML to XHTML, then it would "officially" be considered XML and usable with XML tools. If you want to go that route, see this question: How to convert HTML to XHTML?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow