html5lib with lxml treebuilder doesn't parse namespaces correctly

https://stackoverflow.com/questions/12253791

30-06-2021
|

Pergunta

I'm trying to parse some HTML content with html5lib using the lxml treebuilder. Note: I'm using the requests library to grab the content and the content is HTML5 (tried with XHTML - same result).

When I simply output the HTML source, it looks alright:

response = requests.get(url)
return response.text

returns

<html xmlns:foo="http://www.example.com/ns/foo">

But when I'm actually parsing it with the html5lib, something odd happens:

tree = html5lib.parse(response.text, treebuilder = 'lxml', namespaceHTMLElements = True)
html = tree.getroot()
return lxml.etree.tostring(html, pretty_print = False)

returns

<html:html xmlns:html="http://www.w3.org/1999/xhtml" xmlnsU0003Afoo="http://www.example.com/ns/foo">

Note the xmlnsU0003Afoo thing.

Also, the html.nsmap dict does not contain the foo namespace, only html.

Does anyone have an idea about what's going on and how I could fix this?

Later edit:

It seems that this is expected behavior:

If the XML API being used restricts the allowable characters in the local names of elements and attributes, then the tool may map all element and attribute local names [...] to a set of names that are allowed, by replacing any character that isn't supported with the uppercase letter U and the six digits of the character's Unicode code [...] - Coercing an HTML DOM into an infoset

Solução

A few observations:

HTML5 doesn't seem to support xmlns attributes. Quoting section 1.6 of the latest HTML5 specification: "...namespaces cannot be represented using the HTML syntax, but they are supported in the DOM and in the XHTML syntax." I see you tried with XHTML as well, but you're currently using HTML5, so there could be an issue there. U+003A is the Unicode for colon, so somehow the xmlns is being noted but flubbed.
There is an open issue with custom namespace elements for at least the PHP version.
I don't understand the role of html5lib here. Why not just use lxml directly:

from lxml import etree

tree = etree.fromstring(resp_text)
print etree.tostring(tree, pretty_print=True)

That seems to do what you want, without html5lib and without the goofy xmlnsU0003Afoo error. With the test HTML I used, I got the right output (follows), and tree.nsmap contained an entry for 'foo'.

<html xmlns:foo="http://www.example.com/ns/foo">
    <head>
        <title>yo</title>
    </head>
    <body>
        <p>test</p>
    </body>
</html>

Alternatively, if you wish to use pure html5lib, you can just use the included simpletree:

tree = html5lib.parse(resp_text, namespaceHTMLElements=True)
print tree.toxml()

While this doesn't muck up the xmlns attribute, simpletree unfortunately lacks the more powerful ElementTree functions like xpath().

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow