Pergunta

I'm validating custom HTML from users with html5lib. The problem is the html5lib adds html, head and body tags, which I don't need.

parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
f = open('/home/user/ex.html')
doc = parser.parse(f)
doc.toxml()
'<html><head/><body><div>\n  <a href="http://speedhunters.com">speedhunters.com\n</a></div><a href="http://speedhunters.com">\n</a></body></html>'

This is validated, can be sanitized, but how can I remove or prevent adding these tags to the tree? I mean exclude replace using.

Foi útil?

Solução

Wow, html5lib has horrible documentation.

Looking through the source, and working on a quick test case, this appears to work:

import html5lib
from html5lib import treebuilders
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("simpleTree"))
with open('test.html') as test:
    doc = parser.parse(test)
    for child in doc:
        if child.parent.name == "body":
            return child.toxml()

It's a bit hackish, but less so than a replace().

Outras dicas

It seems that we can use the hidden property of Tags in order to prevent the tag itself from being 'exported' when casting a tag/soup to string/unicode:

>>> from bs4 import BeautifulSoup
>>> html = u"<div><footer><h3>foot</h3></footer></div><div>foo</div>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print soup.body.prettify()
<body>
 <div>
  <footer>
   <h3>
    foot
   </h3>
  </footer>
 </div>
 <div>
  foo
 </div>
</body>

Essentially, the questioner's goal is to get the entire content of the body tag without the <body> wrapper itself. This works:

>>> soup.body.hidden=True
>>> print soup.body.prettify()
 <div>
  <footer>
   <h3>
    foot
   </h3>
  </footer>
 </div>
 <div>
  foo
 </div>

I found this by going through the BeautifulSoup source. After calling soup = BeautifulSoup(html), the root tag has the internal name '[document]'. By default, only the root tag has hidden==True. This prevents its name from ending up in any HTML output.

lxml may be a better choice if you're dealing with "uncommon" html.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top