How can I parse HTML with html5lib, and query the parsed HTML with XPath?

https://stackoverflow.com/questions/2558056

23-09-2019
|

Question

I am trying to use html5lib to parse an html page in to something I can query with xpath. html5lib has close to zero documentation and I've spent too much time trying to figure this problem out. Ultimate goal is to pull out the second row of a table:

<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>

so lets try it:

>>> doc = html5lib.parse('<html><table><tr><td>Header</td></tr><tr><td>Want This</td> </tr></table></html>', treebuilder='lxml')
>>> doc
<lxml.etree._ElementTree object at 0x1a1c290>

that looks good, lets see what else we have:

>>> root = doc.getroot()
>>> print(lxml.etree.tostring(root))
<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head/><html:body><html:table><html:tbody><html:tr><html:td>Header</html:td></html:tr><html:tr><html:td>Want This</html:td></html:tr></html:tbody></html:table></html:body></html:html>

LOL WUT?

seriously. I was planning on using some xpath to get at the data I want, but that doesn't seem to work. So what can I do? I am willing to try different libraries and approaches.

Solution

Lack of documentation is a good reason to avoid a library IMO, no matter how cool it is. Are you wedded to using html5lib? Have you looked at lxml.html?

Here is a way to do this with lxml:

from lxml import html
tree = html.fromstring(text)
[td.text for td in tree.xpath("//td")]

Result:

['Header', 'Want This']

OTHER TIPS

What you want to use is the namespaceHTMLElements argument, which for some reason defaults to True.

doc = html5lib.parse('''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>
''', treebuilder='lxml', namespaceHTMLElements=False)

print lxml.html.tostring(doc)

It's probably still easier to use lxml.html however.

I always recommend to try out lxml library. It's blazingly fast and has many features.

It has also support for html5lib parser if you need that: html5parser

>>> from lxml.html import fromstring, tostring

>>> html = """
... <html>
...     <table>
...         <tr><td>Header</td></tr>
...         <tr><td>Want This</td></tr>
...     </table>
... </html>
... """
>>> doc = fromstring(html)
>>> tr = doc.cssselect('table tr')[1]
>>> print tostring(tr)
<tr><td>Want This</td></tr>

With BeautifulSoup, you can do that with

>>> soup = BeautifulSoup.BeautifulSoup('<html><table><tr><td>Header</td></tr><tr><td>Want This</td></tr></table></html>')
>>> soup.findAll('td')[1].string
u'Want This'
>>> soup.findAll('tr')[1].td.string
u'Want This'

(Obviously that's a really crude example, but ya.)

i believe you can do css search on lxml objects.. like so

elements = root.cssselect('div.content')
data = elements[0].text

Since html5lib (by default) creates trees that contain (correct) namespace information you have specify (the right) namespaces in your queries, as well.

Example with an XPath query:

import html5lib
inp='''<html>
    <table>
        <tr><td>Header</td></tr>
        <tr><td>Want This</td></tr>
    </table>
</html>'''
xns = '{http://www.w3.org/1999/xhtml}'
d = html5lib.parse(inp)
s = d.findall('.//{}td'.format(xns))[-1].text
print(s)

Output:

Want This

The same result without XPath:

s = d.find(xns+'body').find(xns+'table').find(xns+'tbody') \
     .findall(xns+'tr')[-1].find(xns+'td').text

Alternatively, you can also tell html5lib to avoid adding any namespace information during parsing:

d = html5lib.parse(inp, namespaceHTMLElements=False)
s = d.findall('.//td')[-1].text
print(s)

Output:

Want This

try using jquery. and you can retrieve all elements. alternately, you can put an id on your row and pull it out.

1) ... ...

$("td")[1].innerHTML will be what you want

2) ... ...

$("#blah").text() will be what you want

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow