Question

I'm trying to extract every HTML tag including a match for a regular expression. For example, suppose I want to get every tag including the string "name" and I have a HTML document like this:

<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>

Probably, I should try a regular expression to catch every match between opening and closing "<>", however, I'd like to be able to traverse the parsed tree based on those matches, so I can get the siblings or parents or 'nextElements'. In the example above, that amounts to get <head>*</head> or maybe <h2>*</h2> once I know they're parents or siblings of a tag containing the match.

I tried BeautifulSoap but it seems to me it's useful when you already know what kind of tag you're looking for or based on its contents. In this case, I want to get a match first, take that match as a starting point and then navigate the tree as BeautifulSoap and other HTML parsers are able to do.

Suggestions?

Was it helpful?

Solution

Use lxml.html. It's a great parser, it support xpath which can express anything you'd want easily.

The example below uses this xpath expression:

//*[contains(text(),'name']/parent::*/following-sibling::*[1]/*[@class='name']/text()

That means, in english:

Find me any tag that contains the word 'name' in its text, then get the parent, and then the next sibling, and find inside that any tag with the class 'name' and finally return the text content of that.

The result of running the code is:

['This is also a tag to be retrieved']

Here's the full code:

text = """
<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>
"""

import lxml.html
doc = lxml.html.fromstring(text)
print doc.xpath('//*[contains(text(), $stuff)]/parent::*/'
    'following-sibling::*[1]/*[@class=$stuff]/text()', stuff='name')

Obligatory read, the "please don't parse HTML with regex" answer is here: https://stackoverflow.com/a/1732454/17160

OTHER TIPS

Given the following conditions:

  • The match must occur in value of an attribute on the tag
  • The match must occur in a text node which is a direct child of the tag

You can use beautiful soup:

from bs4 import BeautifulSoup
from bs4 import NavigableString
import re

html = '''<html>
  <head>
    <title>This tag includes 'name', so it should be retrieved</title>
  </head>
  <body>
    <h1 class="name">This is also a tag to be retrieved</h1>
    <h2>Generic h2 tag</h2>
  </body>
</html>'''

soup = BeautifulSoup(html)
p = re.compile("name")

def match(patt):
    def closure(tag):
        for c in tag.contents:
            if isinstance(c, NavigableString):
                if patt.search(unicode(c)):
                    return True
        for v in tag.attrs.values():
            if patt.search(v):
                return True
    return closure

for t in soup.find_all(match(p)):
    print t

Output:

<title>This tag includes 'name', so it should be retrieved</title>
<h1 class="name">This is also a tag to be retrieved</h1>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top