How to retrieve a HTML tag based on a regular expression
-
28-04-2021 - |
Question
I'm trying to extract every HTML tag including a match for a regular expression. For example, suppose I want to get every tag including the string "name" and I have a HTML document like this:
<html>
<head>
<title>This tag includes 'name', so it should be retrieved</title>
</head>
<body>
<h1 class="name">This is also a tag to be retrieved</h1>
<h2>Generic h2 tag</h2>
</body>
</html>
Probably, I should try a regular expression to catch every match between opening and closing "<>"
, however, I'd like to be able to traverse the parsed tree based on those matches, so I can get the siblings or parents or 'nextElements'. In the example above, that amounts to get <head>*</head>
or maybe <h2>*</h2>
once I know they're parents or siblings of a tag containing the match.
I tried BeautifulSoap but it seems to me it's useful when you already know what kind of tag you're looking for or based on its contents. In this case, I want to get a match first, take that match as a starting point and then navigate the tree as BeautifulSoap and other HTML parsers are able to do.
Suggestions?
Solution
Use lxml.html
. It's a great parser, it support xpath which can express anything you'd want easily.
The example below uses this xpath expression:
//*[contains(text(),'name']/parent::*/following-sibling::*[1]/*[@class='name']/text()
That means, in english:
Find me any tag that contains the word
'name'
in its text, then get the parent, and then the next sibling, and find inside that any tag with the class'name'
and finally return the text content of that.
The result of running the code is:
['This is also a tag to be retrieved']
Here's the full code:
text = """
<html>
<head>
<title>This tag includes 'name', so it should be retrieved</title>
</head>
<body>
<h1 class="name">This is also a tag to be retrieved</h1>
<h2>Generic h2 tag</h2>
</body>
</html>
"""
import lxml.html
doc = lxml.html.fromstring(text)
print doc.xpath('//*[contains(text(), $stuff)]/parent::*/'
'following-sibling::*[1]/*[@class=$stuff]/text()', stuff='name')
Obligatory read, the "please don't parse HTML with regex" answer is here: https://stackoverflow.com/a/1732454/17160
OTHER TIPS
Given the following conditions:
- The match must occur in value of an attribute on the tag
- The match must occur in a text node which is a direct child of the tag
You can use beautiful soup:
from bs4 import BeautifulSoup
from bs4 import NavigableString
import re
html = '''<html>
<head>
<title>This tag includes 'name', so it should be retrieved</title>
</head>
<body>
<h1 class="name">This is also a tag to be retrieved</h1>
<h2>Generic h2 tag</h2>
</body>
</html>'''
soup = BeautifulSoup(html)
p = re.compile("name")
def match(patt):
def closure(tag):
for c in tag.contents:
if isinstance(c, NavigableString):
if patt.search(unicode(c)):
return True
for v in tag.attrs.values():
if patt.search(v):
return True
return closure
for t in soup.find_all(match(p)):
print t
Output:
<title>This tag includes 'name', so it should be retrieved</title>
<h1 class="name">This is also a tag to be retrieved</h1>