使用BeautifulSoup找到包含某些文本HTML标记
-
21-08-2019 - |
题
我试图获得在包含文本的以下模式的HTML文档中的元素:#\ S {11}
<h2> this is cool #12345678901 </h2>
所以,以前的将通过使用匹配:
soup('h2',text=re.compile(r' #\S{11}'))
和的结果将是这样的:
[u'blahblah #223409823523', u'thisisinteresting #293845023984']
我能得到所有匹配的文本(见上线)。但我想文字的父元素来搭配,这样我就可以把它作为一个起点,遍历文档树。在这种情况下,我希望所有的H2元素返回,而不是文本相匹配。
想法?
解决方案
from BeautifulSoup import BeautifulSoup
import re
html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""
soup = BeautifulSoup(html_text)
for elem in soup(text=re.compile(r' #\S{11}')):
print elem.parent
打印:
<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
其他提示
当
BeautifulSoup.NavigableString
被用作标准,而不是在其他情况下,以text=
BeautifulSoup搜索操作提供BeautifulSoup.Tag
对象[列表]。检查对象的__dict__
看到提供给你的属性。这些属性,parent
是优于由于在BS4 变化previous
一>
from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re
html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""
soup = BeautifulSoup(html_text)
# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')
pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>> 'nextSibling': None,
#>> 'parent': <h2>this is cool #12345678901</h2>,
#>> 'previous': <h2>this is cool #12345678901</h2>,
#>> 'previousSibling': None}
print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True
使用BS4(美丽的汤4)中,OP的尝试的操作完全相同预期:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))
返回[<h2> this is cool #12345678901 </h2>]
。
不隶属于 StackOverflow