使用BeautifulSoup找到包含某些文本HTML标记

https://stackoverflow.com/questions/866000

21-08-2019
|

题

我试图获得在包含文本的以下模式的HTML文档中的元素：＃\ S {11}

<h2> this is cool #12345678901 </h2>

所以，以前的将通过使用匹配：

soup('h2',text=re.compile(r' #\S{11}'))

和的结果将是这样的：

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

我能得到所有匹配的文本（见上线）。但我想文字的父元素来搭配，这样我就可以把它作为一个起点，遍历文档树。在这种情况下，我希望所有的H2元素返回，而不是文本相匹配。

想法？

解决方案

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

打印：

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

其他提示

当BeautifulSoup.NavigableString被用作标准，而不是在其他情况下，以text=

BeautifulSoup搜索操作提供BeautifulSoup.Tag对象[列表]。检查对象的__dict__看到提供给你的属性。这些属性，parent是优于由于在BS4 previous一>

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

使用BS4（美丽的汤4）中，OP的尝试的操作完全相同预期：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

返回[<h2> this is cool #12345678901 </h2>]。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow