BeautifulSoupを使用すると、特定のテキストを含むHTMLタグを検索します

https://stackoverflow.com/questions/866000

21-08-2019
|

質問

私はテキストの次のパターンが含まれているHTMLドキュメント内の要素を取得しようとしている：＃1の\ S {11}

<h2> this is cool #12345678901 </h2>

だから、以前は使用して一致します：

soup('h2',text=re.compile(r' #\S{11}'))

と結果のようなものになります：

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

私は（上の行を参照）と一致するすべてのテキストを取得することができますよ。しかし、私は、テキストの親要素を一致させたいので、私はドキュメントツリーをトラバースするための出発点としてそれを使用することができます。この場合、私はすべてのH2要素は、テキスト一致ではなく、返すようにしたいと思います。

アイデア？

解決

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

プリントます：

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

他のヒント

BeautifulSoup検索操作とは対照的にBeautifulSoup.NavigableStringは、他の場合にはtext=するための基準として使用される場合BeautifulSoup.Tagオブジェクト【のリスト]を提供します。あなたに利用可能な属性を参照するには、オブジェクトの__dict__を確認してください。これらの属性の、parentがあるためBS4 変化のpreviousよりも優先されます>。

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

は、BS4（美しいスープ4）では、OPの試みはまさに期待のように動作します：

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

[<h2> this is cool #12345678901 </h2>]を返します。

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow