BeautifulSoup을 사용하여 특정 텍스트가 포함 된 HTML 태그를 찾습니다.

https://stackoverflow.com/questions/866000

21-08-2019
|

문제

다음과 같은 텍스트 패턴을 포함하는 HTML 문서에서 요소를 얻으려고 노력하고 있습니다. # s {11}

<h2> this is cool #12345678901 </h2>

따라서 이전은 다음을 사용하여 일치합니다.

soup('h2',text=re.compile(r' #\S{11}'))

그리고 결과는 다음과 같습니다.

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

일치하는 모든 텍스트를 얻을 수 있습니다 (위 줄 참조). 그러나 텍스트의 부모 요소가 일치하기를 원하므로 문서 트리를 가로 지르는 시작점으로 사용할 수 있습니다. 이 경우 텍스트 일치가 아니라 모든 H2 요소가 돌아 오기를 원합니다.

아이디어?

해결책

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

인쇄물:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

다른 팁

BeautifulSoup 검색 작업은 [목록]을 제공합니다. BeautifulSoup.NavigableString 언제 text= 반대로 기준으로 사용됩니다 BeautifulSoup.Tag 다른 경우. 개체를 확인하십시오 __dict__ 당신이 이용할 수있는 속성을보기 위해. 이 속성 중에서 parent 선호합니다 previous 때문에 BS4의 변화.

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

BS4 (아름다운 수프 4)를 사용하면 OP의 시도는 예상과 똑같이 작동합니다.

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

보고 [<h2> this is cool #12345678901 </h2>].

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow