استخدام BeautifulSoup للعثور على علامة HTML تحتوي على نص معين

https://stackoverflow.com/questions/866000

21-08-2019
|

سؤال

أحاول الحصول على العناصر في مستند HTML الذي يحتوي على نمط النص التالي:#\س{11}

<h2> this is cool #12345678901 </h2>

لذلك، سيتم مطابقة السابق باستخدام:

soup('h2',text=re.compile(r' #\S{11}'))

وستكون النتائج مثل:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

أنا قادر على الحصول على كل النص المطابق (انظر السطر أعلاه).لكنني أريد أن يتطابق العنصر الأصلي للنص، حتى أتمكن من استخدام ذلك كنقطة بداية لاجتياز شجرة الوثيقة.في هذه الحالة، أريد أن تعود جميع عناصر h2، وليس تطابقات النص.

أفكار؟

المحلول

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

وأختام:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

نصائح أخرى

تقدم عمليات البحث في BeautifulSoup [قائمة] BeautifulSoup.NavigableString الكائنات عندما text= يتم استخدامه كمعايير بدلا من BeautifulSoup.Tag في حالات أخرى.تحقق من الكائن __dict__ لرؤية السمات المتاحة لك.ومن هذه الصفات، parent يفضل على previous بسبب التغييرات في BS4.

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

ومع BS4 (جميل حساء 4)، محاولة OP وتعمل تماما مثل متوقع:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

وعوائد [<h2> this is cool #12345678901 </h2>].

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow