Mit BeautifulSoup ein HTML-Tag zu finden, die einen bestimmten Text enthält

https://stackoverflow.com/questions/866000

21-08-2019
|

Frage

Ich versuche, die Elemente in einem HTML-Dokument zu erhalten, die das folgende Muster von Text enthalten: # \ S {11}

<h2> this is cool #12345678901 </h2>

So würde passen die vorherige durch die Verwendung:

soup('h2',text=re.compile(r' #\S{11}'))

Und die Ergebnisse würden wie etwas sein:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

Ich bin in der Lage den gesamten Text zu erhalten, die übereinstimmt (siehe Zeile oben). Aber ich will das übergeordnete Element des Textes entsprechen, so kann ich, dass zum Verfahren des Dokumentenbaumes als Ausgangspunkt verwenden. In diesem Fall, kehre ich alle h2-Elemente wollen würde zu, nicht den Text übereinstimmt.

Ideen?

Lösung

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

Prints:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

Andere Tipps

BeautifulSoup Suchoperationen liefern [Liste] BeautifulSoup.NavigableString Objekte, wenn text= als Kriterium verwendet wird, wie in anderen Fällen BeautifulSoup.Tag gegenüber. Überprüfen Sie die __dict__ Objekt zu sehen, die Ihnen gemacht Attribute zur Verfügung. Von diesen Eigenschaften wird parent über previous begünstigt wegen Änderungen in BS4 .

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

Mit BS4 (Schöner Suppe 4), der Versuch des OP funktioniert genau wie erwartet:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

kehrt [<h2> this is cool #12345678901 </h2>].

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow