En utilisant BeautifulSoup pour trouver une balise HTML qui contient un certain texte

https://stackoverflow.com/questions/866000

21-08-2019
|

Question

Je suis en train d'obtenir les éléments dans un document HTML contenant le modèle de texte suivant: # \ S {11}

<h2> this is cool #12345678901 </h2>

Ainsi, le précédent correspondra à l'aide:

soup('h2',text=re.compile(r' #\S{11}'))

Et les résultats seraient quelque chose comme:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

Je suis en mesure d'obtenir tout le texte correspondant (voir ci-dessus la ligne). Mais je veux que l'élément parent du texte en fonction, afin que je puisse l'utiliser comme point de départ pour parcourir l'arborescence du document. Dans ce cas, je veux que tous les éléments h2 pour revenir, pas le texte correspond.

Idées?

La solution

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

Prints:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

Autres conseils

opérations de recherche beautifulsoup livrer [une liste des objets lorsque] BeautifulSoup.NavigableString est utilisé comme text= un critère par opposition à d'autres cas dans BeautifulSoup.Tag. Vérifiez l'objet de __dict__ pour voir les attributs mis à votre disposition. Parmi ces attributs, est favorisée par rapport parent en raison des previous changements BS4.

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

Avec BS4 (Beautiful Soup 4), la tentative de OP fonctionne exactement comme prévu:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

retourne [<h2> this is cool #12345678901 </h2>].

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow