Usando BeautifulSoup para encontrar una etiqueta HTML que contiene un texto determinado

https://stackoverflow.com/questions/866000

21-08-2019
|

Pregunta

Estoy tratando de obtener los elementos de un documento HTML que contiene el siguiente patrón de texto: # \ S {11}

<h2> this is cool #12345678901 </h2>

Por lo tanto, la anterior coincidiría con:

soup('h2',text=re.compile(r' #\S{11}'))

Y el resultado sería algo como:

[u'blahblah #223409823523', u'thisisinteresting #293845023984']

Soy capaz de obtener todo el texto que coincide (véase la línea de arriba). Pero quiero que el elemento principal del texto a la altura, por lo que puedo usar eso como un punto de partida para recorrer el árbol de documentos. En este caso, me gustaría que todos los elementos h2 para volver, no coincide con el texto.

Las ideas?

Solución

from BeautifulSoup import BeautifulSoup
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h1>foo #126666678901</h1>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)


for elem in soup(text=re.compile(r' #\S{11}')):
    print elem.parent

Las impresiones:

<h2>this is cool #12345678901</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>

Otros consejos

operaciones de búsqueda

BeautifulSoup entregan [] una lista de objetos BeautifulSoup.NavigableString cuando text= se utiliza como criterio en contraposición a BeautifulSoup.Tag en otros casos. Compruebe el objeto de __dict__ para ver los atributos puestos a su disposición. De estos atributos, parent se ve favorecida por previous debido a los cambios en BS4 .

from BeautifulSoup import BeautifulSoup
from pprint import pprint
import re

html_text = """
<h2>this is cool #12345678901</h2>
<h2>this is nothing</h2>
<h2>this is interesting #126666678901</h2>
<h2>this is blah #124445678901</h2>
"""

soup = BeautifulSoup(html_text)

# Even though the OP was not looking for 'cool', it's more understandable to work with item zero.
pattern = re.compile(r'cool')

pprint(soup.find(text=pattern).__dict__)
#>> {'next': u'\n',
#>>  'nextSibling': None,
#>>  'parent': <h2>this is cool #12345678901</h2>,
#>>  'previous': <h2>this is cool #12345678901</h2>,
#>>  'previousSibling': None}

print soup.find('h2')
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern)
#>> this is cool #12345678901
print soup.find('h2', text=pattern).parent
#>> <h2>this is cool #12345678901</h2>
print soup.find('h2', text=pattern) == soup.find('h2')
#>> False
print soup.find('h2', text=pattern) == soup.find('h2').text
#>> True
print soup.find('h2', text=pattern).parent == soup.find('h2')
#>> True

Con bs4 (Beautiful Soup 4), el intento de la OP funciona exactamente igual que se espera:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<h2> this is cool #12345678901 </h2>")
soup('h2',text=re.compile(r' #\S{11}'))

retornos [<h2> this is cool #12345678901 </h2>].

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow