¿Cuál es esta función haciendo en Python que involucra urllib2 y BeautifulSoup?

https://stackoverflow.com/questions/991967

13-09-2019
|

Pregunta

Así que he hecho una pregunta anterior acerca de la recuperación de las puntuaciones altas formar una página HTML y otro usuario me dio el código siguiente para ayudar. Soy nuevo en Python y BeautifulSoup así que estoy tratando de ir a través de algunos otros códigos pieza por pieza. Entiendo la mayor parte de ella, pero no me llevo lo que este pedazo de código es y cuál es su función:

    def parse_string(el):
       text = ''.join(el.findAll(text=True))
       return text.strip()

Aquí está el código completo:

from urllib2 import urlopen
from BeautifulSoup import BeautifulSoup
import sys

URL = "http://hiscore.runescape.com/hiscorepersonal.ws?user1=" + sys.argv[1]

# Grab page html, create BeatifulSoup object
html = urlopen(URL).read()
soup = BeautifulSoup(html)

# Grab the <table id="mini_player"> element
scores = soup.find('table', {'id':'mini_player'})

# Get a list of all the <tr>s in the table, skip the header row
rows = scores.findAll('tr')[1:]

# Helper function to return concatenation of all character data in an element
def parse_string(el):
   text = ''.join(el.findAll(text=True))
   return text.strip()

for row in rows:

   # Get all the text from the <td>s
   data = map(parse_string, row.findAll('td'))

   # Skip the first td, which is an image
   data = data[1:]

   # Do something with the data...
   print data

Solución

el.findAll(text=True) devuelve todo el texto contenido en un elemento y sus sub-elementos. Por el texto me refiero a todo lo que no dentro de una etiqueta; por lo que en <b>hello</b> luego "hola" sería el texto, pero <b> y </b> haría no.

Por consiguiente, esta función une todo el texto se encuentra por debajo del elemento y tiras determinado espacio en blanco fuera de la parte delantera y trasera.

Aquí hay un enlace a la documentación findAll: http: // www .crummy.com / software / BeautifulSoup / documentation.html # arg-texto

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow