Comment extraire-je consulter mes données requises à partir du fichier HTML?

https://stackoverflow.com/questions/560936

05-09-2019
|

Question

Ceci est le code HTML je:

p_tags = '''<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font> hand <br />
  <font class="test-proof">bar style</font> ball<br />
  <font class="test-proof">foo position</font> bak<br />
  <br class="bar" />
</p>'''

Ceci est mon code Python, en utilisant Beautiful Soup:

def get_info(p_tags):
    """Returns brief information."""

    head_list = []
    detail_list = []
    # This works fine
    for head in p_tags.findAll('font', 'test-proof'):
        head_list.append(head.contents[0])

    # Some problem with this?
    for index in xrange(2, 30, 4):
        detail_list.append(p_tags.contents[index])


    return dict([(l, detail_list[head_list.index(l)]) for l in head_list])

Je reçois le bon head_list du HTML, mais l'detail_list ne fonctionne pas.

head_list = [u'Full name',
 u'Born',
 u'Current age',
 u'Major teams',
 u'Also',
 u'foo style',
 u'bar style',
 u'foo position']

Je voulais quelque chose comme ça

{
  'Full name': 'Foobar', 
  'Born': 'July 7, 1923, foo, bar', 
  'Current age': '78 years 226 days', 
  'Major teams': 'Japan, Jakarta, bazz, foo, foobazz', 
  'Also': 'bar', 
  'foo style': 'hand', 
  'bar style': 'ball', 
  'foo position': 'bak'
}

Toute aide serait appréciable. Merci à l'avance.

La solution

Désolé pour le code inutilement complexe, j'ai besoin mal une grande dose de caféine;)

import re

str = """<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font> hand <br />
  <font class="test-proof">bar style</font> ball<br />
  <font class="test-proof">foo position</font> bak<br />
  <br class="bar" />
</p>"""

R_EXTRACT_DATA = re.compile("<font\s[^>]*>[\s]*(.*?)[\s]*</font>[\s]*(.*?)[\s]*<br />", re.IGNORECASE)
R_STRIP_TAGS = re.compile("<span\s[^>]*>|</span>", re.IGNORECASE)

def strip_tags(str):
    """Strip un-necessary <span> tags
    """
    return R_STRIP_TAGS.sub("", str)

def get_info(str):
    """Extract useful info from the given string
    """
    data = R_EXTRACT_DATA.findall(str)
    data_dict = {}

    for x in [(x[0], strip_tags(x[1])) for x in data]:
        data_dict[x[0]] = x[1]

    return data_dict

print get_info(str)

Autres conseils

J'ai commencé à répondre à cette avant que je réalise que vous utilisiez « belle soupe », mais voici un analyseur que je pense que fonctionne avec votre exemple chaîne écrite à l'aide de la bibliothèque HTMLParser

from HTMLParser import HTMLParser

results = {}
class myParse(HTMLParser):

   def __init__(self):
      self.state = ""
      HTMLParser.__init__(self)

   def handle_starttag(self, tag, attrs):
      attrs = dict(attrs)
      if tag == "font" and attrs.has_key("class") and attrs['class'] == "test-proof":
         self.state = "getKey"

   def handle_endtag(self, tag):
      if self.state == "getKey" and tag == "font":
         self.state = "getValue"

   def handle_data(self, data):
      data = data.strip()
      if not data:
         return
      if self.state == "getKey":
         self.resultsKey = data
      elif self.state == "getValue":
         if results.has_key(self.resultsKey):
            results[self.resultsKey] += " " + data 
         else: 
            results[self.resultsKey] = data


if __name__ == "__main__":
   p_tags = """<p class="foo-body">  <font class="test-proof">Full name</font> Foobar<br />  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />  <font class="test-proof">Current age</font> 27 years 226 days<br />  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />  <font class="test-proof">Also</font> bar<br />  <font class="test-proof">foo style</font> hand <br />  <font class="test-proof">bar style</font> ball<br />  <font class="test-proof">foo position</font> bak<br />  <br class="bar" /></p>"""
   parser = myParse()
   parser.feed(p_tags)
   print results

donne le résultat:

{'foo position': 'bak', 
'Major teams': 'Japan, Jakarta, bazz, foo, foobazz', 
'Also': 'bar', 
'Current age': '27 years 226 days', 
'Born': 'July 7, 1923, foo, bar' , 
'foo style': 'hand', 
'bar style': 'ball', 
'Full name': 'Foobar'}

Le problème est que votre HTML est pas très bien pensé - vous avez un « modèle de contenu mixte » où vos étiquettes et vos données sont intercalés. Vos étiquettes sont enveloppés dans les balises <font>, mais vos données sont dans les noeuds NavigableString.

Vous devez itérer sur le contenu de p_tag. Il y aura deux types de nœuds:. Nœuds de Tag (qui ont vos balises <font>) et les nœuds de NavigableString qui ont les autres bits de texte

from beautifulsoup import *
label_value_pairs = []
for n in p_tag.contents:
    if isinstance(n,Tag) and tag == "font"
        label= n.string
    elif isinstance(n, NavigableString):
        value= n.string
        label_value_pairs.append( label, value )
    else:
        # Generally tag == "br"
        pass
print dict( label_value_pairs )

Quelque chose à peu près comme ça.

Vous voulez trouver les chaînes précédés> et puis par <, en ignorant ou de fuite menant des espaces. Vous pouvez le faire assez facilement avec une boucle regardant chaque caractère dans la chaîne, ou des expressions régulières pourrait aider. Quelque chose comme> [\ t] * [^ <] + [\ t] * <.

Vous pouvez également utiliser re.split et regex représentant le contenu d'étiquette, quelque chose comme <[^>] *> comme séparateur, vous obtiendrez des entrées vides dans le tableau, mais ceux-ci sont facilement supprimés.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow