Wie kann ich meine benötigten Daten aus HTML-Datei extrahieren?

https://stackoverflow.com/questions/560936

05-09-2019
|

Frage

Dies ist die HTML ich habe:

p_tags = '''<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font> hand <br />
  <font class="test-proof">bar style</font> ball<br />
  <font class="test-proof">foo position</font> bak<br />
  <br class="bar" />
</p>'''

Das ist mein Python-Code, mit Schöner Suppe:

def get_info(p_tags):
    """Returns brief information."""

    head_list = []
    detail_list = []
    # This works fine
    for head in p_tags.findAll('font', 'test-proof'):
        head_list.append(head.contents[0])

    # Some problem with this?
    for index in xrange(2, 30, 4):
        detail_list.append(p_tags.contents[index])


    return dict([(l, detail_list[head_list.index(l)]) for l in head_list])

ich den richtigen head_list aus dem HTML aber die detail_list nicht funktioniert.

head_list = [u'Full name',
 u'Born',
 u'Current age',
 u'Major teams',
 u'Also',
 u'foo style',
 u'bar style',
 u'foo position']

ich so etwas wie dies wollte

{
  'Full name': 'Foobar', 
  'Born': 'July 7, 1923, foo, bar', 
  'Current age': '78 years 226 days', 
  'Major teams': 'Japan, Jakarta, bazz, foo, foobazz', 
  'Also': 'bar', 
  'foo style': 'hand', 
  'bar style': 'ball', 
  'foo position': 'bak'
}

Jede Hilfe wäre spürbar sein. Vielen Dank im Voraus.

Lösung

Sorry für den unnötig komplexen Code, ich dringend eine große Dosis Koffein benötigen;)

import re

str = """<p class="foo-body">
  <font class="test-proof">Full name</font> Foobar<br />
  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />
  <font class="test-proof">Current age</font> 27 years 226 days<br />
  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />
  <font class="test-proof">Also</font> bar<br />
  <font class="test-proof">foo style</font> hand <br />
  <font class="test-proof">bar style</font> ball<br />
  <font class="test-proof">foo position</font> bak<br />
  <br class="bar" />
</p>"""

R_EXTRACT_DATA = re.compile("<font\s[^>]*>[\s]*(.*?)[\s]*</font>[\s]*(.*?)[\s]*<br />", re.IGNORECASE)
R_STRIP_TAGS = re.compile("<span\s[^>]*>|</span>", re.IGNORECASE)

def strip_tags(str):
    """Strip un-necessary <span> tags
    """
    return R_STRIP_TAGS.sub("", str)

def get_info(str):
    """Extract useful info from the given string
    """
    data = R_EXTRACT_DATA.findall(str)
    data_dict = {}

    for x in [(x[0], strip_tags(x[1])) for x in data]:
        data_dict[x[0]] = x[1]

    return data_dict

print get_info(str)

Andere Tipps

ich die Beantwortung dieser begann, bevor ich realisiert man ‚schöne Suppe‘ wurden unter Verwendung aber hier ist ein Parser, der ich Arbeiten mit Ihrem Beispiel String denken geschrieben, um die HTMLParser Bibliothek mit

from HTMLParser import HTMLParser

results = {}
class myParse(HTMLParser):

   def __init__(self):
      self.state = ""
      HTMLParser.__init__(self)

   def handle_starttag(self, tag, attrs):
      attrs = dict(attrs)
      if tag == "font" and attrs.has_key("class") and attrs['class'] == "test-proof":
         self.state = "getKey"

   def handle_endtag(self, tag):
      if self.state == "getKey" and tag == "font":
         self.state = "getValue"

   def handle_data(self, data):
      data = data.strip()
      if not data:
         return
      if self.state == "getKey":
         self.resultsKey = data
      elif self.state == "getValue":
         if results.has_key(self.resultsKey):
            results[self.resultsKey] += " " + data 
         else: 
            results[self.resultsKey] = data


if __name__ == "__main__":
   p_tags = """<p class="foo-body">  <font class="test-proof">Full name</font> Foobar<br />  <font class="test-proof">Born</font> July 7, 1923, foo, bar<br />  <font class="test-proof">Current age</font> 27 years 226 days<br />  <font class="test-proof">Major teams</font> <span style="white-space: nowrap">Japan,</span> <span style="white-space: nowrap">Jakarta,</span> <span style="white-space: nowrap">bazz,</span> <span style="white-space: nowrap">foo,</span> <span style="white-space: nowrap">foobazz</span><br />  <font class="test-proof">Also</font> bar<br />  <font class="test-proof">foo style</font> hand <br />  <font class="test-proof">bar style</font> ball<br />  <font class="test-proof">foo position</font> bak<br />  <br class="bar" /></p>"""
   parser = myParse()
   parser.feed(p_tags)
   print results

Gibt das Ergebnis:

{'foo position': 'bak', 
'Major teams': 'Japan, Jakarta, bazz, foo, foobazz', 
'Also': 'bar', 
'Current age': '27 years 226 days', 
'Born': 'July 7, 1923, foo, bar' , 
'foo style': 'hand', 
'bar style': 'ball', 
'Full name': 'Foobar'}

Das Problem ist, dass Ihr HTML ist nicht sehr gut durchdacht - Sie ein „gemischtes Content-Modell“, wo Sie Ihre Etiketten und Ihre Daten verschachtelt sind. Ihre Etiketten sind in <font> Stichworte eingewickelt, aber Ihre Daten sind in NavigableString Knoten.

Sie müssen über den Inhalt p_tag iterieren. Es gibt zwei Arten von Knoten sein. Tag Knoten (die Ihre <font>-Tags) und NavigableString Knoten, die die anderen Bits von Texten haben

from beautifulsoup import *
label_value_pairs = []
for n in p_tag.contents:
    if isinstance(n,Tag) and tag == "font"
        label= n.string
    elif isinstance(n, NavigableString):
        value= n.string
        label_value_pairs.append( label, value )
    else:
        # Generally tag == "br"
        pass
print dict( label_value_pairs )

Etwas ungefähr so.

Sie möchten die Saiten von> und anschließend <, ignoriert stechend oder schleppend Leerzeichen zu finden. Sie können dies tun, ganz einfach mit einer Schleife im String an jedem Zeichen suchen, oder reguläre Ausdrücke helfen könnten. So etwas wie> [\ t] * [^ <] + [\ t] * <.

Sie können auch benutzen re.split und ein regex die Tag-Inhalte darstellen, so etwas wie <[^>] *> als Splitter, werden Sie einige leere Einträge in dem Array erhalten, aber diese sind leicht gelöscht werden.

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow