Frage

For the following xhtml snippet I need to use either BS4 or xpath to get attribute value pairs from the structured html, the attribute name is present in h5 tag and its value follows either in a span tag or a p tag.

for below code I should get following output as dictionary:

Husbandary Management:'Animal: Cow Farmer: Mr smith,'

Milch category:'Milk supply'

Services:'cow milk,ghee'

animal colors:'red,geen...'

<div id="animalcontainer" class="container last fixed-height">

                <h5>
                  Husbandary Management
                </h5>
                <span>
                  Animal: Cow
                </span>
                <span>
                  Farmer: Mr smith
                </span>
                <h5>
                  Milch Category
                </h5>
                <p>
                  Milk supply
                </p>
                <h5>
                  Services
                </h5>
                <p>
                  cow milk, ghee
                </p>
                <h5>
                  animal colors
                </h5>
                <span>
                  green,red
                </span>


              </div>

htmlcode.findAll('h5') finds the h5 elements but I want both the h5 element and the successor before another 'h5'

War es hilfreich?

Lösung

Example solution using lxml.html and XPath:

  1. select all h5 elements
  2. and for each h5 element,
    1. select next siblings elements -- following-sibling::*
    2. that are not h5 themselves, -- [not(self::h5)]
    3. and that have up to the current h5 number preceding sibling -- [count(preceding-sibling::h5) = 1] then 2, then 3...

(with the for loop enumerate() starting at 1)

Sample code, with simple prints of the text content of the elements (using lxml.html's .text_content() on elements):

import lxml.html
html = """<div id="animalcontainer" class="container last fixed-height">

                <h5>
                  Husbandary Management
                </h5>
                <span>
                  Animal: Cow
                </span>
                <span>
                  Farmer: Mr smith
                </span>
                <h5>
                  Milch Category
                </h5>
                <p>
                  Milk supply
                </p>
                <h5>
                  Services
                </h5>
                <p>
                  cow milk, ghee
                </p>
                <h5>
                  animal colors
                </h5>
                <span>
                  green,red
                </span>


              </div>"""
doc = lxml.html.fromstring(html)
headers = doc.xpath('//div/h5')
for i, header in enumerate(headers, start=1):
    print "--------------------------------"
    print header.text_content().strip()
    for following in header.xpath("""following-sibling::*
                                     [not(self::h5)]
                                     [count(preceding-sibling::h5) = %d]""" % i):
        print "\t", following.text_content().strip()

This outputs:

--------------------------------
Husbandary Management
    Animal: Cow
    Farmer: Mr smith
--------------------------------
Milch Category
    Milk supply
--------------------------------
Services
    cow milk, ghee
--------------------------------
animal colors
    green,red

Andere Tipps

I finally did it using BS, it seems it can be done more efficiently as the following solution regenerates the siblings every time:

h5s=addinfo.findAll('h5')
txtcontents=[]
datad={}
for h5el in h5s:
    hcontents=list(h5el.nextSiblingGenerator())
    txtcontents=[]
    for con in hcontents:
        try:
            if con.name=='h5':
                break
        except AttributeError:
            print "error:",con

            continue
        txtcontents.append(con.contents)
    datad["\n".join(h5el.contents)]=txtcontents
print datad
Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top