Example solution using lxml.html
and XPath:
- select all
h5
elements - and for each
h5
element,- select next siblings elements --
following-sibling::*
- that are not
h5
themselves, --[not(self::h5)]
- and that have up to the current
h5
number preceding sibling --[count(preceding-sibling::h5) = 1]
then 2, then 3...
- select next siblings elements --
(with the for
loop enumerate()
starting at 1)
Sample code, with simple prints of the text content of the elements (using lxml.html
's .text_content()
on elements):
import lxml.html
html = """<div id="animalcontainer" class="container last fixed-height">
<h5>
Husbandary Management
</h5>
<span>
Animal: Cow
</span>
<span>
Farmer: Mr smith
</span>
<h5>
Milch Category
</h5>
<p>
Milk supply
</p>
<h5>
Services
</h5>
<p>
cow milk, ghee
</p>
<h5>
animal colors
</h5>
<span>
green,red
</span>
</div>"""
doc = lxml.html.fromstring(html)
headers = doc.xpath('//div/h5')
for i, header in enumerate(headers, start=1):
print "--------------------------------"
print header.text_content().strip()
for following in header.xpath("""following-sibling::*
[not(self::h5)]
[count(preceding-sibling::h5) = %d]""" % i):
print "\t", following.text_content().strip()
This outputs:
--------------------------------
Husbandary Management
Animal: Cow
Farmer: Mr smith
--------------------------------
Milch Category
Milk supply
--------------------------------
Services
cow milk, ghee
--------------------------------
animal colors
green,red