lxml: how to iterate only on the first/second siblings?

https://stackoverflow.com/questions/21575699

07-10-2022
|

Question

I have an html-document of this kind:

<html>
 <head></head>
 <body>
   <p>
     <dfn>text</dfn>sometext
     **<i>othertext</i>**
     <i>...</i>
     <i>...</i></p>
   <p>
     <dfn>text</dfn>sometext
     **<i>othertext</i>**
     <i>...</i>
     <i>...</i></p>
  </body>
 </html>

I need to parse it so that I could get text from inside each first i-tag, and with respect to dfn's text (I will extract dfn-text finally) At the moment I this code:

tree = etree.parse(filename)
for dfn in tree.iter('dfn'):
   bu = dfn.text
   for sibling in dfn.itersiblings():   
            su = sibling.text
            if su != None and bu != None and re.findall(..,su):
                places.append(bu)

This goes through each i-tag, giving me sometimes erroneous output. How can I limit iteration to only first siblings of the dfn?

Solution

Break out of the itersiblings() loop when you found your match:

for dfn in tree.iter('dfn'):
    bu = dfn.text
    for sibling in dfn.itersiblings():   
        su = sibling.text
        if su != None and bu != None and re.findall(..,su):
            places.append(bu)
            break

The break statement ends the for sibling loop early, and no further siblings are processed. Instead, the outer for dfn loop continues with the next dfn element.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow