Question

I need to parse an XML which looks like :

<tag>
   text1 text2 text3
  <some-tag/>
       More text
  <some-tag/>
       Some more text
  <some-tag/>
  Even more text
</tag>

Using ElementTree's head and tail method, I can get to "text1 text2 text3" and "Even more text".

However, I am unable to come up with a way to reach the text in the middle ("More text" and "Some more text").

Due to the idiosyncrasies of the software generating the XML, I cannot be sure of the stray tags and hence can't use the command find('some-tag').

Is there any way that I can parse this XML using python?

Thanks

Was it helpful?

Solution

More text and Some more text are tails of some-tag. See the following:

>>> import xml.etree.cElementTree as et
>>> text = """<tag>
   text1 text2 text3
  <some-tag/>
       More text
  <some-tag/>
       Some more text
  <some-tag/>
  Even more text
</tag>"""
>>> root = et.fromstring(text)
>>> for element in root:  # leaving aside the text and tail of root for the moment
    print element.tag, ': text =>', element.text or '', 'tail =>', element.tail

some-tag : text =>  tail =>  # the tail also has a newline character and white space at its beginning
       More text

some-tag : text =>  tail => 
       Some more text

some-tag : text =>  tail => 
  Even more text

Thus you will need to iterate through the children of each element in order to see if the children have any tails.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top