Question

I have this xml inputfile:

<?xml version="1.0"?>
<zero>
  <First>
    <second>
      <third-num>1</third-num>
      <third-def>object001</third-def>
      <third-len>458</third-len>
    </second>
    <second>
      <third-num>2</third-num>
      <third-def>object002</third-def>
      <third-len>426</third-len>
    </second>
    <second>
      <third-num>3</third-num>
      <third-def>object003</third-def>
      <third-len>998</third-len>
    </second>
  </First>
</zero>

My goal is to remove any second level for which <third-def> that is not a value. To do that, I wrote this code:

try:
    import xml.etree.cElementTree as ET
except ImportError:
    import xml.etree.ElementTree as ET
inputfile='inputfile.xml'
tree = ET.parse(inputfile)
root = tree.getroot()

elem = tree.find('First')
for elem2 in tree.iter(tag='second'):
    if elem2.find('third-def').text == 'object001':
        pass
    else:
        elem.remove(elem2)
        #elem2.clear()

My problem is elem.remove(elem2). It skips every other second level. Here is the output of this code:

<?xml version="1.0" ?>
<zero>
  <First>
    <second>
      <third-num>1</third-num>
      <third-def>object001</third-def>
      <third-len>458</third-len>
    </second>
    <second>
      <third-num>3</third-num>
      <third-def>object003</third-def>
      <third-len>998</third-len>
    </second>
  </First>
</zero>

Now if I un-comment the elem2.clear() line, the script works perfectly, but the output is less nice as it keeps all the removed second levels:

<?xml version="1.0" ?>
<zero>
  <First>
    <second>
      <third-num>1</third-num>
      <third-def>object001</third-def>
      <third-len>458</third-len>
    </second>
    <second/>
    <second/>
  </First>
</zero>

Does anybody has a clue why my element.remove() statement is wrong?

Was it helpful?

Solution

You are looping over the live tree:

for elem2 in tree.iter(tag='second'):

which you then change while iterating. The 'counter' of the iteration won't be told about the changed number of elements, so when looking at element 0 and removing that element, the iterator then moves on to element number 1. But what was element number 1 is now element number 0.

Capture a list of all the elements first, then loop over that:

for elem2 in tree.findall('.//second'):

.findall() returns a list of results, which doesn't update as you alter the tree.

Now the iteration won't skip the last element:

>>> print ET.tostring(tree)
<zero>
  <First>
    <second>
      <third-num>1</third-num>
      <third-def>object001</third-def>
      <third-len>458</third-len>
    </second>
    </First>
</zero>

This phenomenon is not limited to ElementTree trees; see Loop "Forgets" to Remove Some Items

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top