سؤال

I have an XML file similar to this:

<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>

I want to remove all text in <b> or <u> elements (and descendants), and print the rest. This is what I tried:

from __future__ import print_function
import xml.etree.ElementTree as ET

tree = ET.parse('a.xml')
root = tree.getroot()

parent_map = {c:p for p in root.iter() for c in p}

for item in root.findall('.//b'):
  parent_map[item].remove(item)
for item in root.findall('.//u'):
  parent_map[item].remove(item)
print(''.join(root.itertext()).strip())

(I used the recipe in this answer to build the parent_map). The problem, of course, is that with remove(item) I'm also removing the text after the element, and the result is:

Some that I

whereas what I want is:

Some  text that I  want to keep.

Is there any solution?

هل كانت مفيدة؟

المحلول

If you won't end up using anything better, you can use clear() instead of remove() keeping the tail of the element:

import xml.etree.ElementTree as ET


data = """<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>"""

tree = ET.fromstring(data)
a = tree.find('a')
for element in a:
    if element.tag in ('b', 'u'):
        tail = element.tail
        element.clear()
        element.tail = tail

print ET.tostring(tree)

prints (see empty b and u tags):

<root>
<a>Some <b /> text <i>that</i> I <u /> want to keep.</a>
</root>

Also, here's a solution using xml.dom.minodom:

import xml.dom.minidom

data = """<root>
<a>Some <b>bad</b> text <i>that</i> I <u>do <i>not</i></u> want to keep.</a>
</root>"""

dom = xml.dom.minidom.parseString(data)
a = dom.getElementsByTagName('a')[0]
for child in a.childNodes:
    if getattr(child, 'tagName', '') in ('u', 'b'):
        a.removeChild(child)

print dom.toxml()

prints:

<?xml version="1.0" ?><root>
<a>Some  text <i>that</i> I  want to keep.</a>
</root>
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top