How can I differentiate regular whitespaces and escaped ones ( ) when parsing XML with xml.etree.ElementTree (python)

StackOverflow https://stackoverflow.com/questions/20675545

Domanda

I'm using xml.etree.ElementTree to parse an XML file. How can I force it to either strip text of whitespaces (just regular spaces, not  ) or leave spaces and ignore escapes (leave them as is)? Here is my problem:

xml_text = """
<root>
    <mytag>
        data_with_space&#32;
    </mytag>
</root>"""
root = xml.etree.ElementTree.fromstring(xml_text)
mytag = root.find("mytag")
print "original text: ", repr(mytag.text)
print "stripped text: ", repr(mytag.text.strip())

It prints:

original text:  '\n        data_with_space \n    '
stripped text:  'data_with_space'

What I need:

'data_with_space '

or (which I can escape by other means):

'data_with_space&#32;'

A solution using xml.etree.ElementTree is preferable because I'd have to rewrite a whole lot of code otherwise

È stato utile?

Soluzione

The standard XML library treats &#32; and ' ' as equal. There's no way to avoid the equalization if you directly apply fromstring(xml_text), and therefore it's impossible to differentiate them then. The only way to stop the escaping is to translate it into something else before apply fromstring(), and translate it back after then.


import xml.etree.ElementTree

stop_escape   = lambda text: text.replace("&#", "|STOP_ESCAPE|")
resume_escape = lambda text: text.replace("|STOP_ESCAPE|", "&#")

xml_text = """
<root>
    <mytag>
        data_with_space&#32;
    </mytag>
</root>"""
root = xml.etree.ElementTree.fromstring(stop_escape(xml_text))
mytag_txt = resume_escape(root.find("mytag").text)
print "original text: ", repr(mytag_txt)
print "stripped text: ", repr(mytag_txt.strip())

You would get:

original text:  '\n        data_with_space&#32;\n    '
stripped text:  'data_with_space&#32;'        
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top