How can I differentiate regular whitespaces and escaped ones ( ) when parsing XML with xml.etree.ElementTree (python)

https://stackoverflow.com/questions/20675545

19-09-2022
|

题

I'm using xml.etree.ElementTree to parse an XML file. How can I force it to either strip text of whitespaces (just regular spaces, not  ) or leave spaces and ignore escapes (leave them as is)? Here is my problem:

xml_text = """
<root>
    <mytag>
        data_with_space&#32;
    </mytag>
</root>"""
root = xml.etree.ElementTree.fromstring(xml_text)
mytag = root.find("mytag")
print "original text: ", repr(mytag.text)
print "stripped text: ", repr(mytag.text.strip())

It prints:

original text:  '\n        data_with_space \n    '
stripped text:  'data_with_space'

What I need:

'data_with_space '

or (which I can escape by other means):

'data_with_space&#32;'

A solution using xml.etree.ElementTree is preferable because I'd have to rewrite a whole lot of code otherwise

解决方案

The standard XML library treats   and ' ' as equal. There's no way to avoid the equalization if you directly apply fromstring(xml_text), and therefore it's impossible to differentiate them then. The only way to stop the escaping is to translate it into something else before apply fromstring(), and translate it back after then.

import xml.etree.ElementTree

stop_escape   = lambda text: text.replace("&#", "|STOP_ESCAPE|")
resume_escape = lambda text: text.replace("|STOP_ESCAPE|", "&#")

xml_text = """
<root>
    <mytag>
        data_with_space&#32;
    </mytag>
</root>"""
root = xml.etree.ElementTree.fromstring(stop_escape(xml_text))
mytag_txt = resume_escape(root.find("mytag").text)
print "original text: ", repr(mytag_txt)
print "stripped text: ", repr(mytag_txt.strip())

You would get:

original text:  '\n        data_with_space&#32;\n    '
stripped text:  'data_with_space&#32;'

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow

How can I differentiate regular whitespaces and escaped ones (&#32;) when parsing XML with xml.etree.ElementTree (python)

How can I differentiate regular whitespaces and escaped ones ( ) when parsing XML with xml.etree.ElementTree (python)