How to iterate through all XML Elements and apply logic to each Element's value with ElementTree for Python

https://stackoverflow.com/questions/15667263

29-03-2022
|

Вопрос

I am currently trying to apply logic to Element values in a XML file. Specifically I am trying to encode all the values to UTF-8 while not touching any of the element names/attributes themselves.

Here is the sample XML:

<?xml version="1.0"?>
<sd_1>
    <sd_2>
        <sd_3>\311 is a fancy kind of E</sd_3>
    </sd_2>
</sd_1>

Currently I have tried 3 methods to achieve this with no success:

First I tried the looping through each element retrieving the values with .text and using .parse:

import xml.etree.ElementTree as ET

et = ET.parse('xml/test.xml')

for child in et.getroot():
    for core in child:
        core_value = str(core.text)
        core.text = core_value.encode('utf-8')

et.write('output.xml')

This results in an XML file that does not have the text \311 altered correctly, it just stays as it is.

Next I tried the .iterparse with cElementTree to no avail:

import xml.etree.cElementTree as etree

xml_file_path = 'xml/test.xml'
with open(xml_file_path) as xml_file:
    tree = etree.iterparse(xml_file) 
    for items in tree:
        for item in items:
            print item.text

etree.write('output1.xml')

This results in:

 "...print item.text\n', "AttributeError: 'str' object has no attribute 'text'..."

Not sure what I am doing wrong there, I have seen multiple examples with the same arrangement, but when I print through the elements without the .text I see the tuple with a string value of 'end' at the start and I think that is causing the issue with this method.

How do I properly iterate through my elements, and without specifying the element names e.g. .findall(), apply logic to the values housed in each Element so that when I write the xml to file it saves the changes made when the program was iterating through element values?

Решение

Is this what you are looking for?

import xml.etree.ElementTree as ET

et = ET.parse('xml/test.xml')

for child in et.getroot():
    for core in child:
        core_value = str(core.text)
        core.text = core_value.decode('unicode-escape')

et.write('output.xml')

Другие советы

This is an interesting question. Let's focus on the first method you proposed, as that should be a totally fine way to approach this problem. When I print out the lines one by one, here is what I get:

>>> core_value
'\\311 is a fancy kind of E'

What happened for me is that the character was read as a literal '\', which must be escaped to be printed as such. If we change the escaped character (\\) to a non-escaped character (\), we get the following:

>>> cv = core_value.replace('\\311','\311')
'\xc9 is a fancy kind of E'
>>> print cv
É is a fancy kind of E

The weird piece here is that you don't know when in the original file \311 is "supposed to be" one character or four. If you know for a fact that those will all be one character, you can write some vile code based on this answer:

Python Unicode, have unicode number in normal string, want to print unicode

To transorm all of the things that come after a \ into the correct unicode characters and delete the \.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow