How to iterate through all XML Elements and apply logic to each Element's value with ElementTree for Python

StackOverflow https://stackoverflow.com/questions/15667263

Question

I am currently trying to apply logic to Element values in a XML file. Specifically I am trying to encode all the values to UTF-8 while not touching any of the element names/attributes themselves.

Here is the sample XML:

<?xml version="1.0"?>
<sd_1>
    <sd_2>
        <sd_3>\311 is a fancy kind of E</sd_3>
    </sd_2>
</sd_1>

Currently I have tried 3 methods to achieve this with no success:

First I tried the looping through each element retrieving the values with .text and using .parse:

import xml.etree.ElementTree as ET

et = ET.parse('xml/test.xml')

for child in et.getroot():
    for core in child:
        core_value = str(core.text)
        core.text = core_value.encode('utf-8')

et.write('output.xml')

This results in an XML file that does not have the text \311 altered correctly, it just stays as it is.

Next I tried the .iterparse with cElementTree to no avail:

import xml.etree.cElementTree as etree

xml_file_path = 'xml/test.xml'
with open(xml_file_path) as xml_file:
    tree = etree.iterparse(xml_file) 
    for items in tree:
        for item in items:
            print item.text

etree.write('output1.xml')

This results in:

 "...print item.text\n', "AttributeError: 'str' object has no attribute 'text'..."

Not sure what I am doing wrong there, I have seen multiple examples with the same arrangement, but when I print through the elements without the .text I see the tuple with a string value of 'end' at the start and I think that is causing the issue with this method.

How do I properly iterate through my elements, and without specifying the element names e.g. .findall(), apply logic to the values housed in each Element so that when I write the xml to file it saves the changes made when the program was iterating through element values?

Était-ce utile?

La solution

Is this what you are looking for?

import xml.etree.ElementTree as ET

et = ET.parse('xml/test.xml')

for child in et.getroot():
    for core in child:
        core_value = str(core.text)
        core.text = core_value.decode('unicode-escape')

et.write('output.xml')

Autres conseils

This is an interesting question. Let's focus on the first method you proposed, as that should be a totally fine way to approach this problem. When I print out the lines one by one, here is what I get:

>>> core_value
'\\311 is a fancy kind of E'

What happened for me is that the character was read as a literal '\', which must be escaped to be printed as such. If we change the escaped character (\\) to a non-escaped character (\), we get the following:

>>> cv = core_value.replace('\\311','\311')
'\xc9 is a fancy kind of E'
>>> print cv
É is a fancy kind of E

The weird piece here is that you don't know when in the original file \311 is "supposed to be" one character or four. If you know for a fact that those will all be one character, you can write some vile code based on this answer:

Python Unicode, have unicode number in normal string, want to print unicode

To transorm all of the things that come after a \ into the correct unicode characters and delete the \.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top