Question

So this question basically is the first of more to follow once I figured out what really goes on. I read a lot about encoding decoding and XML standarts, but I did not find an answer to this specific topic.

import elementtree.ElementTree as ET

root = ET.Element("Prüfung")
main = ET.SubElement(root,'Test')
main.text='\xe4 '+'ä'.decode('UTF-8')
tree=ET.ElementTree(root)
tree.write('testout.xml')

My first question is in the main.text='\xe4 '+'ä'.decode('UTF-8') line. I understand, that \xe4 is the code for the letter ä, so does this mean, that I have to decode every string passed to my interpreter as utf-8 in order to work properly? Because when I read special characters from a .txt file using python's readline method, they seem to be already decoded correctly.

A related but slightly different question is the line root = ET.Element("Prüfung"). It seems not to be possible to use non-ASCII characters in XML Tags (at least not with element tree). Is this because of the XML standart or basically just another decoding/encoding problem?

Was it helpful?

Solution

You can have non-ASCII characters in element names (and element content). Use Unicode strings and it should work.

At http://effbot.org/zone/element.htm#the-element-type, it says:

All elements must have a tag, but all other properties are optional. All strings can either be Unicode strings, or 8-bit strings containing US-ASCII only.

Demo program (tested with Python 2.7):

# coding: utf-8

import xml.etree.ElementTree as ET

root = ET.Element(u'Prüfung') 
main = ET.SubElement(root, 'Test')
main.text = u'\xe4 ' + u'ä'
tree = ET.ElementTree(root)
tree.write('testout.xml', encoding="utf-8")    # The default encoding is us-ascii

Output (in testout.xml):

<Prüfung><Test>ä ä</Test></Prüfung>

The above program also works unchanged in Python 3.3+. The leading u characters are redundant, but allowed (the u'unicode' syntax is restored for str objects in Python 3.3).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top