Question

So, I am obtaining some xml data. One such example, is as follows:

xmlString = '<location>san diego, ça</location>'

This is currently as a string. I now need to convert it to a XML object, by using ElementTree, fromstring() method. The import is as follows:

import xml.etree.ElementTree as ET

The method call is:

xml = ET.fromstring(xmlString)

I kept on getting errors, saying:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position xxx: 
ordinal not in range(128)

In order to deal with this I looked quite a bit over StackOverflow, as well as Python Docs.

It seems a suggestion is to encode and decode the string.

xmlString = xmlString.encode('utf-8', 'ignore')
xmlString = xmlString.decode('ascii', 'ignore')

The ignore is for errors, but they still arise. This is done prior to converting the xmlString into a xml object. But still the error comes up!

Any ideas?

The full code is:

xmlString = '<?xml version="1.0" encoding="UTF-8"?><o><location>san diego, ça</location>
</o>'
xmlString = xmlString.encode('utf-8', 'ignore')
xmlString = xmlString.decode('ascii', 'ignore')
xml = ET.fromstring(xmlString)

Using Python 2.7

Was it helpful?

Solution

You are calling str.encode(); Python 2 strings are already encoded, so Python tries to do the right thing and first decode to unicode so it can then encode the value back to a bytestring for you.

This implicit decode is done with the default codec, ASCII:

>>> '<?xml version="1.0" encoding="UTF-8"?><o><location>san diego, ça</location></o>'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 62: ordinal not in range(128)

Note that I called .encode() but the exception is UnicodeDecodeError; Python was decoding here first.

However, because ET.fromstring() already wants UTF-8 encoded bytes, you do not need to recode the value at all.

If you see problems with parsing the string value, make sure you saved your Python source code using the right codec, UTF8, from your text editor.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top