Alternative XML parser for ElementTree to ease UTF-8 woes?

https://stackoverflow.com/questions/1139090

16-09-2019
|

Question

I am parsing some XML with the elementtree.parse() function. It works, except for some utf-8 characters(single byte character above 128). I see that the default parser is XMLTreeBuilder which is based on expat.

Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?

This is the error I'm getting with the default parser:

ExpatError: not well-formed (invalid token): line 311, column 190

The character causing this is a single byte x92 (in hex). I'm not certain this is even a valid utf-8 character. But it would be nice to handle it because most text editors display this as: í

EDIT: The context of the character is: canít , where I assume it is supposed to be a fancy apostraphe, but in the hex editor, that same sequence is: 63 61 6E 92 74

Solution

I'll start from the question: "Is there an alternative parser that I can use that may be less strict and allow utf-8 characters?"

All XML parsers will accept data encoded in UTF-8. In fact, UTF-8 is the default encoding.

An XML document may start with a declaration like this:

`<?xml version="1.0" encoding="UTF-8"?>`

or like this: <?xml version="1.0"?> or not have a declaration at all ... in each case the parser will decode the document using UTF-8.

However your data is NOT encoded in UTF-8 ... it's probably Windows-1252 aka cp1252.

If the encoding is not UTF-8, then either the creator should include a declaration (or the recipient can prepend one) or the recipient can transcode the data to UTF-8. The following showcases what works and what doesn't:

>>> import xml.etree.ElementTree as ET
>>> from StringIO import StringIO as sio

>>> raw_text = '<root>can\x92t</root>' # text encoded in cp1252, no XML declaration

>>> t = ET.parse(sio(raw_text))
[tracebacks omitted]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9
# parser is expecting UTF-8

>>> t = ET.parse(sio('<?xml version="1.0" encoding="UTF-8"?>' + raw_text))
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 47
# parser is expecting UTF-8 again

>>> t = ET.parse(sio('<?xml version="1.0" encoding="cp1252"?>' + raw_text))
>>> t.getroot().text
u'can\u2019t'
# parser was told to expect cp1252; it works

>>> import unicodedata
>>> unicodedata.name(u'\u2019')
'RIGHT SINGLE QUOTATION MARK'
# not quite an apostrophe, but better than an exception

>>> fixed_text = raw_text.decode('cp1252').encode('utf8')
# alternative: we transcode the data to UTF-8

>>> t = ET.parse(sio(fixed_text))
>>> t.getroot().text
u'can\u2019t'
# UTF-8 is the default; no declaration needed

OTHER TIPS

It looks like you have CP1252 text. If so, it should be specified at the top of the file, eg.:

<?xml version="1.0" encoding="CP1252" ?>

This does work with ElementTree.

If you're creating these files yourself, don't write them in this encoding. Save them as UTF-8 and do your part to help kill obsolete text encodings.

If you're receiving CP1252 data with no encoding specification, and you know for sure that it's always going to be CP1252, you can just convert it to UTF-8 before sending it to the parser:

s.decode("CP1252").encode("UTF-8")

Byte 0x92 is never valid as the first byte of a UTF-8 character. It can be valid as a subsequent byte, however. See this UTF-8 guide for a table of valid byte sequences.

Could you give us an idea of what bytes are surrounding 0x92? Does the XML declaration include a character encoding?

Ah. That is "can´t", obviously, and indeed, 0x92 is an apostrophe in many Windows code pages. Your editor assumes instead that it's a Mac file. ;)

If it's a one-off, fixing the file is the right thing to do. But almost always when you need to import other peoples XML there is a lot of things that simply do not agree with the stated encoding. I've found that the best solution is to decode with error setting 'xmlcharrefreplace', and in severe cases do your own custom character replacement that fixes the most common problems for that particular customer.

I'll also recommend lxml as XML library in Python, but that's not the problem here.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow