Gracefully recover from parse error in expat

https://stackoverflow.com/questions/5381577

28-10-2019
|

Question

XML is supposed to be strict, and so there are some Unicode characters which aren't allowed in XML. However I'm trying to work with RSS feeds which often contain these characters anyway, and I'd like to either avoid parse errors from invalid characters or recover gracefully from them and present the document anyway.

See an example here (on March 21 anyway): http://feeds.feedburner.com/chrisblattman

What's the recommended way to handle unicode in the XML feed? Detect the characters and substitute in null bytes, edit the parser, or some other method?

Solution

Looks like that RSS feed contained a vertical tab character \x0c which is illegal per the XML 1.0 spec.

My advice is to filter out the illegal characters before passing the data to expat, rather than attempting to catch errors and recover. Here is a routine to filter out the Unicode characters which are illegal. I tested it on your chrisblattman.xml RSS feed:

import re
from xml.parsers import expat

# illegal XML 1.0 character ranges
# See http://www.w3.org/TR/REC-xml/#charsets
XML_ILLEGALS = u'|'.join(u'[%s-%s]' % (s, e) for s, e in [
    (u'\u0000', u'\u0008'),             # null and C0 controls
    (u'\u000B', u'\u000C'),             # vertical tab and form feed
    (u'\u000E', u'\u001F'),             # shift out / shift in
    (u'\u007F', u'\u009F'),             # C1 controls
    (u'\uD800', u'\uDFFF'),             # High and Low surrogate areas
    (u'\uFDD0', u'\uFDDF'),             # not permitted for interchange
    (u'\uFFFE', u'\uFFFF'),             # byte order marks
    ])

RE_SANITIZE_XML = re.compile(XML_ILLEGALS, re.M | re.U)

# decode, filter illegals out, then encode back to utf-8
data = open('chrisblattman.xml', 'rb').read().decode('utf-8')
data = RE_SANITIZE_XML.sub('', data).encode('utf-8')

pr = expat.ParserCreate('utf-8')
pr.Parse(data)

Update: Here is a Wikipedia page about XML character validity. My regexp above filters out the C1 control range, but you may want to allow those characters depending on your application.

OTHER TIPS

You may try Beautiful Soupwich may parse HTML/XML documents even if they are not well formed.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow