How to parse RSS with GB2312 encoding in Python

https://stackoverflow.com/questions/7569256

30-01-2021
|

Question

I have a RSS feed shich is encoded in GB2312

When I am trying to parse it using following code:

for item in XML.ElementFromURL(feed).xpath('//item'):
    title = item.find('title').text

It is not able to parse the Feed.

Any Idea how to parse GB2312 encoded RSS feed

The error Log from Plex Media Server is below after using encoding as below

for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
        title = item.find('title').text

***Error Log:***
>  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Zaobao.bundle\Contents\Code\__init__.py", line 24, in GetDetails
    for item in XML.ElementFromURL(feed, encoding='gb2312').xpath('//item'):
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 81, in ElementFromURL
    return self.ElementFromString(self._core.networking.http_request(url, values, headers, cacheTime, autoUpdate, encoding, errors, immediate=True, sleep=sleep, opener=self._opener, txn_id=self._txn_id).content, isHTML=isHTML)
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\api\parsekit.py", line 76, in ElementFromString
    return self._core.data.xml.from_string(string, isHTML)
  File "C:\Documents and Settings\subhendu.swain\Local Settings\Application Data\Plex Media Server\Plug-ins\Framework.bundle\Contents\Resources\Versions\2\Python\Framework\components\data.py", line 134, in from_string
    return etree.fromstring(markup)
  File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270)
  File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812)
  File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673)
  File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 565, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64088)
XMLSyntaxError: switching encoding: encoder error, line 1, column 36

2011-09-28 09:34:33,453 (9d0) :  DEBUG (core) - Response: 404

Solution

I assume you are using the Plex XML API. The documentation states that you can call XML.ElementFromURL(feed, encoding='gb2312') if you know that this is really the encoding being used.

If the XML really is encoded with GB2312, then the declaration must be <?xml version="1.0" encoding="gb2312"?> (or begin with a byte order mark, for UTF-16), otherwise the XML is invalid. If there is no encoding in the XML declaration, and no byte order mark, parsers must assume UTF-8 encoding by default, and therefore it is invalid to use any other character encoding for XML without an encoding in the declaration. Since not specifying the encoding produces an error for you, I think it is possible that the RSS feed is not valid XML.

OTHER TIPS

Your error message is XMLSyntaxError: switching encoding: encoder error, line 1, column 36. You asked for ideas. Here's a novel idea: Tell us what is in the first 50 or so bytes of "line 1". Then somebody may be able to come up with a remedy.

Update: The encoding declaration is incorrect. The data is NOT encoded in gb2312. It's at least GBK aka cp936. GB2312-80 (that's 80 as in the year 1980) is a limited character set. Chinese websites that are not using UTF-8 would be using at least the superset GBK (been in use for well over 10 years) and moving to the supersuperset GB18030 (which is itself a UTF). See below:

[Python 2.7.1]
>>> import urllib
>>> url = "http://www.zaobao.com/sp/sp.xml"
>>> data = urllib.urlopen(url).read()
>>> len(data)
10071
>>> data[:100]
'<?xml version="1.0" encoding="GB2312"?>\n\n<rss version="2.0"\n>\n\n<channel>\n<title>\xc1\xaa\xba\xcf\xd4\xe7\xb1\xa8\xcd\xf8 zaobao.co'
>>> x = data.decode('gb2312')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 1771-1772: illegal multibyte sequence
>>> data[1771:1773]
'\x95N'
>>> x = data.decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python27\lib\encodings\utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 80: invalid start byte
>>> x = data.decode('gbk')
>>> y = data.decode('cp936')
>>> x == y
True

I suggest that you try XML.ElementFromURL(feed, encoding='gbk').

If that works, you may wish to bullet-proof your code against this not-uncommon problem by reading the data with urllib, checking for gb2312 and if you find it, use gb18030 instead.

Update 2: In case anyone mentions chardet: due to GBK using the many unused slots in GB2312, and chardet not working on actually-used slots, and not attempting to verify its answer by doing a trial decode, charget guesses GB2312.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow