Question

I'm reading SQLASCII strings from a database and encountered some bytes that did not decode properly based on the big5 encoding I declared. Below is the simplified problem. It appears the the big 5 encoding table for python does not know how to decode these 2 characters. As far as I can tell ( I am not an expert) these are valid Chinese characters since I can use Notepad++ and change the Encoding to have them display as Chinese characters. I compared what they look like in Notepad++ with this web-site, and the characters match, so I assume that they are valid bytes for the big5 encoding table.

http://ash.jp/code/cn/big5tbl.htm

by = b'\xBD\xC6\xBB\x73'
print(by,len(by))
print(by.decode('big5'))

b'\xbd\xc6\xbbs' 4

Traceback (most recent call last): File "qtest1.py", line 15, in print(by.decode('big5')) File "C:\Python32\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-1: character maps to

Any help greatly appreciated...

Was it helpful?

Solution

Look carefully: that's a UnicodeEncodeError - it's failing to encode, not decode. Also look at the module it's using: ...\lib\encodings\cp1252.py. So something's trying to encode text to cp1252.

In fact, decoding as big5 works fine - I can run your exact code and get chinese characters[1]. The problem is your terminal - Python's trying to encode the chinese characters using your Windows code page (cp1252), which doesn't know what to do with them. You should be able to write them to a file opened with a suitable encoding (UTF-8 or big5), or do whatever you need to with them, just not write them to the terminal.

[1] Most Linux terminals use UTF-8, so any character works.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top