Converting incorrectly encoded Chinese characters in MySQL to UTF-8

https://stackoverflow.com/questions/18271667

24-06-2022
|

문제

I have a large MySQL table filled with Chinese characters in an incorrect encoding. I believe they were supposed to be encoded in latin1 (iso-8859-1), but I just can't find a way to get the Chinese characters from the contents of the database rows.

Converting between latin1 and utf8 doesn't help - the fields remain unchanged. I've tried re-importing the database with various encodings - always the same results.

Some examples of the current contents and what they should be:

æƒ¨äº‹ should be 惨事
ä¸ should be 不
æœ€ should be 最

I've also tried using Python to try and 'decode' the contents, but again without success. I've tried various combinations of this:

databasefield.decode('iso-8859-1').encode('utf8')

But I can't get anything like that to work either.

Sorry for asking such a vague question, but I just don't know how to continue trying to figure this out!

Does anyone know what the problem is here?

해결책

You are looking at UTF-8 decoded as Windows codepage 1252 instead:

>>> print u'惨事'.encode('utf8').decode('cp1252')
æƒ¨äº‹
>>> print u'最'.encode('utf8').decode('cp1252')
æœ€

Fixing this requires going the other way:

>>> print u'æƒ¨äº‹'.encode('cp1252').decode('utf8')
惨事
>>> print u'æœ€'.encode('cp1252').decode('utf8')
最

There may have been some loss there though, as the UTF-8 encoding for 不 uses a codepoint not supported by 1252:

>>> u'不'.encode('utf8')
'\xe4\xb8\x8d'
>>> print u'不'.encode('utf8').decode('cp1252')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/cp1252.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 2: character maps to <undefined>

There are several other Windows codepage candidates that can be tried here though; 1254 would result in similar output, for example, with only minor differences.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow