Domanda

In my database mix some wrong ascii code, how to make concatenate those string without errors?

my example situation is like(some ascii character is larger than 128):

>>> s=b'\xb0'
>>> addstr='read '+s
>>> print addstr
read ░

>>> addstr.encode('ascii','ignore')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 5: ordinal
not in range(128)
>>> addstr.encode('utf_8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb0 in position 5: ordinal
not in range(128)

I can do:

>>> addstr.decode("windows-1252").encode('utf-8')
'read \xc2\xb0'

but you can see the windows-1252 coding will change my character.

I would like convert the addstr to unicode? how to do it?

È stato utile?

Soluzione

addstrUnicode = addstr.decode("unicode-escape")

You should not be concerned about the character changing, it is just that the utf-8 encoding requires two bytes, not one byte, for characters between 0x80 and 0x7FF, so when you encode as utf-8, an extra byte (0xC2) is added.

This is a useful link to read to help understand different types of encodings.

Additionally, make sure you know the original encoding of the character before you start trying to decode it. While you mentioned that it was "ascii code", the ascii character set only extends up to 127, which means the character cannot be ascii-encoded. I'm assuming here it's just Unicode point \u00B0.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top