Question

My code:

a = '汉'
b = u'汉'

These two are the same Chinese character. But obviously, a == b is False. How do I fix this? Note, I can't convert a to utf-8 because I have no access to the code. I need to convert b to the encoding that a is using.

So, my question is, what do I do to turn the encoding of b into that of a?

No correct solution

OTHER TIPS

If you don't know a's encoding, you'll need to:

  1. detect a's encoding
  2. encode b using the detected encoding

First, to detect a's encoding, let's use chardet.

$ pip install chardet

Now let's use it:

>>> import chardet
>>> a = '汉'
>>> chardet.detect(a)
{'confidence': 0.505, 'encoding': 'utf-8'}

So, to actually accomplish what you requested:

>>> encoding = chardet.detect(a)['encoding']
>>> b = u'汉'
>>> b_encoded = b.encode(encoding)
>>> a == b_encoded
True

Decode the encoded string a using str.decode:

>>> a = '汉'
>>> b = u'汉'
>>> a.decode('utf-8') == b
True

NOTE Replace utf-8 according to the source code encoding.

both a.decode and b.encode are OK:

In [133]: a.decode('utf') == b
Out[133]: True

In [134]: b.encode('utf') == a
Out[134]: True

Note that str.encode and unicode.decode are also available, don't mix them up. See What is the difference between encode/decode?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top