Question

I am trying to see what different strings would look like in different encodings...

For example:

>>> str1 = "asdf"
>>> str1.encode('utf-16')
'\xff\xfea\x00s\x00d\x00f\x00'
>>> str1.encode('base64')
'YXNkZg==\n'

And those all get me what I want.

But I'd like to see what certain strings would look like in gbk, gb2312, or gb18030.

>>> str1.encode('gbk')
'asdf'
>>> str1.encode('gb2312')
'asdf'
>>> str1.encode('gb18030')
'asdf'

Shouldn't the outputs be something other than 'asdf'?

I have python 2.7 and I can see the gbk.py and the other files in lib/encodings

I was wondering if I see no change in the output because those letters will show up the same in that encoding, or because I need to somehow enable the use of those encodings (some sort of import needed?)...

Was it helpful?

Solution

As long as only byte values 0-127 are used, these encodings are equivalent to ASCII. The same is true for UTF-8. To really see the difference, try with some actual Chinese.

OTHER TIPS

From the Wikipedia page:

A character is encoded as 1 or 2 bytes. A byte in the range 00–7F is a single byte that means the same thing as it does in ASCII. Strictly speaking, there are 96 characters and 32 control codes in this range.

So no, your test string of ASCII characters should not encode as something different (at least not for GBK, I didn't check the varieties).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top