Question

Given a character code as integer number in one encoding, how can you get the character code in, say, utf-8 and again as integer?

Was it helpful?

Solution

UTF-8 is a variable-length encoding, so I'll assume you really meant "Unicode code point". Use chr() to convert the character code to a character, decode it, and use ord() to get the code point.

>>> ord(chr(145).decode('koi8-r'))
9618

OTHER TIPS

You can only map an "integer number" from one encoding to another if they are both single-byte encodings.

Here's an example using "iso-8859-15" and "cp1252" (aka "ANSI"):

>>> s = u'€'
>>> s.encode('iso-8859-15')
'\xa4'
>>> s.encode('cp1252')
'\x80'
>>> ord(s.encode('cp1252'))
128
>>> ord(s.encode('iso-8859-15'))
164

Note that ord is here being used to get the ordinal number of the encoded byte. Using ord on the original unicode string would give its unicode code point:

>>> ord(s)
8364

The reverse operation to ord can be done using either chr (for codes in the range 0 to 127) or unichr (for codes in the range 0 to sys.maxunicode):

>>> print chr(65)
A
>>> print unichr(8364)
€

For multi-byte encodings, a simple "integer number" mapping is usually not possible.

Here's the same example as above, but using "iso-8859-15" and "utf-8":

>>> s = u'€'
>>> s.encode('iso-8859-15')
'\xa4'
>>> s.encode('utf-8')
'\xe2\x82\xac'
>>> [ord(c) for c in s.encode('iso-8859-15')]
[164]
>>> [ord(c) for c in s.encode('utf-8')]
[226, 130, 172]

The "utf-8" encoding uses three bytes to encode the same character, so a one-to-one mapping is not possible. Having said that, many encodings (including "utf-8") are designed to be ASCII-compatible, so a mapping is usually possible for codes in the range 0-127 (but only trivially so, because the code will always be the same).

Here's an example of how the encode/decode dance works:

>>> s = b'd\x06'             # perhaps start with bytes encoded in utf-16
>>> map(ord, s)              # show those bytes as integers
[100, 6]
>>> u = s.decode('utf-16')   # turn the bytes into unicode
>>> print u                  # show what the character looks like
٤
>>> print ord(u)             # show the unicode code point as an integer
1636
>>> t = u.encode('utf-8')    # turn the unicode into bytes with a different encoding
>>> map(ord, t)              # show that encoding as integers
[217, 164]

Hope this helps :-)

If you need to construct the unicode directly from an integer, use unichr:

>>> u = unichr(1636)
>>> print u
٤
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top