Question

In my program I get shift-jis character codes as Python integers which I need to convert to their corresponding utf8 character codes (which should also be in integers). How can I do that? For ASCII you have the helpful functions ord()/chr() which allows you to convert an integer into an ASCII string which you can easily convert to unicode later. I can't find anything like that for other encodings.

Using Python 2.

EDIT: the final code. Thanks everyone:

def shift_jis2unicode(charcode): # charcode is an integer
    if charcode <= 0xFF:
        string = chr(charcode)
    else:
        string = chr(charcode >> 8) + chr(charcode & 0xFF)

    return ord(string.decode('shift-jis'))

print shift_jis2unicode(8140)
Was it helpful?

Solution

There's no such thing as "utf8 character codes (which should also be in integers)".

Unicode defines "code points", which are integers. UTF-8 defines how to convert those code points to an array of bytes.

So I think you want the Unicode code points. In that case:

def shift_jis2unicode(charcode): # charcode is an integer
    if charcode <= 0xFF:
        shift_jis_string = chr(charcode)
    else:
        shift_jis_string = chr(charcode >> 8) + chr(charcode & 0xFF)

    unicode_string = shift_jis_string.decode('shift-jis')

    assert len(unicode_string) == 1
    return ord(unicode_string)

print "U+%04X" % shift_jis2unicode(0x8144)
print "U+%04X" % shift_jis2unicode(0x51)

(Also: I don't think 8100 is a valid shift-JIS character code...)

OTHER TIPS

There may be a better way to do this, but since there are no other answers yet here is an option.

You could use this table to convert your shift-jis integers to unicode code points, then use unichr() to convert your data into a Python unicode object, and then convert it from unicode to utf8 using unicode.encode('utf-8').

def from_shift_jis(seq):
    chars = [chr(c) if c <= 0xff else chr(c>>8) + chr(c&0xff) for c in seq]
    return ''.join(chars).decode('shift-jis')

utf8_output = [ord(c) for c in from_shift_jis(shift_jis_input).encode('utf-8')]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top