Question

How can I replace all characters in a string by their Unicode-representation?

Minimal example:

>>> s = 'ä'
>>> new = some_function(s, 'unicode')
>>> new
'\u00E4'
Was it helpful?

Solution 2

The first step is to convert from a byte string to a Unicode string:

u = s.decode('utf-8')

The second step is to create a new string with every character replaced by its Unicode escape sequence.

new = ''.join('\\u{:04x}'.format(ord(c)) for c in u)

If your intent was only to replace the non-ASCII characters then a slight modification will do:

new = ''.join(c if 32 <= ord(c) < 128 else '\\u{:04x}'.format(ord(c)) for c in u)

Note that the \u0000 notation only works for Unicode codepoints in the base Unicode plane. You need the \U00000000 notation for anything larger. You can also use \x00 notation for anything less than 256. The following handles all cases and is probably a bit easier to read:

def unicode_notation(c):
    x = ord(c)
    if 32 <= x < 128:
        return c
    if x < 256:
        return '\\x{:02x}'.format(x)
    if x < 0x10000:
        return '\\u{:04x}'.format(x)
    return '\\U{:08x}'.format(x)

new = ''.join(unicode_notation(c) for c in u)

OTHER TIPS

Not sure what you mean by “unicode representation”. Here is solution to my interpretation of your question.

For Python 3:

print('\\u' + hex(ord('ä'))[2:].upper().rjust(4, '0'))

For Python 2:

print('\\u' + repr(u'ä')[4:-1].upper().rjust(4, '0'))

I assume that's what you want?

>>> new = s.decode('utf-8')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top