Question

How can I replace all characters in a string by their Unicode-representation?

Minimal example:

>>> s = 'ä'
>>> new = some_function(s, 'unicode')
>>> new
'\u00E4'
Était-ce utile?

La solution 2

The first step is to convert from a byte string to a Unicode string:

u = s.decode('utf-8')

The second step is to create a new string with every character replaced by its Unicode escape sequence.

new = ''.join('\\u{:04x}'.format(ord(c)) for c in u)

If your intent was only to replace the non-ASCII characters then a slight modification will do:

new = ''.join(c if 32 <= ord(c) < 128 else '\\u{:04x}'.format(ord(c)) for c in u)

Note that the \u0000 notation only works for Unicode codepoints in the base Unicode plane. You need the \U00000000 notation for anything larger. You can also use \x00 notation for anything less than 256. The following handles all cases and is probably a bit easier to read:

def unicode_notation(c):
    x = ord(c)
    if 32 <= x < 128:
        return c
    if x < 256:
        return '\\x{:02x}'.format(x)
    if x < 0x10000:
        return '\\u{:04x}'.format(x)
    return '\\U{:08x}'.format(x)

new = ''.join(unicode_notation(c) for c in u)

Autres conseils

Not sure what you mean by “unicode representation”. Here is solution to my interpretation of your question.

For Python 3:

print('\\u' + hex(ord('ä'))[2:].upper().rjust(4, '0'))

For Python 2:

print('\\u' + repr(u'ä')[4:-1].upper().rjust(4, '0'))

I assume that's what you want?

>>> new = s.decode('utf-8')
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top