Question

I have known how to get this '4f60597d' from u'\u4f60\u597d'

>>> u_str= u'你好'
>>> repr(u_str).replace('\u', '')[2:-1] 
'4f60597d'

But if there are some ascii in the string :

>>> u_str= u'12你好'    
>>> repr(u_str).replace('\u', '')[2:-1] 
'124f60597d'

This is not the result I want to.

I expect that I can get the output like this : 003100324f60597d

Could you tell me?

Was it helpful?

Solution

You could use ord() to get the integer codepoint for each character and format that instead:

''.join(format(ord(c), '04x') for c in u_str)

Demo:

>>> u_str = u'12你好'  
>>> ''.join(format(ord(c), '04x') for c in u_str)
'003100324f60597d'

or you could encode to UTF-16 (big endian) and use binascii.hexlify() on the result; this is probably the faster option:

from binascii import hexlify

hexlify(u_str.encode('utf-16-be'))

Demo:

>>> from binascii import hexlify
>>> hexlify(u_str.encode('utf-16-be'))
'003100324f60597d'

The latter also handles characters outside of the BMP, requiring 4 bytes per codepoint, which would be encoded using UTF-16 surrogate pairs:

>>> hexlify(u'\U0001F493'.encode('utf-16-be'))
'd83ddc93'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top