Question

I use python 2.7. and I want to convert a unicode value string to unicode.

print u'abc' == unicode('abc')  #True  
print u'\u0026abc' == unicode('\u0026abc')  #False

what I want to do is make '\u0026abc' be a variable and convert to u'\u0026abc'.
but you can see unicode('\u0026abc') is not equals u'\u0026abc'.
is there any way I can use to make variable like '\u0026abc' to u'\u0026abc'?

Was it helpful?

Solution 2

If you try to print unicode("\u0026abc"), you will see the root of your problem:

>>> a = u"abc"
>>> ua = unicode("abc")
>>> a == ua
True
>>> b = u"\u0026abc"
>>> b
u'&abc'
>>> ub = unicode("\u0026abc")
>>> ub
u'\\u0026abc'

You can fix it this way:

>>> ub = unicode("&abc")
>>> ub
u'&abc'
>>> b == ub
True

But that required a human changing the code. To do so programmatically, you might try to do:

>>> c = "\u0026abc"
>>> c
'\\u0026abc'
>>> cc = "u\'" + c + "\'"
>>> cc
"u'\\u0026abc'"
>>> eval cc
>>> eval(cc)
u'&abc'

However, this solution is not much general, Daniel's answer provides better one.

OTHER TIPS

In byte strings '\uxxxx' is no special escape sequence, it's simply a backslash followed by 'u'. If you really have a byte string with \u sequnces, use regular expressions to convert them to unicode:

import re
text = '\\u0026abc'
text = re.sub('\\\\u(....)', lambda x:unichr(int(x.group(1),16)), text)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top