Can't decode a cyrillic string in python

https://stackoverflow.com/questions/23691094

29-07-2023
|

سؤال

I have an encoded file with strings like

b'1'    b'\xca\xee\xef\xe5\xe9\xf1\xea' b'1'    b'ADMIN'    b'2013-07-08 00:21:55'  
b'2'    b'\xd7\xe5\xeb\xff\xe1\xe8\xed\xf1\xea' b'1'    b'ADMIN'    b'2013-07-08 00:22:05'

How should I decode it? I tried to use codecs, decode/encode cp1251, but it didn't work.

file -bi says charset=us-ascii

There should be a string in cyrillic(cp1251) actually

python 2.7

The output:

>>> w=r'\xd7\xe5\xe\xff\xe1\xe8\xed\xf1\xea'
>>> w='\xd7\xe5\xe\xff\xe1\xe8\xed\xf1\xea'
ValueError: invalid \x escape
>>> w=r'\xd7\xe5\xe\xff\xe1\xe8\xed\xf1\xea'
>>> w.decode('raw_unicode_escape')
u'\\xd7\\xe5\\xe\\xff\\xe1\\xe8\\xed\\xf1\\xea'
>>> w.decode('utf-8')
u'\\xd7\\xe5\\xe\\xff\\xe1\\xe8\\xed\\xf1\\xea'
>>> unicode(w)
u'\\xd7\\xe5\\xe\\xff\\xe1\\xe8\\xed\\xf1\\xea'
>>> unicode(w, 'utf-8')
u'\\xd7\\xe5\\xe\\xff\\xe1\\xe8\\xed\\xf1\\xea'

I did everything: decode("utf-8"), used unicode and so, but nothing changes. Every time I get the same set of bytes.

المحلول

The issue is you are missing a b after the 3rd \x escape in your w variable when it says invalid escape.

>>> w = '\xd7\xe5\xeb\xff\xe1\xe8\xed\xf1\xea'
>>> w.decode('cp1251')
u'\u0427\u0435\u043b\u044f\u0431\u0438\u043d\u0441\u043a'

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow