This
s = u'Zåìôèðà'
print s.encode('latin1').decode('cp1251')
# Zемфира
Explanation: Zåìôèðà
is mistakenly treated as a unicode string, while it's actually a sequence of bytes, which mean Zемфира
in cp1251. By applying encode('latin1')
we convert this "unicode" string back to bytes, using codepoint numbers as byte values, and then convert these bytes back to unicode telling the decode we're using cp1251.
As to automatic decoding, the following brute force approach seems to work with your examples:
import re, itertools
def guess_decode(s):
encodings = ['cp1251', 'cp1252', 'utf8']
for steps in range(2, 10, 2):
for encs in itertools.product(encodings, repeat=steps):
r = s
try:
for enc in encs:
r = r.encode(enc) if isinstance(r, unicode) else r.decode(enc)
except (UnicodeEncodeError, UnicodeDecodeError) as e:
continue
if re.match(ur'^[\w\sа-яА-Я]+$', r):
print 'debug', encs, r
return r
print guess_decode(u'Zемфира')
print guess_decode(u'Zåìôèðà')
print guess_decode(u'ZåìôèðÃ\xA0')
Results:
debug ('cp1252', 'utf8') Zемфира
Zемфира
debug ('cp1252', 'cp1251') Zемфира
Zемфира
debug ('cp1252', 'utf8', 'cp1252', 'cp1251') Zемфира
Zемфира