Question

I am parsing a long string of persian in python, and am opening it like this:

fp = codecs.open(f+i, 'r', encoding='utf-8').readlines()

and using

print(line[1])

but instead of printing out readable Persian, it outputs things like this in the terminal.

اطÙ
     اعâرساÙ

On the webpage, it outputs it fine.

What is the issue with it? Thank you

Was it helpful?

Solution

You have a CP1252 Mojibake here. The first character is the code point U+0627 ARABIC LETTER ALEF, encoded to UTF-8, but then interpreted as CP1252:

>>> print u'\u0627'.encode('utf8').decode('cp1252')
ا

Your SSH shell is misconfigured somewhere; the remote shell thinks you are using UTF-8, while locally the printed UTF-8 bytes are being printed as if they were CP1252 bytes.

What I can decipher is:

The Ù character is a Mojibake starting point for anything in the U+640 to U+0660 range; we cannot see the second byte for the two occurrences here. Ditto for the â character; the second byte wasn't printable in CP1252 so it is again missing.

Overall, what I can recover is:

>>> print u'اط - اع - رسا'.encode('cp1252').decode('utf8')
اط - اع - رسا
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top