Python print to terminal shell unicode

https://stackoverflow.com/questions/23415741

13-07-2023
|

Question

I am parsing a long string of persian in python, and am opening it like this:

fp = codecs.open(f+i, 'r', encoding='utf-8').readlines()

and using

print(line[1])

but instead of printing out readable Persian, it outputs things like this in the terminal.

Ø§Ø·Ù
     Ø§Ø¹âØ±Ø³Ø§Ù

On the webpage, it outputs it fine.

What is the issue with it? Thank you

Solution

You have a CP1252 Mojibake here. The first character is the code point U+0627 ARABIC LETTER ALEF, encoded to UTF-8, but then interpreted as CP1252:

>>> print u'\u0627'.encode('utf8').decode('cp1252')
Ø§

Your SSH shell is misconfigured somewhere; the remote shell thinks you are using UTF-8, while locally the printed UTF-8 bytes are being printed as if they were CP1252 bytes.

What I can decipher is:

The Ù character is a Mojibake starting point for anything in the U+640 to U+0660 range; we cannot see the second byte for the two occurrences here. Ditto for the â character; the second byte wasn't printable in CP1252 so it is again missing.

Overall, what I can recover is:

>>> print u'Ø§Ø· - Ø§Ø¹ - Ø±Ø³Ø§'.encode('cp1252').decode('utf8')
اط - اع - رسا

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow