Frage

here is my code:

# -*- coding: utf-8-*-
array=["à","á","â","ã","ä","å","æ","ç","è","é","ê","ë","ì","í","î","ï","ð","ñ","ó","ô","õ","ö","ø","ù","ú","û","ü","ý","þ","ÿ"]
array1=["א","ב","ג","ד","ה","ו","ז","ח","ט","י","ך","כ","ל","ם","מ","ן","נ","ס","ע","ף","פ","ץ","צ","ק","ר","ש","ת"]
str="áï éäåãä"
message=""
for i in range(0,len(str)):
   s=str[i]
   index=-1
   for j in range(0,len(array)):
       if(array[j]==s):
           index=j
           break
   if(index!=-1):
   message+=array1[index]
   print array1[index]
print message

the error is:

SyntaxError: EOL while scanning string literal

in line 2

I have a text file in hebrew, but it always displays in gibbrish, no matter what the encoding is. this is a python program to convert it to hebrew. original file is in IS0-8859-1

War es hilfreich?

Lösung

As @Martijn suggests, decoding your original file correctly would be a better solution. If your file is Hebrew but displays array characters, it is probably being displayed as latin1 or cp1252 encoding. cp1255 looks like a close match. Perhaps your array1 isn't quite right. Also note strings are iterable so you can simplify your arrays:

# coding: utf8
array  = u'àáâãäåæçèéêëìíîïðñóôõöøùúûüýþÿ'
array1 = u'אבגדהוזחטיךכלםמןנסעףפץצקרשת'
print(array)
print(array1)
print(array.encode('cp1252').decode('cp1255',errors='replace'))

The last line above reverses the "incorrect" encoding and decodes it with cp1255 (a Hebrew encoding) instead. Output:

àáâãäåæçèéêëìíîïðñóôõöøùúûüýþÿ
אבגדהוזחטיךכלםמןנסעףפץצקרשת
אבגדהוזחטיךכלםמןנסףפץצרשת��‎‏�

It's not a perfect match, but close enough that I think your original file was encoded with cp1255.

Andere Tipps

You used a ' where you should have used a ":

'ÿ"

for the last entry in:

array=["à","á","â","ã","ä","å","æ","ç","è","é","ê","ë","ì","í","î","ï","ð","ñ","ó","ô","õ","ö","ø","ù","ú","û","ü","ý","þ",'ÿ"]

Make that single quote a double.

As for your translation program; it sounds as if your file encoding is incorrect, or is decoded incorrectly. Perhaps you should figure out the correct encoding instead, and not blindly replace Latin-1 bytes with UTF-8 sequences for Hebrew codepoints?

If you were to use the codec module to open the file with the correct codec and decode to Unicode, you most probably will find the data is correctly encoded anyway.

I strongly urge you to study up on Unicode, codecs and Python before you continue:

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top