Domanda

I've been struggling with encoding for a while as I'm biulding a multi-lingual database with sqlite3 in Python. So far, I've solved everything, thanks to Google and articles on Stack Overflow. I had problems with Russian, Slovenian, Polish, Spanish, French... but it's all solved, appart from this ONE file I can't fix.

I thought I had found a possible solution on this website: http://www.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/, I even found a decoder, which got me reeeally close to solving the problem. But it only produced partially understandable Russian... (I'm sure it can help in other cases though: http://2cyr.com/decode/?lang=fr and it also exists in English).

But this last file is gonna be the end of me. Here's the major issue: I KNOW it's Russian because the linguist who gave it to me built it, and knows it's in Russian. BUT, the file itself looks like this:

£ËÁÀÝÅÅ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÇÏ    UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÊ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÍ UNK £ËÁÀÝÉÊ UNKA
£ËÁÀÝÅÍÕ    UNK £ËÁÀÝÉÊ UNKA

According to my shell, it's encoded in utf-8. I've therefore been trying to decode utf-8 and encode it into all russian encodings I could find (ISO-8859-5, koi8_r, koi8_u, cp1252, cp1251...). It never worked. I also tried saving the file in all these encodings and decoding the other way around, without much success...

It has to go in a database (sqlite), and I know the required encoding for this is utf-8. The previous Russian file I delt with was "properly" written (in cyrillic), and I just had to figure out which encoding to use. But here, I feel like I've tried everything, I'm just not getting any results...

I'm actually wondering if decoding such a file is even possible, since it's not cyrillic to start with.

Any suggestion would be welcome :)

È stato utile?

Soluzione

The first and foremost problem - the text is not in UTF-8, it is in KOI8R. So if you need to decode via Python, you may refer to this answer - string encode / decode - it might give you some clue.

I have decoded the text specified by you - enjoy:

ёкающее UNK ёкающий UNKA
ёкающего    UNK ёкающий UNKA
ёкающей UNK ёкающий UNKA
ёкающем UNK ёкающий UNKA
ёкающему    UNK ёкающий UNKA
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top