Question

I cannot get cyrillic characters in php from a .txt file with unknown encoding. I tried almost everything I could find on the web. What php function do I need to use get the contents of this file?

https://www.dropbox.com/s/w7cex4wiogyytvm/100004-6.txt

EDIT

Input:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    debug($string);

Output: debug is broken, if I try to save the value to database it fails (BOM does some trouble and the value cannot be saved).

Input

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = mb_convert_encoding ($string , 'utf-8');
    debug($string);

Output:

    '????? ???:300/500V
    ???? ???:2000V
    ????? ???? ??????: ? +70??
    ?? ??? ?? (????? 5 ??.): ? +160??
    ????? ?????? ?? ?????: ? +5??   '

Input:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = iconv("UTF-16", "UTF-8//TRANSLIT//IGNORE", $string);
    debug($string);

Output:

췮㌰〯㔰ざഊ죱㈰〰嘍્⃰⃲㨠‫㜰냑ഊ쿰⃱밠⣭㔠⤺⃤⬱㘰냑ഊ췠볭

Input:

    $path = WWW_ROOT . 'files' . DS . '100002-6.txt';
    $string = file_get_contents($path);
    $string = iconv("ISO-8859-5", "UTF-8//TRANSLIT//IGNORE", $string);
    debug($string);

Output:

    Эюьшэрыхэ эряюэ:300/500V
    Шёяшђхэ эряюэ:2000V
    ЭрМтшёюър №рсюђэр ђхьях№рђѓ№р: фю +70Аб
    Я№ш ъ№рђюъ ёяюМ (эрМьэюуѓ 5 ёхъ.): фю +160Аб
    ЭрМэшёър ђхьях№рђѓ№р я№ш шэёђрырішМр: фю +5Аб

Now that I tested multiple files, I don't think the input file is Unicode encoded anymore. I succeeded on reading my test file, but on the one that matters (and I don't know the encoding of) still nothing. So I changed the question, the encoding seems to be undefined still.

A little bit more for clearance. I can open this file and see it normally in notepad. It contains cyrillic characters that make this problem.

Was it helpful?

Solution

The file is encoded in CP1251 a.k.a. MS-CYRL a.k.a. "Cyrillic (Windows)".

$string = file_get_contents($path);
$string = iconv('CP1251', 'UTF-8', $string);

How did I figure this out? Opened the file in a text editor and tried a few relevant encodings until it looked right. There's hardly anything else you can do if the file encoding is unknown.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top