on http://www.gnu.org/software/libiconv/ there are like 20 types of encoding for Chinese:

Chinese EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, BIG5-HKSCS:1999, ISO-2022-CN, ISO-2022-CN-EXT

So I have a text file that is not UTF-8. It's ASCII. And I want to convert it to UTF-8 using iconv(). But for that I need to know the character encoding of the source.

How can I do that if I don't know chinese? :(

I noticed that:

$str = iconv('GB18030', 'UTF-8', $str);
file_put_contents('file.txt', $str);

produces an UTF-8 encoded file, while other encodings I tried (CP950, GBK and EUC-CN) produced an ASCII file. Could that mean that iconv is able to detect if the input encoding is wrong for the given string?

有帮助吗?

解决方案

This may work for your needs (but I really cant tell). Setting the locale and utf8_decode, and using mb_check_encoding instead of mt_detect_encoding seems to give some useful output..

// some text from http://chinesenotes.com/chinese_text_l10n.php
// have tried both as string and content loaded from a file
$chinese = '譧躆 礛簼繰 剆坲姏 潧 騔鯬 跠 瘱瘵瘲 忁曨曣 蛃袚觙';
$chinese=utf8_decode($chinese);

$chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT';

$encodings = explode(',',$chinese_encodings);

//set chinese locale
setlocale(LC_CTYPE, 'Chinese');

foreach($encodings as $encoding) {
    if (@mb_check_encoding($chinese, $encoding)) {
        echo 'The string seems to be compatible with '.$encoding.'<br>';
    } else {
        echo 'Not compatible with '.$encoding.'<br>';
    }
}

outputs

The string seems to be compatible with EUC-CN
The string seems to be compatible with HZ
The string seems to be compatible with GBK
The string seems to be compatible with CP936
Not compatible with GB18030
The string seems to be compatible with EUC-TW
The string seems to be compatible with BIG5
The string seems to be compatible with CP950
Not compatible with BIG5-HKSCS
Not compatible with BIG5-HKSCS:2004
Not compatible with BIG5-HKSCS:2001
Not compatible with BIG5-HKSCS:1999
Not compatible with ISO-2022-CN
Not compatible with ISO-2022-CN-EXT

It is total guess. Now it at least seems to recognise some of the chinese encodings. Delete if it is total junk.

其他提示

I have zero experience with chinese encoding and I know this question is tagged iconv, but if it will get the job done, then you may try mb_detect_encoding to detect your encoding; The second argument is list of encodings to check, and there is a user-crafted comment about chinese encodings:

For Chinese developers: please note that the second argument of this function DOES NOT include 'GB2312' and 'GBK' and the return value is 'EUC-CN' when it is detected as a GB2312 string.

so maybe it will work if you explicitly provide full list of chinese encodings as a second argument? It could work like this:

$encoding = mb_detect_encoding($chineseString, 'GB2312,GBK,(...)');
if($encoding) $utf8text = iconv($encoding, 'UTF-8', $str);

you may also want to play with third argument (strict)

What makes it hard to detect the encoding is the fact that octet sequences decode to valid characters in several encodings, but the result makes sense in only the correct encoding. What I've done in these cases is take the decoded text and go to an automatic translation service and see if you get back legible text or a jumble of syllables.

You can do this programmatically for example by analyzing trigraph frequencies in the input text. Libraries like this one have already been created to solve this problem, and there are external programs that do it, but I have yet to see anything with a PHP API. This approach is not fool-proof though.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top