Question

This is a strange scenario, not conventional converting one encoding to another one.

Question

I use Readability API to retrieve main content from given url, it works fine if the target url is encoded with UTF-8, but when target url is encoded in GB2312(one of Chinese encoding), I get rubbish information instead(the Chinese characters are wrongly encoded but English letters and digits work fine).

Deep Research

I inspected the HTTP header Readability API returns, it indicates that the encoding of API response is UTF-8.

Here's a snippet of wrongly encoded Chinese characters:

ÄÉ´ï¶û¾ø¾³Ï´󷴻÷¾Ü¾øÀäÃÅÄæת½ú¼¶ÖÐÍøËÄÇ¿

Length: 42

Which originally are:

纳达尔绝境下大反击拒绝冷门逆转晋级中网四强

Length: 21

However, if you convert the correct Chinese into unicode, it should be:

纳达尔绝境下大反击拒绝冷门逆转晋级中网四强

Tried But Not Working

iconv("GB2312", "UTF-8", $str);
iconv("GBK", "UTF-8", $str);
iconv("GB18300", "UTF-8", $str);
mb_convert_enconding($str, "UTF-8", "GB2312");
mb_convert_enconding($str, "UTF-8", "GB18300");
mb_convert_enconding($str, "UTF-8", "GBK");

Solution Requested

Since Readability API doesn't provide a parameter for charset of target url( I call this API like https://www.readability.com/api/content/v1/parser?url=http://sports.sina.com.cn/t/2013-10-04/14596813815.shtml&token=my_token_here), I have to do the convertion when handling the API response.

I will appreciate it very much if you have any idea about this issue.

Environment Info: PHP 5.3.6

Was it helpful?

Solution

It seems that the individual bytes that make up the characters have been encoded as HTML numeric entities as if they were characters from ISO-8859-1 or some other 8-bit encoding. To undo the numeric entity encoding you can use mb_decode_numericentity:

$str = "ÄÉ´ï¶û¾ø¾³Ï´󷴻÷¾Ü¾øÀäÃÅÄæת½ú¼¶ÖÐÍøËÄÇ¿";

$str = mb_decode_numericentity($str, array(0, 255, 0, 255), "ISO-8859-1");

echo iconv("gb2312", "utf8", $str);

This produces the expected output of 纳达尔绝境下大反击拒绝冷门逆转晋级中网四强.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top