I got a system which previously the html encoding type was set as ISO-8859-1 and it caused all the Chinese characters store in the format of "&\#36830;&\#34915;&\#35033;".

So my question is, how can I convert the format above into Chinese word back in UTF-8?

For your information, I had tried with utf8_decode, iconv, but none of them work. :(

Thank you very much.

有帮助吗?

解决方案

The current text encoding of that string is rather insubstantial. What you have there are HTML entities; they have little to do with the underlying "physical" encoding like ISO-8859 or UTF-8. What you want is to decode those HTML entities into a byte representation of the characters in a specific encoding, in this case to UTF-8. Therefore:

echo html_entity_decode('连衣裙', ENT_COMPAT, 'UTF-8');
// 连衣裙

其他提示

You need to use:

utf8_encode($data);

and not decode,to convert your current ISO-8859-1 to UTF-8.

Some native PHP functions such as strtolower(), strtoupper() and ucfirst() do not always function correctly with UTF-8 strings. Possible solutions: convert to latin first or add the following line to your code:

setlocale(LC_CTYPE, 'C');

Make sure not to save your PHP files using a BOM (Byte-Order Marker) UTF-8 file marker (your browser might show these BOM characters between PHP pages on your site).

Just for your reference:

ISO-8859-1 => Albanian, Brazilian, Catalan, Danish, Dutch, English, Finnish, French, German, Portuguese, Norwegian, Spanish, Swedish

UTF-8 => Chinese (simplified), Chinese (traditional), Japanese, Persian

There are many tools that can convert character references to characters, and writing such a tool is rather straightforward, especially if you know the references are all decimal. So the answer really depends on the software environment.

For example, to do such a conversion for an individual HTML document, you could use the BabelPad editor: command Convert → Numeric Character References (NCR) → NCR to Unicode, and save the result as UTF-8.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top