mb_detect_encoding() discrepancy for non latin1 characters

https://stackoverflow.com/questions/5384892

28-10-2019
|

Question

I'm using the mb_detect_encoding() function to check if a string contains non latin1 (ISO-8859-1) characters.

Since Japanese isn't part of latin1 I'm using it as the text within the test string, yet when the string is passed in to the function it seems to return ok for ISO-8859-1. Example code:

$str = "これは日本語のテキストです。読めますか";
$res = mb_detect_encoding($str,"ISO-8859-1",true);

print  $res;

I've tried using 'ASCII' instead of 'ISO-8859-1', which correctly returns false. Is anyone able to explain the discrepancy?

Solution

I wanted to be funny and say hexdump could explain it:

0000000 81e3 e393 8c82 81e3 e6af a597 9ce6 e8ac
0000010 9eaa 81e3 e3ae 8683 82e3 e3ad b982 83e3
0000020 e388 a781 81e3 e399 8280 aae8 e3ad 8182
0000030 81e3 e3be 9981 81e3 0a8b

But alas, that's quite the opposite.

In ISO-8859-1 practically only the code points \x80-\x9F are invalid. But these are exactly the byte values your UTF-8 representation of the Japanese characters occupy.

Anyway, mb_detect_encoding uses heuristics. And it fails in this example. My conjecture is that it mistakes ISO-8859-1 for -15 or worse: CP1251 the incompatible Windows charset, which allows said code points.

I would say you use a workaround and test it yourself. The only check to assure that a byte in a string is certainly not a Latin-1 character is:

preg_match('/[\x7F-\x9F]/', $str);

I'm linking to the German Wikipedia, because their article shows the differences best: http://de.wikipedia.org/wiki/ISO_8859-1

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow