سؤال

So I ran into this issue and I've simplified it as much as possible.

$test = 'XXX' . chr(241) . 'XXX';
print($test); // XXX�XXX
print(mb_strlen($test, 'UTF-8')); // 4
print(count(str_split($test))); // 7

So basically my question is: why is chr(241) not returning one single character making the length of the string 7? It's six characters, I add one, and it's four characters? Why is chr(241) not equal to html entity 241?

Other information listed below. Note that as long as you don't add X AFTER the chr(241), everybody is happy:

print(mb_detect_encoding($test)); // UTF-8
print(mb_strlen('XX' . chr(241) . 'XX', 'UTF-8')); // 3
print(mb_strlen('X' . chr(241) . 'X', 'UTF-8')); // 2
print(mb_strlen('' . chr(241) . 'X', 'UTF-8')); // 1
print(mb_strlen('X' . chr(241) . '', 'UTF-8')); // 2
print(mb_strlen('XXX' . chr(241) . '', 'UTF-8')); // 4
print(mb_strlen(chr(241), 'UTF-8')); // 1

It seems like an encoding issue but how? The file is saved as UTF-8, the internal encoding is UTF-8, and I'm not passing data anywhere to mess it up.

هل كانت مفيدة؟

المحلول

In UTF-8 all ASCII characters under 127 are represented by one byte (binary representation of 0xxxxxxx) and code points larger than 127 are represented by multi-byte sequences. Multi-byte sequences are composed of a leading byte and one or more continuation bytes.

The leading byte's high order bits serve to tell us how many continuation bytes to use and for that purpose it has two or more high-order 1s followed by a 0, i.e. the high bits can be 110 or 1110 or 11110 or 111110. The number of the high-order bits are equal to the sum of the leading byte plus the continuation bytes, i.e.

110   means 1 leading byte + 1 continuation byte 
1110  means 1 leading byte + 2 continuation bytes
11110 means 1 leading byte + 3 continuation bytes

Continuation bytes which follow a leading byte have the format 10xxxxxx.

Applying the above to your $test string:

We have three bytes ord('X') that all are ascii chars under 127, so those are counted as 1 char to 1 byte,

Then we have a chr(241) with binary representation of 11110001 so it's a leading byte since it has two or more high-bits.

Since it has 4 high bits that means that the code point it represents consists of 1 leading byte plus 3 continuation bytes, so the 3 ord('X') bytes that remain in the string are considered by mb_strlen() as continuation bytes* and although together with the chr(241) are a total of four bytes they are counted as one UTF-8 code point.

*Here we must state that those trailing 'X's are not valid continuation bytes since they do not conform to the standard of a continuation byte. However mb_strlen() will consume as explained above up to 3 more bytes after the chr(241). You can test this if you add another 'X' or you subtract 'X's from the end of the $test string.

UPDATE: Verifying the findings:

/*
 * The following strings are non valid UTF-8 encodings.
 * We test to see if mb_strlen() consumes non VALID UTF-8
 * byte strings like they are valid (driven by the leading bytes)
 *
 */

/*
 * 0xc0 as a leading byte should consume one continuation byte
 * so the length reported should be 6
 */ 
$test = 'XXX' . chr(0xc0) . 'XXX'; 
echo '6 == ', mb_strlen($test, 'UTF8');

/*
 * 0xe0 as a leading byte should consume two continuation bytes
 * so the length reported should be 5
 */ 
$test = 'XXX' . chr(0xe0) . 'XXX'; 
echo '5 == ', mb_strlen($test, 'UTF8'), PHP_EOL;

// results in 6 == 6 and 5 == 5

UPDATE 2:

An example of constructing with chr() the same symbol in Latin-1 and UTF-8.

$euroSignAscii = chr(0x80); // Latin-1 extended ASCII
$euroSignUtf8 = chr(0xe2) . chr(0x82) . chr(0xac); // UTF-8

Take note if you echo the above strings the encoding of your console or web page (if it is latin-1 then the $euroSignAscii will output correctly, if it is UTF-8 then the $euroSignUtf8 will output correctly).


Links:

A good reference is the relevant UTF-8 article on Wikipedia

A classic post from Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

And to get the feel UTF-8 encoding table and Unicode characters

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top