Why does PHP's preg_split split the hebrew letter “נ” in UTF-8 when splitting on “\s”?
-
26-09-2019 - |
Question
This doesn't work, it turns it to gibberish:
$foo = 'נ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));
Array ( [0] => � [1] => )
But this works:
$foo = 'נ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));
נ
The problem is only with the letter "נ
". It works fine with all the other Hebrew letters. Is there a solution for that?
Solution
When working with UTF-8 data, always use the u modifier in your patterns:
/\s/u
Because otherwise the pattern is not interpreted as UTF-8.
Like in this case the character נ
(U+05E0) is encoded with 0xD7A0 in UTF-8. And \s
represents any whitespace character (according to PCRE):
The
\s
characters are HT (9), LF (10), FF (12), CR (13), and space (32).
When UTF-8 support was added, they have also added a special option called PCRE_UCP to have \b
, \d
, \s
, and \w
not just match US-ASCII characters but also other Unicode characters by their Unicode properties:
By default, in UTF-8 mode, characters with values greater than 128 never match
\d
,\s
, or\w
, and always match\D
,\S
, and\W
. […] However, if PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:
\d
any character that\p{Nd}
matches (decimal digit)\s
any character that\p{Z}
matches, plus HT, LF, FF, CR\w
any character that\p{L}
or\p{N}
matches, plus underscore
And that non-breaking space U+00A0 has the property of a separator (\p{Z}
).
So although your pattern is not in UTF-8 mode, it seems that \s
does match that 0xA0 in the UTF-8 code word 0xD7A0, splitting the string at that position and returning an array that is equivalent to array("\xD7", "")
.
And that’s obviously a bug as the pattern is not in UTF-8 mode but 0xA0 is greater than 0x80 (additionally, 0xA0 would be encoded as 0xC2A0). The bug #52971 PCRE-Meta-Characters not working with utf-8 could be related with this.