Why does PHP's preg_split split the hebrew letter “נ” in UTF-8 when splitting on “\s”?

https://stackoverflow.com/questions/4231864

26-09-2019
|

Question

This doesn't work, it turns it to gibberish:

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'UTF-8', mb_detect_encoding($foo));
print_r(preg_split('/\s/', $bar));

Array ( [0] => � [1] => )

But this works:

$foo = 'נ';
$bar = mb_convert_encoding($foo, 'ISO-8859-8', mb_detect_encoding($foo));
$baz = preg_split('/\s/', $bar);
echo(mb_convert_encoding($baz[0], 'UTF-8', 'ISO-8859-8'));

נ

The problem is only with the letter "נ". It works fine with all the other Hebrew letters. Is there a solution for that?

Solution

When working with UTF-8 data, always use the u modifier in your patterns:

/\s/u

Because otherwise the pattern is not interpreted as UTF-8.

Like in this case the character נ (U+05E0) is encoded with 0xD7A0 in UTF-8. And \s represents any whitespace character (according to PCRE):

The \s characters are HT (9), LF (10), FF (12), CR (13), and space (32).

When UTF-8 support was added, they have also added a special option called PCRE_UCP to have \b, \d, \s, and \w not just match US-ASCII characters but also other Unicode characters by their Unicode properties:

By default, in UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. […] However, if PCRE is compiled with Unicode property support, and the PCRE_UCP option is set, the behaviour is changed so that Unicode properties are used to determine character types, as follows:

\d any character that \p{Nd} matches (decimal digit)

\s any character that \p{Z} matches, plus HT, LF, FF, CR

\w any character that \p{L} or \p{N} matches, plus underscore

And that non-breaking space U+00A0 has the property of a separator (\p{Z}).

So although your pattern is not in UTF-8 mode, it seems that \s does match that 0xA0 in the UTF-8 code word 0xD7A0, splitting the string at that position and returning an array that is equivalent to array("\xD7", "").

And that’s obviously a bug as the pattern is not in UTF-8 mode but 0xA0 is greater than 0x80 (additionally, 0xA0 would be encoded as 0xC2A0). The bug #52971 PCRE-Meta-Characters not working with utf-8 could be related with this.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow