PHP PCRE unicode escape [duplicate]

Question 1

$regx = preg_replace("/\\u(\w+)/i", "\x$1", $regx);

The reason this doesn't work is because you need to double-escape your backslashes.

As things stand, \\u is inside a PHP double-quoted string, which means that the \\ is escaped by PHP down to a single slash.

This single slash is then given to PRCE, so the regex parser just sees \u. This fails because \u is not a valid escape sequence in regex.

If you want to actually match a backslash character in a PHP regex, you need to actually supply four backslashes.

$regx = preg_replace("/\\\\u(\w+)/i", "\x$1", $regx);

Yep. It's ugly. But that's how it is.

Technically, this applies to any regex backslash, so in theory your \w should have a double backslash too, but you can get away with that, and most others, because \w has no meaning to PHP, so it doesn't parse it. This is helpful behaviour, but does make things more confusing when it goes wrong, as in this case.

Question 2

\u won't work with PHP but \x will. Explanation from PCRE documentation:

\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
\uhhhh    character with hex code hhhh (JavaScript mode only)

The modifier u shouldn't be forgotten. ("Javascript mode" is an "internal" flag)

An other solution to interpret Unicode sequences (\u as \U) is to use intl/Transliterator (PHP >= 5.4):

$in = '\u0041\U00000062';
$out = transliterator_create('Hex-Any')->transliterate($in);
var_dump($out); # string(2) "Ab"