Question

I am working with a regexp database that contains expressions with "\uXXXX", which, of course, breaks PHP PCRE.

So, two part question, is there a way to tell PCRE to accept those sequences?

And I got around the issue, luckily it was only the one sequence, by doing:

$regx = str_ireplace('\u00a7', '\xa7', $regx);

but when I was attempting to do:

$regx = preg_replace("/\\u(\w+)/i", "\x$1", $regx);

I was still getting -

Warning: preg_replace() [function.preg-replace]: Compilation failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1

and it took double escaping the \u => \\\\u, not simply \\u, why is that/is there a better way? Note: I actually had to just do the same thing, and more so, to get the correct string into this post.

update: running 5.3.3 on our server

Was it helpful?

Solution

$regx = preg_replace("/\\u(\w+)/i", "\x$1", $regx);

The reason this doesn't work is because you need to double-escape your backslashes.

As things stand, \\u is inside a PHP double-quoted string, which means that the \\ is escaped by PHP down to a single slash.

This single slash is then given to PRCE, so the regex parser just sees \u. This fails because \u is not a valid escape sequence in regex.

If you want to actually match a backslash character in a PHP regex, you need to actually supply four backslashes.

$regx = preg_replace("/\\\\u(\w+)/i", "\x$1", $regx);

Yep. It's ugly. But that's how it is.

Technically, this applies to any regex backslash, so in theory your \w should have a double backslash too, but you can get away with that, and most others, because \w has no meaning to PHP, so it doesn't parse it. This is helpful behaviour, but does make things more confusing when it goes wrong, as in this case.

OTHER TIPS

\u won't work with PHP but \x will. Explanation from PCRE documentation:

\x{hhh..} character with hex code hhh.. (non-JavaScript mode)
\uhhhh    character with hex code hhhh (JavaScript mode only)

The modifier u shouldn't be forgotten. ("Javascript mode" is an "internal" flag)

An other solution to interpret Unicode sequences (\u as \U) is to use intl/Transliterator (PHP >= 5.4):

$in = '\u0041\U00000062';
$out = transliterator_create('Hex-Any')->transliterate($in);
var_dump($out); # string(2) "Ab"
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top