Question

Probleme starts when i search words in string which starts/ends with utf-8 character - it cant find the match. If word starts/ends with none utf-8 character then all works fine. The code:

$str= 'String: ābols, abols, abŌls, abōls, aboļŠ, aboĻs';
$find = array('ābols', 'abols', 'abōls', 'aboļš', 'aboļs');
preg_match_all("/(*UTF8)\b(" . implode($find,"|") . ")\b/i",  $str, $matches);

In result you can see - "words" which starts with utf-8 character cant be found: Image to result: http://i.stack.imgur.com/qZku3.png

What i`m doing wrong? Thanks.

Was it helpful?

Solution

The reason why you don't see words that begins (or finish) with an "utf-8" character is simple: \b is a word boundary that is by default a limit between a character from (and only from) \w (or [a-zA-Z0-9_]) and another character.

To change the behaviour of \b (to get it works with all numbers and all letters of the galaxy), you must use the u modifier. With this modifier \w contains now all letters and all numbers:

preg_match_all("/(*UTF8)\b(" . implode($find,"|") . ")\b/iu",  $str, $matches);

another way is to replace word boundaries with lookarounds:

preg_match_all("/(*UTF8)(?<=^|[\s\pP])(" . implode($find,"|") . ")(?=[\s\pP]|$)/i",  $str, $matches);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top