Question

According to Mongo documentation it uses PCRE engine, and PCRE support \b and \w for unicode characters, but following RegEx does not match with Unicode strings, Is there any solution or alternative syntax for it?

// in Mongo console:
 db.col.find({word:/\b\pL\b/});

// in PHP
(new Mongo())->db->col->find(['word'=>new MongoRegex('/\b\pL\b/u')]);
Was it helpful?

Solution

I couldn't find any documentation on exactly which features MongoDB's PCRE implementation supports, but if it includes the \pL Unicode character class as well as look-ahead and look-behind assertions, then a Unicode-aware replacement for \b would be:

(?:(?=\pL)(?<!\pL)|(?!\pL)(?<=\pL))

Basically, (?=\pL)(?<!\pL) matches if the next character is a letter while the previous one is not, whereas (?!\pL)(?<=\pL) conversely matches if the previous character is a letter but the next one is not.

Of course, this regexp can be simplified a lot if we already know something about what the adjacent characters can be. For example, the Unicode-aware version of \b\pL+\b can be written simply as:

(?<!\pL)\pL+(?!\pL)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top