preg_match a keyword variable against a list of latin and non-latin chars keywords in a local UTF-8 encoded file

https://stackoverflow.com/questions/8631951

30-03-2021
|

Question

I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.

How do I go about matching both Latin and non-Latin keywords.

The badwords.txt file includes one word per line as in this example



bad

nasty

racist

سفالة

وساخة

جنس

Code used for matching:




$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);

foreach ($badwords as $key => $val) {
    if (!empty($val)) {
        $val = trim($val);
        $regexp = "/\b" . $val . "\b/i";
        if (preg_match($regexp, $query))
            $badFlag = 1;

        if ($badFlag == 1) {
           // Bad word detected die...
        }
    }
}

I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.

Solution

The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set

$wstart = '(^|[^\p{L}])';
$wend = '([^\p{L}]|$)';

and modify the regexp as follows:

$regexp = "/" . $wstart . $val . $wend . "/iu";

OTHER TIPS

Some string functions in PHP cannot be used on UTF-8 strings, they're supposedly going to fix it in version 6, but for now you need to be careful what you do with a string.

It looks like strtolower() is one of them, you need to use mb_strtolower($query, 'UTF-8'). If that doesn't fix it, you'll need to read through the code and find every point where you process $query or badwords.txt and check the documentation for UTF-8 bugs.

As far as I know, preg_match() is ok with UTF-8 strings, but there are some features disabled by default to improve performance. I don't think you need any of them.

Please also double check that badwords.txt is a UTF-8 file and that $query contains a valid UTF-8 string (if it's coming from the browser, you set it with a <meta> tag).

If you're trying to debug UTF-8 text, remember most web browsers do not default to the UTF-8 text encoding, so any PHP variable you print out for debugging will not be displayed correctly by the browser, unless you select UTF-8 (in my browser, with View -> Encoding -> Unicode).

You shouldn't need to use iconv or any of the other conversion API's, most of them will simply replace all of the non-latin characters with latin ones. Obviously not what you want.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow