preg_match_all removes latin letters

https://stackoverflow.com/questions/12134471

28-06-2021
|

Question

I have problem with latin chars, here is the code:

$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www', 'on', 'ona', 'ja');

$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string

$string = preg_replace('/[^a-zA-Z0-9žšđčćŽŠĐČĆ -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…

$string = mb_strtolower($string); // make it lowercase

preg_match_all('/\b.*?\b/i', $string, $matchWords);

$matchWords = $matchWords[0];

foreach ( $matchWords as $key=>$item ) {
    if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
        unset($matchWords[$key]);
    }
}

$wordCountArr = array();
if ( is_array($matchWords) ) {
    foreach ( $matchWords as $key => $val ) {
        $val = strtolower($val);
        if ( isset($wordCountArr[$val]) ) {
            $wordCountArr[$val]++;
        } else {
            $wordCountArr[$val] = 1;
        }
    }
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;

when I return $matchWords[0] from this code:

preg_match_all('/\b.*?\b/i', $string, $matchWords);

i get this string with imploded space on array:

ti si mi znaj na srcu kvar znaj znaj znaj srcu ž urka

there is space on ž urka

Solution

From the docs: A word boundary is a position in the subject string where the current character and the previous character do not both match \w or \W (i.e. one matches \w and the other matches \W), or the start or end of the string if the first or last character matches \w, respectively.

the ž(including the space before it) matches a \W but the u matches \w , therefore you'll get ž and urka

These characters at the end will not match the pattern:

 žšđčć ŽŠĐČĆ :)

...they are all \W-characters and need to be followed by a \w-character to match the pattern(the 2nd \b)

I guess your are looking for the u-modifier. Try

preg_match_all('/\b.*?\b/iu', $string, $matchWords);

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow