PHP Multi Byte str_replace?

https://stackoverflow.com/questions/1451144

12-09-2019
|

Question

I'm trying to do accented character replacement in PHP but get funky results, my guess being because i'm using a UTF-8 string and str_replace can't properly handle multi-byte strings..

$accents_search     = array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'); 

$accents_replace    = array('a','a','a','a','a','a','a','A','A','A','A','A','e','e',
'e','e','E','E','E','E','i','i','i','i','I','I','I','I','oe','o','o','o','o','o','o',
'O','O','O','O','O','u','u','u','U','U','U','c','C','N','n'); 

$str = str_replace($accents_search, $accents_replace, $str);

Results I get:

Ørjan Nilsen -> �orjan Nilsen

Expected Result:

Ørjan Nilsen -> Orjan Nilsen

Edit: I've got my internal character handler set to UTF-8 (according to mb_internal_encoding()), also the value of $str is UTF-8, so from what I can tell, all the strings involved are UTF-8. Does str_replace() detect char sets and use them properly?

Solution

Looks like the string was not replaced because your input encoding and the file encoding mismatch.

OTHER TIPS

According to php documentation str_replace function is binary-safe, which means that it can handle UTF-8 encoded text without any data loss.

It's possible to remove diacritics using Unicode normalization form D (NFD) and Unicode character properties.

NFD converts something like the "ü" umlaut from "LATIN SMALL LETTER U WITH DIAERESIS" (which is a letter) to "LATIN SMALL LETTER U" (letter) and "COMBINING DIAERESIS" (not a letter).

header('Content-Type: text/plain; charset=utf-8');

$test = implode('', array('á','à','â','ã','ª','ä','å','Á','À','Â','Ã','Ä','é','è',
'ê','ë','É','È','Ê','Ë','í','ì','î','ï','Í','Ì','Î','Ï','œ','ò','ó','ô','õ','º','ø',
'Ø','Ó','Ò','Ô','Õ','ú','ù','û','Ú','Ù','Û','ç','Ç','Ñ','ñ'));

$test = Normalizer::normalize($test, Normalizer::FORM_D);

// Remove everything that's not a "letter" or a space (e.g. diacritics)
// (see http://de2.php.net/manual/en/regexp.reference.unicode.php)
$pattern = '/[^\pL ]/u';

echo preg_replace($pattern, '', $test);

Output:

aaaaªaaAAAAAeeeeEEEEiiiiIIIIœooooºøØOOOOuuuUUUcCNn

The Normalizer class is part of the PECL intl package. (The algorithm itself isn't very complicated but needs to load a lot of character mappings afaik. I wrote a PHP implementation a while ago.)

(I'm adding this two months late because I think it's a nice technique that's not known widely enough.)

Try this function definition:

if (!function_exists('mb_str_replace')) {
    function mb_str_replace($search, $replace, $subject) {
        if (is_array($subject)) {
            foreach ($subject as $key => $val) {
                $subject[$key] = mb_str_replace((string)$search, $replace, $subject[$key]);
            }
            return $subject;
        }
        $pattern = '/(?:'.implode('|', array_map(create_function('$match', 'return preg_quote($match[0], "/");'), (array)$search)).')/u';
        if (is_array($search)) {
            if (is_array($replace)) {
                $len = min(count($search), count($replace));
                $table = array_combine(array_slice($search, 0, $len), array_slice($replace, 0, $len));
                $f = create_function('$match', '$table = '.var_export($table, true).'; return array_key_exists($match[0], $table) ? $table[$match[0]] : $match[0];');
                $subject = preg_replace_callback($pattern, $f, $subject);
                return $subject;
            }
        }
        $subject = preg_replace($pattern, (string)$replace, $subject);
        return $subject;
    }
}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow