Question

I'm trying to replace the special characters in a PHP string with normal characters (as in replace ó with o and á with a). I tried using the PHP Normalizer::normalize function as in the following code:

if (!Normalizer::isNormalized($word, Normalizer::FORM_C))
{
    echo "original: ".$word;
    $word = Normalizer::normalize($word, Normalizer::FORM_C);

    echo "\tnormalized: ".$word."<br />";
    exit; // see if it worked without having to go through every file
}

However, Normalizer::normalize returned null and the output from that code was:

original: adiós normalized:

Since this method didn't seem to be working, I went and found a function that was supposed to remove special characters. Here is the function:

function normalize ($string) {
    $table = array(
        'Š'=>'S', 'š'=>'s', 'Đ'=>'Dj', 'đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c',
        'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
        'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
        'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
        'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
        'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
        'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
        'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r',
    );

    return strtr($string, $table);
}

This code had no noticeable effect, however, and returned the same string that was passed in.

I'm obtaining my strings from *.txt files in Windows 7. I've never been very good at encodings, and would appreciate any help on this issue.

Was it helpful?

Solution

I copied and pasted your code into my editor and something interesting happened. Instead of getting adios I was getting adjiós. Notice the j in the middle after the d. This was coming from the 'đ'=>'dj', in the first line of the table map. Apparently, my editor changed the đ to a regular d, and then it wouldn't convert the ó. I removed this key/value pair and suddenly it worked for me. Are you sure all of your keys are correct in your editor (Does you editor accept alternative character sets?) Here is my test file (with the đ removed:

<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">
</head>
<body>
<?php

function normalize ($string) {
    $table = array(
        'Š'=>'S', 'š'=>'s', 'Ð'=>'Dj', 'Ž'=>'Z', 'ž'=>'z', 'C'=>'C', 'c'=>'c', 'C'=>'C', 'c'=>'c',
        'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
        'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
        'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
        'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
        'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
        'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
        'ÿ'=>'y', 'R'=>'R', 'r'=>'r',
    );

    return strtr($string, $table);
}

$word = 'adiós';
$length = strlen($word);

echo 'original: '. $word;
echo '<br />';
echo 'normalized: '. normalize($word); 
echo '<br />';
echo 'loop: ';

for($i = 0; $i < $length; $i++) {
    echo normalize($word[$i]);
}

?>

</body>
</html>

When I loop through each character with the 'd' => 'dj' in the array map then I correctly get adjios

OTHER TIPS

There's a great tip from this page: How to remove diacritics from text? Here's my version of it:

/** Normalize a string so that it can be compared with others without being too fussy.
*   e.g. "Ádrèñålînë" would return "adrenaline"
*   Note: Some letters are converted into more than one letter, 
*   e.g. "ß" becomes "sz", or "æ" becomes "ae"
*/
function normalize_string($string) {
    // remove whitespace, leaving only a single space between words. 
    $string = preg_replace('/\s+/', ' ', $string);
    // flick diacritics off of their letters
    $string = preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml|caron);~i', '$1', htmlentities($string, ENT_COMPAT, 'UTF-8'));  
    // lower case
    $string = strtolower($string);
    return $string;
}

It's good because, unlike the iconv method mentioned above, there's no converting between character sets (they're a minefield).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top