PHP readdir with european characters

https://stackoverflow.com/questions/1766863

21-09-2019
|

Question

I get images files which have Czech characters in the filename (eg, ěščřžýáíé) and I want to rename them without the accents so that they are more compatible for the web. I thought I could use a simple str_replace function but it doesn't seem to work the same with the file array as it does with a string literal.

I read the files with readdir, after checking for extension.

function readFiles($dir, $ext = false) {
    if (is_dir($dir)) {
        if ($dh = opendir($dir)) {
            while (($file = readdir($dh)) !== false) {
                if($ext){  
                    if(end(explode('.', $file)) == $ext) {
                        $f[] = $file;
                    }
                } else {
                    $f[] = $file;
                }
            }

            closedir($dh);
            return $f;
        } else {
            return false;
        }
    } else {
        return false;
    }
}

$files = readFiles(".", "jpg");

$search = array('š','á','ž','í','ě','é','ř','ň','ý','č',' ');
$replace = array('s','a','z','i','e','e','r','n','y','c','-');

$string = "čšěáýísdjksnalci sášěééalskcnkkjy+ěéší";
$safe_string = str_replace($search, $replace, $string);

echo '<pre>';

foreach($files as $fl) {
    $safe_files[] = str_replace($search, $replace, $fl);
}

var_dump($files);
var_dump($safe_files);

var_dump($string);
var_dump($safe_string);

echo '</pre>';

Output

array(6) {
  [0]=>
  string(21) "Hl�vka s listem01.jpg"
  [1]=>
  string(23) "Hl�vky v atelieru02.jpg"
  [2]=>
  string(17) "Jarn� v�hon03.jpg"
  [3]=>
  string(17) "Mlad� chmel04.jpg"
  [4]=>
  string(23) "Stavba chmelnice 05.jpg"
  [5]=>
  string(21) "Zimni chmelnice06.jpg"
}
array(6) {
  [0]=>
  string(21) "Hl�vka-s-listem01.jpg"
  [1]=>
  string(23) "Hl�vky-v-atelieru02.jpg"
  [2]=>
  string(17) "Jarn�-v�hon03.jpg"
  [3]=>
  string(17) "Mlad�-chmel04.jpg"
  [4]=>
  string(23) "Stavba-chmelnice-05.jpg"
  [5]=>
  string(21) "Zimni-chmelnice06.jpg"
}
string(53) "čšěáýísdjksnalci sášěééalskcnkkjy+ěéší"
string(38) "cseayisdjksnalci-saseeealskcnkkjy+eesi"

Right now I'm running on WAMP but answers that work across platforms are even better :)

Solution

According to the 0xFFFD marks (which appears in Firefox as diamonds with a question mark inside) you already aren't reading them using the correct encoding (which would be Unicode / UTF-8). As far I found this bug, it seems to be related.

Here's another SO topic about that: php readdir problem with japanese language file name

To the point, wait until they get PHP6 stable and then use it.

Unrelated to the problem: the Normalizer is a better tool to get rid of diacritical marks.

OTHER TIPS

If it works with strings but not with arrays, just applies it on strings :-)

$search = array('š','á','ž','í','ě','é','ř','ň','ý','č',' ');
$replace = array('s','a','z','i','e','e','r','n','y','c','-');

len = count($safe_files)

for ($i=0; $i<len; $i++)
    $safe_files[$i] = str_replace($search, $replace, $safe_files[$i]);

I think str_replace accept arrays only for the 2 first params, and not the last. I may be wrong, but anyway this should work.

If by any mean, you have a real encoding problem, it could just be that you OS use a single byte encoding while your source file use another, probably UTF-8.

In that case, do something like :

$search = array('š','á','ž','í','ě','é','ř','ň','ý','č',' ');
$replace = array('s','a','z','i','e','e','r','n','y','c','-');

$code_encoding = "UTF-8"; // this is my guess, but put whatever is yours
$os_encoding = "CP-1250"; // this is my guess, but put whatever is yours

len = count($safe_files)

for ($i=0; $i<len; $i++)
{
    $safe_files[$i] = iconv($os_encoding , $code_encoding, $safe_files[$i]); // convert before replace
    /*
     ALternatively :
     $safe_files[$i] = mb_convert_encoding($safe_files[$i], $code_encoding , $os_encoding );
    */
    $safe_files[$i] = str_replace($search, $replace, $safe_files[$i]);
}

mb_convert_encoding() require the ext/mbstring extension and iconv() require ext/iconv.

Not directly an answer to your question maybe but you might want to take a look at the iconv() function in PHP and more in particulare the //TRANSLIT option that you can append to the second argument. I've used it several times turning french and eastern europe strings to their a-z and url friendly counterparts.

From PHP.net (http://www.php.net/manual/en/function.iconv.php)

If you append the string //TRANSLIT to out_charset transliteration is activated. This means that when a character can't be represented in the target charset, it can be approximated through one or several similarly looking characters.

Your source code (and the test string) appear to be in utf8, while file names seem to use a single-byte encoding. I'd suggest you use the same encoding for your replacement string. To avoid source encoding issues, it'd better to write accented chars in your code in a hex form (like \xE8 for "č" etc).

So I got it working on my Windows XP system by this

$search = array('š','á','ž','í','e','é','r','n','ý','c',' ');
$replace = array('s','a','z','i','e','e','r','n','y','c','-');

$files = readFiles(".", "jpg");
$len = count($files);

for($i = 0; $i < $len; $i++){
  if(mb_check_encoding($files[$i], 'ASCII')){
    $safe_files[$i] = $files[$i];
  }else{
    $safe_files[$i] = str_replace(
        $search, $replace, iconv("iso-8859-1", "utf-8//TRANSLIT", $files[$i]));
  }
  if($files[$i] != $safe_files[$i]){
    rename($files[$i], $safe_files[$i]);
  }
}

I don't know if it's a conincidence or not, but calling mb_get_info() shows

[internal_encoding] => ISO-8859-1

Here is another function I found helpful on the PHP strtr page

<?
// Windows-1250 to ASCII
// This function replace all Windows-1250 accent characters with
// thier non-accent ekvivalents. Useful for Czech and Slovak languages.

function win2ascii($str)    {   

$str = StrTr($str,
    "\xE1\xE8\xEF\xEC\xE9\xED\xF2",
    "\x61\x63\x64\x65\x65\x69\x6E");

$str = StrTr($str,
    "\xF3\xF8\x9A\x9D\xF9\xFA\xFD\x9E\xF4\xBC\xBE",
    "\x6F\x72\x73\x74\x75\x75\x79\x7A\x6F\x4C\x6C");

$str = StrTr($str,
    "\xC1\xC8\xCF\xCC\xC9\xCD\xC2\xD3\xD8",
    "\x41\x43\x44\x45\x45\x49\x4E\x4F\x52");

$str = StrTr($str,
    "\x8A\x8D\xDA\xDD\x8E\xD2\xD9\xEF\xCF",
    "\x53\x54\x55\x59\x5A\x4E\x55\x64\x44");

return $str;
}
?>

Basically, it wasn't such a problem to convert the european characters to an ascii equivilent, but I could find no reliable way to rename the files (ie, reference files with non-ascii characters).

For UTF-8 use the PHP function utf8_encode. Microsoft Windows uses ISO-8859-1 so in this case a conversion is necessary.

Example - listing the files in a dir:

<?php
$dir_handle = opendir(".");
while (false !== ($file = readdir($dir_handle)))
{
  echo utf8_encode($file)."<br>";
}
?>

Area5one has it right - it's a problem of different encoding.

When I upgraded my machine from XP to Win7, I also upgraded my version of MySQL and PHP. Somewhere along the way, PHP programs that used to work stopped working. In particular, scandir, readdir and utf-8 had lived happily together, but no longer.

So, I've modified my code. Variables related to data taken from the hard disk end in "_iso" to reflecct Windows' ISO-8859-1 encoding, data from the MySQL database goes in variables ending in "_utf". Thus, the code from area5one would like this: $dir_handle_iso = opendir("."); while (false !== ($file_iso = readdir($dir_handle_iso))) { $file_utf = utf8_encode($file); ... }

This works for me 100%:

setlocale(LC_ALL,"cs_CZ");
$new_str = iconv("UTF-8","ASCII//TRANSLIT",$orig_str);

$file = mb_convert_encoding($file, 'UTF-8', "iso-8859-1"); Worked for me (Windows, Danish characters).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow