Question

On OS-X (PHP5.2.11) I have a file: siësta.doc (and thousand other with Unicode filenames) and I want to convert the file names to a web-consumable format (a-zA-Z0-9.). If I hardcode the file name above I can do the right conversion:

<?php
  $file = 'siësta.doc';
  echo preg_replace("/[^a-zA-Z0-9.]/u", '_', $file);
  // Output: si_sta.doc
?>

But if I read the file names with scandir, I've got strange conversions:

<?php
  $files = scandir(DIRNAME);
  foreach ($files as $file) {
    echo preg_replace("/[^a-zA-Z0-9.]/u", '_', $file);
    // Output for the file above: sie_sta.doc
  }
?>

I tried to detect the encoding, set the encoding, convert it with iconv functions. I tried the mb_ functions also. But it was just worse. What did I do wrong?

Thanks in advance

Was it helpful?

Solution

Interesting. After a bit recherché i've found that OSX stores filenames as "decomposed unicode" (see http://developer.apple.com/mac/library/qa/qa2001/qa1173.html). That is, "ë" is represented as "e" + diaresis symbol (0xcc88).

OTHER TIPS

You did try utf8_encode? (Works on Windows at least)

<?php
  $files = scandir(DIRNAME);
  foreach ($files as $file) {
    echo preg_replace("/[^a-zA-Z0-9.]/u", '_', utf8_encode($file));
    // Output for the file above: sie_sta.doc
  }
?>

The problem is the communication between windows and php. It´s no possible to get unicode filenames, because they depend on non-unicode windows application language.

The best solution is to execute a dir command and get info to be processed, but you must do it so through a cmd, and getting the windows short names:

chcp 65001
dir /x c:\test\ > myinfo.txt

It returns:

 El volumen de la unidad C es Windows8_OS
 El número de serie del volumen es: 14A3-025F

 Directorio de C:\test

22/12/2015  22:11    <DIR>                       .
22/12/2015  22:11    <DIR>                       ..
22/12/2015  22:12                 0              a.txt
22/12/2015  22:10    <DIR>                       English
22/12/2015  22:10    <DIR>          ESPAOL~1     Español
22/12/2015  22:11    <DIR>          8311~1       ру́сский язы́к
22/12/2015  22:10    <DIR>          _0B41~1      عربي ,عربى
22/12/2015  22:10    <DIR>          8F4C~1       北方話
               1 archivos              0 bytes
               7 dirs  839.672.786.944 bytes libres

Then, you can read myinfo.txt to get the connection between original name and windows short name.

Some PHP functions works fine with short names and you can build and array as if you need display it:

$array['short_name']= $original_name;

For example: is_dir, is_file works fine. However, scandir or is_readable fails with short names too. The solution to use these functions is to rerun a dir command recursively.

To get info from txt file, you can use a regular expression or substr, discarding the first five lines and the last two. For example:

for($k=6;$k<(count($array)-2);$k++) ...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top