Question

Just starting working on one perl application. Need some advice how to (correctly) deal with unicode filenames vs filenames in the file content - portable way.

Here are several systems, Windows and Unix world uses different unicode encoding (Unixes utf8, Windows - don't know), but Linux and Mac OS X different unicode normalization for filenames. (OS X - enforced NFD, Linux - "usually" NFC).

All advices what I already read says: (always normalize unicode data at the boundaries of application) - but the question is - what is the correct - most portable way to do it?

The problem is, than OS X (when creating text files) using NFC for content. I don't know what other systems using.

So the questions, what is correct method of making portable apps and dealing with filenames in:

  • opendir/readdir
  • glob and similar "file-operations"
  • textfiles (what will contain filenames)
  • perl internals...
  • other?

When and where do normalization? How to save utf8 text files what in their content contains filenames?

I know, here is many perl-unicode related questions already in StacOverflow. I digged probably most of them - but still not understand what is "recommented" practice for dealing with above list of questions.

Will need make modules for dealing with specific operating system differences? Or here are alredy CPAN modules what dealing with OS differences in file-operations ?)

Can somebody point me to some good resource with recommended practices? Or it is much simpler as I think now?

Was it helpful?

Solution

Note: Asking for off-site resources is not encouraged on StackOverflow. Also, the question of how to normalize Unicode text in general is too broad.

Regarding filenames returned from readdir or glob, it's good practice to decode and normalize them. Consider the following code:

#!/usr/bin/perl
use strict;
use utf8;

use File::Slurp;
use Unicode::Normalize;

binmode(STDOUT, ':utf8');

write_file("Unicode Test - Übersee.txt", "text");

opendir(my $dh, ".") or die($!);
while (my $entry = readdir($dh)) {
    utf8::decode($entry);

    if ($entry =~ /^Unicode Test - (.*)\.txt/) {
        my $word = $1;
        print("got $word\n");
        print("matches 'Übersee': ", $word eq "Übersee" ? "yes" : "no", "\n");
        my $nfc = NFC($word);
        print("NFC matches 'Übersee': ", $nfc eq "Übersee" ? "yes" : "no", "\n");
    }
}   
closedir($dh);

On OS X, this will output:

got Übersee
matches 'Übersee': no
NFC matches 'Übersee': yes

This is due to the variation of NFD that HFS uses to normalize filenames.

In essence, normalize all input from sources where you can't be sure that it's in normal form. In most cases, you should use NFC because most data will be in NFC already.

OTHER TIPS

As far as I can tell, MS enforces no normalization upon it's file system. This means that if you plan for this worst-case scenario, you'll be good on other OSs.

A technique that seems to work is to query the OS for the files it sees. Create a normalization hash keyed on the normalized form of your choice & containing as values the names from the OS. It's not elegant, but it works.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top