Unicode normalization - filenames in text files vs filenames on filesystem

Question 1

_{Note: Asking for off-site resources is not encouraged on StackOverflow. Also, the question of how to normalize Unicode text in general is too broad.}

Regarding filenames returned from readdir or glob, it's good practice to decode and normalize them. Consider the following code:

#!/usr/bin/perl
use strict;
use utf8;

use File::Slurp;
use Unicode::Normalize;

binmode(STDOUT, ':utf8');

write_file("Unicode Test - Übersee.txt", "text");

opendir(my $dh, ".") or die($!);
while (my $entry = readdir($dh)) {
    utf8::decode($entry);

    if ($entry =~ /^Unicode Test - (.*)\.txt/) {
        my $word = $1;
        print("got $word\n");
        print("matches 'Übersee': ", $word eq "Übersee" ? "yes" : "no", "\n");
        my $nfc = NFC($word);
        print("NFC matches 'Übersee': ", $nfc eq "Übersee" ? "yes" : "no", "\n");
    }
}   
closedir($dh);

On OS X, this will output:

got Übersee
matches 'Übersee': no
NFC matches 'Übersee': yes

This is due to the variation of NFD that HFS uses to normalize filenames.

In essence, normalize all input from sources where you can't be sure that it's in normal form. In most cases, you should use NFC because most data will be in NFC already.

Question 2

As far as I can tell, MS enforces no normalization upon it's file system. This means that if you plan for this worst-case scenario, you'll be good on other OSs.

A technique that seems to work is to query the OS for the files it sees. Create a normalization hash keyed on the normalized form of your choice & containing as values the names from the OS. It's not elegant, but it works.