Note: Asking for off-site resources is not encouraged on StackOverflow. Also, the question of how to normalize Unicode text in general is too broad.
Regarding filenames returned from readdir
or glob
, it's good practice to decode and normalize them. Consider the following code:
#!/usr/bin/perl
use strict;
use utf8;
use File::Slurp;
use Unicode::Normalize;
binmode(STDOUT, ':utf8');
write_file("Unicode Test - Übersee.txt", "text");
opendir(my $dh, ".") or die($!);
while (my $entry = readdir($dh)) {
utf8::decode($entry);
if ($entry =~ /^Unicode Test - (.*)\.txt/) {
my $word = $1;
print("got $word\n");
print("matches 'Übersee': ", $word eq "Übersee" ? "yes" : "no", "\n");
my $nfc = NFC($word);
print("NFC matches 'Übersee': ", $nfc eq "Übersee" ? "yes" : "no", "\n");
}
}
closedir($dh);
On OS X, this will output:
got Übersee
matches 'Übersee': no
NFC matches 'Übersee': yes
This is due to the variation of NFD that HFS uses to normalize filenames.
In essence, normalize all input from sources where you can't be sure that it's in normal form. In most cases, you should use NFC because most data will be in NFC already.