Question

Without looping over the entire range of Unicode characters, how can I get a list of characters that have a given property? In particular I want a list of all characters that are digits (i.e. those that match /\d/). I have looked at Unicode::UCD, and it is useful for determining the properties of a given character, but there doesn't seem to be a way to get a list characters that have a property out of it.

Was it helpful?

Solution

The list of Unicode characters for each class is generated from the Unicode spec when you compile Perl, and is typically stored in /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/

For example, the list of Unicode character ranges that match IsDigit (a.k.a. \d) is stored in the file /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/Digit.pl

OTHER TIPS

Even better than unicore/lib/gc_sc/Digit.pl is unicore/To/Digit.pl. It is a direct mapping of Unicode digit characters (well, really their offsets) to their numeric values. This means instead of:

use Unicode::Digits qw/digit_to_int/;

my @digits;
for (split "\n", require "unicore/lib/gc_sc/Digit.pl") {
    my ($s, $e) = map hex, split;
    for (my $ord = $s; $ord <= $e; $ord++) {
        my $chr = chr $ord;
        push @{$digits[digits_to_int $chr]}, $chr;
    }
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

I can say:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    my $chr = chr hex $ord;
    push @{$digits[$val]}, $chr;
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

Or even better:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    $digits[$val] .= "\\x{$ord}";
}
@digits = map { qr/[$_]/ } @digits;

which characters /\d/ match depends entirely on your regexp implementation (although standard 0-9 are guaranteed). In the case of perl the perl locale used defines which characters are considered alphabetic and digits.

There is no way to do that without iterating through all the characters. (if you create a huge string with all of them and use a regexp you still have to do the loop at least once, to create the string).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top