What is the right way to get a grapheme?

https://stackoverflow.com/questions/9428891

12-11-2019
|

Pregunta

Why does this print a U and not a Ü?

#!/usr/bin/env perl
use warnings;
use 5.014;
use utf8;
binmode STDOUT, ':utf8';
use charnames qw(:full);

my $string = "\N{LATIN CAPITAL LETTER U}\N{COMBINING DIAERESIS}";

while ( $string =~ /(\X)/g ) {
        say $1;
}

# Output: U

Solución

Your code is correct.

You really do need to play these things by the numbers; don’t trust what a "terminal" displays. Pipe it through the uniquote program, probably with -x or -v, and see what it is really doing.

Eyes deceive, and programs are even worse. Your terminal program is buggy, so is lying to you. Normalization shouldn’t matter.

$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"'
crème brûlée
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say "crème brûlée"' | uniquote -x
cr\x{E8}me br\x{FB}l\x{E9}e
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' 
crème brûlée
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFD "crème brûlée"' | uniquote -x
cre\x{300}me bru\x{302}le\x{301}e

$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée"' 
éel̂urb em̀erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say NFC scalar reverse NFD "crème brûlée")' | uniquote -x
\x{E9}el\x{302}urb em\x{300}erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"'
éel̂urb em̀erc
$ perl -CS -Mutf8 -MUnicode::Normalize -E 'say scalar reverse NFD "crème brûlée"' | uniquote -x
e\x{301}el\x{302}urb em\x{300}erc

Otros consejos

This works for me, though I have an older version of perl, 5.012, on ubuntu. My only change to your script is: use 5.012;

$ perl so.pl 
Ü

May I suggest it's the output which is incorrect? It's easy to check: replace your loop code with:

my $counter;
while ( $string =~ /(\X)/g ) {
  say ++$counter, ': ', $1;
}

... and look up how many times the regex will match. My guess it will still match only once.

Alternatively, you can use this code:

use Encode;
sub codepoint_hex {
    sprintf "%04x", ord Encode::decode("UTF-8", shift);
}

... and then print codepoint_hex ($1) instead of plain $1 within the while loop.

1) Apparently, your terminal can't display extended characters. On my terminal, it prints:

U¨

2) \X doesn't do what you think it does. It merely selects characters that go together. If you use the string "fu\N{COMBINING DIAERESIS}r", your program displays:

f
u¨
r

Note how the diacritic mark isn't printed alone but with its corresponding character.

3) To combine all related characters in one, use the module Unicode::Normalize:

use Unicode::Normalize;

my $string = "fu\N{COMBINING DIAERESIS}r";
$string = NFC($string);

while ( $string =~ /(\X)/g ) {
    say $1;
}

It displays:

f
ü
r

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow