Question

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).

This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.

So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?

It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.

Was it helpful?

Solution

I believe you could use Text::Unidecode for this, it is precisely what it tries to do.

OTHER TIPS

If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?

If you get cyrilic text there is no "closest ASCII representation" for many characters.

In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:

Make sure that the input data really is a utf8 string.

UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:

Make sure that the input data really is a string of two-byte Unicode characters.

This is also referred to as UCS-2.

If you want to convert strings which really are utf8, you would do it like so:

my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top