Filtering microsoft 1252 characters out of an ASCII text file opened in utf8 mode in Perl

https://stackoverflow.com/questions/7848110

10-02-2021
|

Question

I have a reasonable size flat file database of text documents mostly saved in 8859 format which have been collected through a web form (using Perl scripts). Up until recently I was negotiating the common 1252 characters (curly quotes, apostrophes etc.) with a simple set of regex's:

$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right

... etc.

However since I decided I ought to be going Unicode, and have converted all my scripts to read in and output utf8 (which works a treat for all new material), the regex for these (existing) 1252 characters no longer works and my Perl html output outputs literally the 4 characters: '\x92' and '\x93' etc. (at least that's how it appears on a browser in utf8 mode, downloading (ftp not http) and opening in a text editor (textpad) it's different, a single undefined character remains, and opening the output file in Firefox default (no content type header) 8859 mode renders the correct character).

The new utf8 pragmas at the start of the script are:

use CGI qw(-utf8); use open IO => ':utf8';

I understand this is due to utf8 mode making the characters double byte instead of single byte and applies to those chars in the 0x80 to 0xff range, having read up the article on wikibooks relating to this, however I was non the wiser as to how to filter them. Ideally I know I ought to resave all the documents in utf8 mode (since the flat file database now contains a mixture of 8859 and utf8), however I will need some kind of filter in the first place if I'm going to do this anyway.

And I could be wrong as to the 2-byte storage internally, since it did seem to imply that Perl handles stuff very differently according to various circumstances.

If anybody could provide me with a regex solution I would be very grateful. Or some other method. I have been tearing my hair out for weeks on this with various attempts and failed hacking. There's simply about 6 1252 characters that commonly need replacing, and with a filter method I could resave the whole flippin lot in utf8 and forget there ever was a 1252...

Solution

Ikegami already mentioned the Encoding::FixLatin module.

Another way to do it, if you know that each string will be either UTF-8 or CP1252, but not a mixture of both, is to read it as a binary string and do:

unless ( utf8::decode($string) ) {
    require Encode;
    $string = Encode::decode(cp1252 => $string);
}

Compared to Encoding::FixLatin, this has two small advantages: a slightly lower chance of misinterpreting CP1252 text as UTF-8 (because the entire string must be valid UTF-8) and the possibility of replacing CP1252 with some other fallback encoding. A corresponding disadvantage is that this code could fall back to CP1252 on strings that are not entirely valid UTF-8 for some other reason, such as because they were truncated in the middle of a multi-byte character.

OTHER TIPS

Encoding::FixLatin was specifically written to help fix data broken in the same manner as yours.

You could also use Encode.pm's support for fallback.

use Encode qw[decode];

my $octets = "\x91 Foo \xE2\x98\xBA \x92";
my $string = decode('UTF-8', $octets, sub {
    my ($ordinal) = @_;
    return decode('Windows-1252', pack 'C', $ordinal);
});

printf "<%s>\n", 
  join ' ', map { sprintf 'U+%.4X', ord $_ } split //, $string;

Output:

<U+2018 U+0020 U+0046 U+006F U+006F U+0020 U+263A U+0020 U+2019>

Did you recode the data files? If not, opening them as UTF-8 won't work. You can simply open them as

open $filehandle, '<:encoding(cp1252)', $filename or die ...;

and everything (tm) should work.

If you did recode, something seem to have gone wrong, and you need to analyze what it is, and fix it. I recommend using hexdump to find out what actually is in a file. Text consoles and editors sometimes lie to you, hexdump never lies.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow