Perl: Problem with changing encoding in the middle of reading a file

https://stackoverflow.com/questions/5305647

24-10-2019
|

Question

I am using Perl to load some 'macro' files. These macros can, however, be encoded in various encodings, so there is a directive defined for users writing their macros (i.e.

#encoding iso-8859-2

at the beginning of the macro).

Every time this directive is encountered in the macro, a function setting encoding is called and looks sth like this:

sub change_encoding {
  my ($file_handle, $encoding) = @_;
  $file_handle->flush();
  binmode($file_handle);           # get rid of IO layers
  binmode($file_handle,":encoding($encoding)");
}

The problem is that when I read the macro using standard

while($line = <$file_handle>){
  process_macro($line);
}

I got messages saying "utf8 "\xXY" does not map to Unicode", but only if characters with diacritics is near the #encoding directive. I tried several examples and I was able to have half of the string with \xXY codes and other half of the string with correctly decoded characters, like here:

sub macro5_fn {
  print "\xBElu\xBBou\xE8k\xFD k\xF9\xF2 úpěl ďábelské ódy\n";
}

If I put more comments before the function, all the characters are OK:

sub macro5_fn {
  print "žluťoučký kůň úpěl ďábelské ódy\n";
}

Simply said, the number of correctly decoded characters depends on the distance of these characters from the #encoding directive, the ones that are close are not decoded correctly.

It seems to me that this is an issue of Perl and PerlIO (not) flushing the buffer. Or am I doing something wrong?

Thank you for your answers.

Solution

The problem is that <> reads more than just one line, so the next line or so is being interpreted under the old encoding before you ever see the #encoding directive for the new.

Your best bet is probably to read the file in binary mode and use the Encode module to decode each line from the current encoding.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow