The suggestion I would make depends very much on the actual problem you're trying to solve. Looking at this question in isolation, I would not have so much encoding / decoding 'magic' and would simply use the raw bytes (as the script doesn't need to know anything about the characters themselves for this task). The below produces the expected result given the input and output you described.
use v5.014;
use warnings;
use autodie;
use Carp::Always;
use Tie::File;
my $file_in = 'test_in.txt';
my $file_out = 'test_tie.txt';
unlink $file_out;
tie my @tied, 'Tie::File', $file_out, recsep => "\x0D\x0A" or die 'tie failed';
open my $fh, '<', $file_in;
while (my $line = <$fh>) {
chomp $line;
push @tied, $line;
}
close $fh;
my $i = 0;
say $i++ . ' ' . $_ foreach @tied;
untie @tied;
However, you probably do want to do some processing on that text in the middle. In which case you want decoded characters. As I see it there are two options:
- Encode manually before handing off to the tied array
- Figure out what the issue is with Tie::File
Number 2 is probably non-trivial - a quick scan of the Tie::File source and it looks like it assumes it will always be given bytes. The only part that you can seemingly affect is the binmode at https://metacpan.org/source/TODDR/Tie-File-0.98/lib/Tie/File.pm#L111 - which you are doing.
Tie::File does a lot of seek
calls, perldoc has this to say on seek ( http://perldoc.perl.org/functions/seek.html ):
Note the in bytes: even if the filehandle has been set to operate on characters (for example by using the :encoding(utf8) open layer), tell() will return byte offsets, not character offsets (because implementing that would render seek() and tell() rather slow).
So it appears that Tie::File is using character lengths to determine its byte offsets for records. Therefore it can end up in the middle of a UTF-8 character sequence. This seems a likely cause for your errors.
In general, I stay away from binmode
when relying on an external module to read/write to a file handle - in this case I would have a simple sub calling Encode::encode('UTF-8', ...)
on the data before pushing onto @tied.
Exception is where the module's documentation clearly states the behaviour for decoded data or if the source is simple enough for me to verify the behaviour.