Thanks to everyone's comments and some further research, I figured out how to solve the problem and it was slightly different than I thought; it turned out to be a combination of a split() issue and an encoding issue. I had to both add the encoding in an explicit open statement instead of using the implicit open in the for loop, and skip the first two bytes at the beginning of the file.
Here's what the corrected, working code looks like for the section I posted in my question:
for my $infile (@ARGV){
my $outfile = $infile . '.out';
# SOLUTION part 1: added explicit open statement
open (INFILE, "<:raw:encoding(UCS-2le):crlf", $infile) or die "Error opening $infile: $!";
# SOLUTION part 2: had to skip the first two bytes of the file
seek INFILE, 2, 0;
if (!open (OUTFILE, ">$outfile")) {
die "Couldn't write to $outfile.\n";
}
binmode (OUTFILE, ":utf8");
print OUTFILE "Line#\tOriginal_Entry\tLangCode\tOffending_Char(s)\n";
$tBad = 0;
$tTot = 0;
$lineNo = 1;
while (<INFILE>) {
chomp;
$tTot++;
# SOLUTION part 3: deleted the "if" block I had here before that was handling encoding
# Rest of code in the original block is the same
}
My code now properly recognizes tab characters adjacent to characters not part of the extended Latin set, and splits on tabs as it should.
NOTE: Another solution would have been to enclose the foreign words in double quotes, but, in our case, we couldn't guarantee that our input files would be formatted that way.
Thanks to everyone who commented and helped me out!