Question

I am trying to detect if uploaded file is a valid UTF-8 and only after that do some operations with its content. It detects non-UTF-8 files ok, but if file is a valid UTF-8 there is no content to process. There is no data to process in a while(){} loop. Where is my mistake?

use utf8;
use CGI qw(:all -utf8);
use Encode;

my $q           = new CGI;

my $file        = $q->param('importfile');
my $file_handle = $q->upload('importfile');
my $fhtest      = do {
        local $/;
        <$file_handle>;
};

my $utf8;
eval { $utf8 = decode( "utf8", $fhtest, Encode::FB_CROAK ) };
if ($@) {
        die 'Not a valid UTF-8 file';
}

binmode $file_handle, ':encoding(UTF-8)';
while (<$file_handle>) {
        chomp();
        # my code here
}
Was it helpful?

Solution

When you use readline (aka <$fh>), you read the next line after where you left off. You left off at the end of the file.

Sure, you might be able to use seek to rewind the file handle (assuming it's not a pipe), but why would you want to read from the file again? You already have the whole thing in memory, and it's already decoded too! Just split it into lines.

 my $file_contents; { local $/; $file_contents = <$file_handle>; }

utf8::decode($file_contents)
   or die 'Not a valid UTF-8 file';

for (split /^/m, $file_contents, -1) {
    chomp;
    ...
}

Or since you're chomping anyway,

for (split /\n/, $file_contents) {
    ...
}

I avoided do as it causes an extra copy of the file to be created in memory.

OTHER TIPS

You've already read the entire filehandle in your first loop when you create $fhtest. If you want to go back to the beginning, you can use seek:

use Fcntl ':seek';    # import constants
...
my $fhtest      = do {
        local $/;
        <$file_handle>;
};

my $utf8;
eval { $utf8 = decode( "utf8", $fhtest, Encode::FB_CROAK | Encode::LEAVE_SRC) };
if ($@) {
        die 'Not a valid UTF-8 file';
}

seek $file_handle, 0, SEEK_SET;

# now you can start over with $file_handle

Of course, since you've already loaded all the data into memory in $fhtest, you could just split it on newlines (or whatever) and loop over the results. Or you could open a fake filehandle to what you already have in memory:

open my $fake_fh, '<', \$fhtest;
while( <$fake_fh> ) { 
    ....
}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top