PERL to count non-printable characters

Question 1

Here is my working solution based on the feedback provided.

I would appreciate any further feedback on form or more efficient methods:

    #!/usr/bin/perl

    use strict;
    use warnings;

    # This program receives a file path and name.
    # The program attempts to read the first 2000 bytes.
    # The output is a list of files, the number of bytes
    # actually read and the percent of tbe bytes that are
    # ASCII "printable" aka [\x20-\x7E].

    my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print);

    # loop through each file
    foreach(@ARGV) {
       $file_name = shift or die "Pass the file name on the command line.\n";

       # open the file read only with "<" in "<$file_name"
       open(FILE, "<$file_name") or die "Can't open $file_name: $!";

       # open each file in binary mode to handle non-printable characters
       binmode FILE;

       # try to read 2000 bytes from FILE, save the results in $data and the
       # actual number of bytes read in $n_bytes
       $n_bytes = read FILE, $data, 2000;

       $cnt_n_print = 0;
       $cnt_print = 0;

       # count the number of non-printable characters
       ++$cnt_n_print while ($data =~ m/[^[:print:]]/g);

       $cnt_print = $n_bytes - $cnt_n_print;
       $prc_print = $cnt_print/$n_bytes;

       print "$file_name|$n_bytes|$prc_print\n";
       close(FILE);
    }

Here is a sample of how to call the above script:

    find /some/path/to/files/ -type f -exec perl this_script.pl {} +

Here's a list of references I found helpful:

POSIX Bracket Expressions
Opening files in binmode
Read function
Open file read only

Question 2

The -n option wraps your entire code in a while(defined($_=<ARGV>) { ... } block. This means your my $cnt_print and other variable declarations are repeated for every line of input, essentially resetting all your variable values.

The workaround is to use global variables (declare them with our if you want to keep using use strict), and not to initialize them to 0, as they would be reinitialized for every line of input. You could say something like

our $cnt_print //= 0;

if you don't want $cnt_print and its friends to be undefined for the first line of input.

See this recent question with a similar issue.

Question 3

You could have find pass you one arg at a time.

find /fct/inbound/trans/ -type f -exec perl script.pl {} \;

But I'd continue passing multiple files at a time, either through xargs, or using GNU find's -exec +.

find /fct/inbound/trans/ -type f -exec perl script.pl {} +

The following code snippets support both.

You can continue reading a line at a time:

#!/usr/bin/perl

use strict;
use warnings;

my $cnt_total   = 0;
my $cnt_n_print = 0;

while (<>) {
    $cnt_total += length;
    ++$cnt_n_print while /[^[:print:]]/g;
} continue {
    if (eof) {
        my $cnt_print = $cnt_total - $cnt_n_print;
        my $prc_print = $cnt_print/$cnt_total;

        print "$ARGV: $cnt_total|$prc_print\n";

        $cnt_total   = 0;
        $cnt_n_print = 0;
    }
}

Or you could read a whole file at a time:

#!/usr/bin/perl

use strict;
use warnings;

local $/;
while (<>) {
    my $cnt_n_print = 0;
    ++$cnt_n_print while /[^[:print:]]/g;

    my $cnt_total = length;
    my $cnt_print = $cnt_total - $cnt_n_print;
    my $prc_print = $cnt_print/$cnt_total;

    print "$ARGV: $cnt_total|$prc_print\n";
}