Domanda

I have 100,000's of files that I would like to analyze. Specifically I would like to calculate the percentage of printable characters from a sample of the file of arbitrary size. Some of these files are from mainframes, Windows, Unix, etc. so it is likely that binary and control characters are included.

I started by using the Linux "file" command, but it did not provide enough detail for my purposes. The following code conveys what I am trying to do, but does not always work.

    #!/usr/bin/perl -n

    use strict;
    use warnings;

    my $cnt_n_print = 0;
    my $cnt_print = 0;
    my $cnt_total = 0;
    my $prc_print = 0;

    #Count the number of non-printable characters
    while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};

    #Count the number of printable characters
    while ($_ =~ m/[[:print:]]/g) {$cnt_print++};

    $cnt_total = $cnt_n_print + $cnt_print;
    $prc_print = $cnt_print/$cnt_total;

    #Print the # total number of bytes read followed by the % printable
    print "$cnt_total|$prc_print\n"

This is a test call that works:

    echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl

This is how I intend to call it, and works for one file:

    find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

This does not work correctly:

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

Neither does this:

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl

Instead of executing the script once for EACH line returned by the find, it executes ONCE for ALL the results.

Thanks in advance.


Research so far:

Pipe and XARGS and separators

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem


Clarification(s):
1.) Desired output: If there are 932 files in a directory, the output would be a 932 line list of file names, the total bytes read from the file and the % that were printable characters.
2.) Many of the files are binary. Script needs to handle embedded binary eol or eof sequences.
3.) Many of the files are large, so I would like to only read the first/last xx bytes. I had been trying to use head -c 256 or tail -c 128 to read either the first 256 bytes or the last 128 bytes respectively. Solution could either work in a pipe line or limit bytes within perl script.

È stato utile?

Soluzione 3

Here is my working solution based on the feedback provided.

I would appreciate any further feedback on form or more efficient methods:

    #!/usr/bin/perl

    use strict;
    use warnings;

    # This program receives a file path and name.
    # The program attempts to read the first 2000 bytes.
    # The output is a list of files, the number of bytes
    # actually read and the percent of tbe bytes that are
    # ASCII "printable" aka [\x20-\x7E].

    my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print);

    # loop through each file
    foreach(@ARGV) {
       $file_name = shift or die "Pass the file name on the command line.\n";

       # open the file read only with "<" in "<$file_name"
       open(FILE, "<$file_name") or die "Can't open $file_name: $!";

       # open each file in binary mode to handle non-printable characters
       binmode FILE;

       # try to read 2000 bytes from FILE, save the results in $data and the
       # actual number of bytes read in $n_bytes
       $n_bytes = read FILE, $data, 2000;

       $cnt_n_print = 0;
       $cnt_print = 0;

       # count the number of non-printable characters
       ++$cnt_n_print while ($data =~ m/[^[:print:]]/g);

       $cnt_print = $n_bytes - $cnt_n_print;
       $prc_print = $cnt_print/$n_bytes;

       print "$file_name|$n_bytes|$prc_print\n";
       close(FILE);
    }

Here is a sample of how to call the above script:

    find /some/path/to/files/ -type f -exec perl this_script.pl {} +

Here's a list of references I found helpful:

POSIX Bracket Expressions
Opening files in binmode
Read function
Open file read only

Altri suggerimenti

The -n option wraps your entire code in a while(defined($_=<ARGV>) { ... } block. This means your my $cnt_print and other variable declarations are repeated for every line of input, essentially resetting all your variable values.

The workaround is to use global variables (declare them with our if you want to keep using use strict), and not to initialize them to 0, as they would be reinitialized for every line of input. You could say something like

our $cnt_print //= 0;

if you don't want $cnt_print and its friends to be undefined for the first line of input.

See this recent question with a similar issue.

You could have find pass you one arg at a time.

find /fct/inbound/trans/ -type f -exec perl script.pl {} \;

But I'd continue passing multiple files at a time, either through xargs, or using GNU find's -exec +.

find /fct/inbound/trans/ -type f -exec perl script.pl {} +

The following code snippets support both.

You can continue reading a line at a time:

#!/usr/bin/perl

use strict;
use warnings;

my $cnt_total   = 0;
my $cnt_n_print = 0;

while (<>) {
    $cnt_total += length;
    ++$cnt_n_print while /[^[:print:]]/g;
} continue {
    if (eof) {
        my $cnt_print = $cnt_total - $cnt_n_print;
        my $prc_print = $cnt_print/$cnt_total;

        print "$ARGV: $cnt_total|$prc_print\n";

        $cnt_total   = 0;
        $cnt_n_print = 0;
    }
}

Or you could read a whole file at a time:

#!/usr/bin/perl

use strict;
use warnings;

local $/;
while (<>) {
    my $cnt_n_print = 0;
    ++$cnt_n_print while /[^[:print:]]/g;

    my $cnt_total = length;
    my $cnt_print = $cnt_total - $cnt_n_print;
    my $prc_print = $cnt_print/$cnt_total;

    print "$ARGV: $cnt_total|$prc_print\n";
}
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top