كيف يمكنني حساب الأحرف والكلمات والخطوط في ملف باستخدام بيرل؟

https://stackoverflow.com/questions/782087

13-09-2019
|

سؤال

ما هي طريقة جيدة / أفضل طريقة لحساب عدد الأحرف والكلمات وخطوط ملف نصي باستخدام PERL (بدون استخدام WC)؟

المحلول

إليك رمز بيرل. يمكن أن تكون الكلمات ذاتية ذاتية إلى حد ما، لكنني أقول فقط إنها أي سلسلة من الأحرف التي ليست بيضاء.

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\s+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

نصائح أخرى

هناك اختلاف في إجابة BMDhacks التي ستنتج نتائج أفضل هي استخدام S + (أو حتى أفضل W +) كمسؤول. النظر في السلسلة "الثعلب البني السريع" (مساحات إضافية إذا لم تكن واضحة). باستخدام محدد شخصية واحدة من الأحاديث سوف تعطي عدد الكلمات من ستة ليس أربعة. لذا حاول:

open(FILE, "<file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);

while (<FILE>) {
    $lines++;
    $chars += length($_);
    $words += scalar(split(/\W+/, $_));
}

print("lines=$lines words=$words chars=$chars\n");

باستخدام W + حيث أن المحدد سوف يتوقف عن علامات الترقيم (من بين أشياء أخرى) من العد ككلمات.

ال أداة عدد الكلمات تحسب الأحرف والكلمات والخطوط في الملفات النصية

هنا. جرب إصدار Unicode-Savvy من برنامج WC.

يتخطى حجج غير الملفات (الأنابيب، الدلائل، المقابس، إلخ).
يفترض نص UTF-8.
يحسب أي unicode whitespace كفاصل كلمة.
كما يقبل الترميزات البديلة إذا كان هناك .ENCODING في نهاية اسم الملف، مثل foo.cp1252, foo.latin1, foo.utf16, ، إلخ.
كما تعمل مع الملفات التي تم ضغطها في مجموعة متنوعة من التنسيقات.
انه يعطي التهم الفقرات والخطوط والكلمات والجرالم والأحرف, ، و بايت.
إنه يفهم جميع تسلسل تندل يونيكود.
تحذر من رسائل النصية التالفة مع أخطاء ليلة.

إليك مثال على تشغيله:

   Paras    Lines    Words   Graphs    Chars    Bytes File
       2     2270    82249   504169   504333   528663 /tmp/ap
       1     2404    11163    63164    63164    66336 /tmp/b3
    uwc: missing linebreak at end of corrupted textfiile /tmp/bad
      1*       2*        4       19       19       19 /tmp/bad
       1       14       52      273      273      293 /tmp/es
      57      383     1369    11997    11997    12001 /tmp/funny
       1   657068  3175429 31205970 31209138 32633834 /tmp/lw
       1        1        4       27       27       27 /tmp/nf.cp1252
       1        1        4       27       27       34 /tmp/nf.euc-jp
       1        1        4       27       27       27 /tmp/nf.latin1
       1        1        4       27       27       27 /tmp/nf.macroman
       1        1        4       27       27       54 /tmp/nf.ucs2
       1        1        4       27       27       56 /tmp/nf.utf16
       1        1        4       27       27       54 /tmp/nf.utf16be
       1        1        4       27       27       54 /tmp/nf.utf16le
       1        1        4       27       27      112 /tmp/nf.utf32
       1        1        4       27       27      108 /tmp/nf.utf32be
       1        1        4       27       27      108 /tmp/nf.utf32le
       1        1        4       27       27       39 /tmp/nf.utf7
       1        1        4       27       27       31 /tmp/nf.utf8
       1    26906   101528   635841   636026   661202 /tmp/o2
     131      346     1370     9590     9590     4486 /tmp/perl5122delta.pod.gz
     291      814     3941    25318    25318     9878 /tmp/perl51310delta.pod.bz2
       1     2551     5345   132655   132655   133178 /tmp/tailsort-pl.utf8
       1       89      334     1784     1784     2094 /tmp/til
       1        4       18       88       88      106 /tmp/w
     276     1736     5773    53782    53782    53804 /tmp/www

هنا يا الذهاب:

#!/usr/bin/env perl 
#########################################################################
# uniwc - improved version of wc that works correctly with Unicode
#
# Tom Christiansen <tchrist@perl.com>
# Mon Feb 28 15:59:01 MST 2011
#########################################################################

use 5.10.0;

use strict;
use warnings FATAL => "all";
use sigtrap qw[ die untrapped normal-signals ];

use Carp;

$SIG{__WARN__}  = sub {
    confess("FATALIZED WARNING: @_")  unless $^S;
};

$SIG{__DIE__}  = sub {
    confess("UNCAUGHT EXCEPTION: @_")  unless $^S;
};

$| = 1;

my $Errors = 0;
my $Headers = 0;

sub yuck($) {
    my $errmsg = $_[0];
    $errmsg =~ s/(?<=[^\n])\z/\n/;
    print STDERR "$0: $errmsg";
}

process_input(\&countem);

sub countem { 
    my ($_, $file) = @_;

    my (
        @paras, @lines, @words,
        $paracount, $linecount, $wordcount, 
        $grafcount, $charcount, $bytecount,
    );

    if ($charcount = length($_)) {
        $wordcount = eval { @words = split m{ \p{Space}+  }x }; 
        yuck "error splitting words: $@" if $@;

        $linecount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $grafcount = 0;
        $grafcount++ while /\X/g;
        #$grafcount = eval { @lines = split m{ \R     }x }; 
        yuck "error splitting lines: $@" if $@;

        $paracount = eval { @paras = split m{ \R{2,} }x }; 
        yuck "error splitting paras: $@" if $@;

        if ($linecount && !/\R\z/) {
            yuck("missing linebreak at end of corrupted textfiile $file");
            $linecount .= "*";
            $paracount .= "*";
        } 
    }

    $bytecount = tell;
    if (-e $file) {
        $bytecount = -s $file;
        if ($bytecount != -s $file) {
            yuck "filesize of $file differs from bytecount\n";
            $Errors++;
        }
    } 
    my $mask = "%8s " x 6 . "%s\n";
    printf  $mask => qw{ Paras Lines Words Graphs Chars Bytes File } unless $Headers++;

    printf $mask => map( { show_undef($_) } 
                                $paracount, $linecount, 
                                $wordcount, $grafcount, 
                                $charcount, $bytecount,
                       ), $file;
} 

sub show_undef {
    my $value = shift;
    return defined($value)
             ? $value
             : "undef";
} 

END { 
    close(STDOUT) || die "$0: can't close STDOUT: $!";
    exit($Errors != 0);
}

sub process_input {

    my $function = shift();

    my $enc;

    if (@ARGV == 0 && -t) {
        warn "$0: reading from stdin, type ^D to end or ^C to kill.\n";
    }

    unshift(@ARGV, "-") if @ARGV == 0;

FILE:

    for my $file (@ARGV) {
        # don't let magic open make an output handle

        next if -e $file && ! -f _;

        my $quasi_filename = fix_extension($file);

        $file = "standard input" if $file eq q(-);
        $quasi_filename =~ s/^(?=\s*[>|])/< /;

        no strict "refs";
        my $fh = $file;   # is *so* a lexical filehandle! ☺
        unless (open($fh, $quasi_filename)) {
            yuck("couldn't open $quasi_filename: $!");
            next FILE;
        }
        set_encoding($fh, $file) || next FILE;

        my $whole_file = eval {
            use warnings "FATAL" => "all";
            local $/;
            scalar <$fh>;
        };

        if ($@) {
            $@ =~ s/ at \K.*? line \d+.*/$file line $./;
            yuck($@);
            next FILE;
        }

        $function->($whole_file, $file);

        unless (close $fh) {
            yuck("couldn't close $quasi_filename at line $.: $!");
            next FILE;
        }

    } # foreach file

}

sub set_encoding(*$) {
    my ($handle, $path) = @_;

    my $enc_name = "utf8";

    if ($path && $path =~ m{ \. ([^\s.]+) \z }x) {
        my $ext = $1;
        die unless defined $ext;
        require Encode;
        if (my $enc_obj = Encode::find_encoding($ext)) {
            my $name = $enc_obj->name || $ext;
            $enc_name = "encoding($name)";
        }
    }

    return 1 if eval {
        use warnings FATAL => "all";
        no strict "refs";
        binmode($handle, ":$enc_name");
        1;
    };

    for ($@) {
        s/ at .* line \d+\.//;
        s/$/ for $path/;
    }

    yuck("set_encoding: $@");

    return undef;
}

sub fix_extension {
    my $path = shift();
    my %Compress = (
        Z       =>  "zcat",
        z       => "gzcat",            # for uncompressing
        gz      => "gzcat",
        bz      => "bzcat",
        bz2     => "bzcat",
        bzip    => "bzcat",
        bzip2   => "bzcat",
        lzma    => "lzcat",
    );

    if ($path =~ m{ \. ( [^.\s] +) \z }x) {
        if (my $prog = $Compress{$1}) {
            return "$prog $path |";
        } 
    } 

    return $path;

}

تعثرت على هذا بينما googling لحل عدد الأحرف. من المسلم به أنني أعرف بجوار أي شيء عن بيرل، لذلك قد يكون بعضها خارج القاعدة، ولكن فيما يلي قرصي في حل Newt.

أولا، هناك متغير عدد المدمج في الخط المدمج على أي حال، لذلك اعتدت ذلك للتو. هذا ربما أكثر كفاءة بعض الشيء، أعتقد. كما هو الحال، يتضمن عدد الأحرف أحرفا جديدة، والتي ربما لا تكون ما تريده، لذلك اخترت $ _. اشتكى بيرل أيضا من الطريقة التي يتم بها الانقسام (الانقسام الضمني، انظر: لماذا يشتكي بيرل يشكو من "استخدام الانقسام الضمني إلى_ ) لذلك اخترت ذلك. ملفات الإدخال الخاصة بي هي UTF-8 لذلك فتحتها على هذا النحو. ربما يساعد ذلك في الحصول على عدد الأحرف الصحيحة في ملف الإدخال يحتوي على أحرف غير ASCII.

إليك الرمز:

open(FILE, "<:encoding(UTF-8)", "file.txt") or die "Could not open file: $!";

my ($lines, $words, $chars) = (0,0,0);
my @wordcounter;
while (<FILE>) {
    chomp($_);
    $chars += length($_);
    @wordcounter = split(/\W+/, $_);
    $words += @wordcounter;
}
$lines = $.;
close FILE;
print "\nlines=$lines, words=$words, chars=$chars\n";

هنالك ال أدوات السلطة بيرل المشروع الذي هدفه هو إعادة بناء جميع مرافق UNIX بن المرافق، في المقام الأول لأولئك على أنظمة التشغيل المحرومين من UNIX. نعم لقد فعلوا مرحاض. وبعد التنفيذ مبالغ فيه، لكنه متوافق بوسيكس.

إنه يحصل قليلا سخيفة عندما تنظر إلى تنفيذ جنو متوافق حقيقي.

إجابة غير جدية:

system("wc foo");

قد تكون قراءة الملف في قطع قطع ثابتة الحجم أكثر كفاءة من قراءة خط القراءة. ال wc ثنائي يفعل هذا.

#!/usr/bin/env perl

use constant BLOCK_SIZE => 16384;

for my $file (@ARGV) {
    open my $fh, '<', $file or do {
        warn "couldn't open $file: $!\n";
        continue;
    };

    my ($chars, $words, $lines) = (0, 0, 0);

    my ($new_word, $new_line);
    while ((my $size = sysread $fh, local $_, BLOCK_SIZE) > 0) {
        $chars += $size;
        $words += /\s+/g;
        $words-- if $new_word && /\A\s/;
        $lines += () = /\n/g;

        $new_word = /\s\Z/;
        $new_line = /\n\Z/;
    }
    $lines-- if $new_line;

    print "\t$lines\t$words\t$chars\t$file\n";
}

لتكون قادرة على حساب الأحرف وليس بايت، فكر في هذا:
(جربه مع الحروف الصينية أو السيريلية والملف المحفوظة في UTF8)

use utf8;

my $file='file.txt';
my $LAYER = ':encoding(UTF-8)';
open( my $fh, '<', $file )
  || die( "$file couldn't be opened: $!" );
binmode( $fh, $LAYER );
read $fh, my $txt, -s $file;
close $fh;

print length $txt,$/;
use bytes;
print length $txt,$/;

قد يكون هذا مفيدا لبيتل للمبتدئين. حاولت محاكاة وظائف حساب MS Word وأضفت ميزة أخرى لا تظهر باستخدام WC في Linux.

عدد الخطوط
عدد الكلمات
عدد الشخصيات مع الفضاء
عدد الأحرف دون مساحة (WC لن يمنح ذلك في إخراجه، لكن كلمات Microsoft تعرضها.)

هنا هو عنوان URL: عد الكلمات والشخصيات والخطوط في ملف

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow