문제

I'm doing Perl programming. I need to read a paragraph and print it out each sentence as a line.

Anyone know how to do it?

Below is my code:

#! /C:/Perl64/bin/perl.exe

use utf8;

if (! open(INPUT, '< text1.txt')){
die "cannot open input file: $!";
}

if (! open(OUTPUT, '> output.txt')){
die "cannot open input file: $!";
}

select OUTPUT;

while (<INPUT>){
print "$_";
}

close INPUT;
close OUTPUT;
select STDOUT;
도움이 되었습니까?

해결책 3

If you are given the paragraph as a string, you can split() it on characters that mark the end of a sentence.

for example:

my @sentences = split /[.?!]/, $paragraph;

다른 팁

Rather than handle file names, I'll let Perl do that.

This is very crude on multiple levels, and the full job is undoubtedly tough.

sentence.pl

#!/usr/bin/env perl
use strict;
use warnings;
use Lingua::EN::Sentence qw(get_sentences);

sub normalize
{
    my($str) = @_;
    $str =~ s/\n/ /gm;
    $str =~ s/\s\s+/ /gm;
    return $str;
}

{
    local $/ = "\n\n";
    while (<>)
    {
        chomp;
        print "Para: [[$_]]\n";
        my @sentences = split m/(?<=[.!?])\s+/m, $_;
        foreach my $sentence (@sentences)
        {
            $sentence = normalize $sentence;
            print "Ad Hoc Sentence: $sentence\n";
        }
        my $sref = get_sentences($_);
        foreach my $sentence (@$sref)
        {
            $sentence = normalize $sentence;
            print "Lingua Sentence: $sentence\n";
        }
    }
}

The split regex looks for one or more spaces preceded by a full stop (period), exclamation mark or question mark, and matches across multiple lines. The look-behind (?<=[.!?]) means that the punctuation is kept with the sentence. The normalize function simply flattens newlines into spaces and renders multiple spaces into single spaces. (Note that this would not properly recognize a parenthetical sentence.) This would be counted as part of the previous sentence, because the . is not followed by a blank.

Sample input

This is a paragraph with more than one sentence in it.  How many will be
determined later.  Mr. A. P. McDowney has been rather busy.  This
incomplete sentence will still be counted as one

This is the second paragraph.  With three sentences in it, it is a lot
less exciting than the first paragraph, but the middle sentence extends
over multiple lines and   there   is     some         wonky spacing too.
But 'tis time to finish.

Sample output

Para: [[This is a paragraph with more than one sentence in it.  How many will be
determined later.  Mr. A. P. McDowney has been rather busy.  This
incomplete sentence will still be counted as one]]
Ad Hoc Sentence: This is a paragraph with more than one sentence in it.
Ad Hoc Sentence: How many will be determined later.
Ad Hoc Sentence: Mr.
Ad Hoc Sentence: A.
Ad Hoc Sentence: P.
Ad Hoc Sentence: McDowney has been rather busy.
Ad Hoc Sentence: This incomplete sentence will still be counted as one
Lingua Sentence: This is a paragraph with more than one sentence in it.
Lingua Sentence: How many will be determined later.
Lingua Sentence: Mr. A. P. McDowney has been rather busy.
Lingua Sentence: This incomplete sentence will still be counted as one
Para: [[This is the second paragraph.  With three sentences in it, it is a lot
less exciting than the first paragraph, but the middle sentence extends
over multiple lines and   there   is     some         wonky spacing too.
But 'tis time to finish.
]]
Ad Hoc Sentence: This is the second paragraph.
Ad Hoc Sentence: With three sentences in it, it is a lot less exciting than the first paragraph, but the middle sentence extends over multiple lines and there is some wonky spacing too.
Ad Hoc Sentence: But 'tis time to finish.
Lingua Sentence: This is the second paragraph.
Lingua Sentence: With three sentences in it, it is a lot less exciting than the first paragraph, but the middle sentence extends over multiple lines and there is some wonky spacing too.
Lingua Sentence: But 'tis time to finish.

Notice how Lingua::EN::Sentence managed to handle 'Mr. A. P. McDowney' better than the simple-minded regex does.

Identifying sentences is very hard and language-specific. You'll need help. Maybe Lingua::EN::Sentence is the way to go?

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top