프로젝트 Gutenberg 텍스트에서 헤더/바닥 글을 제거하는 방법?

https://stackoverflow.com/questions/1269146

13-09-2019
|

문제

언어 학습 프로젝트를위한 코퍼스로 사용하기 위해 Project Gutenberg 텍스트에서 라이센스를 제거하는 다양한 방법을 시도했지만 감독되지 않은 신뢰할 수있는 접근 방식을 생각해 낼 수는 없습니다. 내가 지금까지 내가 생각해 낸 최고의 휴리스틱은 처음 스물 8 라인과 마지막 398을 벗기는 것입니다. 텍스트를 자동으로 제거 할 수있는 방법에 대한 제안 (많은 텍스트와는 매우 유사하지만 각 경우에는 약간의 차이가 있으며 몇 가지 다른 템플릿도)뿐만 아니라 텍스트가 정확하게 제거되었으며 매우 유용합니다.

해결책

당신은 농담이 아니 었습니다. 마치 마치 작업을 AI를 완성 시키려고하는 것처럼 보입니다. 나는 두 가지 접근 방식 만 생각할 수 있으며, 그 중 어느 것도 완벽하지 않습니다.

1) 가장 일반적인 패턴을 해결하기 위해 Perl에 스크립트를 설정하십시오 (예 : "생성 된"문구를 찾아 다음 빈 선으로 계속 내려 가서 잘라냅니다). 예상 (예 : 다음 텍스트는 제목 또는 저자 여야합니다). 그렇게하면 패턴이 실패하면 알게 될 것입니다. 패턴이 처음 실패하면 손으로 수행하십시오. 두 번째로 스크립트를 수정하십시오.

2) 시도하십시오 아마존의 기계적 터크.

다른 팁

또한 ETXT와 혼합 된 보일러 플레이트로 분석을 오염시키지 않고 자연 언어 가공으로 연주하기 위해 프로젝트 Gutenberg 헤더 및 바닥 글을 몇 년 동안 제거하는 도구를 원했습니다. 이 질문을 읽은 후 마침내 손가락을 꺼내어 다른 도구로 파이프 할 수있는 Perl 필터를 썼습니다.

자발적 인 regexes를 사용하여 상태 머신으로 만들어졌습니다. 속도는 전형적인 크기의 etExts에서 문제가되지 않기 때문에 이해하기 쉬운 것으로 작성되었습니다. 지금까지 그것은 내가 여기있는 12 개의 eTexts에서 작동하지만 야생에는 더 많은 변형이 추가되어야합니다. 바라건대 코드가 누구나 추가 할 수있을 정도로 분명하기를 바랍니다.


#!/usr/bin/perl

# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010

use strict;

my $debug = 0;

my $state = 'beginning';
my $print = 0;
my $printed = 0;

while (1) {
    $_ = <>;

    last unless $_;

    # strip UTF-8 BOM
    if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
        $_ = substr($_, 3);
    }

    if ($state eq 'beginning') {
        if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
            $state = 'normal pg header';
            $debug && print "state: beginning -> normal pg header\n";
            $print = 0;
        } elsif (/^$/) {
            $state = 'beginning blanks';
            $debug && print "state: beginning -> beginning blanks\n";
        } else {
            die "unrecognized beginning: $_";
        }
    } elsif ($state eq 'normal pg header') {
        if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
            $state = 'end of normal header';
            $debug && print "state: normal pg header -> end of normal pg header\n";
        } else {
            # body of normal pg header
        }
    } elsif ($state eq 'end of normal header') {
        if (/^(Produced by|Transcribed from)/) {
            $state = 'post header';
            $debug && print "state: end of normal pg header -> post header\n";
        } elsif (/^$/) {
            # blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: end of normal header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'post header') {
        if (/^$/) {
            $state = 'blanks after post header';
            $debug && print "state: post header -> blanks after post header\n";
        } else {
            # multiline Produced / Transcribed
        }
    } elsif ($state eq 'blanks after post header') {
        if (/^$/) {
            # more blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: blanks after post header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'beginning blanks') {
        if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
            $state = 'header include';
            $debug && print "state: beginning blanks -> header include\n";
        } elsif (/^Title: /) {
            $state = 'aus header';
            $debug && print "state: beginning blanks -> aus header\n";
        } elsif (/^$/) {
            # more blanks
        } else {
            die "unexpected stuff after beginning blanks: $_";
        }
    } elsif ($state eq 'header include') {
        if (/^$/) {
            # blanks after header include
        } else {
            $state = 'aus header';
            $debug && print "state: header include -> aus header\n";
        }
    } elsif ($state eq 'aus header') {
        if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        } elsif (/^A Project Gutenberg of Australia eBook$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        }
    } elsif ($state eq 'end of aus header') {
        if (/^((Title|Author): .*)?$/) {
            # title, author, or blank line
        } else {
            $state = 'etext body';
            $debug && print "state: end of aus header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'etext body') {
        # here's the stuff
        if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        } elsif (/^(\*\*\* ?)?end of (the )?project/i) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        }
    } elsif ($state eq 'footer') {
        # nothing more of interest
    } else {
        die "unknown state '$state'";
    }

    if ($print) {
        print;
        ++$printed;
    } else {
        $debug && print "## $_";
    }
}

와우,이 질문은 이제 너무 오래되었습니다. 그럼에도 불구하고 R의 Gutenbergr 패키지는 헤더의 '공식'끝 이후 정크를 포함하여 헤더를 제거하는 OK 작업을 수행하는 것으로 보입니다.

먼저 r/rstudio를 설치해야합니다.

install.packages('gutenbergr')
library(gutenbergr)
t <- gutenberg_download('25519')  # give it the id number of the text

Strip_headers arg는 기본적으로 t입니다. 아마도 삽화를 제거하고 싶을 것입니다.

library(data.table)
t <- as.data.table(t)  # I hate tibbles -- datatables are easier to work with
head(t)  # get the column names

# filter out lines that are illustrations and joins all lines with a space
# the \\[ searches for the [ character, the \\ are used to 'escape' the special [ character
# the !like() means find rows where the text column is not like the search string
no_il <- t[!like(text, '\\[Illustration'), 'text']
# collapse the text into a single character string
t_cln <- do.call(paste, c(no_il, collapse = ' '))

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow