如何剥去Project Gutenberg的文本页眉/页脚？

https://stackoverflow.com/questions/1269146

13-09-2019
|

题

我已经试过各种方法来剥去古登堡计划的文本许可证，用作语言学习项目中的主体，但我似乎无法拿出一个无监督的，可靠的方法。我想出了迄今为止最好的启发式剥离前二十八行和最后398，其工作了大量的文本。作为方法，我可以自动剥文本的任何建议（这是大量文本的非常相似，但在每种情况下的细微差别，和几个不同的模板，以及），以及建议如何验证文本已被准确地剥离，将是非常有用的。

解决方案

您不是在开玩笑。这是几乎一样，如果他们试图使工作AI-完成。我能想到的只有两种方法，都没有完美的。

1）成立，譬如Perl脚本，以解决最常见的模式（例如，寻找“所生产的”这句话，继续前进到下一个空行，切那里），但投入大量的关于什么的期望断言（例如下一个文本应该是标题或作者）。当模式失败这样的话，你就会知道它。第一次的模式失败，手工做。在第二时间，修改脚本。

2）尝试亚马逊的Mechanical Turk 。

其他提示

我也想剥夺古登堡计划的页眉和页脚年自然语言处理打不污染与样板与etxt混在分析的工具。阅读这个问题后，我终于把我的手指了，写了一个Perl过滤器，可通过管道进入任何其他工具。

它是由如使用每行的正则表达式的状态机。这是写是很容易理解，因为速度不是与电子文本的典型大小的问题。到目前为止，它工作在几十个电子文本我在这里，但在野外也有一定为需要添加更多的变化。希望的代码是足够清楚的是任何人都可以添加到它：


#!/usr/bin/perl

# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010

use strict;

my $debug = 0;

my $state = 'beginning';
my $print = 0;
my $printed = 0;

while (1) {
    $_ = <>;

    last unless $_;

    # strip UTF-8 BOM
    if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
        $_ = substr($_, 3);
    }

    if ($state eq 'beginning') {
        if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
            $state = 'normal pg header';
            $debug && print "state: beginning -> normal pg header\n";
            $print = 0;
        } elsif (/^$/) {
            $state = 'beginning blanks';
            $debug && print "state: beginning -> beginning blanks\n";
        } else {
            die "unrecognized beginning: $_";
        }
    } elsif ($state eq 'normal pg header') {
        if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
            $state = 'end of normal header';
            $debug && print "state: normal pg header -> end of normal pg header\n";
        } else {
            # body of normal pg header
        }
    } elsif ($state eq 'end of normal header') {
        if (/^(Produced by|Transcribed from)/) {
            $state = 'post header';
            $debug && print "state: end of normal pg header -> post header\n";
        } elsif (/^$/) {
            # blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: end of normal header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'post header') {
        if (/^$/) {
            $state = 'blanks after post header';
            $debug && print "state: post header -> blanks after post header\n";
        } else {
            # multiline Produced / Transcribed
        }
    } elsif ($state eq 'blanks after post header') {
        if (/^$/) {
            # more blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: blanks after post header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'beginning blanks') {
        if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
            $state = 'header include';
            $debug && print "state: beginning blanks -> header include\n";
        } elsif (/^Title: /) {
            $state = 'aus header';
            $debug && print "state: beginning blanks -> aus header\n";
        } elsif (/^$/) {
            # more blanks
        } else {
            die "unexpected stuff after beginning blanks: $_";
        }
    } elsif ($state eq 'header include') {
        if (/^$/) {
            # blanks after header include
        } else {
            $state = 'aus header';
            $debug && print "state: header include -> aus header\n";
        }
    } elsif ($state eq 'aus header') {
        if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        } elsif (/^A Project Gutenberg of Australia eBook$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        }
    } elsif ($state eq 'end of aus header') {
        if (/^((Title|Author): .*)?$/) {
            # title, author, or blank line
        } else {
            $state = 'etext body';
            $debug && print "state: end of aus header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'etext body') {
        # here's the stuff
        if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        } elsif (/^(\*\*\* ?)?end of (the )?project/i) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        }
    } elsif ($state eq 'footer') {
        # nothing more of interest
    } else {
        die "unknown state '$state'";
    }

    if ($print) {
        print;
        ++$printed;
    } else {
        $debug && print "## $_";
    }
}

哇，这现在的问题是这样的历史。尽管如此，gutenbergr包中的R似乎做去除报头，包括报头的“官方”结束后垃圾一个确定工作。

首先，您需要安装R / Rstudio，然后

install.packages('gutenbergr')
library(gutenbergr)
t <- gutenberg_download('25519')  # give it the id number of the text

在strip_headers ARG为T默认。你也可能会想要删除说明：

library(data.table)
t <- as.data.table(t)  # I hate tibbles -- datatables are easier to work with
head(t)  # get the column names

# filter out lines that are illustrations and joins all lines with a space
# the \\[ searches for the [ character, the \\ are used to 'escape' the special [ character
# the !like() means find rows where the text column is not like the search string
no_il <- t[!like(text, '\\[Illustration'), 'text']
# collapse the text into a single character string
t_cln <- do.call(paste, c(no_il, collapse = ' '))

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow