Come mettere a nudo intestazioni / piè di pagina da testi Progetto Gutenberg?

https://stackoverflow.com/questions/1269146

13-09-2019
|

Domanda

Ho provato vari metodi per striscia la licenza da testi Progetto Gutenberg, per l'uso come un corpus per un progetto di apprendimento delle lingue, ma io non riesco a venire con un approccio senza supervisione, affidabile. Il miglior euristica mi è venuta in mente finora è spogliando i primi venti otto linee e l'ultimo 398, che ha lavorato per un gran numero di testi. Eventuali suggerimenti quanto a modi posso spogliare automaticamente il testo (che è molto simile per i lotti dei testi, ma con lievi differenze in ogni caso, e un paio di modelli differenti, come pure), così come i suggerimenti su come verificare che il il testo è stato messo a nudo con precisione, sarebbe molto utile.

Soluzione

Non stavi scherzando. E 'quasi come se stessero cercando di rendere il lavoro di AI-complete. Mi vengono in mente solo due approcci, nessuno dei due perfette.

1) Impostare uno script, per esempio, Perl, per affrontare i modelli più comuni (ad esempio, cercare la frase "prodotto da", proseguire fino alla successiva riga vuota e tagliare lì) ma messo in un sacco di asserzioni su ciò che ci si aspetta (ad esempio, il testo accanto dovrebbe essere il titolo o autore). In questo modo quando il modello non riesce, lo saprete. La prima volta che un modello non funziona, farlo a mano. La seconda volta, modificare lo script.

2) Prova Mechanical Turk di Amazon .

Altri suggerimenti

Ho voluto anche uno strumento per mettere a nudo le intestazioni e piè di pagina Progetto Gutenberg per anni per giocare con l'elaborazione del linguaggio naturale, senza contaminare l'analisi con boilerplate mescolato con l'etxt. Dopo aver letto questa domanda ho finalmente tirato il mio dito e ha scritto un filtro Perl, che è possibile reindirizzare attraverso in qualsiasi altro strumento.

E 'fatta come una macchina a stati utilizzando espressioni regolari per-linea. E 'scritto per essere facile da capire in quanto la velocità non è un problema con la dimensione tipica di etexts. Finora funziona sulle dozzina etexts paio che ho qui, ma in natura ci sono sicuro di essere molti di più varianti che devono essere aggiunti. Speriamo che il codice è abbastanza chiaro che chiunque può aggiungere ad esso:


#!/usr/bin/perl

# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010

use strict;

my $debug = 0;

my $state = 'beginning';
my $print = 0;
my $printed = 0;

while (1) {
    $_ = <>;

    last unless $_;

    # strip UTF-8 BOM
    if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
        $_ = substr($_, 3);
    }

    if ($state eq 'beginning') {
        if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
            $state = 'normal pg header';
            $debug && print "state: beginning -> normal pg header\n";
            $print = 0;
        } elsif (/^$/) {
            $state = 'beginning blanks';
            $debug && print "state: beginning -> beginning blanks\n";
        } else {
            die "unrecognized beginning: $_";
        }
    } elsif ($state eq 'normal pg header') {
        if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
            $state = 'end of normal header';
            $debug && print "state: normal pg header -> end of normal pg header\n";
        } else {
            # body of normal pg header
        }
    } elsif ($state eq 'end of normal header') {
        if (/^(Produced by|Transcribed from)/) {
            $state = 'post header';
            $debug && print "state: end of normal pg header -> post header\n";
        } elsif (/^$/) {
            # blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: end of normal header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'post header') {
        if (/^$/) {
            $state = 'blanks after post header';
            $debug && print "state: post header -> blanks after post header\n";
        } else {
            # multiline Produced / Transcribed
        }
    } elsif ($state eq 'blanks after post header') {
        if (/^$/) {
            # more blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: blanks after post header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'beginning blanks') {
        if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
            $state = 'header include';
            $debug && print "state: beginning blanks -> header include\n";
        } elsif (/^Title: /) {
            $state = 'aus header';
            $debug && print "state: beginning blanks -> aus header\n";
        } elsif (/^$/) {
            # more blanks
        } else {
            die "unexpected stuff after beginning blanks: $_";
        }
    } elsif ($state eq 'header include') {
        if (/^$/) {
            # blanks after header include
        } else {
            $state = 'aus header';
            $debug && print "state: header include -> aus header\n";
        }
    } elsif ($state eq 'aus header') {
        if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        } elsif (/^A Project Gutenberg of Australia eBook$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        }
    } elsif ($state eq 'end of aus header') {
        if (/^((Title|Author): .*)?$/) {
            # title, author, or blank line
        } else {
            $state = 'etext body';
            $debug && print "state: end of aus header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'etext body') {
        # here's the stuff
        if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        } elsif (/^(\*\*\* ?)?end of (the )?project/i) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        }
    } elsif ($state eq 'footer') {
        # nothing more of interest
    } else {
        die "unknown state '$state'";
    }

    if ($print) {
        print;
        ++$printed;
    } else {
        $debug && print "## $_";
    }
}

Wow, questa domanda è così vecchio ora. Tuttavia, il pacchetto gutenbergr in R sembra fare un lavoro ok di rimuovere le intestazioni, compreso spazzatura dopo la fine 'ufficiale' della testata.

Per prima cosa è necessario installare R / Rstudio, quindi

install.packages('gutenbergr')
library(gutenbergr)
t <- gutenberg_download('25519')  # give it the id number of the text

Lo strip_headers arg è T per impostazione predefinita. Sarà inoltre probabilmente necessario rimuovere le illustrazioni:

library(data.table)
t <- as.data.table(t)  # I hate tibbles -- datatables are easier to work with
head(t)  # get the column names

# filter out lines that are illustrations and joins all lines with a space
# the \\[ searches for the [ character, the \\ are used to 'escape' the special [ character
# the !like() means find rows where the text column is not like the search string
no_il <- t[!like(text, '\\[Illustration'), 'text']
# collapse the text into a single character string
t_cln <- do.call(paste, c(no_il, collapse = ' '))

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow