Cómo quitar encabezados / pies de página de los textos del Proyecto Gutenberg?

https://stackoverflow.com/questions/1269146

13-09-2019
|

Pregunta

He intentado varios métodos para despojar la licencia a partir de textos del Proyecto Gutenberg, para su uso como un corpus para un proyecto de aprendizaje de idiomas, pero me parece que no puede llegar a un enfoque sin supervisión, fiable. La mejor heurística que he encontrado hasta el momento es el paso de los primeros veinte y ocho líneas y el último 398, que trabajaban para un gran número de los textos. Cualquier sugerencia en cuanto a formas que pueden pelar automáticamente el texto (que es muy similar para las porciones de los textos, pero con ligeras diferencias en cada caso, y algunas plantillas diferentes, así), así como sugerencias sobre cómo verificar que el texto ha sido despojado con precisión, sería muy útil.

Solución

No era una broma. Es casi como si estuvieran tratando de hacer el trabajo de AI-completo. No puedo pensar en sólo dos enfoques, ninguno de ellos perfecto.

1) Establecer una secuencia de comandos en, por ejemplo, Perl, para hacer frente a los patrones más comunes (por ejemplo, buscar la frase "producido por", seguir bajando a la siguiente línea en blanco y cortar allí), pero poner en un montón de afirmaciones acerca de lo que se espera (por ejemplo, el siguiente texto debe ser el título o el autor). De esta forma cuando el patrón de falla, lo sabrás. La primera vez que un patrón de falla, lo hace con la mano. La segunda vez, modificar la secuencia de comandos.

2) Trate Mechanical Turk de Amazon .

Otros consejos

También he querido una herramienta para quitar encabezados y pies Proyecto Gutenberg durante años para jugar con el procesamiento del lenguaje natural sin contaminar el análisis con repetitivo mezclado con el etxt. Después de leer esta pregunta, finalmente, saqué mi dedo y escribió un filtro de Perl, que se puede canalizar a través en cualquier otra herramienta.

Está hecho como una máquina de estado usando expresiones regulares por línea. Está escrito para ser fácil de entender ya que la velocidad no es un problema con el tamaño típico de textos electrónicos. Hasta el momento se trabaja en la docena de textos electrónicos par que tengo aquí, sino en la naturaleza no están seguros de ser muchas más variaciones que se deben agregar. Esperemos que el código es bastante claro que cualquiera puede añadir a ella:


#!/usr/bin/perl

# stripgutenberg.pl < in.txt > out.txt
#
# designed for piping
# Written by Andrew Dunbar (hippietrail), released into the public domain, Dec 2010

use strict;

my $debug = 0;

my $state = 'beginning';
my $print = 0;
my $printed = 0;

while (1) {
    $_ = <>;

    last unless $_;

    # strip UTF-8 BOM
    if ($. == 1 && index($_, "\xef\xbb\xbf") == 0) {
        $_ = substr($_, 3);
    }

    if ($state eq 'beginning') {
        if (/^(The Project Gutenberg [Ee]Book( of|,)|Project Gutenberg's )/) {
            $state = 'normal pg header';
            $debug && print "state: beginning -> normal pg header\n";
            $print = 0;
        } elsif (/^$/) {
            $state = 'beginning blanks';
            $debug && print "state: beginning -> beginning blanks\n";
        } else {
            die "unrecognized beginning: $_";
        }
    } elsif ($state eq 'normal pg header') {
        if (/^\*\*\*\ ?START OF TH(IS|E) PROJECT GUTENBERG EBOOK,? /) {
            $state = 'end of normal header';
            $debug && print "state: normal pg header -> end of normal pg header\n";
        } else {
            # body of normal pg header
        }
    } elsif ($state eq 'end of normal header') {
        if (/^(Produced by|Transcribed from)/) {
            $state = 'post header';
            $debug && print "state: end of normal pg header -> post header\n";
        } elsif (/^$/) {
            # blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: end of normal header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'post header') {
        if (/^$/) {
            $state = 'blanks after post header';
            $debug && print "state: post header -> blanks after post header\n";
        } else {
            # multiline Produced / Transcribed
        }
    } elsif ($state eq 'blanks after post header') {
        if (/^$/) {
            # more blank lines
        } else {
            $state = 'etext body';
            $debug && print "state: blanks after post header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'beginning blanks') {
        if (/<!-- #INCLUDE virtual=\"\/include\/ga-books-texth\.html\" -->/) {
            $state = 'header include';
            $debug && print "state: beginning blanks -> header include\n";
        } elsif (/^Title: /) {
            $state = 'aus header';
            $debug && print "state: beginning blanks -> aus header\n";
        } elsif (/^$/) {
            # more blanks
        } else {
            die "unexpected stuff after beginning blanks: $_";
        }
    } elsif ($state eq 'header include') {
        if (/^$/) {
            # blanks after header include
        } else {
            $state = 'aus header';
            $debug && print "state: header include -> aus header\n";
        }
    } elsif ($state eq 'aus header') {
        if (/^To contact Project Gutenberg of Australia go to http:\/\/gutenberg\.net\.au$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        } elsif (/^A Project Gutenberg of Australia eBook$/) {
            $state = 'end of aus header';
            $debug && print "state: aus header -> end of aus header\n";
        }
    } elsif ($state eq 'end of aus header') {
        if (/^((Title|Author): .*)?$/) {
            # title, author, or blank line
        } else {
            $state = 'etext body';
            $debug && print "state: end of aus header -> etext body\n";
            $print = 1;
        }
    } elsif ($state eq 'etext body') {
        # here's the stuff
        if (/^<!-- #INCLUDE virtual="\/include\/ga-books-textf\.html" -->$/) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        } elsif (/^(\*\*\* ?)?end of (the )?project/i) {
            $state = 'footer';
            $debug && print "state: etext body -> footer\n";
            $print = 0;
        }
    } elsif ($state eq 'footer') {
        # nothing more of interest
    } else {
        die "unknown state '$state'";
    }

    if ($print) {
        print;
        ++$printed;
    } else {
        $debug && print "## $_";
    }
}

Vaya, esta pregunta es tan viejo. Sin embargo, el paquete gutenbergr en I parece hacer un buen trabajo de eliminación de cabecera, incluidos los desperdicios después del final 'oficial' de la cabecera.

En primer lugar, tendrá que instalar R / rstudio, entonces

install.packages('gutenbergr')
library(gutenbergr)
t <- gutenberg_download('25519')  # give it the id number of the text

El arg strip_headers es T por defecto. También es probable que desee quitar ilustraciones:

library(data.table)
t <- as.data.table(t)  # I hate tibbles -- datatables are easier to work with
head(t)  # get the column names

# filter out lines that are illustrations and joins all lines with a space
# the \\[ searches for the [ character, the \\ are used to 'escape' the special [ character
# the !like() means find rows where the text column is not like the search string
no_il <- t[!like(text, '\\[Illustration'), 'text']
# collapse the text into a single character string
t_cln <- do.call(paste, c(no_il, collapse = ' '))

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow