Question

from a given noun list in a .txt file, where nouns are separated by new lines, such as this one:

hooligan
football
brother
bollocks

...and a separate .txt file containing a series of regular expressions separated by new lines, like this:

[a-z]+\tNN(S)?
[a-z]+\tJJ(S)?

...I would like to run the regular expressions through each sentence of a corpus and, every time the regexp matches a pattern, if that pattern contains one of the nouns in the list of nouns, I would like to print that noun in the output and (separated it by tab) the regular expression that matched it. Here is an example of how the resulting output could be:

football    [a-z]+NN(S)?\'s POS[a-z]+NN(S)?
hooligan    [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
hooligan    [a-z]+NN(S)?,,[a-z]+JJ[a-z]+NN(S)?
football    [a-z]+NN(S)?[a-z]+NN(S)?
brother [a-z]+PP$[a-z]+NN(S)?
bollocks    [a-z]+DT[a-z]+NN(S)?
football    [a-z]+NN(s)?(be)VBZnotRB

The corpus I would use is huge (tens of GB) and has the following format (each sentence is contained in the tag <s>):

<s>
Hooligans   hooligan    NNS 1   4   NMOD
,   ,   ,   2   4   P
unbridled   unbridled   JJ  3   4   NMOD
passion passion NN  4   0   ROOT
-   -   :   5   4   P
and and CC  6   4   CC
no  no  DT  7   9   NMOD
executive   executive   JJ  8   9   NMOD
boxes   box NNS 9   4   COORD
.   .   SENT    10  0   ROOT
</s>
<s>
Hooligans   hooligan    NNS 1   4   NMOD
,   ,   ,   2   4   P
unbridled   unbridled   JJ  3   4   NMOD
passion passion NN  4   0   ROOT
-   -   :   5   4   P
and and CC  6   4   CC
no  no  DT  7   9   NMOD
executive   executive   JJ  8   9   NMOD
boxes   box NNS 9   4   COORD
.   .   SENT    10  0   ROOT
</s>
<s>
Portsmouth  Portsmouth  NP  1   2   SBJ
bring   bring   VVP 2   0   ROOT
something   something   NN  3   2   OBJ
entirely    entirely    RB  4   5   AMOD
different   different   JJ  5   3   NMOD
to  to  TO  6   5   AMOD
the the DT  7   12  NMOD
Premiership Premiership NP  8   12  NMOD
:   :   :   9   12  P
football    football    NN  10  12  NMOD
's  's  POS 11  10  NMOD
past    past    NN  12  6   PMOD
.   .   SENT    13  2   P
</s>
<s>
This    this    DT  1   2   SBJ
is  be  VBZ 2   0   ROOT
one one CD  3   2   PRD
of  of  IN  4   3   NMOD
Britain Britain NP  5   10  NMOD
's  's  POS 6   5   NMOD
most    most    RBS 7   8   AMOD
ardent  ardent  JJ  8   10  NMOD
football    football    NN  9   10  NMOD
cities  city    NNS 10  4   PMOD
:   :   :   11  2   P
think   think   VVP 12  2   COORD
Liverpool   Liverpool   NP  13  0   ROOT
or  or  CC  14  13  CC
Newcastle   Newcastle   NP  15  19  SBJ
in  in  IN  16  15  ADV
miniature   miniature   NN  17  16  PMOD
,   ,   ,   18  15  P
wound   wind    VVD 19  13  COORD
back    back    RB  20  19  ADV
three   three   CD  21  22  NMOD
decades decade  NNS 22  19  OBJ
.   .   SENT    23  2   P
</s>

I started to work to a script in PERL to achieve my goal, and in order to not run out of memory with such a huge dataset I used the module Tie::File so that my script would read one line at a time (instead of trying to open the entire corpus file in memory). This would work perfectly with a corpus where each sentence corresponds to one single line, but not in the current case where sentences are spread on more lines and delimited by a tag.

Is there a way to achieve what I want using a combination unix terminal commands (e.g. cat and grep)? Alternatively, which would be the best solution for this issue? (Some code examples would be great).

Was it helpful?

Solution 2

I ended up writing a quick code that solves my problem. I used Tie::File to handle huge textual datasets and specified </s> as record separator, as suggested by Jonathan Leffler (the solution proposed by Dave Sherohman seems very elegant but I couldn't try it). After the separation of the sentences I isolate the columns that I need (2nd and 3rd) and I run the regular expressions. Before printing the output I check whether the matched word is present in my word list: if not, this is excluded from the output.

I share my code here (comments included) in case someone else needs something similar.

It's bit dirty and it could definitely be optimized but it works for me and it supports very large corpora (I tested it with a corpus of 10GB: it completed successfully in a few hours).

use strict;
use Tie::File; #This module makes a file look like a Perl array, each array element corresponds to a line of the file.

if ($#ARGV < 0 ) {  print "Usage: perl albzcount.pl corpusfile\n"; exit; }

#read nouns list (.txt file with one word per line - line breaks LF)
my $nouns_list = "nouns.txt";
open(DAT, $nouns_list) || die("Could not open the config file $nouns_list or file doesn't exist!"); 
my @nouns_contained_in_list=<DAT>;
close(DAT);

# Reading regexp list (.txt file with one regexp per line - line breaks LF)
my $regex_list = "regexp.txt";
open(DAT, $regex_list) || die("Could not open the config file $regex_list or file doesn't exist!");
my @regexps_contained_in_list=<DAT>;
close(DAT);

# Reading Corpus File (each sentence is spread on more lines and separated by tag <s>)
my $corpusfile = $ARGV[0]; #Corpus filename (passed as an argument through the command)

# With TIE I don't load the entire file in an array. Perl thinks it's an array but the file is actually read line by line
# This is the key to manipulate huge text files without running out of memory
tie my @raw_corpus_data, 'Tie::File', $corpusfile,  recsep => '</s>' or die "Can't read file: $!\n";

#START go throught the sentences of the corpus (spread on multiple lines and separated by <s>), one by one
foreach my $corpus_line (@raw_corpus_data){

#take a single sentence (that is spread along different lines).
#NB each line contains "columns" separated by tab
my @corpus_sublines = split('\n', $corpus_line); 

#declare variable. Later values will be appended to it
my $corpus_line; 

    #for each line that composes a sentence
    foreach my $sentence_newline(@corpus_sublines){ a

    #explode by tab (column separator)
    my @corpus_columns = split('\t', $sentence_newline); 

    #put together new sentences using just column 2 and 3 (noun and tag) for each original sentence
    $corpus_line .= "@corpus_columns[1]\t@corpus_columns[2]\n";

    #... Now the corpus has the format I want and can be processed
    }

    #foreach regex
    foreach my $single_regexp(@regexps_contained_in_list){ 

        # Remove the new lines (both \n and \r - depending on the OS) from the regexp present in the file. 
        # Without this, the regular expressions read from the file don't always work.
        $single_regexp =~ s/\r|\n//g; 

            #if the corpus line analyzed in this cycle matches the regexp
            if($corpus_line =~ m/$single_regexp/) { 

            # explode by tab the matched results so the first word $onematch[0] can be isolated
            # $& is the entire matched string
            my @onematch = split('\t', $&);

                # OUTPUT RESULTS
                #if the matched noun is not empty and it is part of the word list
                if ($onematch[0] ne "" && grep( /^$onematch[0]$/, @nouns_contained_in_list )) { 
                print "$onematch[0]\t$single_regexp\n";
                } # END OUTPUT RESULTS
            } #END if the corpus line analyzed in this cycle matches the regexp
    } #END foreach regex
} #END go throught the lines of the corpus, one by one

# Untie the source corpus file
untie @raw_corpus_data; 

OTHER TIPS

A simple regex alternation is sufficient to extract matching data from the noun list and Regexp::Assemble can handle the requirement for identifying which pattern from the other file matched. And, as Jonathan Leffler mentions in his comment, setting the input record separator allows you to read a single record at a time, even when each record spans multiple lines.

Combining all that into a running example, we get:

#!/usr/bin/env perl    

use strict;
use warnings;
use 5.010;

use Regexp::Assemble;

my @nouns = qw( hooligan football brother bollocks );
my @patterns = ('[a-z]+\s+NN(S)?', '[a-z]+\s+JJ(S)?');

my $name_re = '(' . join('|', @nouns) . ')'; # Assumes no regex metacharacters

my $ra = Regexp::Assemble->new(track => 1);
$ra->add(@patterns);

local $/ = '<s>';

while (my $line = <DATA>) {
  my $match = $ra->match($line);
  next unless defined $match;

  while ($line =~ /$name_re/g) {
    say "$1\t\t$match";
  }
}


__DATA__
...

...where the content of the __DATA__ section is the sample corpus provided in the original question. I didn't include it here in the interest of keeping the answer compact. Note also that, in both patterns, I changed \t to \s+; this is because the tabs were not preserved when I copied and pasted your sample corpus.

Running that code, I get the output:

hooligan        [a-z]+\s+NN(S)?
hooligan        [a-z]+\s+NN(S)?
football        [a-z]+\s+NN(S)?
football        [a-z]+\s+NN(S)?
football        [a-z]+\s+JJ(S)?
football        [a-z]+\s+JJ(S)?

Edit: Corrected regexes. I initially replaced \t with \s, causing it to match NN or JJ only when preceded by exactly one space. It now also matches multiple spaces, which better emulates the original \t.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top