I ended up writing a quick code that solves my problem. I used Tie::File to handle huge textual datasets and specified </s>
as record separator, as suggested by Jonathan Leffler (the solution proposed by Dave Sherohman seems very elegant but I couldn't try it).
After the separation of the sentences I isolate the columns that I need (2nd and 3rd) and I run the regular expressions. Before printing the output I check whether the matched word is present in my word list: if not, this is excluded from the output.
I share my code here (comments included) in case someone else needs something similar.
It's bit dirty and it could definitely be optimized but it works for me and it supports very large corpora (I tested it with a corpus of 10GB: it completed successfully in a few hours).
use strict;
use Tie::File; #This module makes a file look like a Perl array, each array element corresponds to a line of the file.
if ($#ARGV < 0 ) { print "Usage: perl albzcount.pl corpusfile\n"; exit; }
#read nouns list (.txt file with one word per line - line breaks LF)
my $nouns_list = "nouns.txt";
open(DAT, $nouns_list) || die("Could not open the config file $nouns_list or file doesn't exist!");
my @nouns_contained_in_list=<DAT>;
close(DAT);
# Reading regexp list (.txt file with one regexp per line - line breaks LF)
my $regex_list = "regexp.txt";
open(DAT, $regex_list) || die("Could not open the config file $regex_list or file doesn't exist!");
my @regexps_contained_in_list=<DAT>;
close(DAT);
# Reading Corpus File (each sentence is spread on more lines and separated by tag <s>)
my $corpusfile = $ARGV[0]; #Corpus filename (passed as an argument through the command)
# With TIE I don't load the entire file in an array. Perl thinks it's an array but the file is actually read line by line
# This is the key to manipulate huge text files without running out of memory
tie my @raw_corpus_data, 'Tie::File', $corpusfile, recsep => '</s>' or die "Can't read file: $!\n";
#START go throught the sentences of the corpus (spread on multiple lines and separated by <s>), one by one
foreach my $corpus_line (@raw_corpus_data){
#take a single sentence (that is spread along different lines).
#NB each line contains "columns" separated by tab
my @corpus_sublines = split('\n', $corpus_line);
#declare variable. Later values will be appended to it
my $corpus_line;
#for each line that composes a sentence
foreach my $sentence_newline(@corpus_sublines){ a
#explode by tab (column separator)
my @corpus_columns = split('\t', $sentence_newline);
#put together new sentences using just column 2 and 3 (noun and tag) for each original sentence
$corpus_line .= "@corpus_columns[1]\t@corpus_columns[2]\n";
#... Now the corpus has the format I want and can be processed
}
#foreach regex
foreach my $single_regexp(@regexps_contained_in_list){
# Remove the new lines (both \n and \r - depending on the OS) from the regexp present in the file.
# Without this, the regular expressions read from the file don't always work.
$single_regexp =~ s/\r|\n//g;
#if the corpus line analyzed in this cycle matches the regexp
if($corpus_line =~ m/$single_regexp/) {
# explode by tab the matched results so the first word $onematch[0] can be isolated
# $& is the entire matched string
my @onematch = split('\t', $&);
# OUTPUT RESULTS
#if the matched noun is not empty and it is part of the word list
if ($onematch[0] ne "" && grep( /^$onematch[0]$/, @nouns_contained_in_list )) {
print "$onematch[0]\t$single_regexp\n";
} # END OUTPUT RESULTS
} #END if the corpus line analyzed in this cycle matches the regexp
} #END foreach regex
} #END go throught the lines of the corpus, one by one
# Untie the source corpus file
untie @raw_corpus_data;