How to match words from a list in a huge corpus using regexp (in Perl or *nix terminal)?

Question 1

I ended up writing a quick code that solves my problem. I used Tie::File to handle huge textual datasets and specified </s> as record separator, as suggested by Jonathan Leffler (the solution proposed by Dave Sherohman seems very elegant but I couldn't try it). After the separation of the sentences I isolate the columns that I need (2nd and 3rd) and I run the regular expressions. Before printing the output I check whether the matched word is present in my word list: if not, this is excluded from the output.

I share my code here (comments included) in case someone else needs something similar.

It's bit dirty and it could definitely be optimized but it works for me and it supports very large corpora (I tested it with a corpus of 10GB: it completed successfully in a few hours).

use strict;
use Tie::File; #This module makes a file look like a Perl array, each array element corresponds to a line of the file.

if ($#ARGV < 0 ) {  print "Usage: perl albzcount.pl corpusfile\n"; exit; }

#read nouns list (.txt file with one word per line - line breaks LF)
my $nouns_list = "nouns.txt";
open(DAT, $nouns_list) || die("Could not open the config file $nouns_list or file doesn't exist!"); 
my @nouns_contained_in_list=<DAT>;
close(DAT);

# Reading regexp list (.txt file with one regexp per line - line breaks LF)
my $regex_list = "regexp.txt";
open(DAT, $regex_list) || die("Could not open the config file $regex_list or file doesn't exist!");
my @regexps_contained_in_list=<DAT>;
close(DAT);

# Reading Corpus File (each sentence is spread on more lines and separated by tag <s>)
my $corpusfile = $ARGV[0]; #Corpus filename (passed as an argument through the command)

# With TIE I don't load the entire file in an array. Perl thinks it's an array but the file is actually read line by line
# This is the key to manipulate huge text files without running out of memory
tie my @raw_corpus_data, 'Tie::File', $corpusfile,  recsep => '</s>' or die "Can't read file: $!\n";

#START go throught the sentences of the corpus (spread on multiple lines and separated by <s>), one by one
foreach my $corpus_line (@raw_corpus_data){

#take a single sentence (that is spread along different lines).
#NB each line contains "columns" separated by tab
my @corpus_sublines = split('\n', $corpus_line); 

#declare variable. Later values will be appended to it
my $corpus_line; 

    #for each line that composes a sentence
    foreach my $sentence_newline(@corpus_sublines){ a

    #explode by tab (column separator)
    my @corpus_columns = split('\t', $sentence_newline); 

    #put together new sentences using just column 2 and 3 (noun and tag) for each original sentence
    $corpus_line .= "@corpus_columns[1]\t@corpus_columns[2]\n";

    #... Now the corpus has the format I want and can be processed
    }

    #foreach regex
    foreach my $single_regexp(@regexps_contained_in_list){ 

        # Remove the new lines (both \n and \r - depending on the OS) from the regexp present in the file. 
        # Without this, the regular expressions read from the file don't always work.
        $single_regexp =~ s/\r|\n//g; 

            #if the corpus line analyzed in this cycle matches the regexp
            if($corpus_line =~ m/$single_regexp/) { 

            # explode by tab the matched results so the first word $onematch[0] can be isolated
            # $& is the entire matched string
            my @onematch = split('\t', $&);

                # OUTPUT RESULTS
                #if the matched noun is not empty and it is part of the word list
                if ($onematch[0] ne "" && grep( /^$onematch[0]$/, @nouns_contained_in_list )) { 
                print "$onematch[0]\t$single_regexp\n";
                } # END OUTPUT RESULTS
            } #END if the corpus line analyzed in this cycle matches the regexp
    } #END foreach regex
} #END go throught the lines of the corpus, one by one

# Untie the source corpus file
untie @raw_corpus_data;

Question 2

A simple regex alternation is sufficient to extract matching data from the noun list and Regexp::Assemble can handle the requirement for identifying which pattern from the other file matched. And, as Jonathan Leffler mentions in his comment, setting the input record separator allows you to read a single record at a time, even when each record spans multiple lines.

Combining all that into a running example, we get:

#!/usr/bin/env perl    

use strict;
use warnings;
use 5.010;

use Regexp::Assemble;

my @nouns = qw( hooligan football brother bollocks );
my @patterns = ('[a-z]+\s+NN(S)?', '[a-z]+\s+JJ(S)?');

my $name_re = '(' . join('|', @nouns) . ')'; # Assumes no regex metacharacters

my $ra = Regexp::Assemble->new(track => 1);
$ra->add(@patterns);

local $/ = '<s>';

while (my $line = <DATA>) {
  my $match = $ra->match($line);
  next unless defined $match;

  while ($line =~ /$name_re/g) {
    say "$1\t\t$match";
  }
}


__DATA__
...

...where the content of the __DATA__ section is the sample corpus provided in the original question. I didn't include it here in the interest of keeping the answer compact. Note also that, in both patterns, I changed \t to \s+; this is because the tabs were not preserved when I copied and pasted your sample corpus.

Running that code, I get the output:

hooligan        [a-z]+\s+NN(S)?
hooligan        [a-z]+\s+NN(S)?
football        [a-z]+\s+NN(S)?
football        [a-z]+\s+NN(S)?
football        [a-z]+\s+JJ(S)?
football        [a-z]+\s+JJ(S)?

Edit: Corrected regexes. I initially replaced \t with \s, causing it to match NN or JJ only when preceded by exactly one space. It now also matches multiple spaces, which better emulates the original \t.