Question

I have a large file which contains 400000 lines, each line contains many number of keywords separated by tab.

And also I have a file that contains list of keywords to be matched. Say this file act as a look up.

So for each keyword in the look up table I need to search all its occurrence in the given file. And should print the line number of the occurrence.

I have tried this

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

open OUT, ">", "SampleLineNum.txt";

while( $line = <FILE1> )
{
    while( <FILE2> ) 
    {
        $linenum = $., last if(/$line/);
    }
    print OUT "$linenum ";
}

close FILE1;

This gives the first occurrence of the keyword. But I need all the occurrence and also the keyword should be exactly match.

The problem am facing in exact match is, for instance I have the keywords "hello" and "hello world"

if I need to match "hello", it returns the line number which contains "hello world" also my script should match only "hello" and give its line number.

Was it helpful?

Solution

Here is a solution that matches every occurrence of all keywords:

#!usr/bin/perl
use strict;
use warnings;

#Lexical variable for filehandle is preferred, and always error check opens.
open my $keywords,    '<', 'keywords.txt' or die "Can't open keywords: $!";
open my $search_file, '<', 'search.txt'   or die "Can't open search file: $!";

my $keyword_or = join '|', map {chomp;qr/\Q$_\E/} <$keywords>;
my $regex = qr|\b($keyword_or)\b|;

while (<$search_file>)
{
    while (/$regex/g)
    {
        print "$.: $1\n";
    }
}

keywords.txt:

hello
foo
bar

search.txt:

plonk
food is good
this line doesn't match anything
bar bar bar
hello world
lalalala
hello everyone

Output:

4: bar
4: bar
4: bar
5: hello
7: hello

Explanation:

This creates a single regex that matches all of the keywords in the keywords file.

<$keywords> - when this is used in list context, it returns a list of all lines of the file.

map {chomp;qr/\Q$_\E/} - this removes the newline from each line and applies the \Q...\E quote-literal regex operator to each line (This ensures that if you have a keyword like "foo.bar" it will treat the dot as a literal character, not a regex metacharacter).

join '|', - join the resulting list into a single string, separated by pipe characters.

my $regex = qr|\b($keyword_or)\b|; - create a regex that looks like this:

/\b(\Qhello\E|\Qfoo\E|\Qbar\E)\b/

This regex will match any of your keywords. \b is the word boundary marker, ensuring that only whole words match: food no longer matches foo. The parentheses capture the specific keyword that matched in $1. This is how the output prints the keyword that matched.

I updated the solution to match each keyword on a given line and to only match complete words.

OTHER TIPS

Is this part of something bigger? Because this is a one liner with grep

grep -n hello filewithlotsalines.txt

grep -n "hello world" filewithlotsalines.txt

-n gets grep to show the line numbers first before the matching lines. You can do man grep for more options.

I am assuming here that you are on a linux or *nix system.

I have a different interpretation of your request. It seems that you may want to maintain a list of line numbers where certain entries from a lookup table are found on lines of a 'keyword' file. Here's a sample lookup table:

hello world
hello
perl
hash
Test
script

And a tab-delimited 'keyword' file, where multiple keywords may be found on a single line:

programming tests
hello   everyone
hello   hello world perl
scripting   scalar
test    perl    script
hello world perl    script  hash

Given the above, consider the following solution:

use strict;
use warnings;

my %lookupTable;

print "Enter the file path of lookup table: \n";
chomp( my $lookupTableFile = <> );

print "Enter the file path that contains keywords: \n";
chomp( my $keywordsFile = <> );

open my $ltFH, '<', $lookupTableFile or die $!;

while (<$ltFH>) {
    chomp;
    undef @{ $lookupTable{$_} };
}

close $ltFH;

open my $kfFH, '<', $keywordsFile or die $!;

while (<$kfFH>) {
    chomp;
    for my $keyword ( split /\t+/ ) {
        push @{ $lookupTable{$keyword} }, $. if defined $lookupTable{$keyword};
    }
}

close $kfFH;

open my $slFH, '>', 'SampleLineNum.txt' or die $!;

print $slFH "$_: @{ $lookupTable{$_} }\n"
  for sort { lc $a cmp lc $b } keys %lookupTable;

close $slFH;

print "Done!\n";

Output to SampleLineNum.txt:

hash: 6
hello: 2 3
hello world: 3 6
perl: 3 5 6
script: 5 6
Test: 

The script uses a hash of arrays (HoA), where the key is an entry from the lookup table and the associated value is a reference to a list of line numbers where that entry was found on lines of a 'keyword' file. The hash %lookupTable is initialized with a reference to an empty list.

The each line of the 'keywords' file is split on the delimiting tab, and if a corresponding entry is defined in %lookupTable, the line number is pushed onto the corresponding list. When done, the %lookupTable keys are case-insensitively sorted and written out to SampleLineNum.txt, along with their corresponding list of line numbers where the entry was found, if any.

There's no sanity checks on the file names entered, so consider adding those.

Hope this helps!

To find all of the occurrences, you need to read in the keywords and then loop through the keywords to find matches for each line. Here is what I modified to find keywords in the line using an array. In addition, I added a counter to count the line number and then if there is a match to print to print out the line number. Your code will print out a item for each line even if there is not a match.

#!usr/bin/perl
use strict;
use warnings;

my $linenum = 0;

print "Enter the file path of lookup table:";
my $filepath1 = <>;

print "Enter the file path that contains keywords :";
my $filepath2 = <>;

open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;

# Read in all of the keywords
my @keywords = <FILE2>; 

# Close the file2
close(FILE2);

# Remove the line returns from the keywords
chomp @keywords;

# Sort and reverse the items to compare the maximum length items
# first (hello there before hello)
@keywords = reverse sort @keywords;

foreach my $k ( @keywords)
{
  print "$k\n";
}
open OUT, ">", "SampleLineNum.txt";
my $line;
# Counter for the lines in the file
my $count = 0;
while( $line = <FILE1> )
{
    # Increment the counter for the number of lines
    $count++;
    # loop through the keywords to find matches
    foreach my $k ( @keywords ) 
    {
        # If there is a match, print out the line number 
        # and use last to exit the loop and go to the 
        # next line
        if ( $line =~ m/$k/ ) 
        {
            print "$count\n";
            last;
        }
    }
}

close FILE1;

I think there are some questions similar to this one. You can check out:

The File::Grep module is interesting.

as others had already given some perl solution,i will suggest you that may be you could use awk here.

> cat temp
abc
bac
xyz

> cat temp2
abc     jbfwerf kfnm
jfjkwebfkjwe    bac     xyz
ndwjkfn abc kenmfkwe    bac     xyz

> awk 'FNR==NR{a[$1];next}{for(i=1;i<=NF;i++)if($i in a)print $i,FNR}' temp temp2
abc 1
bac 2
xyz 2
abc 3
bac 3
xyz 3
>
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top