Using Perl, how do I show the context around a search term in the search results?
-
03-07-2019 - |
Question
I am writing a Perl script that is searching for a term in large portions of text. What I would like to display back to the user is a small subset of the text around the search term, so the user can have context of where this search term is used. Google search results are a good example of what I am trying to accomplish, where the context of your search term is displayed under the title of the link.
My basic search is using this:
if ($text =~ /$search/i ) {
print "${title}:${text}\n";
}
($title contains the title of the item the search term was found in) This is too much though, since sometimes $text will be holding hundreds of lines of text.
This is going to be displayed on the web, so I could just provide the title as a link to the actual text, but there is no context for the user.
I tried modifying my regex to capture 4 words before and 4 words after the search term, but ran into problems if the search term was at the very beginning or very end of $text.
What would be a good way to accomplish this? I tried searching CPAN because I'm sure someone has a module for this, but I can't think of the right terms to search for. I would like to do this without modules if possible, because getting modules installed here is a pain. Does anyone have any ideas?
Solution
Your initial attempt at 4 words before/after wasn't too far off.
Try:
if ($text =~ /((\S+\s+){0,4})($search)((\s+\S+){0,4})/i) {
my ($pre, $match, $post) = ($1, $3, $4);
...
}
OTHER TIPS
You can use $and $' to get the string before and after the match. Then truncate those values appropriately. But as blixtor points out, shlomif is correct to suggest using
@+and
@-to avoid the performance penalty imposed by $
and #' -
$foo =~ /(match)/;
my $match = $1;
#my $before = $`;
#my $after = $';
my $before = substr($foo, 0, $-[0]);
my $after = substr($foo, $+[0]);
$after =~ s/((?:(?:\w+)(?:\W+)){4}).*/$1/;
$before = reverse $before; # reverse the string to limit backtracking.
$before =~ s/((?:(?:\W+)(?:\w+)){4}).*/$1/;
$before = reverse $before;
print "$before -> $match <- $after\n";
I would suggest using the positional parameters - @+ and @- (see perldoc perlvar) to find the position in the string of the match, and how much it takes.
You could try the following:
if ($text =~ /(.*)$search(.*)/i ) {
my @before_words = split ' ', $1;
my @after_words = split ' ',$2;
my $before_str = get_last_x_words_from_array(@before_words);
my $after_str = get_first_x_words_from_array(@after_words);
print $before_str . ' ' . $search . ' ' . $after_str;
}
Some code obviously omitted, but this should give you an idea of the approach.
As far as extracting the title ... I think this approach does not lend itself to that very well.