Question

I am writing a Perl script that is searching for a term in large portions of text. What I would like to display back to the user is a small subset of the text around the search term, so the user can have context of where this search term is used. Google search results are a good example of what I am trying to accomplish, where the context of your search term is displayed under the title of the link.

My basic search is using this:

if ($text =~ /$search/i ) {
    print "${title}:${text}\n";
}

($title contains the title of the item the search term was found in) This is too much though, since sometimes $text will be holding hundreds of lines of text.

This is going to be displayed on the web, so I could just provide the title as a link to the actual text, but there is no context for the user.

I tried modifying my regex to capture 4 words before and 4 words after the search term, but ran into problems if the search term was at the very beginning or very end of $text.

What would be a good way to accomplish this? I tried searching CPAN because I'm sure someone has a module for this, but I can't think of the right terms to search for. I would like to do this without modules if possible, because getting modules installed here is a pain. Does anyone have any ideas?

Was it helpful?

Solution

Your initial attempt at 4 words before/after wasn't too far off.

Try:

if ($text =~ /((\S+\s+){0,4})($search)((\s+\S+){0,4})/i) {
    my ($pre, $match, $post) = ($1, $3, $4);
    ...
}

OTHER TIPS

You can use $and $' to get the string before and after the match. Then truncate those values appropriately. But as blixtor points out, shlomif is correct to suggest using@+and@-to avoid the performance penalty imposed by $ and #' -

$foo =~ /(match)/;

my $match = $1;
#my $before = $`;
#my $after = $';
my $before = substr($foo, 0, $-[0]);
my $after =  substr($foo, $+[0]);

$after =~ s/((?:(?:\w+)(?:\W+)){4}).*/$1/;
$before = reverse $before;                   # reverse the string to limit backtracking.
$before =~ s/((?:(?:\W+)(?:\w+)){4}).*/$1/;
$before = reverse $before;

print "$before -> $match <- $after\n";

I would suggest using the positional parameters - @+ and @- (see perldoc perlvar) to find the position in the string of the match, and how much it takes.

You could try the following:

if ($text =~ /(.*)$search(.*)/i ) {

  my @before_words = split ' ', $1;
  my @after_words = split ' ',$2;

  my $before_str = get_last_x_words_from_array(@before_words);
  my $after_str = get_first_x_words_from_array(@after_words); 

  print $before_str . ' ' . $search . ' ' . $after_str;

}

Some code obviously omitted, but this should give you an idea of the approach.

As far as extracting the title ... I think this approach does not lend itself to that very well.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top