How can I preserve whitespace when I match and replace several words in Perl?

https://stackoverflow.com/questions/1425023

07-07-2019
|

Question

Let's say I have some original text:

here is some text that has a substring that I'm interested in embedded in it.

I need the text to match a part of it, say: "has a substring".

However, the original text and the matching string may have whitespace differences. For example the match text might be:

has a
substring

has  a substring

and/or the original text might be:

here is some
text that has
a substring that I'm interested in embedded in it.

What I need my program to output is:

here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.

I also need to preserve the whitespace pattern in the original and just add the start and end markers to it.

Any ideas about a way of using Perl regexes to get this to happen? I tried, but ended up getting horribly confused.

Solution

Been some time since I've used perl regular expressions, but what about:

$match = s/(has\s+a\s+substring)/[$1]/ig

This would capture zero or more whitespace and newline characters between the words. It will wrap the entire match with brackets while maintaining the original separation. It ain't automatic, but it does work.

You could play games with this, like taking the string "has a substring" and doing a transform on it to make it "has\s*a\s*substring" to make this a little less painful.

EDIT: Incorporated ysth's comments that the \s metacharacter matches newlines and hobbs corrections to my \s usage.

OTHER TIPS

This pattern will match the string that you're looking to find:

(has\s+a\s+substring)

So, when the user enters a search string, replace any whitespace in the search string with \s+ and you have your pattern. The, just replace every match with [match starts here]$1[match ends here] where $1 is the matched text.

In regexes, you can use + to mean "one or more." So something like this

/has\s+a\s+substring/

matches has followed by one or more whitespace chars, followed by a followed by one or more whitespace chars, followed by substring.

Putting it together with a substitution operator, you can say:

my $str = "here is some text that has     a  substring that I'm interested in embedded in it.";
$str =~ s/(has\s+a\s+substring)/\[match starts here]$1\[match ends here]/gs;

print $str;

And the output is:

here is some text that [match starts here]has     a  substring[match ends here] that I'm interested in embedded in it.

A many has suggested, use \s+ to match whitespace. Here is how you do it automaticly:

my $original = "here is some text that has a substring that I'm interested in embedded in it.";
my $search = "has a\nsubstring";

my $re = $search;
$re =~ s/\s+/\\s+/g;

$original =~ s/\b$re\b/[match starts here]$&[match ends here]/g;

print $original;

Output:

here is some text that [match starts here]has a substring[match ends here] that I'm interested in embedded in it.

You might want to escape any meta-characters in the string. If someone is interested, I could add it.

This is an example of how you could do that.

#! /opt/perl/bin/perl
use strict;
use warnings;

my $submatch = "has a\nsubstring";

my $str = "
here is some
text that has
a substring that I'm interested in, embedded in it.
";

print substr_match($str, $submatch), "\n";

sub substr_match{
  my($string,$match) = @_;

  $match =~ s/\s+/\\s+/g;

  # This isn't safe the way it is now, you will need to sanitize $match
  $string =~ /\b$match\b/;
}

This currently does anything to check the $match variable for unsafe characters.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow