That is one horrible regex. I would not want to be the poor sucker who is stuck with maintaining it. Also, how did you generate it from your replacement template?
I would suggest something considerably simpler. Use a hash to store the replacements, use word boundary to prevent partial matches, use /i
modifier to match case insensitively, and use regular loop logic to avoid replacements on commented lines.
use strict;
use warnings;
my @kw = "keyword::(keyword)[#heading-to-jump-to]";
my %rep = map { /([^:]+)::(.+)/ } @kw;
while (<DATA>) {
next if /^#/;
for my $kw (keys %rep) {
s/\b\Q$kw\E\b/$rep{$kw}/ig;
}
} continue {
print;
}
__DATA__
This is a text with keywords. Only the keyword 'keyword' should be replaced.
# Dont replace keyword when in a comment
Output:
This is a text with keywords. Only the (keyword)[#heading-to-jump-to] '(keyword)
[#heading-to-jump-to]' should be replaced.
# Dont replace keyword when in a comment
Explanation:
- Create the hash of replacement keywords with a
map
statement, which returns a two element list for each keyword::replacement string.
- With lines that begin with
#
, skip directly to print
- For each keyword in the hash, perform a global
/g
, case insensitive /i
substitution on each line. Use word boundary \b
to prevent partial matches, and quote meta characters with \Q ... \E
. Substitute with the hash value for that keyword.
As with all language processing, this will have some caveats and edge cases that needs handling. For example, word boundary will replace foo
in foo-bar
. As for how to control what not to replace under which heading, you would first have to tell me how to identify a heading.
Update:
If I understand you correctly, what you mean by skipping keywords inside paragraphs with their own heading, is something like this:
#heading-to-jump-to
Here is 'keyword' not replaced
Look up the string #heading-to-jump-to
and remove keyword
from the replacement list.
You might use a lookup hash with the keys being the heading references, and combine that with the generation of the first hash. Although, in this case I would start being concerned that you can have multiple keywords for each link, e.g. both foo
and bar
point to #foobar
, so #foobar
should exclude keywords foo
and bar
both.
my %rep;
my %heading;
for my $str (@kw) {
chomp $str;
my ($kw, $rep) = split /::/, $str, 2; # split into 2 fields
$rep{$kw} = $rep;
my ($heading) = $rep =~ /\[([^]]+)\]/;
push @{ $heading{$heading} }, $kw;
}
And then instead of simply skipping a line with next
, do something like
my @kws = keys %rep; # default list
while (<DATA>) {
if (/^(#.+)/) { # inside heading
my %exclude = map { $_ => 1 } @{ $heading{$1} };
@kws = grep { ! $exclude{$_} } @kws;
} else {
# not in a heading
# ...
}
}
Note that this is just a demonstration of the principle and not intended as working code. As you can see, the tricky part here is knowing when to reset the limited list of @kws
and when to use it. You will have to make those decisions, since I do not know your data.