How optimize this regex?

https://stackoverflow.com/questions/3477593

28-09-2019
|

Question

My tool gets a plain text and gradually generates the "tags" by replacing a terms from text in tags. Due to existence of some compound terms, the only way (i think) is use ReplaceAll regex.

Thanks to the friends of stackoverflow, in my last question i got a excellent regex to my app, but after a tests, emerged a new need:

"A regex to replace all word OUTSIDE a tag AND outside another word"

The orginal code:

String str = "world worldwide <a href=\"world\">my world</world>underworld world";
str = str.replaceAll("\\bworld\\b(?![^<>]*+>)", "repl");
System.out.println(str);

I need now replace only "world" (outside a tag ofcourse) and NOT "underworld" or "worldwide"

Expected result:

repl worldwide <a href="world">my world</world>underworld repl

Solution

I don't think regex is the best tool for the job, but if you just want to tweak and optimize what you have right now, you can use the word boundary \b, throw away the unnecessary capturing group and optional repetition specifier, and use possessive repetition:

\bworld\b(?![^<>]*+>)

The \bworld\b will ensure that "world" are surrounded by the zero-width word boundary anchors. This will prevent it from matching the "world" in "underworld" and "worldwide". Do note that the word boundary definition may not be exactly what you want, e.g. \bworld\b will not match the "world" in "a_world_domination".

The original pattern also contains a subpattern that looks like (x+)?. This is probably better formulated as simply x*. That is, instead of "zero-or-one" ? of "one-or-more" +, simply "zero-or-more" *.

The capturing group (…) is functionally not needed, and it doesn't seem like you need the capture for any substitution in the replacement, so getting rid of it can improve performance (when you need the grouping aspect, but not the capturing aspect, you can use non-capturing group (?:…) instead).

Note also that instead of [^<], we now forbid both brackets with [^<>]. Now the repetition can be specified as possessive since no backtracking is required in this case.

(The […] is a character class. Something like [aeiou] matches one of any of the lowercase vowels. [^…] is a negated character class. [^aeiou] matches one of anything but the lowercase vowels.)

Of course (?!…) is negative lookahead; it asserts that a given pattern can NOT be matched. So the overall pattern reads like this:

\bworld\b(?![^<>]*+>)
\_______/\__________/ NOT the case that
 "world"                      the first bracket to its right is a closing one
 surrounded by
 word boundary anchors

References

regular-expressions.info/Word Boundaries, Brackets for Grouping, Repetition, Possessive, Lookarounds

Note that to get a backslash in a Java string literal, you need to double it, so the whole pattern as a Java string literal is "\\bworld\\b(?![^<>]*+>)".

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow