Question

BACKGROUND

I have a scenario where I must repeatedly find certain words in text, over and over. I have currently used a series of regular Expressions in a format like this...

"((^)|(\W))(?<c>Word1)((\W)|($))"

"((^)|(\W))(?<c>NextWord)((\W)|($))"

"((^)|(\W))(?<c>AnotherWord)((\W)|($))"

...

This list of Regex objects is them looped through with a chunk of data and the matches are pulled out (one loop for one regex.matches(data) call)

I have done everything I can to optimize them, such as Compiling them before hand.

However the list is growing longer and I decided to start making larger compiled regular expressions to optimize the process. such as...

"((^)|(\W))(?<c>((Word1)|(NextWord)|(AnotherWord)))((\W)|($))"

This provides a HUGE speed imporvement, however there is a side effect I cannot figure out how to correct.

When the words are in the data side by side (such as space delimited. eg. "Word1 NextWord AnotherWord") the second word is missed in the capture because the regex for "Word1" also includes the trailing space. The match that could occur for "NextWord" no longer has the Leading space because it's part of the previous match.

QUESTION

Can anyone alter this Regular expression (.net format)

Pattern = "((^)|(\W))(?<c>((Word1)|(NextWord)|(AnotherWord)))((\W)|($))"

to work to capture all the words in this list below with a single call to ".matches(data)" Where

data = "Word1 NextWord AnotherWord" 

? (without sacrificing the efficiency gain)

RESULTS

Just thought I would mention this. After applying the suggested answer/correction with the look ahead and look behind, which I now know how to use :) , the code I just modified has improved in speed by 347x (0.00347% of old testing speed). Which is definitly something to remember when you get into multiple expressions. Very happy.

Was it helpful?

Solution

You may want to use either boundary checks or lookahead/lookbehind, so that the match doesn't consume the whitespace but checks it.

Like so:

Pattern = @"\b(Word1|NextWord|AnotherWord)\b"

Or with the lookbehind and lookahead:

Pattern = @"(?<=\W|^)(Word1|NextWord|AnotherWord)(?=\W|$)"

OTHER TIPS

Use the \b symbol. This matches on a word/non-word boundary.

\b(?((Word1)|(NextWord)|(AnotherWord)))\b
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top