Question

I have a regular expression that turns the following text

alpha beta + gamma delta - epsilon phi

into

<ref4> + <ref45> - <ref11>

with the references being internal ids. I build the regular expression from the following code

EncodeRegex = new Regex("\b(?<nom>" + // word boundary
String.Join("|", Things.Select(t => Regex.Escape(t.Name)).ToArray()) + 
")\b", // word boundary
RegexOptions.IgnoreCase);

An example for the above text could be

\b(alpha\ beta|gamma\ delta|epsilon\ phi)\b

where "alpha beta" and co are the text blocks that i must recognize. I then replace the text blocks values with their references with a custom MatchEvaluator.

I have a problem though; if i have two text blocks A and B where A is a prefix of B, the regular expression depends on the order of A and B. \b(alpha|alpha\ beta)\b will stop as soon as Alpha is evaluated, even if followed by Beta.

Apart from ordering the text blocks in descending length, is there a way to tell the regular expression always to match the longer text block possible?


@Anirudh: i use the following code

EncodeRegex.Replace(s, new MatchEvaluator(m => Things.Where(Function(r) r.Name.ToUpper() == m.Groups("nom").Value.ToUpper()).Select(Function(r) "<" & r.Reference & ">").FirstOrDefault()))
Was it helpful?

Solution

Description

Based on your sample text, there are known delimiters between your groups, so you could simply use a lookahead to validate delimiter like in the following expression this would prevent the shorter prefix from completing the match.

Regex: (^|[+-]\s)(alpha|alpha\ beta)(?=\s[+-]|$)

Replace with: $1~~~new value~~~

enter image description here

Example

Input text

alpha beta + gamma delta - epsilon phi
alpha + alpha beta + gamma delta - epsilon phi

Sample Code

Imports System.Text.RegularExpressions
Module Module1
  Sub Main()
    Dim sourcestring as String = "replace with your source string"
    Dim replacementstring as String = "$1~~~new value~~~"
    Dim matchpattern as String = "(^|[+-]\s)(alpha|alpha\ beta)(?=\s[+-]|$)"
    Console.Writeline(regex.Replace(sourcestring,matchpattern,replacementstring,RegexOptions.IgnoreCase OR RegexOptions.Multiline))
  End Sub
End Module

Input After Replacement

~~~new value~~~ + gamma delta - epsilon phi
~~~new value~~~ + ~~~new value~~~ + gamma delta - epsilon phi

OTHER TIPS

you might wish to try right-to-left matching if none of your patterns is a suffix of another pattern, see msdn tutorial and reference for details.

another way to go would be to factor out common subexpressions from your match expressions, e.g.

\b(alpha(\ beta)?\b

ps: check again your code, as the engine should match greedily by default.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top