Question

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.

Rules are pretty simple:

  1. Must have minimum 3 characters.
  2. If a % character is the first character must be a minimum of 4 characters.

So the following cases should work out as follows:

  • AB - fail
  • ABC - pass
  • ABCDEFG - pass
  • % - fail
  • %AB - fail
  • %ABC - pass
  • %ABCDEFG - pass
  • %%AB - pass

The expression I am using is:

^%?\S{3}

Which to me means:

  • ^ - Start of string
  • %? - Greedy check for 0 or 1 % character
  • \S{3} - 3 other characters that are not white space

The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?

Someone please show me the light :)

Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2}) Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.

Was it helpful?

Solution

The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.

Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:

^(?>%?)\S{3}

If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.

Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as @Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.

OTHER TIPS

Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".

Instead, what you probably want is something like this:

^(%\S{3}|[^%\s]\S{2})

Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.

I always love to look at RE questions to see how much time people spend on them to "Save time"

str.len() >= str[0]=='&' ? 4 : 3

Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)

Try the regex modified a little based on Dav's original one:

^(%\S{3,}|[^%\s]\S{2,})

with the regex option "^ and $ match at line breaks" on.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top