Question

EDIT: remove original example because it provoked ancillary answers. also fixed the title.

The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:

Here is a simpler example:

>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'

The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:

>>> m = re.search(r"a+?", str)
>>> m.group()
'a'

EDIT: In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs: "Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."

This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?

Was it helpful?

Solution

Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.

(Answer changed after OP clarification in comments. See history for previous text.)

OTHER TIPS

The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.

Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.

EDIT:

More details... In this case:

re.search(r"a+?$", "baaaaaaaa")

the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.

But in this case:

re.search(r"a+?", "baaaaaaaa")

the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.

The presence of the $ in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.

Both a+ and a+? are required to consume the first a they find. If that a is followed by more a's, a+ goes ahead and consumes them too, while a+? is content with just the one. If there were anything more to the regex, a+ would be willing to settle for fewer a's, and a+? would consume more, if that's what it took to achieve a match.

With a+$ and a+?$, you've added another condition: match at least one a followed by the end of the string. a+ still consumes all of the a's initially, then it hands off to the anchor ($). That succeeds on the first try, so a+ is not required to give back any of its a's.

On the other hand, a+? initially consumes just the one a before handing off to $. That fails, so control is returned to a+?, which consumes another a and hands off again. And so it goes, until a+? consumes the last a and $ finally succeeds. So yes, a+?$ does match the same number of a's as a+$, but it does so reluctantly, not greedily.

As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.

I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.

Answer to original question:

Why does the first search() span multiple "/"s rather than taking the shortest match?

A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is $, so the previous ones need to stretch out to the end of the string.

Answer to revised question:

A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.

Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

>>> import re
>>> string = "baaa"
>>> 
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # This means the same thing as above, since the presence of the `$`
>>> # cancels out any meaning that the `?` might have.
>>> pattern = re.search(r"a+?$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # Here you remove the `$`, so it matches the least amount of `a` it can.
>>> pattern = re.search(r"a+?", string)
>>> pattern.group()
'a'

Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.

>>> # This is close to the example pattern with `a+?$` and therefore `a+$`.
>>> # It matches `a`s until the end of the line. Again the `?` can't do anything.
>>> pattern = re.search(r"(a+?)$", string)
>>> pattern.group(1)
'aaa'
>>>
>>> # In order to get the `?` to work, you need something else in your pattern
>>> # and outside your group that can be matched that will allow the selection
>>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up
>>> # everything that the lazy `a+?` doesn't want to.
>>> pattern = re.search(r"(a+?).*$", string)
>>> pattern.group(1)
'a'

Edit: Removed text related to old versions of the question.

Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.

>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top