Could you explain why this regex is not working?

https://stackoverflow.com/questions/6378236

28-10-2019
|

Question

>>> d = "Batman,Superman"
>>> m = re.search("(?<!Bat)\w+",d)
>>> m.group(0)
'Batman'

Why isn't group(0) matching Superman? This lookaround tutorial says:

(?<!a)b matches a "b" that is not preceded by an "a", using negative lookbehind

Solution

At a simple level, the regex engine starts from the left of the string and moves progressively towards the right, trying to match your pattern (think of it like a cursor moving through the string). In the case of a lookaround, at each stop of the cursor, the lookaround is asserted, and if true, the engine continues trying to make a match. As soon as the engine can match your pattern, it'll return a match.

At position 0 of your string (ie. prior to the B in Batman), the assertion succeeded, as Bat is not present before the current position - thus, \w+ can match the entire word Batman (remember, regexes are inherently greedy - ie. will match as much as possible).

See this page for more information on engine internals.

To achieve what you wanted, you could instead use something like:

\b(?!Bat)\w+

In this pattern, the engine will match a word boundary (\b)¹, followed by one or more word characters, with the assertion that the word characters do not start with Bat. A lookahead is used rather than a lookbehind because using a lookbehind here would have the same problem as your original pattern; it would look before the position directly following the word boundary, and since its already been determined that the position before the cursor is a word boundary, the negative lookbehind would always succeed.

¹ Note that word boundaries match a boundary between \w and \W (ie. between [A-Za-z0-9_] and any other character; it also matches the ^ and $ anchors). If your boundaries need to be more complex, you'll need a different way of anchoring your pattern.

OTHER TIPS

Batman isn't directly preceded by Bat, so that matches first. In fact, neither is Superman; there's a comma in-between in your string which will do just fine to allow that RE to match, but that's not matched anyway because it's possible to match earlier in the string.

Maybe this will explain better: if the string was Batman and you were starting to try to match from the m, the RE would not match until the character after (giving a match of an) because that's the only place in the string which is preceded by Bat.

From the manual:

Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.

http://docs.python.org/library/re.html#regular-expression-syntax

You're looking for the first set of one or more alphanumeric characters (\w+) that is not preceded by 'Bat'. Batman is the first such match. (Note that negative lookbehind assertions can match the start of a string.)

To do what you want, you have to constrain the regex to match 'man' specifically; otherwise, as others have pointed out, \w greedily matches anything including 'Batman'. As in:

>>> re.search("\w+(?<!Bat)man","Batman,Superman").group(0)
'Superman'

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow