Question

I use python regular expressions (re module) in my code and noticed different behaviour in theese cases:

re.findall(r'\s*(?:[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # non-capturing group
# results in ['a) xyz', ' b) abc']

and

re.findall(r'\s*(?<=[a-z]\))?[^.)]+', 'a) xyz. b) abc.') # lookbehind
# results in ['a', ' xyz', ' b', ' abc']

What I need to get is just ['xyz', 'abc']. Why are the examples behave differently and how t get the desired result?

Was it helpful?

Solution

The reason a and b are included in the second case is because (?<=[a-z]\)) would first find a) and since lookaround's don't consume any character you are back at the start of string.Now [^.)]+ matches a

Now you are at ).Since you have made (?<=[a-z]\)) optional [^.)]+ matches xyz

This same thing is repeated with b) abc

remove ? from the second case and you would get the expected result i.e ['xyz', 'abc']

OTHER TIPS

The regex you are looking for is:

re.findall(r'(?<=[a-z]\) )[^) .]+', 'a) xyz. b) abc.')

I believe the currently accepted answer by Anirudha explains the differences between your use of positive lookbehind and non-capturing well, however, the suggestion of removing the ? from after the positive lookbehind actually results in [' xyz', ' abc'] (note the included spaces).

This is due to the positive lookbehind not matching the space character as well as not including space in the main matching character class itself.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top