python re.findall weird behaviour

https://stackoverflow.com/questions/15907553

03-04-2022
|

Question

>>> text =\
... """xyxyxy testmatch0
... xyxyxy testmatch1
... xyxyxy
... whyisthismatched1
... xyxyxy testmatch2
...  xyxyxy testmatch3
... xyxyxy
... whyisthismatched2
... """
>>> re.findall("^\s*xyxyxy\s+([a-z0-9]+).*$", text, re.MULTILINE)
[u'testmatch0', u'testmatch1', u'whyisthismatched1', u'testmatch2', u'testmatch3', u'whyisthismatched2']

So my expectations would be to not match the lines containing "whyisthismatched".

The Python re documentation states the following:

(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

My question would be if this is really the expected behaviour or a bug. If it is expected someone please explain why those lines are matching and how I should modify my pattern to get the behaviour I expect:

[u'testmatch0', u'testmatch1', u'testmatch2', u'testmatch3']

Solution

Newlines are whitespace too as far as the \s character class is concerned. If you want to match spaces only you need to match [ ] instead:

>>> re.findall("^\s*xyxyxy[ ]+([a-z0-9]+).*$", text, re.MULTILINE)
[u'testmatch0', u'testmatch1', u'testmatch2', u'testmatch3']

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow