"Broken" regular expression?

https://stackoverflow.com/questions/17060890

31-05-2022
|

Question

I have regular expression for parsing many values like a=b c=d e=f which should result in dictionary like this: {'a': 'b', 'c':'d', 'e':'f'}. I wanted user to allow escaping values using \ so instead of really simple regexp I've used ((?:[^\\\s=]+|\\.)+) plus I've added (?:^|\s) and (?=\s|$) so expression wouldn't match partial results.

>>> import re
>>> reg = re.compile(r'(?:^|\s)([\w\d]+)=((?:[^\\\s=]+|\\.)+)(?=\s|$)')
>>> s = r'a=b c=d e=one\two\three'
>>> reg.findall(s)
[('a', 'b'), ('c', 'd'), ('e', 'one\\two\\three')]

But then someone came along and inserted = into right side of the thing.

>>> s = r'a=b c=d e=aaaaaaaaaaaaaaaaaaaaaaaaaa\bbbbbbbbbbbbbbbbbbbbbbbbbbbb\cccc
    ccccc=dddddddddddddddd\eeeeeeeeeeeeeee'    
>>> reg.findall(s)

And script was stuck on this line (I've waited for several hours and it didn't finish).

Question: is this that poor regular expression (why? how would you wrote it?) or is it regexp implementation bug?

Note: I'm not asking for solutions for this issue, I'm curious why findall() doesn't finish in few hours.

Solution

Your problem is that you nest repetitions and the re-engine seems to try all possible distributions among them:

r'(?:^|\s)([\w\d]+)=((?:[^\\\s=]+|\\.)+)(?=\s|$)'
                                ^     ^

Better:

r'(?:^|\s)([\w\d]+)=((?:[^\\\s=]|\\.)+)(?=\s|$)'

In fact the findall would finish (or run out of memory). You can try this with

s = r'a=b c=d e=aaaaaaa\bbbbbbbb\ccccccccc=ddddddddd\eeeee'

and then successively adding characters after "e="

OTHER TIPS

Regular expressions aren't the right tool for your task beyond very simple cases. You need to tokenize the input string.

In simple cases you can use str.split():

for tok in s.split(" "):
    tok = tok.split("=", 1)
    key = tok[0]
    value = tok[1]

I haven't written python in quite some time, so I'm not sure whether the for … in … statement is correct, but you get what I mean.

>>> import re
>>> reg = re.compile(r'(\w+)=(\S+)')
>>> dict(reg.findall(r'a=b c=d e=one\two\three'))
{'e': 'one\\two\\three', 'a': 'b', 'c': 'd'}
>>> dict(reg.findall(r'a=b c=d e=aaaaaaaaaaaaaaaaaaaaaaaaaa\bbbbbbbbbbbbbbbbbbbbbbbbbbbb\ccccccccc=dddddddddddddddd\eeeeeeeeeeeeeee'))
{'e': 'aaaaaaaaaaaaaaaaaaaaaaaaaa\\bbbbbbbbbbbbbbbbbbbbbbbbbbbb\\ccccccccc=dddddddddddddddd\\eeeeeeeeeeeeeee', 'a': 'b', 'c': 'd'}

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow