How can I fix my regex to not match too much with a greedy quantifier? [duplicate]

https://stackoverflow.com/questions/255815

05-07-2019
|

Question

This question already has an answer here:

My regex is matching too much. How do I make it stop? 5 answers

I have the following line:

"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"

I parse this by using a simple regexp:

if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
    my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}

But the ; at the end messes things up and I don't know why. Shouldn't the greedy operator handle "everything"?

Solution

The greedy operator tries to grab as much stuff as it can and still match the string. What's happening is the first one (after "say") grabs "0ed673079715c343281355c2a1fde843;2", the second one takes "laka", the third finds "hello " and the fourth matches the parenthesis.

What you need to do is make all but the last one non-greedy, so they grab as little as possible and still match the string:

(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)

OTHER TIPS

(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)

should work better

Although a regex can easily do this, I'm not sure it's the most straight-forward approach. It's probably the shortest, but that doesn't actually make it the most maintainable.

Instead, I'd suggest something like this:

$x="14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)";

if (($ts,$rest) = $x =~ /(\d+:\d+)\s+(.*)/)
{
    my($command,$hash,$pid,$handle,$quote) = split /;/, $rest, 5;
    print join ",", map { "[$_]" } $ts,$command,$hash,$pid,$handle,$quote
}

This results in:

[14:48],[say],[0ed673079715c343281355c2a1fde843],[2],[laka],[hello ;)]

I think this is just a bit more readable. Not only that, I think it's also easier to debug and maintain, because this is closer to how you would do it if a human were to attempt the same thing with pen and paper. Break the string down into chunks that you can then parse easier - have the computer do exactly what you would do. When it comes time to make modifications, I think this one will fare better. YMMV.

Try making the first 3 (.*) ungreedy (.*?)

If the values in your semicolon-delimited list cannot include any semicolons themselves, you'll get the most efficient and straightforward regular expression simply by spelling that out. If certain values can only be, say, a string of hex characters, spell that out. Solutions using a lazy or greedy dot will always lead to a lot of useless backtracking when the regex does not match the subject string.

(\d+:\d+)\ssay;([a-f0-9]+);(\d+);(\w+);([^;\r\n]+)

You could make * non-greedy by appending a question mark:

$line =~ /(\d+:\d+)\ssay;(.*?);(.*?);(.*?);(.*)/

or you can match everything except a semicolon in each part except the last:

$line =~ /(\d+:\d+)\ssay;([^;]*);([^;]*);([^;]*);(.*)/

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow