Pregunta

I'm pretty sure I'm missing something obvious here, but I cannot make R to use non-greedy regular expressions:

> library(stringr)
> str_match('xxx aaaab yyy', "a.*?b")                                         
     [,1]   
[1,] "aaaab"

Base functions behave the same way:

> regexpr('a.*?b', 'xxx aaaab yyy')
[1] 5
attr(,"match.length")
[1] 5
attr(,"useBytes")
[1] TRUE

I would expect the match to be just ab as per 'greedy' comment in http://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html:

By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to ‘minimal’ by appending ? to the quantifier. (There are further quantifiers that allow approximate matching: see the TRE documentation.)

Could someone please explain me what's going on?

Update. What's crazy is that in some other cases non-greedy patterns behave as expected:

> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*>')
     [,1]                                          
[1,] "<a href=\"abc\">link</a> yyy <h1>Header</h1>"
> str_match('xxx <a href="abc">link</a> yyy <h1>Header</h1>', '<a.*?>')
     [,1]              
[1,] "<a href=\"abc\">"
¿Fue útil?

Solución

Difficult concept so I'll try my best... Someone feel free to edit and explain better if it is a bit confusing.

Expressions that match your patterns are searched from left to right. Yes, all of the following strings aaaab, aaab, aab, and ab are matches to your pattern, but aaaab being the one that starts the most to the left is the one that is returned.

So here, your non-greedy pattern is not very useful. Maybe this other example will help you understand better when a non-greedy pattern kicks in:

str_match('xxx aaaab yyy', "a.*?y") 
#      [,1]     
# [1,] "aaaab y"

Here all of the strings aaaab y, aaaab yy, aaaab yyy matched the pattern and started at the same position, but the first one was returned because of the non-greedy pattern.


So what can you do to catch that last ab? Use this:

str_match('xxx aaaab yyy', ".*(a.*b)")
#      [,1]        [,2]
# [1,] "xxx aaaab" "ab"

How does it work? By adding a greedy pattern .* in the front, you are now forcing the process to put the last possible a into the captured group.

Otros consejos

The problem is matching the shortest window between two strings. @flodel correctly mentions that a regex engine is parsing the string from left to right, and thus all the matches are leftmost. Greediness and laziness only apply to the boundaries on the right: greedy quantifiers get the substrings up to the rightmost boundaries, and the lazy ones will match up to the first occurrence of the subpatterns to follow.

See the examples:

> library(stringr)
> str_extract('xxx aaaab yyy', "a[^ab]*b")
[1] "ab"
> str_extract('xxx aaa xxx aaa zzz', "xxx.*?zzz")
[1] "xxx aaa xxx aaa zzz"
> str_extract('xxx aaa xxx aaa zzz', "xxx(?:(?!xxx|zzz).)*zzz")
[1] "xxx aaa zzz"

The first and the third scenarios return the shortest window, the second one is an illustration of the current problem but with a multicharacter input.

Scenario 1. Boundaries are single characters

In case a and b are single characters, the shortest window is found by using a negated character class. a[^ab]*b will easily grab the substring from a till the next b with no as and bs in between.

Scenario 2. Boundaries are not single characters

You may use a tempered greedy token in these cases that can be further unrolled. The xxx(?:(?!xxx|zzz).)*zzz pattern matches xxx, then any 0+ chars other than a linebreak char that is not the starting char of a xxx or zzz char sequence (the (?!xxx|zzz) is a negative lookahead that fails the match if the substring immediately to the right matches the lookahead pattern), and then a zzz.

These matching scenarios can be easily used with base R regmatches (using a PCRE regex flavor that supports lookaheads):

> x <- 'xxx aaa xxx aaa zzz xxx bbb xxx ccc zzz'
> unlist(regmatches(x, gregexpr("xxx(?:(?!xxx|zzz).)*zzz", x, perl = TRUE)))
[1] "xxx aaa zzz" "xxx ccc zzz"

One note: when using a PCRE regex in base R, or the ICU regex in str_extract/str_match, the . does not match linebreak characters, to enable that behavior, you need to add (?s) at the pattern start (an inline DOTALL modifier).

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top