str_extract_all returns non-matching group

https://stackoverflow.com/questions/18514954

26-06-2022
|

Вопрос

I'm trying to extract values from some text in R using str_extract_all from the stringr package, and I want to use a non-matching group from perl's regexps (?:...) to extract and clean the relevant values in one line.

When running this code:

library(stringr)

## Example string.
## Not the real string, but I get the same results with this one.
x <- 'WIDTH 4\nsome text that should not be matched.\n\nWIDTH   46 some text.'

## extract values
str_extract_all(x, perl('(?:WIDTH\\s+)[0-9]+'))

I want to get this result:

[[1]]
[1] "4"    "46"

But I get this:

[[1]]
[1] "WIDTH 4"    "WIDTH   46"

What am I doing wrong?

Решение

The regex still matches WIDTH – it just doesn't put it into a capture group. Your regex is equivalent to

WIDTH\s+[0-9]+

Your code extracts the whole substring that was matched by the regex. (Non-)Capture groups do not change this.

You can use a lookbehind to assert that a certain string comes before the current position, without including it in the matched substring:

(?<=WIDTH\s)[0-9]+

Depending on the exact regex engine, you cannot use variable-length patterns in a lookbehind. There is another form that can allow this:

WIDTH\s+\K[0-9]+

Другие советы

The perl zero width regular expression is wrong.

Here are solutions that do not need perl regular expressions:

sub("WIDTH\\s+", "", str_extract_all(x, 'WIDTH\\s+[0-9]+')[[1]])

or simpler:

library(gsubfn)
strapplyc(x, "WIDTH\\s+(\\d+)")

Also if we want the results returned as numeric this works:

strapply(x, "WIDTH\\s+(\\d+)", as.numeric)

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow