Pregunta

I have a string variable to parse into two parts. I figured I'd approach this using str_match from the stringr package, which returns a matrix with the original string in the first column and each extracted part in the other columns.

I found about a dozen regular expressions to extract these two parts. (The parts are a ladder and rung on a pay schedule, and it's very messy. I've verified that my regexes work by defining a function with a bunch of nested ifelse statements.)

library(stringr)
library(data.table)
my_strs <- c("A 01","G 00","A    2")
mydt <- data.table(strs = my_strs)

rx1 <- '^([[:alpha:]] )([[:digit:]]{2})$'
rx2 <- '(A)    ([[:digit:]])'

I want to check the regexes in sequence and extract the parts using the first one that checks out. If I only had one regex, I could do this:

myfun <- function(x){
    y <- str_match(x,rx1)
    return(y)
}
mydt[,myfun(strs)] 
#      [,1]   [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
# [3,] NA     NA   NA  

(It took me a long time to even get that to work, trying all combinations of Vectorize and as.list on the function and *applying in the call.)

My best attempt at checking the regexes in sequence is this rather ugly kludge:

myfun2 <- function(x){
    y <- str_match(x,rx1)
    ifelse(!is.na(y[1]),"",(y <- str_match(x,rx2))[1])
    return(y)
}
mydt[1:2,myfun2(strs)] 
#      [,1]   [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
mydt[3,myfun2(strs)] 
#      [,1]     [,2] [,3]
# [1,] "A    2" "A"  "2" 
mydt[1:3,myfun2(strs)]
#      [,1]   [,2] [,3]
# [1,] "A 01" "A " "01"
# [2,] "G 00" "G " "00"
# [3,] NA     NA   NA  

As you can see, it doesn't quite work yet.

Do you have any idea about a better way to approach this? I have about 3.5 m rows in my data set, but only about 2000 unique values for this string, so I'm not really worried about efficiency.

¿Fue útil?

Solución

Try this using strapply from the gsubfn package. We define a function that accepts the matches and returns the first two non-empty ones. Then use it with the regular expression paste(rx1, rx2, sep = "|") for each component of my_str :

library(gsubfn)

# test data
# there was an addition to the question in the comments.  It asked to be able to handle
# one regular expression which has only a single capture.  Make sure its at the end.
rx3 <- "^([[:digit:]]{2})$"
my_strs2 <- c(my_strs, "99")    

# code
first2 <- function(...) { x <- c(..., NA); head(x[x != ""], 2) }
strapply(my_strs2, paste(rx1, rx2, rx3, sep = "|"), first2, simplify = TRUE)

The last line returns:

    [,1] [,2] [,3] [,4]
[1,] "A " "G " "A"  "99"
[2,] "01" "00" "2"  NA  

(If there are components of my_strs that do not match at all then a list will be returned in which those components are NULL. In that case you may prefer to drop the simplify = TRUE and always have it return a list.)

Note: strapplyc in the same package is much faster than strapply since the guts of it are written in tcl (a string processing language) whereas strapply is written in R. Thus you might want to break it up this way to leverage off of the faster routine:

L <- strapplyc(my_strs2, paste(rx1, rx2, rx3, sep = "|"))
sapply(L, first2)

Otros consejos

For posterity, here is another solution I found today:

mydt[,{
    i_rx <- min(which(unlist(sapply(rx_list,function(x)grepl(x,strs)))))
    as.list(str_match(strs,rx_list[[i_rx]]))
},by=1:nrow(mydt)]

I made some minor alterations to the regexes and put them in a list.

rx1  <- '^([[:alpha:]] )([[:digit:]]{2})$'
rx2a <- "^(A)    ([[:digit:]])$"
rx3a <- "^()([[:digit:]]{2})$"
rx_list <- list(rx1,rx2a,rx3a)
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top