Question

I am trying to create a new variable within data table under if statement: if string variable contains substring, then new variable equals to numerical value.

My data:

N X
1 aa1aa 
2 bb2bb
3 cc-1bb 
...

Dataframe contains several thousands of rows.

Result needed is new column containing numerical value which is withing string (X collumn):

N X      Y
1 aa1aa  1
2 bb2bb  2
3 cc-1bb -1 

I was trying with

for (i in 1:length(mydata)){
  if (grep('1', mydata$X) == TRUE) {
    mydata$Y <- 1  }

but I'm not sure if I'm even on correct way... Any help please?

Was it helpful?

Solution

This should work on more of your extended samples. Basically it takes out everything that's not a letter from the middle of the string.

X <- c("aa1aa", "bb2bb", "cc-1bb","aa+0.5b","fg-0.25h")
gsub("^[a-z]+([^a-z]*)[a-z]+$","\\1",X,perl=T)
#[1] "1"     "2"     "-1"    "+0.5"  "-0.25"

OTHER TIPS

Using the example data from @Paulo you can use gsub from base R...

d$Y <- gsub( "[^0-9]" , "" , d$X ) 

something like this?

d <- data.frame(N = 1:3,
                X = c('aa1aa', 'bb2bb', 'cc-1bb'),
                stringsAsFactors = FALSE)

library(stringr)

d$Y <- as.numeric(str_extract_all(d$X,"\\(?[0-9,.]+\\)?"))

d

  N      X  Y
1 1  aa1aa  1
2 2  bb2bb  2
3 3 cc-1bb  1

EDIT - Speed test

The gsub approch provided by @Simon is much faster than stringr

library(microbenchmark)
# 30000 lines data.frame
d1 <- data.frame(N = 1:30000,
                X = rep(c('aa1aa', 'bb2bb', 'cc-1bb'), 10000),
                stringsAsFactors = FALSE)

stringr

microbenchmark(as.numeric(str_extract_all(d1$X,"\\(?[0-9,.]+\\)?")), 
               times = 10L)
Unit: seconds
                                                      expr      min      lq  median       uq      max neval
 as.numeric(str_extract_all(d1$X, "\\\\(?[0-9,.]+\\\\)?")) 2.677408 2.75283 2.76473 2.781083 2.796648    10

base gsub

microbenchmark(gsub( "[^0-9]" , "" , d1$X ), times = 10L)
Unit: milliseconds
                     expr      min       lq   median       uq      max neval
 gsub("[^0-9]", "", d1$X) 44.95564 45.05358 45.07238 45.10201 45.23645    10
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top