Pregunta

I have not used regular expressions much. I have been writing code to extract digits from Column names in R.

Column names:

c<- c("Variable182predict", "Variable123Target", "Timestamp", "TargetVariable")

I used the following function in R to extract digits:

numbers<-gsub(pattern=".*e(\\d+).*","\\1", c)

Luckily enough, I got digits but there are two outcomes which I am unable to understand in the output:

"182" "123" "Timestamp" "TargetVariable"

I got the rationale behind extracting digits, but why is it giving last two column names as it is. This is something which I am unable to understand. Any input will be highly appreciated. Thanks!!!

¿Fue útil?

Solución

hrbrmstr and Jake Burkhead give you the explanation: what is not matched is not replaced.

Since the two last columns don't contain digits, they are not matched (and replaced).

A way to solve the problem is to replace all that is not a digit with nothing:

numbers<-gsub(pattern="\\D+","", c)

Otros consejos

gsub() is going to take the vector, look for the pattern, replace it where found and return each element whether it was replaced or not. You can use something like this:

library(stringr)

c.names <- c("Variable182predict", "Variable123Target", "Timestamp", "TargetVariable")
as.numeric(na.omit(str_extract(c.names, "\\d+")))

which will return

## [1] 182 123

(I made the assumption you only wanted the numeric output and nothing else)

The stringr is a pretty handy package to have around if you do alot with character vectors.

From ?gsub:

 Elements of character vectors ‘x’ which are not
 substituted will be returned unchanged

So if the regex doesn't match one of the input elements it does nothing to that element. The last 2 elements of your input vector don't match the pattern since they don't contain an e followed by one or more digits, so they are returned untouched.

If you want to extract all digits from text use this function from stringi package. "Nd" is the class of decimal digits.

    stri_extract_all_charclass(c(123,43,"66ala123","kot"),"\\p{Nd}")
[[1]]
[1] "123"

[[2]]
[1] "43"

[[3]]
[1] "66"  "123"

[[4]]
[1] NA

Please note that here 66 and 123 numbers are extracted separatly and using gsub function they are paste together in 66123

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top