cbind, grep and quotation marks in R

https://stackoverflow.com/questions/22636123

20-06-2023
|

Pregunta

Consider a minimum working example (for, e.g. a binomial model):

test.a.tset <- rnorm(10)
test.b.tset <- rnorm(10)
c <- runif(10)
c[c < 0.5] <- 0
c[c >= 0.5] <- 1
df <- data.frame(test.a.tset,test.b.tset,c)

Using a regex, I want to regress c on all variables with the structure test."anything".tset:

summary(glm(paste("c ~ ",paste(colnames((df[, grep("test\\.\\w+\\.tset", colnames(df))])),
        collapse = "+"), sep = ""), data = df, family=binomial))

So far, no problems. Now we get to the part where cbind comes into play. Suppose I want to use a different statistical model (e.g. rbprobitGibbs from the bayesm package), which requires a design matrix as input. Thus, I need to transform the data frame into the appropriate format.

X <- cbind(df$test.a.tset,df$test.b.tset)

Or, alternatively, if I want to use regex again (where I even add a second grep to ensure that only the part inside the quotation marks is selected):

X2 <- cbind(grep("[^\"]+",paste(paste("df$", colnames((df[, grep("test\\.\\w+\\.tset", colnames(df))])), 
            sep = ""), collapse = ","), value = TRUE))

But there is a difference:

> X
            [,1]         [,2]
 [1,] -0.4525601 -1.240484170
 [2,]  0.3135625  1.240519383
 [3,] -0.2883953 -0.554670224
 [4,] -1.3696994 -1.373690426
 [5,]  0.8514529 -0.063945537
 [6,] -1.1804205 -0.314132743
 [7,] -1.0161170 -0.001605679
 [8,]  1.0072168  0.938921869
 [9,] -0.8797069 -1.158626865
[10,] -0.9113297  1.641201924
> X2
     [,1]                        
[1,] "df$test.a.tset,df$test.b.tset"

From my point of view the problem seems to be that grep returns the selected value as a string inside quotation marks and that, while glm sort of ignores the quotation marks in "df$test.a.tset,test.b.tset", cbind does not. I.e. the call for X2 after the paste is actually read as:

X2 <- cbind("df$test.a.tset,df$test.b.tset")

Question: Is there a way to get the same result for X2 as for X using a regex?

Solución

The code grep("test\\.\\w+\\.tset", colnames(df)) will return the indexes of columns that match your pattern. If you wanted to build a matrix using just those columns, you could just use:

X3 <- as.matrix(df[,grep("test\\.\\w+\\.tset", colnames(df))])

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow