Question

I have an outcome variable, say Y and a list of 20 variables that could affect Y (say X1...X20). I would like to test which variables are NOT independent of Y. To do this I want to run a univariable glm for each variable and Y (ie Y~X1,...,Y~X20) and then do a likelihood ratio test for each model. Finally I would like to create a table the has the resulting P value from the likelihood test for each model.

From what I have seen the lapply function and split function could be useful for this but I don't really understand how they work in the examples I've seen.

This is what I tried at first:

> VarNames<-c(names(data[30:47]))
> glms<-glm(intBT~VarNames,family=binomial(logit))
Error in model.frame.default(formula = intBT ~ VarNames, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'VarNames')

I'm not sure if that was a good approach though.

Was it helpful?

Solution

It is easier to answer your questions if you provide a minimal example.

One way to go - but certainly not the most beautiful - is to use paste to create the formulas as a vector of strings and then use lapply on them. The Code for this could look like this:

example.data <- data.frame(intBT=1:10, bli=1:10, bla=1:10, blub=1:10)
var.names <- c('bli', 'bla', 'blub')

formulas <- paste('intBT ~', var.names)
fitted.models <- lapply(formulas, glm, data=example.data)

This gives a list of fitted model. You can then use the apply functions on fitted.models to execute further tests.

OTHER TIPS

Like Paul said it really helps if you provide a minimal example, but I think this does what you want.

set.seed(123)
N <- 100
num_vars <- 5
df <- data.frame(lapply(1:num_vars, function(i) i = rnorm(N)))
names(df) <- c(paste0(rep("X",5), 1:num_vars ))
e <- rnorm(N)
y <- as.numeric((df$X1 + df$X2 + e) > 0.5)

pvalues  <- vector(mode = "list")
singlevar  <- function(var, y, df){
  model <- as.formula(paste0("y ~ ", var))
  pvalues[var] <- coef(summary(glm(model, family = "binomial", data = df)))[var,4]
}

sapply(colnames(df), singlevar, y, df)
          X1           X2           X3           X4           X5 
1.477199e-04 4.193461e-05 8.885365e-01 9.064953e-01 9.702645e-01 

For comparison:

Call:
glm(formula = y ~ X2, family = "binomial", data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0674  -0.8211  -0.5296   0.9218   2.5463  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.5591     0.2375  -2.354   0.0186 *  
X2            1.2871     0.3142   4.097 4.19e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 130.68  on 99  degrees of freedom
Residual deviance: 106.24  on 98  degrees of freedom
AIC: 110.24

Number of Fisher Scoring iterations: 4
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top