문제

I am trying to use the speedglm package for R to estimate regression models. In general the results are the same as using base R's glm function, but speedglm delivers unexpected behavior when I completely remove a given factor level from a data.frame. For example, see the code below:

dat1 <- data.frame(y=rnorm(100), x1=gl(5, 20)) 
dat2 <- subset(dat1, x1!=1)

glm("y ~ x1", dat2, family="gaussian")
Coefficients:
(Intercept)          x13          x14          x15  
    -0.2497       0.6268       0.3900       0.2811 

speedglm(as.formula("y ~ x1"), dat2)
Coefficients:
(Intercept)          x12          x13          x14          x15  
    0.03145     -0.28114      0.34563      0.10887           NA 

Here the two functions deliver different results because factor level x1==1 has been deleted from dat2. Had I used dat1 instead the results would have been identical. Is there a way to make speedglm act like glm when processing data like dat2?

도움이 되었습니까?

해결책

Droplevels I think is the key.

str(droplevels(dat2)) vs. str(dat2) - even though x1==1 is dropped it's still listed in the factor levels

So speedglm(as.formula("y ~ x1"), droplevels(dat2)) should equal glm("y ~ x1", dat2, family="gaussian")

다른 팁

The default behavior for glm with a factor independent variable is to use the first non-empty level as a reference category. It appears that speedglm is treating the last level as the reference category. To get comparable results, you can use relevel in the call to glm:

 set.seed(2)
 dat1 <- data.frame(y=rnorm(100), x1=gl(5, 20)) 
 dat2 <- subset(dat1, x1!=1)
 glm(y ~ relevel(x1,"5"), dat2, family="gaussian")

 Coefficients:
   (Intercept)  relevel(x1, "5")2  relevel(x1, "5")3  relevel(x1, "5")4  
     -0.27163            0.27135            0.36688            0.09934  

speedglm(as.formula("y ~ x1"), dat2)
 Coefficients:
 (Intercept)          x12          x13          x14          x15  
     -0.27163      0.27135      0.36688      0.09934           NA  
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top