Why is the standard error different in these two fitting methods (R Logistic Regression and Beta Regression) for a common dataset?

https://stackoverflow.com/questions/23204826

07-07-2023
|

Question

I am trying to understand the difference between two different fitting methods for a data set with a bounded response variable. The response variable is a fraction and therefore has a range of [0,1]. I have uncovered through my Google searching that there are a lot of different methods out there as this is a common operation. I am currently interested in the difference between the stock R GLM fit and the Beta regression offered in the betareg package. I am using the GasolineYield data set from the "betareg" package as my sample data set. Before I post the code and the results my two questions are the following:

Am I performing the Logistic Regression fit in R using the builtin R GLM correctly?
Why are the standard errors reported in the Beta regression so much smaller than the standard errors for the R logistic regression?

R Setup Code

library(betareg)
data("GasolineYield", package = "betareg")

Beta Regression code from the "betareg" package

gy = betareg(yield ~ batch + temp, data = GasolineYield)
summary(gy)

Beta Regression summary output

Call:
betareg(formula = yield ~ batch + temp, data = GasolineYield)

Standardized weighted residuals 2:
    Min      1Q  Median      3Q     Max 
-2.8750 -0.8149  0.1601  0.8384  2.0483 

Coefficients (mean model with logit link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.1595710  0.1823247 -33.784  < 2e-16 ***
batch1       1.7277289  0.1012294  17.067  < 2e-16 ***
batch2       1.3225969  0.1179020  11.218  < 2e-16 ***
batch3       1.5723099  0.1161045  13.542  < 2e-16 ***
batch4       1.0597141  0.1023598  10.353  < 2e-16 ***
batch5       1.1337518  0.1035232  10.952  < 2e-16 ***
batch6       1.0401618  0.1060365   9.809  < 2e-16 ***
batch7       0.5436922  0.1091275   4.982 6.29e-07 ***
batch8       0.4959007  0.1089257   4.553 5.30e-06 ***
batch9       0.3857930  0.1185933   3.253  0.00114 ** 
temp         0.0109669  0.0004126  26.577  < 2e-16 ***

Phi coefficients (precision model with identity link):
      Estimate Std. Error z value Pr(>|z|)    
(phi)    440.3      110.0   4.002 6.29e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Type of estimator: ML (maximum likelihood)
Log-likelihood:  84.8 on 12 Df
Pseudo R-squared: 0.9617
Number of iterations: 51 (BFGS) + 3 (Fisher scoring)

R GLM Logistic Regression code from stock R

glmfit = glm(yield ~ batch + temp, data = GasolineYield, family = "binomial")
summary(glmfit)

R GLM Logistic Regression summary output

Call:
glm(formula = yield ~ batch + temp, family = "binomial", data = GasolineYield)

Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.100459  -0.025272   0.004217   0.032879   0.082113  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.130227   3.831798  -1.600    0.110
batch1       1.720311   2.127205   0.809    0.419
batch2       1.305746   2.481266   0.526    0.599
batch3       1.562343   2.440712   0.640    0.522
batch4       1.048928   2.152385   0.487    0.626
batch5       1.125075   2.176242   0.517    0.605
batch6       1.029601   2.229773   0.462    0.644
batch7       0.540401   2.294474   0.236    0.814
batch8       0.497355   2.288564   0.217    0.828
batch9       0.378315   2.494881   0.152    0.879
temp         0.010906   0.008676   1.257    0.209

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2.34184  on 31  degrees of freedom
Residual deviance: 0.07046  on 21  degrees of freedom
AIC: 36.631

Number of Fisher Scoring iterations: 5

Solution

The standard errors are different because the variance assumptions in the two models are different.

Logistic regression assumes the response has a binomial distribution, while beta regression assumes it has a beta distribution.

The variance functions of the two are different. For the binomial, if you specify the mean (and $n$ is a given) the variance is determined. For the beta there's another free parameter, so it isn't determined by the mean and would presumably be estimated from the data.

This suggests that if you fit a quasibinomial GLM (adding a variance parameter) you might get closer to the same standard errors, but they still won't be the same, since they would weight the observations differently.

What you should actually do:

if your proportions are originally counts divided by some total count, then a binomial GLM would be an appropriate model to consider. (You would need the total counts, though.)
if your proportions are continuous fractions (the proportion of milk that's cream for example), then beta regression is an appropriate model to consider.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow