Dois estágios com menos estágio em r

https://stackoverflow.com/questions/3843882

27-09-2019
|

Pergunta

Quero executar uma regressão de dois estágios de dois estágios em R. Alguém sabe como fazer isso? Existe algum pacote por aí? Eu sei que é possível fazer isso usando o STATA, então imagino que seja possível fazer isso com R.

Solução

Você pode querer ser mais específico quando diz 'quadros de dois etapas de dois estágios'. Como você se refere a um programa STATA que implementa isso, acho que você está falando sobre o pacote CDSIMEQ, que implementa o procedimento de Amemiya (1978) para o modelo Heckit (também conhecido como Tobit generalizado, também conhecido como modelo Tobit tipo II, etc.). Como Grant disse, o SystemFit fará um Tobit por você, mas não com duas equações. O pacote MiqueCon teve um heckit (mas o pacote dividiu tantas vezes que não sei onde está agora).

Se você deseja o que o CDSIMEQ faz, ele pode ser facilmente implementado em R. Eu escrevi uma função que replica o CDSIMEQ:

tspls <- function(formula1, formula2, data) {
    # The Continous model
    mf1 <- model.frame(formula1, data)
    y1 <- model.response(mf1)
    x1 <- model.matrix(attr(mf1, "terms"), mf1)

    # The dicontionous model
    mf2 <- model.frame(formula2, data)
    y2 <- model.response(mf2)
    x2 <- model.matrix(attr(mf2, "terms"), mf2)

    # The matrix of all the exogenous variables
    X <- cbind(x1, x2)
    X <- X[, unique(colnames(X))]

    J1 <- matrix(0, nrow = ncol(X), ncol = ncol(x1))
    J2 <- matrix(0, nrow = ncol(X), ncol = ncol(x2))
    for (i in 1:ncol(x1)) J1[match(colnames(x1)[i], colnames(X)), i] <- 1
    for (i in 1:ncol(x2)) J2[match(colnames(x2)[i], colnames(X)), i] <- 1

    # Step 1:
    cat("\n\tNOW THE FIRST STAGE REGRESSION")
    m1 <- lm(y1 ~ X - 1)
    m2 <- glm(y2 ~ X - 1, family = binomial(link = "probit"))
    print(summary(m1))
    print(summary(m2))

    yhat1 <- m1$fitted.values
    yhat2 <- X %*% coef(m2)

    PI1 <- m1$coefficients
    PI2 <- m2$coefficients
    V0 <- vcov(m2)
    sigma1sq <- sum(m1$residuals ^ 2) / m1$df.residual
    sigma12 <- 1 / length(y2) * sum(y2 * m1$residuals / dnorm(yhat2))

    # Step 2:
    cat("\n\tNOW THE SECOND STAGE REGRESSION WITH INSTRUMENTS")

    m1 <- lm(y1 ~ yhat2 + x1 - 1)
    m2 <- glm(y2 ~ yhat1 + x2 - 1, family = binomial(link = "probit"))
    sm1 <- summary(m1)
    sm2 <- summary(m2)
    print(sm1)
    print(sm2)

    # Step  3:
    cat("\tNOW THE SECOND STAGE REGRESSION WITH CORRECTED STANDARD ERRORS\n\n")
    gamma1 <- m1$coefficients[1]
    gamma2 <- m2$coefficients[1]

    cc <- sigma1sq - 2 * gamma1 * sigma12
    dd <- gamma2 ^ 2 * sigma1sq - 2 * gamma2 * sigma12
    H <- cbind(PI2, J1)
    G <- cbind(PI1, J2)

    XX <- crossprod(X)                          # X'X
    HXXH <- solve(t(H) %*% XX %*% H)            # (H'X'XH)^(-1)
    HXXVXXH <- t(H) %*% XX %*% V0 %*% XX %*% H  # H'X'V0X'XH
    Valpha1 <- cc * HXXH + gamma1 ^ 2 * HXXH %*% HXXVXXH %*% HXXH

    GV <- t(G) %*% solve(V0)    # G'V0^(-1)
    GVG <- solve(GV %*% G)      # (G'V0^(-1)G)^(-1)
    Valpha2 <- GVG + dd * GVG %*% GV %*% solve(XX) %*% solve(V0) %*% G %*% GVG

    ans1 <- coef(sm1)
    ans2 <- coef(sm2)

    ans1[,2] <- sqrt(diag(Valpha1))
    ans2[,2] <- sqrt(diag(Valpha2))
    ans1[,3] <- ans1[,1] / ans1[,2]
    ans2[,3] <- ans2[,1] / ans2[,2]
    ans1[,4] <- 2 * pt(abs(ans1[,3]), m1$df.residual, lower.tail = FALSE)
    ans2[,4] <- 2 * pnorm(abs(ans2[,3]), lower.tail = FALSE)

    cat("Continuous:\n")
    print(ans1)
    cat("Dichotomous:\n")
    print(ans2)
}

Para comparação, podemos replicar a amostra do autor de Cdsimeq em seus artigo sobre o pacote.

> library(foreign)
> cdsimeq <- read.dta("http://www.stata-journal.com/software/sj3-2/st0038/cdsimeq.dta")
> tspls(continuous ~ exog3 + exog2 + exog1 + exog4,
+     dichotomous ~ exog1 + exog2 + exog5 + exog6 + exog7,
+     data = cdsimeq)

        NOW THE FIRST STAGE REGRESSION
Call:
lm(formula = y1 ~ X - 1)

Residuals:
      Min        1Q    Median        3Q       Max 
-1.885921 -0.438579 -0.006262  0.432156  2.133738 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
X(Intercept)  0.010752   0.020620   0.521 0.602187    
Xexog3        0.158469   0.021862   7.249 8.46e-13 ***
Xexog2       -0.009669   0.021666  -0.446 0.655488    
Xexog1        0.159955   0.021260   7.524 1.19e-13 ***
Xexog4        0.316575   0.022456  14.097  < 2e-16 ***
Xexog5        0.497207   0.021356  23.282  < 2e-16 ***
Xexog6       -0.078017   0.021755  -3.586 0.000352 ***
Xexog7        0.161177   0.022103   7.292 6.23e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.6488 on 992 degrees of freedom
Multiple R-squared: 0.5972,     Adjusted R-squared: 0.594 
F-statistic: 183.9 on 8 and 992 DF,  p-value: < 2.2e-16 


Call:
glm(formula = y2 ~ X - 1, family = binomial(link = "probit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.49531  -0.59244   0.01983   0.59708   2.41810  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
X(Intercept)  0.08352    0.05280   1.582 0.113692    
Xexog3        0.21345    0.05678   3.759 0.000170 ***
Xexog2        0.21131    0.05471   3.862 0.000112 ***
Xexog1        0.45591    0.06023   7.570 3.75e-14 ***
Xexog4        0.39031    0.06173   6.322 2.57e-10 ***
Xexog5        0.75955    0.06427  11.818  < 2e-16 ***
Xexog6        0.85461    0.06831  12.510  < 2e-16 ***
Xexog7       -0.16691    0.05653  -2.953 0.003152 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.29  on 1000  degrees of freedom
Residual deviance:  754.14  on  992  degrees of freedom
AIC: 770.14

Number of Fisher Scoring iterations: 6


        NOW THE SECOND STAGE REGRESSION WITH INSTRUMENTS
Call:
lm(formula = y1 ~ yhat2 + x1 - 1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32152 -0.53160  0.04886  0.53502  2.44818 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
yhat2         0.257592   0.021451  12.009   <2e-16 ***
x1(Intercept) 0.012185   0.024809   0.491    0.623    
x1exog3       0.042520   0.026735   1.590    0.112    
x1exog2       0.011854   0.026723   0.444    0.657    
x1exog1       0.007773   0.028217   0.275    0.783    
x1exog4       0.318636   0.028311  11.255   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.7803 on 994 degrees of freedom
Multiple R-squared: 0.4163,     Adjusted R-squared: 0.4128 
F-statistic: 118.2 on 6 and 994 DF,  p-value: < 2.2e-16 


Call:
glm(formula = y2 ~ yhat1 + x2 - 1, family = binomial(link = "probit"))

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.49610  -0.58595   0.01969   0.59857   2.41281  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
yhat1          1.26287    0.16061   7.863 3.75e-15 ***
x2(Intercept)  0.07080    0.05276   1.342 0.179654    
x2exog1        0.25093    0.06466   3.880 0.000104 ***
x2exog2        0.22604    0.05389   4.194 2.74e-05 ***
x2exog5        0.12912    0.09510   1.358 0.174544    
x2exog6        0.95609    0.07172  13.331  < 2e-16 ***
x2exog7       -0.37128    0.06759  -5.493 3.94e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.29  on 1000  degrees of freedom
Residual deviance:  754.21  on  993  degrees of freedom
AIC: 768.21

Number of Fisher Scoring iterations: 6

        NOW THE SECOND STAGE REGRESSION WITH CORRECTED STANDARD ERRORS

Continuous:
                Estimate Std. Error    t value   Pr(>|t|)
yhat2         0.25759209  0.1043073 2.46955009 0.01369540
x1(Intercept) 0.01218500  0.1198713 0.10165068 0.91905445
x1exog3       0.04252006  0.1291588 0.32920764 0.74206810
x1exog2       0.01185438  0.1290754 0.09184073 0.92684309
x1exog1       0.00777347  0.1363643 0.05700519 0.95455252
x1exog4       0.31863627  0.1367881 2.32941597 0.02003661
Dichotomous:
                 Estimate Std. Error    z value     Pr(>|z|)
yhat1          1.26286574  0.7395166  1.7076909 0.0876937093
x2(Intercept)  0.07079775  0.2666447  0.2655134 0.7906139867
x2exog1        0.25092561  0.3126763  0.8025092 0.4222584495
x2exog2        0.22603717  0.2739307  0.8251618 0.4092797527
x2exog5        0.12911922  0.4822986  0.2677163 0.7889176766
x2exog6        0.95609385  0.2823662  3.3860070 0.0007091758
x2exog7       -0.37128221  0.3265478 -1.1369920 0.2555416141

Outras dicas

SystemFit também fará o truque.

Existem vários pacotes disponíveis em R para fazer dois mínimos quadrados de estado. aqui estão alguns

SEM: PENÍMICOS QUADRADOS DOIS ESTÁGIOS
Zelig: Link removido, não mais funcional (28.07.11)

Deixe -me saber se isso serve ao seu propósito.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow