Insertion of column in armadillo gives different results from insertion in R before passing in RcppArmadillo, what have I done wrong?

https://stackoverflow.com/questions/20058365

31-07-2022
|

Question

Here is the cpp file:

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

using namespace Rcpp;

// [[Rcpp::export]]
List mylm1(NumericVector yr, NumericMatrix Xr) {
    int n = Xr.nrow(), k = Xr.ncol();
    arma::mat X(Xr.begin(), n, k, true);
    Rcout << X << std::endl;
    arma::mat allOne(n, 1, arma::fill::ones);

    X.insert_cols(0, allOne);
    Rcout << X << std::endl;
    arma::colvec y(yr.begin(), yr.size(), true);

    arma::colvec coef = arma::solve(X, y);
    arma::colvec resid = y - X*coef;

    double sig2 = arma::as_scalar( arma::trans(resid)*resid/(n-k) );
    arma::colvec stderrest = arma::sqrt( sig2 * arma::diagvec( arma::inv(arma::trans(X)*X)) );
    arma::colvec tval = coef / stderrest;

    return Rcpp::List::create(
        Rcpp::Named("coefficients") = coef,
        Rcpp::Named("stderr")       = stderrest,
        Rcpp::Named("tval")         = tval
    ) ;


}

// [[Rcpp::export]]
List mylm2(NumericVector yr, NumericMatrix Xr) {
    int n = Xr.nrow(), k = Xr.ncol();
    arma::mat X(Xr.begin(), n, k, false);
    Rcout << X << std::endl;
    arma::colvec y(yr.begin(), yr.size(), false);
    arma::colvec coef = arma::solve(X, y);
    arma::colvec resid = y - X*coef;
    double sig2 = arma::as_scalar( arma::trans(resid)*resid/(n-k) );
    arma::colvec stderrest = arma::sqrt( sig2 * arma::diagvec( arma::inv(arma::trans(X)*X)) );
    return Rcpp::List::create(
        Rcpp::Named("coefficients") = coef,
        Rcpp::Named("stderr")       = stderrest
    ) ;
}

Then in R:

sourceCpp('b.cpp')
set.seed(1)
x = matrix(rnorm(100), 25, 4)
y = rnorm(25)
mylm1(y, x)
mylm2(y, cbind(1, x))
summary(lm(y~x))

mylm1(y, x)
#    1.0000  -0.6265  -0.0561   0.3981   0.2914
#    1.0000   0.1836  -0.1558  -0.6120  -0.4433
#    ...
#    1.0000  -1.9894  -0.1123  -0.9341  -1.2246
#    1.0000   0.6198   0.8811  -1.2536  -0.4734
# 
# 
# $stderr
#         [,1]
# [1,] 0.16765
# [2,] 0.18009
# [3,] 0.24104
# [4,] 0.15117
# [5,] 0.21064


mylm2(y, cbind(1, x))
#    1.0000  -0.6265  -0.0561   0.3981   0.2914
#    1.0000   0.1836  -0.1558  -0.6120  -0.4433
#    ...
#    1.0000  -1.9894  -0.1123  -0.9341  -1.2246
#    1.0000   0.6198   0.8811  -1.2536  -0.4734
# 
# $stderr
#         [,1]
# [1,] 0.17179
# [2,] 0.18454
# [3,] 0.24699
# [4,] 0.15490
# [5,] 0.21584

summary(lm(y~x))

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)   0.1122     0.1718    0.65     0.52
# x1            0.0499     0.1845    0.27     0.79
# x2            0.1076     0.2470    0.44     0.67
# x3           -0.0435     0.1549   -0.28     0.78
# x4           -0.1750     0.2158   -0.81     0.43

According to the output of Rcout, the two methods generates the same X matrix in the end, but they don't give the same stand errors, why?

La solution

Your formula for the stdandard errors is wrong. You forgot to adjust k for the added column. The fact that the matrices actually are the same should have made you a little suspicious about the rest of the code.

And I already mentioned to you in the comments to your last question that you need to treat R2 differently for the intercept and 'no intercept' cases. It is the same thing here with the standard errors, and the root cause of your problem here. In short, no problem with the matrix operations, but a problem when you extract and transform data later.

Take the hint and read the RcppArmadillo sources -- the combination of the fastLm.cpp code and the corresponding R functions is really short. Or else read up on the finer details of how to compute linear least squares and assorted analytics.

Oh, and I also have the same objection to your question title as before. RcppArmadillo is once again not at fault; your usage however is.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow