Question

Consider the following R code (which, I think, eventually calls some Fortran):

X <- 1:1000
Y <- rep(1,1000)
summary(lm(Y~X))

Why are values returned by summary? Shouldn't this model fail to fit since there is no variance in Y? More importantly, why is the model R^2 ~= .5?

Edit

I tracked the code from lm to lm.fit and can see this call:

z <- .Fortran("dqrls", qr = x, n = n, p = p, y = y, ny = ny,
   tol = as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y,
   effects = y, rank = integer(1L), pivot = 1L:p, qraux = double(p),
   work = double(2 * p), PACKAGE = "base")

That is where the actual fit seems to happen. Looking at http://svn.r-project.org/R/trunk/src/appl/dqrls.f) did not help me understand what is going on, because I do not know fortran.

Was it helpful?

Solution

Statistically speaking, what should we anticipate (I'd like to say "expect", but that's a very specific term ;-))? The coefficients should be (0,1), rather than "fail to fit". The covariance of (X,Y) is assumed proportional to the variance of X, not the other way around. As X has non-zero variance, there is no problem. As the covariance is 0, the estimated coefficient for X should be 0. So, within machine tolerance, this is the answer you're getting.

There is no statistical anomaly here. There may be a statistical misunderstanding. There's also the issue of machine tolerance, but a coefficient on the order of 1E-19 is rather negligible, given the scale of the predictor and response values.

Update 1: A quick review of simple linear regression can be found on this Wikipedia page. The key thing to note is that Var(x) is in the denominator, Cov(x,y) in the numerator. In this case, the numerator is 0, the denominator is non-zero, so there is no reason to expect a NaN or NA. However, one may ask why isn't the resulting coefficient for x a 0, and that has to do with numerical precision issues of the QR decomposition.

OTHER TIPS

I believe this is simply because the QR decomposition is implemented with floating point arithmetic.

The singular.ok parameter actually refers to the design matrix (i.e. X only). Try

lm.fit(cbind(X, X), Y)

vs.

lm.fit(cbind(X, X), Y, singular.ok=F)

I agree that the problem might be of floating point. but I don't think is singularity.

If you check using solve(t(x1)%*%x1)%*%(t(x1)%*%Y) instead of QR, (t(x1)%*%x1) is not singular

use x1 = cbind(rep(1,1000,X) because lm(Y~X) includes the intercept.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top