Question

I'm getting different values of r^2 (coefficient of determination) when I try OLS fits with these two libraries and I can't quite figure out why. (Some spacing removed for your convenience)

In [1]: import pandas as pd       
In [2]: import numpy as np
In [3]: import statsmodels.api as sm
In [4]: import scipy.stats
In [5]: np.random.seed(100)
In [6]: x = np.linspace(0, 10, 100) + 5*np.random.randn(100)
In [7]: y = np.arange(100)

In [8]: slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)

In [9]: r**2
Out[9]: 0.22045988449873671

In [10]: model = sm.OLS(y, x)
In [11]: est = model.fit()

In [12]: est.rsquared
Out[12]: 0.5327910685035413

What is going on here? I can't figure it out! Is there an error somewhere?

Was it helpful?

Solution

The 0.2205 is coming from a model which also has an intercept term--the 0.5328 value is the result if you remove the intercept.

Basically, one package is modeling y = bx whereas the other (helpfully) assumes that you would also like an intercept term (i.e. y = a + bx). [Note: The advantage of this assumption is that otherwise you would have to take x and bind a column of ones to it every time you wanted to run a regression (or else you'd end up with a biased model)]

Check out this post for a longer discussion.

Good luck!

OTHER TIPS

This is not an answer to the original question which has been answered.

About R-squared in a regression without a constant.

One problem is that a regression without an intercept doesn't have the standard definition of R^2.

Essentially, R-squared as a goodness of fit measure in a model with an intercept compares the full model with the model that has only an intercept. If the full model does not have an intercept, then the standard definition of R^2 can produce weird results like negative R^2.

The conventional definition in the regression without constant divides by the total sum of squares of the dependent variable instead of the demeaned. The R^2 between a regression with a constant and without cannot really be compared in a meaningful way.

see for example the issue that triggered the change in statsmodels to handle R^2 "correctly" in the no-constant regression: https://github.com/statsmodels/statsmodels/issues/785

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top