Question

I'm not sure why I'm getting slightly different results for a simple OLS, depending on whether I go through panda's experimental rpy interface to do the regression in R or whether I use statsmodels in Python.

import pandas
from rpy2.robjects import r

from functools import partial

loadcsv = partial(pandas.DataFrame.from_csv,
                  index_col="seqn", parse_dates=False)

demoq = loadcsv("csv/DEMO.csv")
rxq = loadcsv("csv/quest/RXQ_RX.csv")

num_rx = {}
for seqn, num in rxq.rxd295.iteritems():
    try:
        val = int(num)
    except ValueError:
        val = 0
    num_rx[seqn] = val

series = pandas.Series(num_rx, name="num_rx")
demoq = demoq.join(series)

import pandas.rpy.common as com
df = com.convert_to_r_dataframe(demoq)
r.assign("demoq", df)
r('lmout <- lm(demoq$num_rx ~ demoq$ridageyr)')  # run the regression
r('print(summary(lmout))')  # print from R

From R, I get the following summary:

Call:
lm(formula = demoq$num_rx ~ demoq$ridageyr)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9086 -0.6908 -0.2940  0.1358 15.7003 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.1358216  0.0241399  -5.626 1.89e-08 ***
demoq$ridageyr  0.0358161  0.0006232  57.469  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.545 on 9963 degrees of freedom
Multiple R-squared: 0.249,  Adjusted R-squared: 0.2489 
F-statistic:  3303 on 1 and 9963 DF,  p-value: < 2.2e-16

Using statsmodels.api to do the OLS:

import statsmodels.api as sm
results = sm.OLS(demoq.num_rx, demoq.ridageyr).fit()
results.summary()

The results are similar to R's output but not the same:

OLS Regression Results
Adj. R-squared:  0.247
Log-Likelihood:  -18488.
No. Observations:    9965    AIC:   3.698e+04
Df Residuals:    9964    BIC:   3.698e+04
             coef   std err  t     P>|t|    [95.0% Conf. Int.]
ridageyr     0.0331  0.000   82.787    0.000        0.032 0.034

The install process is a a bit cumbersome. But, there is an ipython notebook here, that can reproduce the inconsistency.

Was it helpful?

Solution

Looks like Python does not add an intercept by default to your expression, whereas R does when you use the formula interface..

This means you did fit two different models. Try

lm( y ~ x - 1, data)

in R to exclude the intercept, or in your case and with somewhat more standard notation

lm(num_rx ~ ridageyr - 1, data=demoq)

OTHER TIPS

Note that you can still use ols from statsmodels.formula.api:

from statsmodels.formula.api import ols

results = ols('num_rx ~ ridageyr', demoq).fit()
results.summary()

I think it uses patsy in the backend to translate the formula expression, and intercept is added automatically.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top