Question

I'm on Mac OS with rpy2 version 2.3.9.

I was trying to use the 'np' R package for fitting a kernel regression model to my data, which is generated from python preprocessing. The usage of the 'np' functions is very much similar to the basic lm function in R. Then I had difficulty calling 'np' functions whereas it's fine with the lm function. Please see the following toy example for my problems. There are three regression models I tried, lm, gam in the 'mgcv' package, and npreg in the 'np' package. I first use the mtcars data loaded from the R datasets and then use the random generated data formed into pandas dataframe.

import pandas as pd
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
import rpy2.robjects as ro
import numpy as np

r_ds = importr('datasets')
r_stats = importr('stats')
r_mgcv = importr('mgcv')
r_np = importr('np')

mtcars = r_ds.__rdata__.fetch('mtcars')['mtcars']

All of the three regression methods work for mtcars:

r_stats.lm(ro.Formula('mpg ~ drat + wt'), data=mtcars)
r_mgcv.gam(ro.Formula('mpg ~ s(drat) + wt'), data=mtcars)
r_np.npreg(ro.Formula('mpg ~ drat + wt'), data=mtcars)

Then I generate a pandas dataframe and convert it to R dataframe:

py_df = pd.DataFrame(np.random.randn(100,3), columns=['y', 'x_1', 'x_2'])
r_df = com.convert_to_r_dataframe(py_df)

Now the strange thing happens: both of

r_stats.lm(ro.Formula('y ~ x_1 + x_2'), data=r_df)
r_mgcv.gam(ro.Formula('y ~ s(x_1) + x_2'), data=r_df)

work, but

r_np.npreg(ro.Formula('y ~ x_1 + x_2'), data=r_df)

fails with the error message

Error in npregbw.default(xdat = xdat, ydat = ydat, bws = bws, ...) : 
'ydat' must be a vector
---------------------------------------------------------------------------
RRuntimeError                             Traceback (most recent call last)
<ipython-input-22-0ec6b4eeaa3b> in <module>()
----> 1 print r_base.summary(r_np.npreg(ro.Formula('y ~ x_1 + x_2'), data=r_df))

/Users/guest/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/rpy2/robjects/functions.pyc in __call__(self, *args, **kwargs)
     84                 v = kwargs.pop(k)
     85                 kwargs[r_k] = v
---> 86         return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)

/Users/guest/Library/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/rpy2/robjects/functions.pyc in __call__(self, *args, **kwargs)
     33         for k, v in kwargs.iteritems():
     34             new_kwargs[k] = conversion.py2ri(v)
---> 35         res = super(Function, self).__call__(*new_args, **new_kwargs)
     36         res = conversion.ri2py(res)
     37         return res

RRuntimeError: Error in npregbw.default(xdat = xdat, ydat = ydat, bws = bws, ...) : 
  'ydat' must be a vector
Was it helpful?

Solution

As pointed out by Ian, the problem is partly originating from the conversion from pandas' DataFrames to rpy2/R's Dataframes (the other part of the problem is originiting from np's npreg() (other modeling function do work fine, as you noted it).

There is work toward improving that in rpy2-2.4.0, so make sure you report issues (and possibly try a recent snapshot of rpy2-2.4.0-dev).

A more immediate (and simple) solution can be obtained as follows (tested with rpy2-2.3.9 and 2.4.0-dev, and with pandas 0.13.0 / R-3.0.2-patched):

# Your pandas DataFrame
py_df = pd.DataFrame(np.random.randn(100,3), columns=['y', 'x_1', 'x_2'])
r_df = com.convert_to_r_dataframe(py_df)

The type is AsIs. A possible simpler way to see it is:

>>> [tuple(x.rclass) for x in r_df]
[('AsIs',), ('AsIs',), ('AsIs',)]

Now we just want to drop the class AsIs:

for col in r_df:
    col.rclass = None

The vectors are back to their basic type:

>>> [tuple(x.rclass) for x in r_df]
[('numeric',), ('numeric',), ('numeric',)]

Now the call is running without error:

r_np.npreg(ro.Formula('y ~ x_1 + x_2'), data=r_df)

OTHER TIPS

Your problem is that when you pass your data frame to convert_to_r_dataframe, it sets the class attribute of each of the vectors that make up your dataframe to "AsIs":

In[56]:  R = ro.r

In [57]: R["print"](R.lapply(r_df,R["class"]))
$y
[1] "AsIs"

$x_1
[1] "AsIs"

$x_2
[1] "AsIs"

when the function you are passing to is expecting the class to be a vector type. If you create the dataframe directly in R, see what you get:

In [58]: R('''r_df2 <- data.frame(y=rnorm(100), x_1=rnorm(100), x_2=rnorm(100))''')

In [59]: R["print"](R.lapply(R["r_df2"],R["class"]))
$y
[1] "numeric"

$x_1
[1] "numeric"

$x_2
[1] "numeric"

If you really need to pass in the data.frame from python, you can alter the class thusly:

In [60]: unAsIs = R('''function(x) {
                       class(x) <- "numeric"
                       return(x) 
                       } ''')

In [61]: r_df3 = R["as.data.frame"](R.lapply(r_df, unAsIs))

In [62]: R["print"](R.lapply(r_df3,R["class"]))
$y
[1] "numeric"

$x_1
[1] "numeric"

$x_2
[1] "numeric"

Be careful with this, in real life the class of your vectors might be more complex that just AsIs, or not all of the columns in your frame should be numeric, its possible some should be character or factor.

However, your code above now works (remember to print or save the result):

In[63]: R["print"](r_np.npreg(ro.Formula('y ~ x_1 + x_2'), data=r_df3))

Regression Data: 100 training points, in 2 variable(s)
                    x_1     x_2
Bandwidth(s): 0.6303934 9451331

Kernel Regression Estimator: Local-Constant
Bandwidth Type: Fixed

Continuous Kernel Type: Second-Order Gaussian
No. Continuous Explanatory Vars.: 2
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top