import numpy
import rpy2
from rpy2 import robjects
import rpy2.robjects.numpy2ri

r = robjects.r
rpy2.robjects.numpy2ri.activate()

x = numpy.array( [1, 5, -99, 4, 5, 3, 7, -99, 6] )
mx = numpy.ma.masked_values( x, -99 )

print x         # works, displays all values
print r.sd(x)   # works, but uses -99 values in calculation

print mx        # works, now -99 values are masked (--)
print r.sd(mx)  # does not work - error

I am a new user of rpy2 and numpy. I am using R 2.14.1, python 2.7.1, rpy2 2.2.5, numpy 1.5.1 on RHEL5.

I need to read data into a numpy array and use rpy2 functions on it. However, I need to mask missing values prior to using the array with rpy2.

I have no problem masking values, but I can't get rpy2 to work with the resulting masked array. Looks like maybe the numpy2ri conversion doesn't work on masked numpy arrays? (see error below)

How can I make this work? Is it possible to tell rpy2 to ignore masked values? I'd like to stick with R rather than use scipy/numpy directly, since I'll be doing more advanced stats later.

Thanks.

Traceback (most recent call last):
  File "d.py", line 16, in <module>
    print r.sd(mx)  # does not work - error
  File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 82, in __call__
    return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
  File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/functions.py", line 30, in __call__
    new_args = [conversion.py2ri(a) for a in args]
  File "/dev/py/lib/python2.7/site-packages/rpy2-2.2.5dev_20120227-py2.7-linux-x86_64.egg/rpy2/robjects/numpy2ri.py", line 36, in numpy2ri
    vec = SexpVector(o.ravel("F"), _kinds[o.dtype.kind])
TypeError: ravel() takes exactly 1 argument (2 given)

Update: Since rpy2 can't handle masked numpy arrays, I tried converting my -99 values to numpy NaN values. Apparently rpy2 recognizes numpy NaN values as R-style NA values.

The code below works because in the r.sd() call I can tell rpy2 to not use NA values. But the initial NaN substitution is definitely slower than applying the numpy mask.

Can any of you python wizards give me a faster way to do the -99 to NaN substitution across a large numpy ndarray? Or maybe suggest another approach?

Thanks.

# 'x' is a large numpy ndarray I am working with
# ('x' in the original code above was a small test array)

for i in range(900, 950):           # random slice of numpy ndarray
  for j in range(6225):             # full extent across slice
    if x[i][j] == -99:
      x[i][j] = numpy.NaN

y = x[933]                          # random piece of converted range
sd = r.sd( y, **{'na.rm': 'TRUE'} ) # r.sd() call that ignores numpy NaN values
print sd
有帮助吗?

解决方案

The concept of "masked values" (that is of an array of value coupled to a list of indices to be masked) does not directly exist in R.

In R values are either set to be "missing" (NA), or a subset of the original data structure is taken (so a new object containing only this subset is created).

Now what is happening behind the scene in rpy2 during numpy to rinterface is that a copy of the numpy array into an R array is made (the other way around, exposing an R array to numpy, does not necessarily require copying). There is no reason why masks would not be handled at that stage (this may make it way to the code base quicker if someone is providing a patch). The alternative is to create a numpy array without the masked values, then feed this to rpy2.

其他提示

You can speed up the process of replacing -99 values by NaN by using masked arrays, objects that are natively defined in numpy.ma

as in the following code :

x_masked = numpy.ma.masked_array(x, mask= (x==-99) )
x_filled = x_masked.filled( numpy.NaN )

x_masked is a numpy.ma (masked array). x_filled is a numpy.ndarray (regular numpy array)

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top