質問

I expected that the results for scipy.stats.mstats.pearsonr for masked array inputs would give the same results for scipy.stats.pearsonr for the unmasked values of the input data, but it doesn't:

from pylab import randn,rand
from numpy import ma
import scipy.stats

# Normally distributed data with noise
x=ma.masked_array(randn(10000),mask=False)
y=x+randn(10000)*0.6

# Randomly mask one tenth of each of x and y
x[rand(10000)<0.1]=ma.masked
y[rand(10000)<0.1]=ma.masked

# Identify indices for which both data are unmasked
bothok=((~x.mask)*(~y.mask))

# print results of both functions, passing only the data where 
# both x and y are good to scipy.stats
print "scipy.stats.mstats.pearsonr:", scipy.stats.mstats.pearsonr(x,y)[0]
print "scipy.stats.pearsonr:", scipy.stats.pearsonr(x[bothok].data,y[bothok].data)[0]

The answer will vary a little bit each time you do this, but the values differed by about 0.1 for me, and the bigger the masked fraction, the bigger the disagreement.

I noticed that if the same mask was used for both x and y, the results are the same for both functions, i.e.:

mask=rand(10000)<0.1
x[mask]=ma.masked
y[mask]=ma.masked
...

Is this a bug, or am I expected to precondition the input data to make sure the masks in both x and y are identical (surely not)?

I'm using numpy version '1.8.0' and scipy version '0.11.0b1'

役に立ちましたか?

解決

This looks like a bug in scipy.stats.mstats.pearsonr. It appears that the values in x and y are expected to be paired by index, so if one is masked, the other should be ignored. That is, if x and y look like (using -- for a masked value):

x = [1, --,  3,  4,  5]
y = [9,  8, --,  6,  5]

then both (--, 8) and (3, --) are to be ignored, and the result should should be the same as scipy.stats.pearsonr([1, 4, 5], [9, 6, 5]).

The bug in the mstats version is that the code to compute the means of x and y does not use the common mask.

I created an issue for this on the scipy github site: https://github.com/scipy/scipy/issues/3645

他のヒント

We have (at least) two options for missing value handling, complete case deletion and pairwise deletion.

In your use of scipy.stats.pearsonr you completely drop cases where there is a missing value in any of the variables.

numpy.ma.corrcoef gives the same results.

Checking the source of scipy.stats.mstats.pearsonr, it doesn't do complete case deletion for the calculating the variance or the mean.

>>> xm = x - x.mean(0)
>>> ym = y - y.mean(0)
>>> np.ma.dot(xm, ym) / np.sqrt(np.ma.dot(xm, xm) * np.ma.dot(ym, ym))
0.7731167378113557

>>> scipy.stats.mstats.pearsonr(x,y)[0]
0.77311673781135637

However, the difference between complete and pairwise case deletion on mean and standard deviations is small.

The main discrepancy seems to come from the missing correction for the different number of non-missing elements. Iignoring degrees of freedom corrections, I get

>>> np.ma.dot(xm, ym) / bothok.sum() / \
    np.sqrt(np.ma.dot(xm, xm) / (~xm.mask).sum() * np.ma.dot(ym, ym) / (~ym.mask).sum())
0.85855728319303393

which is close to the complete case deletion case.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top