Why don't scipy.stats.mstats.pearsonr results agree with scipy.stats.pearsonr?

Question 1

This looks like a bug in scipy.stats.mstats.pearsonr. It appears that the values in x and y are expected to be paired by index, so if one is masked, the other should be ignored. That is, if x and y look like (using -- for a masked value):

x = [1, --,  3,  4,  5]
y = [9,  8, --,  6,  5]

then both (--, 8) and (3, --) are to be ignored, and the result should should be the same as scipy.stats.pearsonr([1, 4, 5], [9, 6, 5]).

The bug in the mstats version is that the code to compute the means of x and y does not use the common mask.

I created an issue for this on the scipy github site: https://github.com/scipy/scipy/issues/3645

Question 2

We have (at least) two options for missing value handling, complete case deletion and pairwise deletion.

In your use of scipy.stats.pearsonr you completely drop cases where there is a missing value in any of the variables.

numpy.ma.corrcoef gives the same results.

Checking the source of scipy.stats.mstats.pearsonr, it doesn't do complete case deletion for the calculating the variance or the mean.

>>> xm = x - x.mean(0)
>>> ym = y - y.mean(0)
>>> np.ma.dot(xm, ym) / np.sqrt(np.ma.dot(xm, xm) * np.ma.dot(ym, ym))
0.7731167378113557

>>> scipy.stats.mstats.pearsonr(x,y)[0]
0.77311673781135637

However, the difference between complete and pairwise case deletion on mean and standard deviations is small.

The main discrepancy seems to come from the missing correction for the different number of non-missing elements. Iignoring degrees of freedom corrections, I get

>>> np.ma.dot(xm, ym) / bothok.sum() / \
    np.sqrt(np.ma.dot(xm, xm) / (~xm.mask).sum() * np.ma.dot(ym, ym) / (~ym.mask).sum())
0.85855728319303393

which is close to the complete case deletion case.