Question

For data X = [0,0,1,1,0]and Y = [1,1,0,1,1]

>> np.corrcoef(X,Y) 

returns

array([[ 1.        , -0.61237244],
       [-0.61237244,  1.        ]])

However, I cannot reproduce this result using np.var and np.cov given the equation shown in http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html:

>> np.cov([0,0,1,1,0],[1,1,0,1,1])/sqrt(np.var([0,0,1,1,0])*np.var([1,1,0,1,1]))

array([[ 1.53093109, -0.76546554],
       [-0.76546554,  1.02062073]])

What's going on here?

Was it helpful?

Solution

This is because, np.var default delta degrees of freedom is 0, not 1.

In [57]:

X = [0,0,1,1,0]
Y = [1,1,0,1,1]
np.corrcoef(X,Y) 
Out[57]:
array([[ 1.        , -0.61237244],
       [-0.61237244,  1.        ]])
In [58]:

V = np.sqrt(np.array([np.var(X, ddof=1), np.var(Y, ddof=1)])).reshape(1,-1)
np.matrix(np.cov(X,Y))
Out[58]:
matrix([[ 0.3 , -0.15],
        [-0.15,  0.2 ]])
In [59]:

np.matrix(np.cov(X,Y))/(V*V.T)
Out[59]:
matrix([[ 1.        , -0.61237244],
        [-0.61237244,  1.        ]])

Or looks it the otherway:

In [70]:

V=np.diag(np.cov(X,Y)).reshape(1,-1) #the diagonal elements
In [71]:

np.matrix(np.cov(X,Y))/np.sqrt(V*V.T)
Out[71]:
matrix([[ 1.        , -0.61237244],
        [-0.61237244,  1.        ]])

What is really going on, np.cov(m, y=None, rowvar=1, bias=0, ddof=None), when bias and ddof both not provided, the default normalization is by N-1, N being the number of observation. So, that is equivalent to have delta degrees of freedom of 1. Unfortunately, the default for np.var(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False) has the default delta degrees of freedom of 0.

Whenever unsure, the safest way is to grab the diagonal elements of the covariance matrix rather than calculate var separately, to ensure consistent behavior.

OTHER TIPS

According to your link (http://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html) you need to be mindful of the indices...

c = np.cov([0,0,1,1,0],[1,1,0,1,1])
corrcoef = [[ c[0,0]/np.sqrt(c[0,0]*c[0,0]), c[0,1]/np.sqrt(c[0,0]*c[1,1]) ],
           [ c[1,0]/np.sqrt(c[1,1]*c[0,0]), c[1,1]/np.sqrt(c[1,1]*c[1,1]) ]]

print corrcoef
# [[1.0, -0.61237243569579447], [-0.61237243569579447, 1.0]]

It's right!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top