Question

I have the following dataset. When I calculate the Spearman correlation coefficient with scipy.stats.spearmanr, it returns 0.718182.

enter image description here

import pandas as pd
import numpy as np
from scipy.stats import spearmanr

df = pd.DataFrame(
    [
        [7,3],
        [6,5],
        [5,4],
        [3,2],
        [6,4],
        [8,9],
        [9,7]
    ],
    columns=['Set of A','Set of B'])

correlation, pval = spearmanr(df)
print(f'correlation={correlation:.6f}, p-value={pval:.6f}')

It returns this:

correlation=0.718182, p-value=0.069096

However, when I tried to calculate it manually:

df_rank = pd.DataFrame(
    [
        [5,2],
        [3.5,4],
        [2,4],
        [1,1],
        [3.5,4],
        [6,7],
        [7,6]
    ],
    columns=['Rank of A','Rank of B'])
cov_rank=np.cov(df_rank.iloc[:,0],df_rank.iloc[:,1])[0][1]

cov_rank/(df_rank.std()[0]*df_rank.std()[1])

It returns a different value.

0.7105597124064275

After the two decimal points are different and I do not know why.

The question is if scipy.stats.spearmanr expect the data to be ranked or not.

Was it helpful?

Solution

I think you have a small error in your manual calculation. You assign rank 4 to 4, 4, and 5. The first two should have rank 3.5 and the last should be rank 5. Your calculation then gives the same answer, 0.7181818181818181

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top