Python SciPy chisquare test returns different p value from Excel and LibreOffice

https://stackoverflow.com/questions/14037939

12-12-2021
|

Question

After reading a recent blog post about an application of the Poisson distribution, I tried reproducing its findings using Python's 'scipy.stats' module, as well as Excel/LibreOffice 'POISSON' and 'CHITEST' functions.

For the expected values shown in the article, I simply used:

import scipy.stats
for i in range(8):
    print(scipy.stats.poisson.pmf(i, 2)*31)

This reproduces the table shown in the blog post - and I also recreated it from within LibreOffice, using a first column A that has values 0 to 7 in cells A1, A2, ..., A8, and the simple formula '=POISSON(A1, 2, 0)*31' repeated in the first 8 lines of column B.

So far so good - now for the chi-squared p-test value:

Under LibreOffice, I just wrote down the observed values in cells C1-C8, and used '=CHITEST(C1:C8, B1:B8)' to reproduce the article's reported p-value of 0.18. Under scipy.stats however, I can't seem to reproduce this value:

import numpy as np
import scipy.stats

obs = [4, 10, 7, 5, 4, 0, 0, 1]
exp = [scipy.stats.poisson.pmf(i, 2)*31 for i in range(8)]

# we only estimated one variable (the rate of 2 killings per year via 62/31) 
# so dof will be N-1-estimates
estimates = 1
print(scipy.stats.chisquare(np.array(obs), np.array(exp), ddof=len(obs)-1-estimates))
# (10.112318133864241, 0.0014728159441179519)
# the p-test value reported is 0.00147, not 0.18...
#
# Maybe I need to aggregate categories with observations less than 5 
# (as suggested in many textbooks of statistics for chi-squared tests)?
observedAggregateLessThan5 = [14, 7, 5, 5]
expectedAggregateLessThan5 = [exp[0]+exp[1], exp[2], exp[3], sum(exp[4:])]
print(scipy.stats.chisquare(np.array(observedAggregateLessThan5), np.array(expectedAggregateLessThan5), ddof=len(observedAggregateLessThan5)-1-estimates))
# (0.53561749342466913, 0.46425467595930309)
# Again the p-test value computed is not 0.18, it is 0.46...

What am I doing wrong?

Solution

You are not using the ddof argument correctly. ddof is the change to make to the default degrees of freedom. The default is one less than the length. So you do not have to specify ddof at all:

In [21]: obs
Out[21]: [4, 10, 7, 5, 4, 0, 0, 1]

In [22]: exp
Out[22]: 
[4.1953937803349941,
 8.3907875606699882,
 8.3907875606699882,
 5.5938583737799901,
 2.796929186889995,
 1.1187716747559984,
 0.37292389158533251,
 0.10654968331009501]

In [23]: chisquare(obs, f_exp=array(exp))
Out[23]: (10.112318133864241, 0.1822973566091409)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow