문제

After reading a recent blog post about an application of the Poisson distribution, I tried reproducing its findings using Python's 'scipy.stats' module, as well as Excel/LibreOffice 'POISSON' and 'CHITEST' functions.

For the expected values shown in the article, I simply used:

import scipy.stats
for i in range(8):
    print(scipy.stats.poisson.pmf(i, 2)*31)

This reproduces the table shown in the blog post - and I also recreated it from within LibreOffice, using a first column A that has values 0 to 7 in cells A1, A2, ..., A8, and the simple formula '=POISSON(A1, 2, 0)*31' repeated in the first 8 lines of column B.

So far so good - now for the chi-squared p-test value:

Under LibreOffice, I just wrote down the observed values in cells C1-C8, and used '=CHITEST(C1:C8, B1:B8)' to reproduce the article's reported p-value of 0.18. Under scipy.stats however, I can't seem to reproduce this value:

import numpy as np
import scipy.stats

obs = [4, 10, 7, 5, 4, 0, 0, 1]
exp = [scipy.stats.poisson.pmf(i, 2)*31 for i in range(8)]

# we only estimated one variable (the rate of 2 killings per year via 62/31) 
# so dof will be N-1-estimates
estimates = 1
print(scipy.stats.chisquare(np.array(obs), np.array(exp), ddof=len(obs)-1-estimates))
# (10.112318133864241, 0.0014728159441179519)
# the p-test value reported is 0.00147, not 0.18...
#
# Maybe I need to aggregate categories with observations less than 5 
# (as suggested in many textbooks of statistics for chi-squared tests)?
observedAggregateLessThan5 = [14, 7, 5, 5]
expectedAggregateLessThan5 = [exp[0]+exp[1], exp[2], exp[3], sum(exp[4:])]
print(scipy.stats.chisquare(np.array(observedAggregateLessThan5), np.array(expectedAggregateLessThan5), ddof=len(observedAggregateLessThan5)-1-estimates))
# (0.53561749342466913, 0.46425467595930309)
# Again the p-test value computed is not 0.18, it is 0.46...

What am I doing wrong?

도움이 되었습니까?

해결책

You are not using the ddof argument correctly. ddof is the change to make to the default degrees of freedom. The default is one less than the length. So you do not have to specify ddof at all:

In [21]: obs
Out[21]: [4, 10, 7, 5, 4, 0, 0, 1]

In [22]: exp
Out[22]: 
[4.1953937803349941,
 8.3907875606699882,
 8.3907875606699882,
 5.5938583737799901,
 2.796929186889995,
 1.1187716747559984,
 0.37292389158533251,
 0.10654968331009501]

In [23]: chisquare(obs, f_exp=array(exp))
Out[23]: (10.112318133864241, 0.1822973566091409)
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top