P-value from Chi sq test statistic in Python

https://stackoverflow.com/questions/11725115

23-06-2021
|

Question

I have computed a test statistic that is distributed as a chi square with 1 degree of freedom, and want to find out what P-value this corresponds to using python.

I'm a python and maths/stats newbie so I think what I want here is the probability denisty function for the chi2 distribution from SciPy. However, when I use this like so:

from scipy import stats
stats.chi2.pdf(3.84 , 1)
0.029846

However some googling and talking to some colleagues who know maths but not python have said it should be 0.05.

Any ideas? Cheers, Davy

Solution

Quick refresher here:

Probability Density Function: think of it as a point value; how dense is the probability at a given point?

Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point?

In your case, you took the PDF, for which you got the correct answer. If you try 1 - CDF:

>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147

PDF CDF

OTHER TIPS

To calculate probability of null hypothesis given chisquared sum, and degrees of freedom you can also call chisqprob:

>>> from scipy.stats import chisqprob
>>> chisqprob(3.84, 1)
0.050043521248705189

Notice:

chisqprob is deprecated! stats.chisqprob is deprecated in scipy 0.17.0; use stats.distributions.chi2.sf instead

Update: as noted, chisqprob() is deprecated for scipy version 0.17.0 onwards. High accuracy chi-square values can now be obtained via scipy.stats.distributions.chi2.sf(), for example:

>>>from scipy.stats.distributions import chi2
>>>chi2.sf(3.84,1)
0.050043521248705189
>>>chi2.sf(1424,1)
1.2799986253099803e-311

While stats.chisqprob() and 1-stats.chi2.cdf() appear comparable for small chi-square values, for large chi-square values the former is preferable. The latter cannot provide a p-value smaller than machine epsilon,and will give very inaccurate answers close to machine epsilon. As shown by others, comparable values result for small chi-squared values with the two methods:

>>>from scipy.stats import chisqprob, chi2
>>>chisqprob(3.84,1)
0.050043521248705189
>>>1 - chi2.cdf(3.84,1)
0.050043521248705147

Using 1-chi2.cdf() breaks down here:

>>>1 - chi2.cdf(67,1)
2.2204460492503131e-16
>>>1 - chi2.cdf(68,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(69,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(70,1)
0.0

Whereas chisqprob() gives you accurate results for a much larger range of chi-square values, producing p-values nearly as small as the smallest float greater than zero, until it too underflows:

>>>chisqprob(67,1)
2.7150713219425247e-16
>>>chisqprob(68,1)
1.6349553217245471e-16
>>>chisqprob(69,1)
9.8463440314253303e-17    
>>>chisqprob(70,1)
5.9304458500824782e-17
>>>chisqprob(500,1)
9.505397766554137e-111
>>>chisqprob(1000,1)
1.7958327848007363e-219
>>>chisqprob(1424,1)
1.2799986253099803e-311
>>>chisqprob(1425,1)
0.0

You meant to do:

>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147

Some of the other solutions are deprecated. Use scipy.stats.chi2 Survival Function. Which is the same as 1 - cdf(chi_statistic, df)

Example:

from scipy.stats import chi2
p_value = chi2.sf(chi_statistic, df)

If you want to understand the math, the p-value of a sample, x (fixed), is

P[P(X) <= P(x)] = P[m(X) >= m(x)] = 1 - G(m(x)^2)

where,

P is the probability of a (say k-variate) normal distribution w/ known covariance (cov) and mean,
X is a random variable from that normal distribution,
m(x) is the mahalanobis distance = sqrt( < cov^{-1} (x-mean), x-mean >. Note that in 1-d this is just the absolute value of the z-score.
G is the CDF of the chi^2 distribution w/ k degrees of freedom.

So if you're computing the p-value of a fixed observation, x, then you compute m(x) (generalized z-score), and 1-G(m(x)^2).

for example, it's well known that if x is sampled from a univariate (k = 1) normal distribution and has z-score = 2 (it's 2 standard deviations from the mean), then the p-value is about .046 (see a z-score table)

In [7]: from scipy.stats import chi2

In [8]: k = 1

In [9]: z = 2

In [10]: 1-chi2.cdf(z**2, k)
Out[10]: 0.045500263896358528

For ultra-high precision, when scipy's chi2.sf() isn't enough, bring out the big guns:

>>> import numpy as np
>>> from rpy2.robjects import r
>>> np.exp(np.longdouble(r.pchisq(19000, 2, lower_tail=False, log_p=True)[0]))
1.5937563168532229629e-4126

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow