Question

I have computed a test statistic that is distributed as a chi square with 1 degree of freedom, and want to find out what P-value this corresponds to using python.

I'm a python and maths/stats newbie so I think what I want here is the probability denisty function for the chi2 distribution from SciPy. However, when I use this like so:

from scipy import stats
stats.chi2.pdf(3.84 , 1)
0.029846

However some googling and talking to some colleagues who know maths but not python have said it should be 0.05.

Any ideas? Cheers, Davy

Was it helpful?

Solution

Quick refresher here:

Probability Density Function: think of it as a point value; how dense is the probability at a given point?

Cumulative Distribution Function: this is the mass of probability of the function up to a given point; what percentage of the distribution lies on one side of this point?

In your case, you took the PDF, for which you got the correct answer. If you try 1 - CDF:

>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147

PDF CDF

OTHER TIPS

To calculate probability of null hypothesis given chisquared sum, and degrees of freedom you can also call chisqprob:

>>> from scipy.stats import chisqprob
>>> chisqprob(3.84, 1)
0.050043521248705189

Notice:

chisqprob is deprecated! stats.chisqprob is deprecated in scipy 0.17.0; use stats.distributions.chi2.sf instead

Update: as noted, chisqprob() is deprecated for scipy version 0.17.0 onwards. High accuracy chi-square values can now be obtained via scipy.stats.distributions.chi2.sf(), for example:

>>>from scipy.stats.distributions import chi2
>>>chi2.sf(3.84,1)
0.050043521248705189
>>>chi2.sf(1424,1)
1.2799986253099803e-311

While stats.chisqprob() and 1-stats.chi2.cdf() appear comparable for small chi-square values, for large chi-square values the former is preferable. The latter cannot provide a p-value smaller than machine epsilon,and will give very inaccurate answers close to machine epsilon. As shown by others, comparable values result for small chi-squared values with the two methods:

>>>from scipy.stats import chisqprob, chi2
>>>chisqprob(3.84,1)
0.050043521248705189
>>>1 - chi2.cdf(3.84,1)
0.050043521248705147

Using 1-chi2.cdf() breaks down here:

>>>1 - chi2.cdf(67,1)
2.2204460492503131e-16
>>>1 - chi2.cdf(68,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(69,1)
1.1102230246251565e-16
>>>1 - chi2.cdf(70,1)
0.0

Whereas chisqprob() gives you accurate results for a much larger range of chi-square values, producing p-values nearly as small as the smallest float greater than zero, until it too underflows:

>>>chisqprob(67,1)
2.7150713219425247e-16
>>>chisqprob(68,1)
1.6349553217245471e-16
>>>chisqprob(69,1)
9.8463440314253303e-17    
>>>chisqprob(70,1)
5.9304458500824782e-17
>>>chisqprob(500,1)
9.505397766554137e-111
>>>chisqprob(1000,1)
1.7958327848007363e-219
>>>chisqprob(1424,1)
1.2799986253099803e-311
>>>chisqprob(1425,1)
0.0

You meant to do:

>>> 1 - stats.chi2.cdf(3.84, 1)
0.050043521248705147

Some of the other solutions are deprecated. Use scipy.stats.chi2 Survival Function. Which is the same as 1 - cdf(chi_statistic, df)

Example:

from scipy.stats import chi2
p_value = chi2.sf(chi_statistic, df)

If you want to understand the math, the p-value of a sample, x (fixed), is

P[P(X) <= P(x)] = P[m(X) >= m(x)] = 1 - G(m(x)^2)

where,

  • P is the probability of a (say k-variate) normal distribution w/ known covariance (cov) and mean,
  • X is a random variable from that normal distribution,
  • m(x) is the mahalanobis distance = sqrt( < cov^{-1} (x-mean), x-mean >. Note that in 1-d this is just the absolute value of the z-score.
  • G is the CDF of the chi^2 distribution w/ k degrees of freedom.

So if you're computing the p-value of a fixed observation, x, then you compute m(x) (generalized z-score), and 1-G(m(x)^2).

for example, it's well known that if x is sampled from a univariate (k = 1) normal distribution and has z-score = 2 (it's 2 standard deviations from the mean), then the p-value is about .046 (see a z-score table)

In [7]: from scipy.stats import chi2

In [8]: k = 1

In [9]: z = 2

In [10]: 1-chi2.cdf(z**2, k)
Out[10]: 0.045500263896358528

For ultra-high precision, when scipy's chi2.sf() isn't enough, bring out the big guns:

>>> import numpy as np
>>> from rpy2.robjects import r
>>> np.exp(np.longdouble(r.pchisq(19000, 2, lower_tail=False, log_p=True)[0]))
1.5937563168532229629e-4126
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top