Question

If small p-values are plentiful in big data, what is a comparable replacement for p-values in data with million of samples?

Was it helpful?

Solution

There is no replacement in the strict sense of the word. Instead you should look at other measures.

The other measures you look at depend on what you type of problem you are solving. In general, if you have a small p-value, also consider the magnitude of the effect size. It may be highly statistically significant but in practice meaningless. It is also helpful to report the confidence interval of the effect size.

I would consider this paper as mentoned in DanC's answer to this question.

OTHER TIPS

See also When are p-values deceptive?

When there are a lot of variables that can be tested for pair-wise correlation (for example), the replacement is to use any of the corrections for False discovery rate (to limit probability that any given discovery is false) or Familywise error rate (to limit probability of one or more false discoveries). For example, you might use the Holm–Bonferroni method.

In the case of a large sample rather than a lot of variables, something else is needed. As Christopher said, magnitude of effect a way to treat this. Combining these two ideas, you might use a confidence interval around your magnitude of effect, and apply a false discovery rate correction to the p-value of the confidence interval. The effects for which even the lowest bound of the corrected confidence interval is high are likely to be strong effects, regardless of huge data set size. I am not aware of any published paper that combines confidence intervals with false discovery rate correction in this way, but it seems like a straightforward and intuitively understandable approach.

To make this even better, use a non-parametric way to estimate confidence intervals. Assuming a distribution is likely to give very optimistic estimates here, and even fitting a distribution to the data is likely to be inaccurate. Since the information about the shape of the distribution past the edges of the confidence interval comes from a relatively small subsample of the data, this is where it really pays to be careful. You can use bootstrapping to get a non-parametric confidence interval.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top