p-value.info: What's the significance of 0.05 significance?

Sunday, January 6, 2013

What's the significance of 0.05 significance?

Why do we tend to use a statistical significance level of 0.05? When I teach statistics or mentor colleagues brushing up, I often get the sense that a statistical significance level of α = 0.05 is viewed as some hard and fast threshold, a publishable / not publishable step function. I've seen grad students finish up an empirical experiment and groan to find that p = 0.052. Depressed, they head for the pub. I've seen the same grad students extend their experiment just long enough for statistical variation to swing in their favor to obtain p = 0.049. Happy, they head for the pub.

Clearly, 0.05 is not the only significance level used. 0.1, 0.01 and some smaller values are common too. This is partly related to field. In my experience, the ecological literature and other fields that are often plagued by small sample sizes are more likely to use 0.1. Engineering and manufacturing where larger samples are easier to obtain tend to use 0.01. Most people in most fields, however, use 0.05. It is indeed the default value in most statistical software applications.

This "standard" 0.05 level is typically associated with Sir R. A. Fisher, a brilliant biologist and statistician that pioneered many areas of statistics, including ANOVA and experimental design. However, the true origins make for a much richer story.

Let's start, however, with Fisher's contribution. In Statistical Methods for Research Workers (1925), he states

The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.

The next year he states, somewhat loosely,

... it is convenient to draw the line at about the level at which we can say: "Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."...

If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

(See http://www.jerrydallal.com/LHSP/p05.htm)

And there you have it. With no theoretical justification, these few sentences drove the standard significance level that we use to this day.

Fisher was not the first to think about this but he was the first to reframe it as a probability in this manner and the first to state this 0.05 value explicitly.

Those two z-values in the first quote, however, hint at a longer history and basis of the different significance levels that we know and love. Cowles & Davis (1982) On the Origins of the .05 level of statistical significance describe a fascinating extended history which reads like a Whos Whos of statistical luminaries: De Moivre, Pearson, Gossett (Student), Laplace, Gauss and others.

Our story really begins in 1818 with Bessel who coined the term "probable error" (well, at least the equivalent in German). Probable error is the semi-interquartle range. That is, ±1PE contains the central 50% of values and is roughly 2/3 of a standard deviation. So, for a uniform distribution ±2PE contains all values but for a standard normal it contains only the central 82% of values. Finally, and crucially to our story,

±3PE contains the central ~95% of values. 1 - 0.95 = 0.05
People like Quetelet and Galton had tended to express variation or errors outside some typical range in terms of ±3PE, even after Pearson coined the term standard deviation.

There you have the basis of 0.05 significance: ±3PE was in common use in the late 1890s and this translates to 0.05. 1 in 20 is easier to interpret for most people than a z value of 2 or in terms of PE (Cowles & Davis, 1982) and thus explains why 0.05 became more popular.

In one paper from the 1890s, Pearson remarks on different p-values obtained as

p = 0.5586 --- "thus we may consider the fit remarkably good"

p = 0.28 --- "fairly represented"

p = 0.1 --- "not very improbable that the observed frequencies are compatible with a random sampling"

p = 0.01 --- "this very improbable result"

and here we see the start of different significance levels. 0.1 is a little probable and 0.01 very improbable. 0.05 rests between the two.

Despite this, ±3PE continued to be used as the primary criterion up to the 1920s and is still used in some fields today, especially in physics. It was Fisher that rounded off the probability to 0.05 which in turn, switched from a clean ±2σ to ±1.96σ.

In summary, ±3PE --> ±2σ --> ±1.96σ --> α = 0.05 more accurately describes the evolution of statistical significance.

14 comments:

GuardianSeptember 23, 2013 at 8:54 PM
Carl,
I am trying to understand p<0.01 to answer the following: A hypothetical experimental clinical research study found a significant difference between the results for the treatment group and results for the control group (p<.01).

Should we, as consumers of research, have confidence that the statistically significant findings are also clinically significant? What kinds of questions might we want to consider before we can answer that question?

Can you help me or point me to see the "light"? Thank you in advance.
ReplyDelete
Replies
UnknownMay 7, 2015 at 4:19 AM
what conclusions can be drawn about statistical significance when the standard deviation is greater than the F value?
ReplyDelete
Replies
Carl AndersonMay 13, 2015 at 8:09 PM
Lukey,

Conceptually, the F statistic is the ratio of the variance explained by a treatment or independent variable / variance explained by noise. Essentially it is a signal to noise ratio.
The higher F, the stronger the signal relative to the noise or error term. Strong signals will typically have much higher variance from treatment than error term and those treatment variances will be much larger than the F value. For instance: if F = signal / noise = 100 / 20 = 5 and so 100 is much greater than 5.

You determine significance not by comparing standard deviation and F but simply by looking F up in a lookup table and it returns a p-value. Of course, all statistical software will both compute F and provide the p-value in their output.

Carl
ReplyDelete
Replies
UnknownSeptember 13, 2015 at 8:21 PM
Could you explain why 0.05 is used to limit the p-value ?
ReplyDelete
Replies
UnknownSeptember 19, 2015 at 11:37 PM
Hi there Carl,
I am trying to write up a project proposal for the very first time , I am given, 'In-silico studies of metasignature genes in Lung Cancer'.
After log transformation and student t test, p values are obtained at the significance fo 0.05.
what I would like to know whether we could sum the p-values obtained from using significance level of 0.01,then again using the same set of genes and setting the significance at 0.02 thus calculatiing till 0.05, and then adjusting the p-values using FDR.
will these produce more significant differentially expressed genes?
ReplyDelete
Replies
UnknownSeptember 20, 2015 at 5:51 AM
Thankyou very much Carl,your simple explanation helped me a lot . I am new to all this and struggling to understand.
ReplyDelete
Replies
UnknownOctober 23, 2015 at 7:27 PM
Hi Carl. I had a p value of exactly 0.05. how will i make my conclusion when my null hypothesis says there is no significant difference?
ReplyDelete
Replies
Carl AndersonOctober 24, 2015 at 4:11 AM
John, your null hypothesis simply states that there is no difference. It doesn't say anything about significance. You are trying to find evidence to reject it. You found a likelihood of 0.05 of seeing the differences you found, or more extreme, if the null hypothesis were true. Because probability is continuous, p < 0.05 is essentially the same as p <= 0.05, so you can consider it significant at 5% level and reject the null hypothesis.
ReplyDelete
Replies
UnknownMay 31, 2016 at 6:52 AM
Carl,
Need your help looking over this data.
For 1st condition
OR = 1.57
95% CI = (0.40, 6.33)
For 2nd condition
OR = 1.35
95% CI = (0.20, 8.86)
Both 95% confidence intervals contain 1, thus are not statistically significant at alpha = 0.05.
So I accept the null, right?
I'm a bit rusty with my stats and am having a little bit of a hard time.

ReplyDelete
Replies
UnknownJuly 30, 2016 at 9:57 AM
How can I select P>0.05, p>0.01 and p>0.001
ReplyDelete
Replies
UnknownJuly 30, 2016 at 9:57 AM
How can I select P>0.05, p>0.01 and p>0.001
ReplyDelete
Replies

Add comment

Note: Only a member of this blog may post a comment.