Sunday, January 6, 2013

What's the significance of 0.05 significance?

Why do we tend to use a statistical significance level of 0.05? When I teach statistics or mentor colleagues brushing up, I often get the sense that a statistical significance level of α = 0.05 is viewed as some hard and fast threshold, a publishable / not publishable step function. I've seen grad students finish up an empirical experiment and groan to find that p = 0.052. Depressed, they head for the pub. I've seen the same grad students extend their experiment just long enough for statistical variation to swing in their favor to obtain p = 0.049. Happy, they head for the pub. 

Clearly, 0.05 is not the only significance level used. 0.1, 0.01 and some smaller values are common too. This is partly related to field. In my experience, the ecological literature and other fields that are often plagued by small sample sizes are more likely to use 0.1. Engineering and manufacturing where larger samples are easier to obtain tend to use 0.01. Most people in most fields, however, use 0.05. It is indeed the default value in most statistical software applications.

This "standard" 0.05 level is typically associated with Sir R. A. Fisher, a brilliant biologist and statistician that pioneered many areas of statistics, including ANOVA and experimental design. However, the true origins make for a much richer story.

Let's start, however, with Fisher's contribution. In Statistical Methods for Research Workers (1925), he states
The value for which P=0.05, or 1 in 20, is 1.96 or nearly 2; it is convenient to take this point as a limit in judging whether a deviation ought to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant. Using this criterion we should be led to follow up a false indication only once in 22 trials, even if the statistics were the only guide available. Small effects will still escape notice if the data are insufficiently numerous to bring them out, but no lowering of the standard of significance would meet this difficulty.
The next year he states, somewhat loosely,
... it is convenient to draw the line at about the level at which we can say: "Either there is something in the treatment, or a coincidence has occurred such as does not occur more than once in twenty trials."... 
If one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.

And there you have it. With no theoretical justification, these few sentences drove the standard significance level that we use to this day. 

Fisher was not the first to think about this but he was the first to reframe it as a probability in this manner and the first to state this 0.05 value explicitly. 

Those two z-values in the first quote, however, hint at a longer history and basis of the different significance levels that we know and love. Cowles & Davis (1982) On the Origins of the .05 level of statistical significance describe a fascinating extended history which reads like a Whos Whos of statistical luminaries: De Moivre, Pearson, Gossett (Student), Laplace, Gauss and others. 

Our story really begins in 1818 with Bessel who coined the term "probable error" (well, at least the equivalent in German). Probable error is the semi-interquartle range. That is, ±1PE contains the central 50% of values and is roughly 2/3 of a standard deviation. So, for a uniform distribution ±2PE contains all values but for a standard normal it contains only the central 82% of values. Finally, and crucially to our story,
  • ±3PE contains the central ~95% of values. 1 - 0.95 = 0.05
  • People like Quetelet and Galton had tended to express variation or errors outside some typical range in terms of ±3PE, even after Pearson coined the term standard deviation. 

There you have the basis of 0.05 significance: ±3PE was in common use in the late 1890s and this translates to 0.05. 1 in 20 is easier to interpret for most people than a z value of 2 or in terms of PE (Cowles & Davis, 1982) and thus explains why 0.05 became more popular. 

In one paper from the 1890s, Pearson remarks on different p-values obtained as

p = 0.5586 --- "thus we may consider the fit remarkably good"
p = 0.28 --- "fairly represented"
p = 0.1 --- "not very improbable that the observed frequencies are compatible with a random sampling"
p = 0.01 --- "this very improbable result"

and here we see the start of different significance levels. 0.1 is a little probable and 0.01 very improbable. 0.05 rests between the two.

Despite this, ±3PE continued to be used as the primary criterion up to the 1920s and is still used in some fields today, especially in physics. It was Fisher that rounded off the probability to 0.05 which in turn, switched from a clean ±2σ to ±1.96σ.

In summary, ±3PE --> ±2σ --> ±1.96σ --> α = 0.05 more accurately describes the evolution of statistical significance.


  1. Carl,
    I am trying to understand p<0.01 to answer the following: A hypothetical experimental clinical research study found a significant difference between the results for the treatment group and results for the control group (p<.01).

    Should we, as consumers of research, have confidence that the statistically significant findings are also clinically significant? What kinds of questions might we want to consider before we can answer that question?

    Can you help me or point me to see the "light"? Thank you in advance.

    1. That really depends on your null hypothesis and experimental design. Suppose I have one group of people sit in a chair for 5 mins and another run for 5 mins and then I measure their pulse. I find that the runners have a significantly higher pulse. Does that provide any meaningful insights? No, not really. If instead, I give the treatment group a new drug, well that is likely different. Sorry, but it is hard to give a good answer to this because it is very much tied up with the basis of statistical inference. I would suggest reading up on that, starting with the null hypothesis.

  2. what conclusions can be drawn about statistical significance when the standard deviation is greater than the F value?

  3. Lukey,

    Conceptually, the F statistic is the ratio of the variance explained by a treatment or independent variable / variance explained by noise. Essentially it is a signal to noise ratio.
    The higher F, the stronger the signal relative to the noise or error term. Strong signals will typically have much higher variance from treatment than error term and those treatment variances will be much larger than the F value. For instance: if F = signal / noise = 100 / 20 = 5 and so 100 is much greater than 5.

    You determine significance not by comparing standard deviation and F but simply by looking F up in a lookup table and it returns a p-value. Of course, all statistical software will both compute F and provide the p-value in their output.


  4. Could you explain why 0.05 is used to limit the p-value ?

    1. Mohd,
      See my response to the comment below. Hopefully that helps clarify things for you.

  5. Hi there Carl,
    I am trying to write up a project proposal for the very first time , I am given, 'In-silico studies of metasignature genes in Lung Cancer'.
    After log transformation and student t test, p values are obtained at the significance fo 0.05.
    what I would like to know whether we could sum the p-values obtained from using significance level of 0.01,then again using the same set of genes and setting the significance at 0.02 thus calculatiing till 0.05, and then adjusting the p-values using FDR.
    will these produce more significant differentially expressed genes?

    1. Moona,
      I don't fully understand what you are trying to do but the answer is no. You can't add p-values.
      There is some muddled thinking here I would like to clear up. P-values are completely independent of significance level. Significance is layer of interpretation after a p-value is obtained.
      Here is the flow:
      1) Set up null hypothesis: metric_control = metric_treatment
      2) set up alternative: metric_control != metric_treament
      3) compute test statistic: say t = 3.95
      4) statistical software will work out p-values associate with that test statistic (for the level of degrees of freedom). Say, p=0.03
      That p-value is the probability of obtaining the metric value, or more extreme, if the null hypothesis were true.
      Notice that I haven't mentioned significance yet. We got a probability of 0.03, a 3% chance that we would get these results (or more extreme) by chance if the null hypothesis were true.
      5) Now you have to interpret how strong a signal that is. If you chose 5% significance level, as 0.03 < 0.05, it is significant at 5%. If you choose 1% significance level, then as 0.03 > 0.01 then it is not significant. A significance level is some critical threshold for our p-value used. A p-value of 0.03 is significant at 3% level and it is significant at 4% level and at 5% level, ... and at 99% level.

      The p-value is a probability associated with a given null hypothesis. As such, you can't sum them across hypotheses like this.

      (In Bayesian statistics, however, you work with likelihood of events which can be multiplied to get joint probabilities but that is very different statistical approach.)

  6. Thankyou very much Carl,your simple explanation helped me a lot . I am new to all this and struggling to understand.