Multiple Testing Problem (Multiple Comparisons)

Testing multiple hypothesis simultaneously requires interpreting the whole outcome and not just some individual results. Remember that a p-value of 0.05 means that there is a 5% chance of getting the observed result (or more extreme) if the null hypothesis is true.

When a large number of tests are performed some will have p-values close to 0.05 purely by chance.

An illustrative case involving this effect in the pseudo-science of astrology is the Mars effect supposedly discovered by Michel Gauquelin in 1955 [Kel97]. Astrology claims that the position of the planets at the time and place of birth has an influence on character and future life of the individual. This claim has been disproved conclusively [Car85], yet many people continue to cling to the ancient superstition [NSF14]. Analysing birth and occupation data of French citizens Gauquelin was largely unsuccessful in validating astrological claims, but he found somewhat increased probabilities for eminent athletes to be born when Mars was at certain positions in the sky. He also found similar but weaker effects for some other professions and planets.

There are several problems with the Gauquelin studies, such as misreporting by parents and his inability to replicate the results with later data where birth times where reported by doctors, but it remains an illustrative example of the multiple testing problem, which is also known as the look-elsewhere effect; in other words, when you keep looking for relationships in the data, performing test after test, you will sooner or later find some effects with p-values just below the significance level (in such cases usually set generously at 0.05). A newer Canadian study provides further illustration by identifying certain illnesses with certain zodiacal signs [Aus06].

In the following example a 20x20 matrix is filled with random numbers from the normal distribution with μ = 10 and σ = 2. Assume that each column takes on the role of some observed parameter. Let's call the first column P1 and test it for correlation with all other columns, resulting in 19 p-values:

x <- matrix(rnorm(400,10,2),20,20)
sapply(2:20, function(i) cor.test(x[,1], x[,i])$p.value)
 [1] 0.95779819 0.08835695 0.39435803 0.56684134 0.19903611 0.69688243
 [7] 0.90155729 0.46493124 0.39372404 0.02006015 0.45580134 0.57489424
[13] 0.58140333 0.49712424 0.53188269 0.45003006 0.98367208 0.15195715
[19] 0.94994881

We see that there is a correlation significant at α = 0.05 of P1 and P10. Interpreting this result directly and isolated from the rest leads to the common fallacy in the multiple testing problem.

The fact that we get about one false positive in 20 is a consequence of our choice of significance level, since this the meaning of α: the chance of a type I error i.e. rejecting the null hypothesis when it is true.

Several methods exist to deal with the multiple testing problem, e.g., simply dividing the individual significance levels by the number of hypothesis tested (i.e. α/19 in the example above). This Bonferroni correction is appropriate when the number of comparisons is small, and only one or two might be significant; see, e.g., [McD14] for a more detailed discussion. The important (and sometimes not trivial) step is to recognize the problem in the first place.

References

[Kel97] For a discussion see, e.g., Note 3 on page 31 in I. W. Kelly, The Concepts of Modern Astrology: A Critique. University of Saskatchewan, Canada, online at http://www.astrosurf.com/nitschelm/Modern_criticism.pdf (expanded and revised version of an article published in Psychological Reports, 1997, 81, 1035-1066)

[Car85] Among the many studies disproving astrology this one is particularly striking: Shawn Carlson (1985), A double-blind test of astrology. Nature 318 (6045): 419-425, online at http://muller.lbl.gov/papers/Astrology-Carlson.pdf

[NSF14] Science and Engineering Indicators 2014, Chapter 7. Science and Technology: Public Attitudes and Understanding, online at http://www.nsf.gov/statistics/seind14/index.cfm/chapter-7/c7s2.htm . In 2012, about half of Americans (55%) said astrology is "not at all scientific." One-third (32%) said they thought astrology was "sort of scientific," and 10% said it was "very scientific."

[Aus06] P. Austin, M. Mamdani, D. Juurlink, J. Hux (2006), Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. J Clin Epidemiol. 2006 Sep;59(9):964-9. Abstract online at http://www.ncbi.nlm.nih.gov/pubmed/16895820

[McD14] J. H. McDonald (2014), Handbook of Biological Statistics (3rd ed.). Sparky House Publishing, Baltimore, Maryland. Relevant pages 254-260 online at http://www.biostathandbook.com/multiplecomparisons.html