Null hypothesis significance testing and Type I error: The domain problem
Introduction
There has been much controversy over the practice of using calculated probability, or p—the most common summary value derived from standard tests of statistical significance—to reject or fail to reject null hypotheses (Abelson, 1997, Bakan, 1966, Carver, 1978, Carver, 1993, Chow, 1998, Cohen, 1994, Fisher, 1925, Fisher, 1973; Hagen, 1997, Howson and Urbach, 1989, Howson and Urbach, 1994, Kass and Raftery, 1995, Krueger, 2001, Mayo, 1996, Meehl, 1967, Meehl, 1978, Meehl, 1990, Meehl, 1997, Mulaik et al., 1997, Nickerson, 2000, Rozeboom, 1960, Rozeboom, 1997, Schmidt, 1996, Schmidt and Hunter, 1997, Trafimow, 2003, Trafimow, 2006, Trafimow and Marks, 2015, Trafimow and Marks, 2016). As a result of extensive criticism, a variety of defenses for this so-called null hypothesis significance testing procedure (NHST) have been offered by those who believe that it is in fact justified; and these defenses, in turn, have been sharply criticized (Bakan, 1966; Carver, 1993, Cohen, 1994, Fisher, 1973, Kass and Raftery, 1995, Meehl, 1967, Meehl, 1978, Meehl, 1990, Meehl, 1997, Nickerson, 2000, Rozeboom, 1960, Rozeboom, 1997, Schmidt, 1996, Schmidt and Hunter, 1997, Trafimow, 2003; Trafimow and Marks, 2015, Trafimow and Marks, 2016). Among other issues, what the critics argue is that p does not, despite appearances, and despite frequent misunderstandings to the contrary, provide any of the following useful information or outcomes: (1) the probability of the null hypothesis; (2) the probability of the alternative hypothesis; (3) a valid way to disconfirm the competing explanation of chance; (4) a good indicator of the probability of the substantive hypothesis; (5) a valid index of effect size; (6) a valid index of the degree of generalizability of the findings; or (7) a valid indicator of the probability of replication.1 In short, the p-value does not mean what many researchers seem to think it means, and it does not justify the sorts of inferences that many researchers seem to think it can be used to justify. As Fidler (2006) has stated, “There is clear evidence that researchers in psychology have many serious misconceptions about null hypothesis significance testing. [Not only do] the problems persist, [but they] are also exhibited even by many teachers of statistics in psychology” (p. 1).
So the problem appears to run deep. However, we do not wish to suggest that the p-value has no use whatsoever in scientific research, at least not without further discussion. Indeed, there is an alternative approach to understanding this statistic that could plausibly justify its use for making certain kinds of inferences, which, in our view, has not yet received sufficient consideration. This approach states that p specifically, and the null hypothesis significance testing procedure more generally, is not so much a reliable tool for reaching the common--albeit mistaken--conclusions or inferences we listed above, as it is a method for controlling the notorious Type I error rate (e.g., Hays, 1994, Mayo, 1996). As controlling this rate is often regarded as an important goal from a statistical research perspective, perhaps the use of p can be justified after all.
Section snippets
Controlling Type I error
But can p in fact be used, validly, to accomplish this goal in practice? We offer a skeptical perspective. To appreciate the reasons for our skepticism, it will be helpful to be reminded of a persistent statistical problem that faced researchers almost a century ago. The problem was this: How can one perform an inferential statistical analysis that would allow scientists to draw reasonable, justified conclusions about the implications of their data under conditions of uncertainty? For example,
The “across what?” question
The notion of exercising control over, or protecting against, that which we do not want, whether in statistics or in any other domain, has deep roots as well as contemporary significance. In some ways, it is rather straightforward. Doctors, for example, have to pass onerous exams in order to practice medicine in their area—thereby controlling against unqualified individuals, who are expected to be unlikely to pass such examinations. A similar scenario applies to lawyers, accountants,
Summary and conclusion
To our knowledge, prior to this paper, no one has attempted to specify, as such, the domain of interest across which, or within which, we should attempt to control the Type I error rate. Of course, we are happy to concede that we have not been exhaustive in our analysis: there might very well be other plausible domains of interest which we did not happen to consider in our paper, and which might not turn out to be so problematic. But for each of the domains we did consider, we reached a
References (75)
The good practices manifesto: Overcoming bad practices pervasive in current research in business
Journal of Business Research
(2016)A retrospective on the significance test ban of 1999 (If there were no significance tests, they would be invented)
- et al.
Response to Comment on “Estimating the reproducibility of psychological science”
Science
(2016) The universe: From flat earth to quasar
(1966)The test of significance in psychological research
Psychological Bulletin
(1966)- et al.
The rules of the game called psychological science
Perspectives on Psychological Science
(2012) - et al.
Power failure: Why small sample size undermines the reliability of neuroscience
Nature Reviews Neuroscience
(2013) The case against statistical significance testing
Harvard Educational Review
(1978)The case against statistical significance testing, revisited
The Journal of Experimental Education
(1993)Précis of statistical significance: Rationale, validity, and utility
Behavioral and Brain Sciences
(1998)
The earth is round (p < .05)
American Psychologist
Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis
Psychology is not in crisis? Depends on what you mean by “crisis”
Huffington Post
What did the OSC replication initiative reveal about the crisis in psychology? An open review of the draft paper entitled “Replication initiatives will not salvage the trustworthiness of psychology” by James C. Coyne
BMC Psychology
How to fix psychology's replication crisis
The Chronicle of Higher Education
Out, damned spot: Can the “Macbeth Effect” be replicated?
Basic and Applied Social Psychology
Replication, falsification, and the crisis of confidence in social psychology
Frontiers in Psychology
Bayesian statistical inference for psychological research
Psychological Review
Relativity: The special and the general theory (Robert W. Lawson, Trans.)
A tragedy of the (academic) commons: Interpreting the replication crisis in psychology as a social dilemma for early-career researchers
Frontiers in Psychology
Against method
Should psychology abandon p values and teach CIs instead? Evidence-based reforms in statistics education
Why figures with error bars should replace p values: Some conceptual arguments and empirical demonstrations
Zeitschrift für Psychologie/Journal of Psychology
Statistical methods for research workers
Inverse probability
Proceedings of the Cambridge Philosophical Society
The fiducial argument in statistical inference
Annals of Eugenics
The design of experiments
Statistical methods and scientific inference
Comment on “Estimating the reproducibility of psychological science”
Science
Introduction to statistics through resampling methods and R/S-PLUS
Consequences of prejudice against the null hypothesis
Psychological Bulletin
Observation oriented modeling: Analysis of cause in the behavioral sciences
In praise of the null hypothesis statistical test
American Psychologist
Statistics in physical science: Estimation, hypothesis testing, and least squares
Statistics
Scientific reasoning: The Bayesian approach
Probability, uncertainty and the practice of statistics
Cited by (43)
Does encouraging a belief in determinism increase cheating? Reconsidering the value of believing in free will
2020, CognitionCitation Excerpt :For, just as we should not take a single original report of a finding as conclusive evidence that it exists and is reliable, nor should we take a single apparently unsuccessful replication attempt as invalidating the original result. Especially when using Null Hypothesis Significance Testing and p-values, a series of high-quality, close or direct replications is typically required to get a meaningful sense of the underlying evidence (Earp & Trafimow, 2015; LeBel et al., 2018; Trafimow & Earp, 2016, 2017). It is therefore worth emphasizing that, to date, several other attempts to conceptually replicate and extend the basic findings of Vohs and Schooler have yielded inconsistent findings—either failing to successfully manipulate free will beliefs or failing to find that successfully manipulating them consistently influences moral behavior (e.g., Crone & Levy, 2018; Monroe, Brady, & Malle, 2016; Schooler, Nadelhoffer, Nahmias, & Vohs, 2014).7
All that glitters is not green: Creating trustworthy ecofriendly services at green hotels
2019, Tourism ManagementCitation Excerpt :Data analysis proceeded in four steps: (1) assess data normality; (2) assess reliability and validity of the constructs; (3) test the higher-order structure of the GSE; (4) test the nomological validity of the measurement items through the hypothesized model. To test statistical significance of all parameter estimates in the empirical analyses, a cutoff value of p<0.01 was selected to maintain consistency across all tests (Trafimow and Earp, 2017). While more stringent p-values (like p<0.001) ensure reduction in Type I error (rejecting a true null hypothesis), they also lead to an increase in Type II error (accepting a false null hypothesis).
Open science as a better gatekeeper for science and society: a perspective from neurolaw
2018, Science BulletinGreen behavior at work of hospitality and tourism employees: evidence from IGSCA-SEM and fsQCA
2024, Journal of Sustainable TourismThe Impact of Fantasy Cricket Motivational Factors on Participants’ Media and Gambling Consumption Behaviour: Fantasy Team Attachment as a Mediator
2023, Journal of Global Sport Management