Elsevier

New Ideas in Psychology

Volume 45, April 2017, Pages 19-27
New Ideas in Psychology

Null hypothesis significance testing and Type I error: The domain problem

https://doi.org/10.1016/j.newideapsych.2017.01.002Get rights and content

Abstract

Although many common uses of p-values for making statistical inferences in contemporary scientific research have been shown to be invalid, no one, to our knowledge, has adequately assessed the main original justification for their use, which is that they can help to control the Type I error rate (Neyman & Pearson, 1928, 1933). We address this issue head-on by asking a specific question: Across what domain, specifically, do we wish to control the Type I error rate? For example, do we wish to control it across all of science, across all of a specific discipline such as psychology, across a researcher's active lifetime, across a substantive research area, across an experiment, or across a set of hypotheses? In attempting to answer these questions, we show that each one leads to troubling dilemmas wherein controlling the Type I error rate turns out to be inconsistent with other scientific desiderata. This inconsistency implies that we must make a choice. In our view, the other scientific desiderata are much more valuable than controlling the Type I error rate and so it is the latter, rather than the former, with which we must dispense. But by doing so—that is, by eliminating the Type I error justification for computing and using p-values—there is even less reason to believe that p is useful for validly rejecting null hypotheses than previous critics have suggested.

Introduction

There has been much controversy over the practice of using calculated probability, or p—the most common summary value derived from standard tests of statistical significance—to reject or fail to reject null hypotheses (Abelson, 1997, Bakan, 1966, Carver, 1978, Carver, 1993, Chow, 1998, Cohen, 1994, Fisher, 1925, Fisher, 1973; Hagen, 1997, Howson and Urbach, 1989, Howson and Urbach, 1994, Kass and Raftery, 1995, Krueger, 2001, Mayo, 1996, Meehl, 1967, Meehl, 1978, Meehl, 1990, Meehl, 1997, Mulaik et al., 1997, Nickerson, 2000, Rozeboom, 1960, Rozeboom, 1997, Schmidt, 1996, Schmidt and Hunter, 1997, Trafimow, 2003, Trafimow, 2006, Trafimow and Marks, 2015, Trafimow and Marks, 2016). As a result of extensive criticism, a variety of defenses for this so-called null hypothesis significance testing procedure (NHST) have been offered by those who believe that it is in fact justified; and these defenses, in turn, have been sharply criticized (Bakan, 1966; Carver, 1993, Cohen, 1994, Fisher, 1973, Kass and Raftery, 1995, Meehl, 1967, Meehl, 1978, Meehl, 1990, Meehl, 1997, Nickerson, 2000, Rozeboom, 1960, Rozeboom, 1997, Schmidt, 1996, Schmidt and Hunter, 1997, Trafimow, 2003; Trafimow and Marks, 2015, Trafimow and Marks, 2016). Among other issues, what the critics argue is that p does not, despite appearances, and despite frequent misunderstandings to the contrary, provide any of the following useful information or outcomes: (1) the probability of the null hypothesis; (2) the probability of the alternative hypothesis; (3) a valid way to disconfirm the competing explanation of chance; (4) a good indicator of the probability of the substantive hypothesis; (5) a valid index of effect size; (6) a valid index of the degree of generalizability of the findings; or (7) a valid indicator of the probability of replication.1 In short, the p-value does not mean what many researchers seem to think it means, and it does not justify the sorts of inferences that many researchers seem to think it can be used to justify. As Fidler (2006) has stated, “There is clear evidence that researchers in psychology have many serious misconceptions about null hypothesis significance testing. [Not only do] the problems persist, [but they] are also exhibited even by many teachers of statistics in psychology” (p. 1).

So the problem appears to run deep. However, we do not wish to suggest that the p-value has no use whatsoever in scientific research, at least not without further discussion. Indeed, there is an alternative approach to understanding this statistic that could plausibly justify its use for making certain kinds of inferences, which, in our view, has not yet received sufficient consideration. This approach states that p specifically, and the null hypothesis significance testing procedure more generally, is not so much a reliable tool for reaching the common--albeit mistaken--conclusions or inferences we listed above, as it is a method for controlling the notorious Type I error rate (e.g., Hays, 1994, Mayo, 1996). As controlling this rate is often regarded as an important goal from a statistical research perspective, perhaps the use of p can be justified after all.

Section snippets

Controlling Type I error

But can p in fact be used, validly, to accomplish this goal in practice? We offer a skeptical perspective. To appreciate the reasons for our skepticism, it will be helpful to be reminded of a persistent statistical problem that faced researchers almost a century ago. The problem was this: How can one perform an inferential statistical analysis that would allow scientists to draw reasonable, justified conclusions about the implications of their data under conditions of uncertainty? For example,

The “across what?” question

The notion of exercising control over, or protecting against, that which we do not want, whether in statistics or in any other domain, has deep roots as well as contemporary significance. In some ways, it is rather straightforward. Doctors, for example, have to pass onerous exams in order to practice medicine in their area—thereby controlling against unqualified individuals, who are expected to be unlikely to pass such examinations. A similar scenario applies to lawyers, accountants,

Summary and conclusion

To our knowledge, prior to this paper, no one has attempted to specify, as such, the domain of interest across which, or within which, we should attempt to control the Type I error rate. Of course, we are happy to concede that we have not been exhaustive in our analysis: there might very well be other plausible domains of interest which we did not happen to consider in our paper, and which might not turn out to be so problematic. But for each of the domains we did consider, we reached a

References (75)

  • A. Woodside

    The good practices manifesto: Overcoming bad practices pervasive in current research in business

    Journal of Business Research

    (2016)
  • R.P. Abelson

    A retrospective on the significance test ban of 1999 (If there were no significance tests, they would be invented)

  • Christopher J. Anderson et al.

    Response to Comment on “Estimating the reproducibility of psychological science”

    Science

    (2016)
  • I. Asimov

    The universe: From flat earth to quasar

    (1966)
  • D. Bakan

    The test of significance in psychological research

    Psychological Bulletin

    (1966)
  • M. Bakker et al.

    The rules of the game called psychological science

    Perspectives on Psychological Science

    (2012)
  • K.S. Button et al.

    Power failure: Why small sample size undermines the reliability of neuroscience

    Nature Reviews Neuroscience

    (2013)
  • R.P. Carver

    The case against statistical significance testing

    Harvard Educational Review

    (1978)
  • R.P. Carver

    The case against statistical significance testing, revisited

    The Journal of Experimental Education

    (1993)
  • S.L. Chow

    Précis of statistical significance: Rationale, validity, and utility

    Behavioral and Brain Sciences

    (1998)
  • J. Cohen

    The earth is round (p < .05)

    American Psychologist

    (1994)
  • G. Cumming

    Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis

    (2012)
  • B.D. Earp

    Psychology is not in crisis? Depends on what you mean by “crisis”

    Huffington Post

    (2015, September 2)
  • B.D. Earp

    What did the OSC replication initiative reveal about the crisis in psychology? An open review of the draft paper entitled “Replication initiatives will not salvage the trustworthiness of psychology” by James C. Coyne

    BMC Psychology

    (2016)
  • B.D. Earp et al.

    How to fix psychology's replication crisis

    The Chronicle of Higher Education

    (2015, October 26)
  • B.D. Earp et al.

    Out, damned spot: Can the “Macbeth Effect” be replicated?

    Basic and Applied Social Psychology

    (2014)
  • B.D. Earp et al.

    Replication, falsification, and the crisis of confidence in social psychology

    Frontiers in Psychology

    (2015)
  • W. Edwards et al.

    Bayesian statistical inference for psychological research

    Psychological Review

    (1963)
  • A. Einstein

    Relativity: The special and the general theory (Robert W. Lawson, Trans.)

    (1961)
  • J.A.C. Everett et al.

    A tragedy of the (academic) commons: Interpreting the replication crisis in psychology as a social dilemma for early-career researchers

    Frontiers in Psychology

    (2015)
  • P. Feyerabend

    Against method

    (1993)
  • F. Fidler

    Should psychology abandon p values and teach CIs instead? Evidence-based reforms in statistics education

    (2006)
  • F. Fidler et al.

    Why figures with error bars should replace p values: Some conceptual arguments and empirical demonstrations

    Zeitschrift für Psychologie/Journal of Psychology

    (2009)
  • R.A. Fisher

    Statistical methods for research workers

    (1925)
  • R.A. Fisher

    Inverse probability

    Proceedings of the Cambridge Philosophical Society

    (1930)
  • R.A. Fisher

    The fiducial argument in statistical inference

    Annals of Eugenics

    (1935)
  • R.A. Fisher

    The design of experiments

    (1951)
  • R.A. Fisher

    Statistical methods and scientific inference

    (1973)
  • D.T. Gilbert et al.

    Comment on “Estimating the reproducibility of psychological science”

    Science

    (2016)
  • P. Good

    Introduction to statistics through resampling methods and R/S-PLUS

    (2005)
  • A.G. Greenwald

    Consequences of prejudice against the null hypothesis

    Psychological Bulletin

    (1975)
  • J.W. Grice

    Observation oriented modeling: Analysis of cause in the behavioral sciences

    (2011)
  • R.L. Hagen

    In praise of the null hypothesis statistical test

    American Psychologist

    (1997)
  • W.C. Hamilton

    Statistics in physical science: Estimation, hypothesis testing, and least squares

    (1964)
  • W.L. Hays

    Statistics

    (1994)
  • C. Howson et al.

    Scientific reasoning: The Bayesian approach

    (1989)
  • C. Howson et al.

    Probability, uncertainty and the practice of statistics

  • Cited by (43)

    • Does encouraging a belief in determinism increase cheating? Reconsidering the value of believing in free will

      2020, Cognition
      Citation Excerpt :

      For, just as we should not take a single original report of a finding as conclusive evidence that it exists and is reliable, nor should we take a single apparently unsuccessful replication attempt as invalidating the original result. Especially when using Null Hypothesis Significance Testing and p-values, a series of high-quality, close or direct replications is typically required to get a meaningful sense of the underlying evidence (Earp & Trafimow, 2015; LeBel et al., 2018; Trafimow & Earp, 2016, 2017). It is therefore worth emphasizing that, to date, several other attempts to conceptually replicate and extend the basic findings of Vohs and Schooler have yielded inconsistent findings—either failing to successfully manipulate free will beliefs or failing to find that successfully manipulating them consistently influences moral behavior (e.g., Crone & Levy, 2018; Monroe, Brady, & Malle, 2016; Schooler, Nadelhoffer, Nahmias, & Vohs, 2014).7

    • All that glitters is not green: Creating trustworthy ecofriendly services at green hotels

      2019, Tourism Management
      Citation Excerpt :

      Data analysis proceeded in four steps: (1) assess data normality; (2) assess reliability and validity of the constructs; (3) test the higher-order structure of the GSE; (4) test the nomological validity of the measurement items through the hypothesized model. To test statistical significance of all parameter estimates in the empirical analyses, a cutoff value of p<0.01 was selected to maintain consistency across all tests (Trafimow and Earp, 2017). While more stringent p-values (like p<0.001) ensure reduction in Type I error (rejecting a true null hypothesis), they also lead to an increase in Type II error (accepting a false null hypothesis).

    View all citing articles on Scopus
    View full text