The following rule of experimentation is therefore suggested: perform that experiment for which the expected gain in information is the greatest, and continue experimentation until a preassigned amount of information has been attained (Lindley 1956, p. 987)

We aim to explore Bayes Factor Design Analysis (BFDA) as a useful tool to design studies for maximum efficiency and informativeness. In the classical frequentist framework, statistical power refers to the long-term probability (across multiple hypothetical studies) of obtaining a significant p-value in case an effect of a certain size exists (Cohen 1988). Classical power analysis is a special case of the broader class of design analysis, which uses prior guesses of effect sizes and other parameters in order to compute distributions of any study outcome (Gelman and Carlin 2014).Footnote 1 The general principle is to assume a certain state of reality, most importantly the expected true effect size, and tune the settings of a research design in a way such that certain desirable outcomes are likely to occur. For example, in frequentist power analysis, the property “sample size” of a design can be tuned such that, say, 80 % of all studies would yield a p-value <.05 if an effect of a certain size exists.

The framework of design analysis is general and can be used both for Bayesian and non-Bayesian designs, and it can be applied to any study outcome of interest. For example, in designs reporting Bayes factors a researcher can plan sample size such that, say, 80 % of all studies result in a compelling Bayes factor, for instance BF10>10 (Weiss 1997; De Santis 2004). One can also determine the sample size such that, with a desired probability of occurrence, a highest density interval for a parameter excludes zero, or a particular parameter is estimated with a predefined precision (Kruschke 2014; Gelman and Tuerlinckx 2000). Hence, the concept of prospective design analysis, which refers to design planning before data are collected, is not limited to null-hypothesis significance testing (NHST), and our paper applies the concept to studies that use Bayes factors (BFs) as an index of evidence.

The first part of this article provides a short introduction to BFs as a measure of evidence for a hypothesis (relative to an alternative hypothesis). The second part describes how compelling evidence is a necessary ingredient for strong inference, which has been argued to be the fastest way to increase knowledge (Platt 1964). The third part of this article elaborates on how to apply the idea of design analysis to research designs with BFs. The fourth part introduces three BF designs, (a) a fixed-n design, (b) an open-ended Sequential Bayes Factor (SBF) design, where researchers can test after each participant and can stop data collection when there is strong evidence for either \(\mathcal {H}_{1}\) or \(\mathcal {H}_{0}\), and (c) a modified SBF design that defines a maximal sample size where data collection is stopped in any case. We demonstrate how to use Monte Carlo simulations and graphical summaries to assess the properties of each design and how to plan for compelling evidence. Finally, we discuss the approach in terms of possible extensions, the issue of (un)biased effect size estimates in sequential designs, and practical considerations.

Bayes factors as an index of evidence

The Bayes factor is “fundamental to the Bayesian comparison of alternative statistical models” (O’Hagan and Forster 2004, p. 55) and it represents “the standard Bayesian solution to the hypothesis testing and model selection problems” (Lewis and Raftery 1997, p. 648) and “the primary tool used in Bayesian inference for hypothesis testing and model selection” (Berger 2006, p. 378). Here we briefly describe the Bayes factor as it applies to the standard scenario where a precise, point-null hypothesis \(\mathcal {H}_{0}\) is compared to a composite alternative hypothesis \(\mathcal {H}_{1}\). Under a composite hypothesis, the parameter of interest is not restricted to a particular fixed value (Jeffreys 1961). In the case of a t-test, for instance, the null hypothesis specifies the absence of an effect, that is, \(\mathcal {H}_{0}: \delta = 0\), whereas the composite alternative hypothesis allows effect size to take on nonzero values.

In order to gauge the support that the data provide for \(\mathcal {H}_{0}\) versus \(\mathcal {H}_{1}\), the Bayes factor hypothesis test requires that both models make predictions. This, in turn, requires that the expectations under \(\mathcal {H}_{1}\) are made explicit by assigning effect size δ a prior distribution, for instance a normal distribution centered on zero with a standard deviation of 1, \(\mathcal {H}_{1}: \delta \sim \mathcal {N}(0,1)\).

After both models have been specified so that they make predictions, the observed data can be used to assess each models’ predictive adequacy (Morey et al. 2016; Wagenmakers et al. 2006; Wagenmakers et al. 2016). The ratio of predictive adequacies –the Bayes factor– represents the extent to which the data update the relative plausibility of the competing hypotheses, that is:

$$\begin{array}{@{}rcl@{}} \underbrace{ \frac{p(\mathcal{H}_{0} \mid \text{data})}{p(\mathcal{H}_{1} \mid \text{data})}}_{\underset{\text{about hypotheses}}{{\text{Posterior plausibility}}}} = \underbrace{ \frac{p(\mathcal{H}_{0})}{p(\mathcal{H}_{1})}}_{\underset{\text{about hypotheses}}{{\text{Prior plausibility}}}} \times \,\,\,\,\,\,\, \underbrace{ \frac{p(\text{data} \mid \mathcal{H}_{0})}{p(\text{data} \mid \mathcal{H}_{1})}}_{\underset{\text{Predictive updating factor}}{\text{Bayes factor =}}} \end{array} $$
(1)

In this equation, the relative prior plausibility of the competing hypotheses is adjusted in light of predictive performance for observed data, and this then yields the relative posterior plausibility. Although the assessment of prior plausibility may be informative and important (e.g., Dreber et al. 2015), the inherently subjective nature of this component has caused many Bayesian statisticians to focus on the Bayes factor –the predictive updating factor– as the metric of interest (Hoijtink et al. 2008; Jeffreys 1961; Kass and Raftery 1995; Ly et al. 2016; Mulder and Wagenmakers 2016; Rouder et al. 2009; Rouder et al. 2012).

Depending on the order of numerator and denominator in the ratio, the Bayes factor is either denoted as BF01 (“\(\mathcal {H}_{0}\) over \(\mathcal {H}_{1}\)”, as in Eq. (1)) or as its inverse BF10 (“\(\mathcal {H}_{1}\) over \(\mathcal {H}_{0}\)”). When the Bayes factor BF01 equals 5, this indicates that the data are five times more likely under \(\mathcal {H}_{0}\) than under \(\mathcal {H}_{1}\), meaning that \(\mathcal {H}_{0}\) has issued a better probabilistic prediction for the observed data than did \(\mathcal {H}_{1}\). In contrast, when BF01 equals 0.25 the data support \(\mathcal {H}_{1}\) over \(\mathcal {H}_{0}\). Specifically, the data are 1/BF01=BF10=4 times more likely under \(\mathcal {H}_{1}\) than under \(\mathcal {H}_{0}\).

The Bayes factor offers several advantages for the practical researcher (Wagenmakers et al. 2016). First, the Bayes factor quantifies evidence, both for \(\mathcal {H}_{1}\) but also for \(\mathcal {H}_{0}\); second, its predictive underpinnings entail that neither \(\mathcal {H}_{0}\) nor \(\mathcal {H}_{1}\) need be “true” for the Bayes factor to be useful (but see van Erven et al. 2012); third, the Bayes factor does not force an all-or-none decision, but instead coherently reallocates belief on a continuous scale; fourth, the Bayes factor distinguishes between absence of evidence and evidence of absence (e.g., Dienes 2014, 2016); fifth, the Bayes factor does not require adjustment for sampling plans (i.e., the Stopping Rule Principle; (Bayarri et al. 2016; Berger and Wolpert 1988; Rouder 2014). A practical corollary is that, in contrast to p-values, Bayes factors retain their meaning in situations common in ecology and astronomy, where nature provides data over time and sampling plans do not exist (Wagenmakers et al. 2016).

Although Bayes factors are defined on a continuous scale, several researchers have proposed to subdivide the scale in discrete evidential categories (Jeffreys 1961; Kass and Raftery 1995; Lee and Wagenmakers 2013). The scheme originally proposed by Jeffreys is shown in Table 1. The evidential categories serve as a rough heuristic whose main goal is to prevent researchers from overinterpreting the evidence in the data. In addition –as we will demonstrate below– the categories permit a concise summary of the results from our simulation studies.

Table 1 A rough heuristic classification scheme for the interpretation of Bayes factors BF10 (Lee & Wagenmakers 2013; adjusted from Jeffreys 1961)

The purpose of design analyses: planning for compelling evidence

In the planning phase of an experiment, the purpose of a prospective design analysis is to facilitate the design of a study that ensures a sufficiently high probability of detecting an effect if it exists. Executed correctly, this is a crucial ingredient to strong inference (Platt 1964), which includes “[d]evising a crucial experiment [...], with alternative possible outcomes, each of which will, as nearly as possible, exclude one or more of the hypotheses” (p. 347). In other words, a study design with strong inferential properties is likely to provide compelling evidence, either for one hypothesis or for the other. Such a study generally does not leave researchers in a state of inference that is inconclusive.

When a study is underpowered, in contrast, it most likely provides only weak inference. Within the framework of frequentist statistics, underpowered studies result in p-values that are relatively nondiagnostic. Specifically, underpowered studies inflate both false-negative and false-positive results (Button et al. 2013; Dreber et al. 2015; Ioannidis 2005; Lakens and Evers 2014), wasting valuable resources such as the time and effort of participants, the lives of animals, and scientific funding provided by society. Consequently, research unlikely to produce diagnostic outcomes is inefficient and can even be considered unethical (Halpern et al. (Halpern et al. 2002); Emanuel et al. 2000; but see Bacchetti et al. 2005).

To summarize, the primary purpose of a prospective design analysis is to assist in the design of studies that increase the probability of obtaining compelling evidence, a necessary requirement for strong inference.

Design analysis for Bayes factor designs

We apply design analysis to studies that report the Bayes factor as a measure of evidence. Note, first, that we seek to evaluate the operational characteristics of a Bayesian research design before the data are collected (i.e., a prospective design analysis). Therefore, our work centers on design, not on inference; once specific data have been collected, pre-data design analyses are inferentially irrelevant, at least from a Bayesian perspective (Bayarri et al. 2016; Wagenmakers et al. 2014). Second, our focus is on the Bayes factor as a measure of evidence, and we expressly ignore both prior model probabilities and utilities (Berger 1985; Taroni et al. 2010; Lindley 1997), two elements that are essential for decision making yet orthogonal to the quantification of evidence provided by the observed data. Thus, we consider scenarios where “the object of experimentation is not to reach decisions but rather to gain knowledge about the world” (Lindley 1956, p. 986).

Target outcome of a Bayes factor design analysis: strong evidence and no misleading evidence

In the context of evaluating the empirical support for and against a null hypothesis, Bayes factors quantify the strength of evidence for that null hypothesis \(\mathcal {H}_{0}\) relative to the alternative hypothesis \(\mathcal {H}_{1}\). To facilitate strong inference, we wish to design studies such that they are likely to result in compelling Bayes factors in favor of the true hypothesis – thus, the informativeness of a design may be quantified by the expected Bayes factor (Good 1979; Lindley 1956; Cavagnaro et al. 2009), or an entire distribution of Bayes factors.

Prior to the experiment, one may expect that in the majority of data sets that may be obtained the Bayes factor will point towards the correct hypothesis. However, for particular data sets sampling variability may result in a misleading Bayes factor, that is, a Bayes factor that points towards the incorrect hypothesis. For example, even when \(\mathcal {H}_{0}\) holds in the population, a random sample can show strong evidence in favor of \(\mathcal {H}_{1}\), just by sampling fluctuations. We term this situation false positive evidence (FPE). If, in contrast, the data set shows strong evidence for \(\mathcal {H}_{0}\), although in reality \(\mathcal {H}_{1}\) is correct, we term this false negative evidence (FNE). In general terms, misleading evidence is defined as a situation where the data show strong evidence in favor of the incorrect hypothesis (Royall 2000).

Research designs differ with respect to their probability of generating misleading evidence. The probability of yielding misleading evidence is a pre-data concept that should not be confused with a related but different post-data concept, namely the probability that a given evidence in a particular data set is misleading (Blume 2002).

The expected strength of evidence (i.e., the expected BF) and the probability of misleading evidence are conceptually distinct, but practically tightly related properties of a research design (Royall 2000), as in general higher evidential thresholds will lead to lower rates of misleading evidence (Blume 2008; Schönbrodt et al. 2015). To summarize, the joint goal of a prospective design analysis should be a high probability of obtaining strong evidence and a low probability of obtaining misleading evidence, which usually go together.

Dealing with uncertainty in expected effect size

Power in a classical power analysis is a conditional power, because the computed power is conditional on the assumed true (or minimally interesting) effect size. One difficulty is to commit to a point estimate of that parameter when there is considerable uncertainty about it. This uncertainty could be dealt with by computing the necessary sample size for a set of plausible fixed parameter values. For example, previous experiments may suggest that the true effect size is around 0.5, but a researcher feels that the true effect could as well be 0.3 or 0.7, and computes the necessary sample sizes for these effect size guesses as well. Such a sensitivity analysis gives an idea about the variability of resulting sample sizes.

A problem of this approach, however, is that there is no principled way of choosing an appropriate sample size from this set: Should the researcher aim for the conservative estimate, which would be highly inefficient in case the true effect is larger? Or should she aim for the optimistic estimate, which would lead to a low actual power if the true effect size is at the lower end of plausible values?

Prior effect size distributions quantify uncertainty

Extending the procedure of a sensitivity analysis, however, one can compute the probability of achieving a research goal averaged across all possible effect sizes. For this purpose, one has to define prior plausibilities of the effect sizes, compute the distribution of target outcomes for each effect size, and then obtain a weighted average. This averaged probability of success has been called “assurance” (O’Hagan et al. 2005) or “expected Bayesian power” (Spiegelhalter et al. 2004), and is the expected probability of success with respect to the prior.Footnote 2

In the above example, not all of the three assumed effect sizes (i.e, 0.3, 0.5, and 0.7) might be equally plausible. For example, one could construct a prior effect size distribution under \(\mathcal {H}_{1}\) that describes the plausibility for each choice (and all effect sizes in between) as a normal distribution centered around the most plausible value of 0.5 with a standard deviation of 0.1: \(\delta \sim \mathcal {N} (0.5, \sigma = 0.1)\), see Fig. 1.

Fig. 1
figure 1

A hypothetical prior distribution expressing the uncertainty about the true effect size. Figure available at https://osf.io/qny5x/, under a CC-BY4.0 license

Garthwaite et al. (2005) give advice on how to elicit a prior distribution from experts. These procedures help an expert to formulate his or her substantive knowledge in probabilistic form, which in turn can be used for Bayesian computations. Such an elicitation typically includes several steps, for example asking experts about the most plausible value (i.e., about the mode of the prior), or asking about the quantiles, such as ‘Please make a guess about a very high value, such that you feel there is only a 5 % probability the true value would exceed your guess’.

Morris et al. (2014) provide an online tool that can help to fit an appropriate distribution to an experts’ inputFootnote 3.

Design priors vs. analysis priors

Two types of priors can be differentiated (Walley et al. 2015; O’Hagan and Stevens 2001). Design priors are used before data collection to quantify prior beliefs about the true state of nature. These design priors are used to do design analyses and in general to assist experimental design. Analysis priors, in contrast, are used for Bayesian statistical analysis after the data are in.

At first glance it might appear straightforward to use the same priors for design planning and for data analysis. Both types of priors, however, can serve different goals. The design prior is used to tune the design before data collection to make compelling evidence likely and to avoid misleading evidence. The target audience for a design analysis is mainly the researcher him- or herself, who wants to design the most informed study. Hence, design priors should be based on the researcher’s experience and can contain a lot of existing prior information and experience to aid an optimal planning of the study’s design. Relying on a non-central, highly informative prior (in the extreme case, a point effect size guess as in classical power analysis) can result in a highly efficient design (i.e., with a just large-enough sample size) if the real effect size is close to that guess. On the other hand, it bears the risk to end up with inconclusive evidence if the true effect is actually smaller. A less informative design prior, in contrast, will typically lead to larger planned sample sizes, as more plausibility is assigned to smaller effect sizes.Footnote 4

This increases the chances of compelling evidence in the actual data analysis, but can be inefficient compared to a design that uses a more precise (and valid) effect size guess. Researchers may balance that trade-off based on their subjective certainty about plausible effect sizes, utilities about successful or failed studies, or budget constraints. Whenever prospective design analyses are used to motivate sample size costs in grant applications, the design priors should be convincing to the funder and the grant reviewers.

The analysis priors that are used to compute the BF, in contrast, should be convincing to a skeptical target audience, and therefore often are less informative than the design priors. In the examples of this paper, we will use an informed, non-central prior distribution for the planning stage, but a default effect size prior (which is less informative) for data analysis.

Three exemplary designs for a Bayes factor design analysis

In the next sections, we will demonstrate how to conduct a Bayes Factor Design Analysis. We consider three design perspectives:

  1. 1.

    Fixed-n design: In this design, a sample of fixed size is collected and one data analysis is performed at the end. From this perspective, one can ask the following design-related questions: Given a fixed sample size and the expected effect size – what BFs can be expected? What sample size do I need to have at least a 90 % probability of obtaining a BF10 of, say, 6 or greater? What is the probability of obtaining misleading evidence?

  2. 2.

    Open-ended sequential designs: Here participants are added to a growing sample and BFs are computed until a desired level of evidence is reached (Schönbrodt et al. 2015). As long as researchers do not run out of participants, time, or money, this approach eliminates the possibility of ending up with weak evidence. With this design, one can ask the following design-related questions: Given the desired level of evidence and the expected effect size – what distribution of sample sizes can be expected? What is the probability of obtaining misleading evidence?

  3. 3.

    Sequential designs with maximal n: In this modification of the open-ended SBF design, participants are added until (a) a desired level of evidence is obtained, or (b) a maximum number of participants has been reached. If sampling is stopped because of (b), the evidence will not be as strong as desired initially, but the direction and the strength of the BF can still be interpreted. With this design, one can ask the following design-related questions: Given the desired level of evidence, the expected effect size, and the maximum sample size – what distribution of sample sizes can be expected? How many studies can be expected to stop because of crossing the evidential threshold, and how many because n max has been reached? What is the probability of obtaining misleading evidence?

As most design planning concerns directional hypotheses, we will focus on these in this paper. Furthermore, in our examples we use the JZS default Bayes factor for a two group t-test provided in the BayesFactor package (Morey and Rouder 2015) for the R Environment for Statistical Computing (R Core Team 2014) and in JASP (JASP Team, 2016). The JZS Bayes factor assumes that effect sizes under \(\mathcal {H}_{1}\) (expressed as Cohen’s d) follow a central Cauchy distribution (Rouder et al. 2009). The Cauchy distribution with a scale parameter of 1 equals a t distribution with one degree of freedom. This prior has several convenient properties and can be used as a default choice when no specific information about the expected effects sizes is available. The width of the Cauchy distribution can be tuned using the scale parameter, which corresponds to smaller or larger plausible effect sizes. In our examples below, we use a default scale parameter of \(\sqrt {2}/2\). This corresponds to the prior expectation that 50 % of probability mass is placed on effect sizes that have an (absolute) size smaller than \(\sqrt {2}/2\), and 50 % larger than \(\sqrt {2}/2\). Note that all computations and procedures outlined here are not restricted to these specific choices and can be easily generalized to undirected tests and all other flavors of Bayes factors as well (Dienes 2014).

Fixed-n design

With a pre-determined sample size, the following questions can be asked in a design analysis: (a) What is the expected distribution of obtained evidence? (b) What is the probability of obtaining misleading evidence? (c) Sample size determination: What is the necessary sample size that compelling evidence can be expected with sufficiently high probability?

Monte Carlo simulations can be used to answer these questions easily. In our example, we focus on a test for the difference between two population means (i.e., a Bayesian t-test; Rouder 2009). For didactic purposes, we demonstrate this design analysis with a fixed expected effect size (i.e., without a prior distribution). This way the design analysis is analogous to a classical power analysis in the NHST paradigm, that also assumes a fixed effect size under \(\mathcal {H}_{1}\).

The recipe for our Monte Carlo simulations is as follows (see also Kruschke 2014):

  1. 1.

    Define a population that reflects the expected effect size under \(\mathcal {H}_{1}\) and, if prior information is available, other properties of the real data (e.g., specific distributional properties). In the example given below, we used two populations with normal distributions and a fixed standardized mean difference of δ = 0.5.

  2. 2.

    Draw a random sample of size n fixed from the populations (all n refer to sample size in each group).

  3. 3.

    Compute the BF for that simulated data using the analysis prior that will also be used in the actual data analysis and save the result. In the example given below, we analyzed simulated data with a Cauchy prior (scale parameter = \(\sqrt {2}/2\)).

  4. 4.

    Repeat steps 2 and 3, say, 10,000 times.

  5. 5.

    In order to compute the probability of false-positive evidence, the same simulation must be done under the \(\mathcal {H}_{0}\) (i.e., two populations that have no mean difference).

Researchers do not know in advance whether and to what extent the data will support \(\mathcal {H}_{1}\) or \(\mathcal {H}_{0}\); therefore, all simulations must be carried out both under \(\mathcal {H}_{1}\) and \(\mathcal {H}_{0}\) (see step 5). Figure 2 provides a flow chart of the simulations that comprise a Bayes factor design analysis. For standard designs, readers can conduct their own design analyses simulations using the R package BFDA (Schönbrodt, 2016; see https://github.com/nicebread/BFDA).Footnote 5

Fig. 2
figure 2

Overview of the planning and analysis stages in a Monte Carlo Bayes factor design analysis. Figure available at https://osf.io/qny5x, under a CC-BY4.0 license

The proposed simulations provide a distribution of obtained BFs under \(\mathcal {H}_{1}\), and another distribution under \(\mathcal {H}_{0}\). For these distributions, one can set several thresholds and retrieve the probability that a random study will provide a BF in a certain evidential category. For example, one can set a single threshold at BF10 = 1 and compute the probability of obtaining a BF with the wrong direction. Or, one can aim for more compelling evidence and set thresholds at BF10 = 6 and BF10 = 1/6. This means evidence is deemed inconclusive when 1/6<BF10<6. Furthermore, one can define asymmetric thresholds under \(\mathcal {H}_{0}\) and \(\mathcal {H}_{1}\). Depending on the analysis prior in the computation of the BF, it can be expensive and time-consuming to gather strong evidence for \(\mathcal {H}_{0}\). In these cases one can relax the requirements for strong \(\mathcal {H}_{0}\) support and still aim for strong \(\mathcal {H}_{1}\) support, for example by using thresholds 1/6 and 20 (Weiss 1997).

Expected distribution of BFs and rates of misleading evidence

Figure 3 compares the BF10 distribution that can be expected under \(\mathcal {H}_{1}\) (top row) and under \(\mathcal {H}_{0}\) (bottom row). The simulations were conducted with two fixed sample sizes: n = 20 (left column) and n = 100 (right column). Evidence thresholds were defined at 1/6 and 6. If an effect of δ = 0.5 exists and studies with n = 20 are conducted, 0.3 % of all simulated studies point towards the (wrong) \(\mathcal {H}_{0}\) (BF < 1/6). This is the rate of false negative evidence, and it is visualized as the dark grey area in the top density of Fig. 3A. Conversely, 21.1 % of studies show \(\mathcal {H}_{1}\) support (BF10> 6; light gray area in the top density), which is the probability of true positive results. The remaining 78.5 % of studies yield inconclusive evidence (1/6 <BF10< 6; medium grey area in the top density).

Fig. 3
figure 3

Distributions of BF10 for a fixed-n design with a true effect size of δ = 0.5 under \(\mathcal {H}_{1}\) and a fixed n of 20 (left column), resp. 100 (right column). Distributions were categorized at BF thresholds of 1/6 and 6. Figure available at https://osf.io/qny5x, under a CC-BY4.0 license

If, however, no effect exists (see bottom density of Fig. 3A), 0.9 % of all studies will yield false-positive evidence (BF10>6), and 13.7 % of all studies correctly support \(\mathcal {H}_{0}\) with the desired strength of evidence (BF10<1/6). A large majority of studies (85.5 %) remain inconclusive under \(\mathcal {H}_{0}\) with respect to that threshold. Hence, a design with that fixed sample size has a high probability of being uninformative under \(\mathcal {H}_{0}\).

With increasing sample size the BF distributions under \(\mathcal {H}_{1}\) and \(\mathcal {H}_{0}\) diverge (see Fig. 3B), making it more likely to obtain compelling evidence for either hypothesis. Consequently, the probability of misleading evidence and the probability of inconclusive evidence is reduced. At n = 100 and evidential thresholds of 6 and 1/6 the rate of false negative evidence drops from 0.3 % to virtually 0 %, and the rate of false positive evidence drops from 0.9 % to 0.6 %. The probability to detect an existing effect of δ = 0.5 increases from 21.1 % to 84.0 %, and the probability to find evidence in favor of a true \(\mathcal {H}_{0}\) increases from 13.7 % to 53.4 %.

Sample size determination

For sample size determination, simulated sample sizes can be adjusted until the computed probability of achieving a research goal under \(\mathcal {H}_{1}\) is close to the desired level. In our example, the necessary sample size of achieving a BF10>6 under \(\mathcal {H}_{1}\) with a probability of 95 % would be n = 146. Such a fixed-n Bayes factor design with n = 146 implies a false negative rate of virtually 0 %, and, under \(\mathcal {H}_{0}\), a false positive rate of 0.4 % and a probability of 61.5 % to correctly support \(\mathcal {H\hspace *{-.22pt}}_{0\hspace *{-.22pt}}\).

In a pre-data design perspective the focus is on the frequentist properties of BFs. We should mention that this can be complemented by investigating the Bayesian properties of BFs. From that perspective, one can look at the probability of a hypothesis being true given a certain BF (Rouder 2014). When \(\mathcal {H}_{1}\) and \(\mathcal {H}_{0}\) have equal prior probability, and when the analysis prior equals the design prior, then a single study with a BF10 of, say, 6 has 6:1 odds of stemming from \(\mathcal {H}_{1}\).

The goal of obtaining strong evidence can be achieved by planning a sample size that ensures a strong enough BF with sufficient probability. There is, however, an easier way that guarantees compelling evidence: Sample sequentially and compute the BF until the desired level of evidence is achieved. This design will be explained in the next section.

Open-ended sequential bayes factor design: SBF

In the planning phase of an experiment, it is often difficult to decide on an expected or minimally interesting effect size. If the planned effect size is smaller than the true effect size, the fixed n will be inefficient. More often, presumably, the effect size is overestimated in the planning stage, leading to a smaller actual probability to detect a true effect.

A proposed solution that is less dependent on the true effect size is the Sequential Bayes Factor (SBF) design (Schönbrodt et al. 2015). In this design, the sample size is increased until the desired level of evidence for \(\mathcal {H}_{1}\) or \(\mathcal {H}_{0}\) has been reached (see also Wald 1945; Kass & Raftery 1995; Berger et al. 1994; Dienes 2008; Lindley 1956). This principle of “accumulation of evidence” is also central to optimal models for human perceptual decision making (e.g., random walk models, diffusion models; e.g., Bogacz et al. 2006; Forstmann et al. 2016). This accumulation principle allows a flexible adaption of the sample size based on the actual empirical evidence.

In the planning phase of a SBF design, researchers define an a priori threshold that represents the desired grade of evidence, for example a BF10 of 6 for \(\mathcal {H}_{1}\) and the reciprocal value of 1/6 for \(\mathcal {H}_{0}\). Furthermore, an analysis prior for the effect sizes under \(\mathcal {H}_{1}\) is defined in order to compute the BF. Finally, the researcher may determine a minimum number of participants to be collected regardless, before the optional stopping phase of the experiment (e.g., n m i n = 20 per group).

After a sample of n m i n participants has been collected, a BF is computed. If this BF does not exceed the \(\mathcal {H}_{1}\) threshold or the \(\mathcal {H}_{0}\) threshold, the sample size is increased as often as desired and a new BF computed at each stage (even after each participant). As soon as one of the thresholds is reached or exceeded, sampling can be stopped. One prominent advantage of sequential designs is that sample sizes are in most cases smaller than those from fixed-n designs with the same error rates.Footnote 6 For example, in typical scenarios the SBF design for comparing two group means yielded about 50 % smaller samples on average compared to the optimal NHST fixed-n design with the same error rates (Schönbrodt et al. 2015).

With regard to design analysis in a SBF design, one can ask: (a) What is the probability of obtaining misleading evidence by stopping at the wrong threshold? (b) What is the expected sample size until an evidential threshold is reached?

In the example for the SBF design, we use a design prior for the a priori effect size estimate: \(d \sim \mathcal {N} (0.5, \sigma = 0.1)\) (see Fig. 1). In our hypothetical scenario this design prior is inspired by relevant substantive knowledge or results from the published literature. Again, Monte Carlo simulations were used to examine the operational characteristics of this design:

  1. 1.

    Define a population that reflects the expected effect size under \(\mathcal {H}_{1}\) and, if prior information is available, other properties of the real data. In the example given below, we used two populations with normal distributions and a standardized mean difference that has been drawn from a normal distribution \(\mathcal {N} (0.5, \sigma = 0.1)\) at each iteration.

  2. 2.

    Draw a random sample of size n m i n from the populations.

  3. 3.

    Compute the BF for that simulated data set, using the analysis prior that will also be used in the actual data analysis (in our example: a Cauchy prior with scale parameter = \(\sqrt {2}/2\)). If the BF exceeds the \(\mathcal {H}_{1}\) or the \(\mathcal {H}_{0}\) threshold (in our example: > 6 or < 1/6), stop sampling, and save the final BF and the current sample size. If the BF does not exceed a threshold yet, increase sample size (in our example: by 1 in each group). Repeat step 3 until one of both thresholds is exceeded.

  4. 4.

    Repeat steps 1 to 3, say, 10,000 times.

  5. 5.

    In order to compute the rate of false-positive evidence and the expected sample size under \(\mathcal {H}_{0}\), the same simulation must be done under the \(\mathcal {H}_{0}\) (i.e., two populations that have no mean difference).

This design can completely eliminate weak evidence, as data collection is continued until evidence is conclusive in either direction. The consistency property ensures that BFs ultimately drift either towards 0 or towards and every study ends up producing compelling evidence – unless researchers run out of time, money, or participants (Edwards et al. 1963). We call this design “open-ended” because there is no fixed termination point defined a priori (in contrast to the SBF design with maximal sample size, which is outlined below). “Open-ended”, however, does not imply that data collection can continue forever without hitting a threshold; in contrast, the consistency property of BFs guarantees that the possibility of collecting samples indefinitely is zero.

Figure 4 (top) visualizes the evolution of the BF10 in several studies where the true effect size follows the prior distribution displayed in Fig. 1. Each grey line in the plot shows how the BF10 of a specific study evolves with increasing n. Some studies hit the (correct) \(\mathcal {H}_{1}\) boundary sooner, some later, and the distribution of stopping-ns is visualized as the density on top of the \(\mathcal {H}_{1}\) boundary. Although all trajectories are guaranteed to drift towards and across the correct threshold in the limiting case, some hit the wrong \(\mathcal {H}_{0}\) threshold prematurely. Most misleading evidence happens at early stages of the sequential sampling. Consequently, increasing n m i n also decreases the rate of misleading evidence (Schönbrodt et al. 2015). Figure 4 (bottom) shows the same evolution of BFs under \(\mathcal {H}_{0}\).

Fig. 4
figure 4

The open-ended Sequential Bayes Factor design. The density of sample sizes at the stopping point and the example trajectories are based on a true effect size of \(\delta \sim \mathcal {N} (0.5, \sigma = 0.1)\) under \(\mathcal {H}_{1}\) and evidential thresholds at 6 and 1/6. Figure available at https://osf.io/qny5x, under a CC-BY4.0 license

Expected rates of misleading evidence

If one updates the BF after each single participant under this \(\mathcal {H}_{1}\) of \(d \sim \mathcal {N} (0.5, \sigma = 0.1)\) and evidential thresholds at 6 and 1/6, 97.2 % of all studies stop at the correct \(\mathcal {H}_{1}\) threshold (i.e., the true positive rate), 2.8 % stop incorrectly at the \(\mathcal {H}_{0}\) threshold (i.e., the false negative rate). Under the \(\mathcal {H}_{0}\), 93.8 % terminate at the correct \(\mathcal {H}_{0}\) threshold, and 6.2 % at the incorrect \(\mathcal {H}_{1}\) threshold (i.e., the false positive rate).

The algorithm above computes the BF after each single participant. The more often a researcher checks whether the BF has exceeded the thresholds, the higher the probability of misleading evidence, because the chances are increased that the stop is at a random extreme value. In contrast to NHST, however, where the probability of a Type-I error can be pushed towards 100 % if enough interim tests are performed (Armitage et al. 1969), the rate of misleading evidence has an upper limit in the SBF design. When the simulations are conducted with interim tests after each single participant, one obtains the upper bound on the rate of misleading evidence. In the current example this leads to a maximal FPE rate of 6.2 %. If the BF is computed after every 5 participants, the rate is reduced to 5.2 %, after every 10 participants to 4.5 %. It should be noted that these changes in FPE rate are, from an inferential Bayesian perspective, irrelevant (Rouder 2014).

Expected sample size

In the above example, the average sample size at the stopping point (across both threshold hits) under \(\mathcal {H}_{1}\) is n = 53, the median sample size is n = 36, and 80 % of all studies stop with fewer than 74 participants. Under \(\mathcal {H}_{0}\), the sample size is on average 93, median = 46, and 80 % quantile = 115. Hence, although the SBF design has no a priori defined upper limit of sample size, the prospective design analysis reveals estimates of the expected sample sizes.

Furthermore, this example highlights the efficiency of the sequential design. A fixed-n Bayes factor design that also aims for evidence with BF10≥6 (resp. ≤1/6) with the same true positive rate of 97.2 % requires n = 241 participants (but will have different rates of misleading evidence).

Sequential Bayes factor with maximal n: SBF+maxN

The SBF design is attractive because a study is guaranteed to end up with compelling evidence. A practical drawback of the open-ended SBF design, however, is that the BF can meander in the inconclusive region for hundreds or even thousands of participants when effect sizes are very small (Schönbrodt et al. 2015). In practice, researchers do not have unlimited resources, and usually want to set a maximum sample size based on budget, time, or availability of participants.

The SBF+maxN design extends the SBF design with such an upper limit on the sample size. Data collection is stopped whenever one of both evidential thresholds has been exceeded, or when the a priori defined maximal sample size has been reached. When sampling is stopped because n max has been reached, one can still interpret the final BF. Although it has not reached the threshold for compelling evidence, its direction and strength can still be interpreted.

When planning an SBF+maxN design, one can ask: (a) How many studies can be expected to stop because of crossing an evidential threshold, and how many because of reaching n max?, (b) What is the probability of obtaining misleading evidence?, (c) If sampling stopped at n max: How many of these studies have a BF that points into the correct direction? (d) What distribution of sample sizes can be expected?

Again, Monte Carlo simulations can be used to examine the operational characteristics of this design. The computation is equivalent to the SBF design above, with the only exception that step 3 is terminated when the BF exceeds the \(\mathcal {H}_{1}\) or \(\mathcal {H}_{0}\) threshold, or n reaches n max.

To highlight the flexibility and practicality of the SBF+maxN design, we consider a hypothetical scenario in which a researcher intends to test as efficiently as possible, has practical limitations on the maximal sample size, and wants to keep the rate of false positive evidence low. To achieve this goal, we introduce some changes to the example from the open-ended SBF design above: Asymmetric boundaries, a different minimal sample size, and a maximum sample size.

False positive evidence happens when the \(\mathcal {H}_{1}\) boundary is hit prematurely although \(\mathcal {H}_{0}\) is true. As most misleading evidence happens at early terminations of a sequential design, the FPE rate can be reduced by increasing n m i n (say, n m i n =40). Furthermore, the FPE rate can be reduced by a high \(\mathcal {H}_{1}\) threshold (say, BF10>=30). With an equally strong threshold for \(\mathcal {H}_{0}\) (1/30), however, the expected sample size can easily go into thousands under \(\mathcal {H}_{0}\) (Schönbrodt et al. 2015). To avoid such a protraction, the researcher may set a lenient \(\mathcal {H}_{0}\) threshold of BF10<1/6. Finally, due to budget restrictions, the maximum affordable sample size is defined as n max = 100. With these settings, the researcher trades in a higher expected rate of false negative evidence (caused by the lenient \(\mathcal {H}_{0}\) threshold), and some probability of weak evidence (when the study is terminated at n max) for a smaller expected sample size, a low rate of false positive evidence and the certainty that the sample size does not exceed n max.

To summarize, in this final example we set evidential thresholds for BF10 at 30 and 1/6, n m i n = 40, and n max = 100. The uncertainty about the effect size under \(\mathcal {H}_{1}\) is expressed as \(\delta \sim \mathcal {N} (0.5, \sigma = 0.1)\). Figure 5 visualizes the trajectories and stopping point distributions under \(\mathcal {H}_{1}\) (results under \(\mathcal {H}_{0}\) not shown). The upper and lower densities show the distribution of n for all studies that hit a threshold. The distribution on the right shows the distribution of BF10 for all studies that stopped at n max.

Fig. 5
figure 5

The Sequential Bayes Factor With Maximal n design under \(\mathcal {H}_{1}\) (results under \(\mathcal {H}_{0}\) not shown). The densities and example trajectories are based on a true effect size of \(\delta \sim \mathcal {N} (0.5, \sigma = 0.1)\), evidential thresholds at 30 and 1/6, and \(n_{\max }\) = 100 in each group. Figure available at https://osf.io/qny5x, under a CC-BY4.0 license

Expected stopping threshold (H 1, H 0, or n max) and expected rates of misleading evidence

Under \(\mathcal {H}_{1}\) of this example, 70.6 % of all studies hit the correct \(\mathcal {H}_{1}\) threshold (i.e., the true positive rate), 1.6 % hit the wrong \(\mathcal {H}_{0}\) threshold (i.e, the false negative rate). The remaining 27.8 % of studies stopped at n max and remained inconclusive with respect to the a priori set thresholds.

One goal in the example was a low FPE rate. Under \(\mathcal {H}_{0}\) (not displayed), 70.9 % of all studies hit the correct \(\mathcal {H}_{0}\) threshold and 0.6 % hit the wrong \(\mathcal {H}_{1}\) threshold (i.e., the false positive rate). The remaining 28.5 % of studies stopped at n max and remained inconclusive with respect to the a priori set thresholds.

Again, these are the maximum rates of misleading evidence, when a test after each participant is computed. More realistic sequential tests, such as testing after every 10 participants, will lower these rates.

Distribution of evidence at n max

The BF of studies that did not reach the a priori threshold for compelling evidence can still be interpreted. In the current example, we categorize the inconclusive studies into results that show at least moderate evidence for either hypothesis (BF < 1/3 or BF > 3) or are completely inconclusive (1/3 < BF < 3). Of course any other threshold can be used to categorize the non-compelling studies; in general a BF of 3 provides only weak evidence for a hypothesis and implies, from a design perspective, a high rate of misleading evidence (Schönbrodt et al. 2015).

In the current example, under \(\mathcal {H}_{1}\), 15.5 % of all studies terminated at n max with a BF10> 3, meaning that these studies correctly indicated at least moderate evidence for \(\mathcal {H}_{1}\). 11.6 % of studies remained inconclusive (1/3<BF10<3), and 0.7 % pointed towards the wrong hypothesis (BF10<1/3). Under \(\mathcal {H}_{0}\), 1.1 % incorrectly pointed towards \(\mathcal {H}_{1}\), 10.8 % towards \(\mathcal {H}_{0}\), and 16.6 % remained inconclusive.

Expected sample size

The average expected sample size under \(\mathcal {H}_{1}\) (combined across all studies, regardless of the stopping condition) is n = 69, with a median of 65. The average expected sample size under \(\mathcal {H}_{0}\) is n = 66, with a median of 56. Hence, the average expected sample size is under both hypotheses considerably lower than n max, which has been defined at n = 100.

Discussion

We explored the concept of a Bayes Factor Design Analysis, and how it can help to plan a study for compelling evidence. Pre-data design analyses allow researchers to plan a study in a way that strong inference is likely. As in frequentist power analysis, one has to find a trade-off between the rates of misleading evidence, the desired probability of achieving compelling evidence, and practical limits concerning sample size. Additionally, in order to compute the expected outcomes of future studies, one has to make explicit one’s assumption for several key parameters, such as the expected effect size under \(\mathcal {H}_{1}\). Any pre-data analysis is conditional on these assumptions, and the validity of the results depends on the validity of the assumptions. If reality does not follow the assumptions, the actual operational characteristics of a design will differ from the results of the design analysis. For example, if the actual effect size is smaller than anti-cipated, a chosen design has actually higher FNE rates and, in the sequential case, larger expected sample sizes until a threshold is reached.

In contrast to p-values, the interpretation of Bayes factors does not depend on stopping rules (Rouder 2014). This property allows researchers to use flexible research designs without the requirement of special and ad-hoc corrections. For example, the proposed SBF+maxN design stops abruptly at n max. An alternative procedure is one where the evidential thresholds gradually move closer together as n increases. This implies that a lower grade of evidence is accepted when sampling was not already stopped at a strong evidential threshold, and puts a practical (but not fixed) upper limit on sample size (for an application in response time modeling see Boehm et al. 2015). The properties of this special design (or of any sequential or non-sequential BF design) can be evaluated using the same simulation approach outlined in this paper. This further underscores the flexibility and the generality of the sequential Bayesian procedure.

From the planning stage to the analysis stage

This paper covered the planning stage, before data are collected. After a design has been chosen, based on a careful evaluation of its operational characteristics, the actual study is carried out (see also Fig. 2). A design analysis only relates to the actual inference if the same analysis prior is used in the planning stage and in the analysis stage. Additionally, the BF computation in the analysis stage should contain a sensitivity analysis, which shows whether the inference is robust against reasonable variations in the analysis prior.

It is important to note that, in contrast to NHST, the inference drawn from the actual data set is entirely independent from the planning stage (Berger and Wolpert 1988; Wagenmakers et al. 2014; Dienes 2011). All inferential information is contained in the actual data set, the analysis prior, and the likelihood function. Hypothetical studies from the planning stage (that have not been done) cannot add anything. From that perspective, it would be perfectly fine to use a different analysis prior in the actual analysis than in the design analysis. This would not invalidate the inference (as long as the chosen analysis prior is defensible); it just would disconnect the pre-data design analysis, which from a post-data perspective is irrelevant anyway, from the actual analysis.

Unbiasedness of effect size estimates

Concerning the sequential procedures described here, some authors have raised concerns that these procedures result in biased effect size estimates (e.g., Bassler et al. 2010; kruschke 2014). We believe these concerns are overstated, for at least two reasons.

First, it is true that studies that terminate early at the \(\mathcal {H}_{1}\) boundary will, on average, overestimate the true effect. This conditional bias, however, is balanced by late terminations, which will, on average, underestimate the true effect. Early terminations have a smaller sample size than late terminations, and consequently receive less weight in a meta-analysis. When all studies (i.e., early and late terminations) are considered together, the bias is negligible (Fan et al. 2004; Schönbrodt et al. 2015; Goodman 2007; Berry et al. 2010). Hence, across multiple studies the sequential procedure is approximately unbiased.

Second, the conditional bias of early terminations is conceptually equivalent to the bias that results when only significant studies are reported and non-significant studies disappear into the file drawer (Goodman 2007). In all experimental designs –whether sequential, non-sequential, frequentist, or Bayesian– the average effect size inevitably increases when one selectively averages studies that show a larger-than-average effect size. Selective publishing is a concern across the board, and an unbiased research synthesis requires that one considers significant and non-significant results, as well as early and late terminations.

Although sequential designs have negligible unconditional bias, it may nevertheless be desirable to provide a principled “correction” for the conditional bias at early terminations, in particular when the effect size of a single study is evaluated. For this purpose, Goodman (2007) outlines a Bayesian approach that uses prior expectations about plausible effect sizes (see also Pocock and Hughes 1989). This approach shrinks extreme estimates from early terminations towards more plausible regions. Smaller sample sizes are naturally more sensitive to prior-induced shrinkage, and hence the proposed correction fits the fact that most extreme deviations from the true value are found in very early terminations that have a small sample size (Schönbrodt et al. 2015).

Practical considerations

Many granting agencies require a priori computations for the determination of sample size. This ensures that proposers explicitly consider the expected or minimally relevant effect size. Such calculations are necessary to pinpoint the amount of requested money to pay participants.

The SBF+maxN design seems especially suitable for a scenario where researchers want to take advantage of the high efficiency of a sequential design but still have to define a fixed (maximum) sample size in a proposal. For this purpose, one could compute a first design analysis based on an open-ended SBF design to determine a reasonable n max. If, for example, the 80 % quantile of the stopping-n distribution is used as n m a x in a SBF+maxN design, one can expect to hit a boundary before n max is reached in 80 % of all studies. Although there is a risk of 20 % that a study does not reach compelling evidence within the funding limit, this outcome is not a “failure” as the direction and the size of the final BF can still be interpreted. In a second design analysis one should consider the characteristics of that SBF+maxN design and evaluate whether the rates of misleading evidence are acceptable.

This approach enables researchers to define an informed upper limit for sample size, which allows them to apply for a predefined amount of money. Still, one can save resources if the evidence is strong enough for an earlier stop, and in almost all cases the study will be more efficient than a fixed-n NHST design with comparable error rates (Schönbrodt et al. 2015).

Conclusion

In the planning phase of a study it is essential to carry out a design analysis in order to formalize one’s expectations and facilitate the design of informative experiments. A large body of literature is available on planning frequentist designs, but little practical advice exists for research designs that employ Bayes factors as a measure of evidence. In this contribution we elaborate on three BF designs –a fixed-n design, an open-ended Sequential Bayes Factor (SBF) design, and an SBF design with maximal sample size– and demonstrate how the properties of each design can be evaluated using Monte Carlo simulations. Based on the analyses of the operational characteristics of a design, the specific settings of the research design can be balanced in a way that compelling evidence is a likely outcome of the to-be-conducted study, misleading evidence is an unlikely outcome, and sample sizes are within practical limits.