Skip to main content

Main menu

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT

User menu

Search

  • Advanced search
eNeuro
eNeuro

Advanced Search

 

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT
PreviousNext
Commentary, Novel Tools and Methods

Using Simulations to Explore Sampling Distributions: An Antidote to Hasty and Extravagant Inferences

Guillaume A. Rousselet
eNeuro 23 October 2025, 12 (10) ENEURO.0339-25.2025; https://doi.org/10.1523/ENEURO.0339-25.2025
Guillaume A. Rousselet
School of Neuroscience & Psychology, University of Glasgow, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Guillaume A. Rousselet
  • Article
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF
Loading

Abstract

Most statistical inferences in neuroscience and psychology are based on frequentist statistics, which rely on sampling distributions: the long-run outcomes of multiple experiments, given a certain model. Yet, sampling distributions are poorly understood and rarely explicitly considered when making inferences. In this tutorial and commentary, I demonstrate how to use simulations to illustrate sampling distributions to answer simple practical questions: for instance, if we could run thousands of experiments, what would the outcome look like? What do these simulations tell us about the results from a single experiment? Such simulations can be run a priori, given expected results, or a posteriori, using existing datasets. Both approaches can help make explicit the data generating process and the sources of variability; they also reveal the large uncertainty in our experimental estimation and lead to the sobering realization that, in most situations, we should not make a big deal out of results from a single experiment. Simulations can also help demonstrate how the selection of effect sizes conditional on some arbitrary cutoff (p ≤ 0.05) leads to a literature filled with false positives, a powerful illustration of the damage done in part by researchers’ over-confidence in their statistical tools. The tutorial focuses on graphical descriptions and covers examples using correlation analyses, proportion data, and response latency data. All the figures and numerical values in this article can be reproduced using code available at https://github.com/GRousselet/sampdist.

  • correlation
  • ERP
  • estimation
  • measurement precision
  • reaction times
  • statistical power

Significance Statement

We all agree that the brain is complex and that statistical modeling is hard, yet, in the literature, it is common to see data from a single experiment analyzed using a simplistic and inappropriate model, followed by sweeping claims based on statistical significance. A potent cure to mindless statistical rituals and over-confidence in our tools is to consider a long-run perspective: what could the results look like if we carried out not one, but thousands of experiments? We can answer this question by using relatively simple simulations to learn more about our data and our tools. This tutorial will help you start on this worthwhile journey.

Introduction

The typical neuroscience or psychology article reports data from one or a few experiments. Inferences are usually made about some unspecified population using frequentist statistics. An arbitrary cutoff is then used to dichotomize effects as significant or not, and to claim a discovery, irrespective of other sources of information (McShane et al., 2019). This mostly mindless ritual gives the illusion of certainty despite the noise and variability inherent to data collection and analysis (Gigerenzer and Marewski, 2015; Gelman, 2018; Yarkoni, 2022). A large part of the so-called replication crisis is probably due to this over-confidence, stemming from the erroneous belief that statistical methods can deliver the truth. In practice, most discoveries can only be made in the long-run, following an extensive program of research (Morey, 2018). This long-run perspective is often lost in the flashy short papers that claim a discovery based on one noisy sample. It is therefore essential for researchers to be aware of this long-run perspective. One of the most efficient ways to achieve this goal is to perform simulations (Morris et al., 2019; DeBruine and Barr, 2021; DeBruine, 2025). Here we look at sampling distributions to consider results not from one experiment but from thousands of them. This can be done using synthetic or real data, and the resulting sampling distributions can be used to answer useful questions. In this article, after defining sampling distributions, I propose a series of examples exploring estimation variability across simulated experiments. We will consider analyses of correlations, proportion data, and two types of response latency data. The examples are relatively simple, but they help demonstrate the potentially large benefits of learning to and spending the time to write a few lines of code to simulate experimental results. By engaging in this type of exercise, we must explicitly consider the data generating process, including the experimental design and the shape of the population distributions we sample from when we carry out experiments (Gelman et al., 2020; McElreath, 2020). We also directly visualize the effect of sample sizes. As a result, the simulation process should ultimately lead to better planning of experiments and also a better understanding and acceptance of uncertainty, a more modest interpretation of results, and a healthy skepticism of published results (Vasishth and Gelman, 2021).

Sampling Distributions

Sampling distributions are at the core of inferential frequentist statistics. Indeed, frequentist statistics deal with the long-run outcomes of imaginary experiments (Wagenmakers, 2007; Kruschke, 2013). For instance, when performing a t test, we use the sampling distribution of t values assuming that there is no effect to compute the probability of observing a t value at least as large as the one we obtained based on our sample—the so-called p value (Greenland et al., 2016). The point is that all users of frequentist statistics already use sampling distributions, although they might not know it (Bayesians also use them when setting informed priors from the literature). But there is more to sampling distributions than t tests and other standardized statistics. Sampling distributions can be used to learn about the long-run behavior of certain quantities over many experiments. For instance, we can ask what we can expect to observe if we did thousands of experiments.

A sampling distribution is essentially the outcome of a simulation in which the same experiment is carried out many times (Baguley, 2012). To illustrate, let's imagine that we perform experiments in which we sample from the skewed distribution in Figure 1A. This distribution has mean μ = 1.13 and standard deviation σ = 0.604, and like many quantities we measure in neuroscience and psychology, it takes only positive values. From these population parameters, the standard error of the mean (SEM) is defined as σ/n ; thus 0.191 for n = 10, 0.135 for n = 20, 0.085 for n = 50. The σ population value is not usually observable, so it is estimated from random samples of observations (aka experiments), using the sample standard deviation (SD), such that SEM=SD/n .

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

Sampling from a lognormal distribution. A, Lognormal distribution with parameters log μ = 0, log σ = 0.5. The corresponding population mean and standard deviation are μ = 1.13, σ = 0.604. The vertical line marks μ. The insets contain the standard errors of the mean (SEM) for three sample sizes. B, Scatterplots of 10 random samples from A. The long vertical line marks the population mean. Each disk is an observation. The short vertical lines mark the sample means. C, Sampling distributions of the mean for 20,000 samples of sizes n = 10, n = 20, n = 50 observations. For each sample size, the standard deviation of the sampling distribution of the sample mean is equal to the SEM. This figure was created using the R notebook lognormal.Rmd.

Random samples of 10 observations are illustrated in Figure 1B. Each of these samples has a different sample mean, a different sample SD, and therefore a different estimated SEM. This seems obvious, but panel B reminds us that there can be a lot of variability across experiments measuring the same phenomenon, something that is easy to forget when considering the outcome of a single experiment.

Instead of a few virtual experiments, we can perform many of them to see how the sample mean is distributed in the long run. Figure 1C shows the outcome of 20,000 experiments in which we randomly sample n observations from a standard lognormal population (Fig. 1A) and compute the sample mean. The standard deviations of these sampling distributions are almost equal to the population SEMs (compare insets in panels A and C—the small differences are due to the limited number of simulation iterations). This is because the two quantities are the same. The SEM is simply the SD of the sampling distribution of the sample mean (this video explains clearly sampling distributions and the SEM: https://www.youtube.com/watch?v=J1twbrHel3o). Hence, whereas the sample SD tells us about the variability of observations about the sample mean, SEM tells us about the variability of the sample mean about the population mean (Howell, 2013).

We can also make two important observations from the sampling distributions. First, they are skewed when drawn from a lognormal population, even for n = 50, which violates the assumption of the one-sample t test, leading to inaccurate p values and confidence intervals (Rousselet et al., 2023; Wilcox and Rousselet, 2023). Second, with increasing sample size, the distributions get narrower, which means that each sample mean is on average closer to the population mean. And that's the main reason to use large sample sizes: to get, on average, closer to the truth. Hence, computing sampling distributions and illustrating them can provide intuitive descriptions of the long-run behavior of a quantity, from which we can assess if our models are appropriate and help us grasp how far off our experimental results could be from the truth. In the rest of this article, we focus on the second aspect by exploring sampling distributions in a series of examples, looking at analyses of correlations, proportion data, and response latency measurements.

Correlation Analyses

Although correlation analyses may not be the optimal way to answer the most meaningful questions about one's data (Baguley, 2009, 2010), they are omnipresent in the literature. Like the more flexible and meaningful regression analyses, correlation analyses suffer from two major issues. First, standard techniques are not generally robust to violations of their assumptions. Such violations can considerably affect the estimation of correlation and regression coefficients (Wilcox and Rousselet, 2023). Second, the sample size strongly affects the precision of the estimates, and small sample sizes may yield highly inaccurate estimates. Here we focus on the second issue as it pertains to correlation analyses.

The problem with small sample sizes in correlation analyses is not new. Approximately 16 years ago, Vul et al. warned the community about the prevalence of false positives in brain–behavior correlation analyses. They argued that standard practices led to the so-called voodoo correlations (Vul et al., 2009a,b). In one of several replies to their article, Yarkoni suggested that the main cause of voodoo correlations was the lack of power of studies with small sample sizes (Yarkoni, 2009). Yarkoni's argument is this: correlation estimates from small samples are imprecise, such that even if samples are taken from a population with a zero correlation, large correlation estimates can be expected by chance. However, in small samples, only extreme correlation estimates yield statistically significant tests. Thus, lack of power, combined with publication bias toward new, unexpected positive results, can easily lead to a literature replete with overestimates (Forstmeier et al., 2016). Although this problem is well documented, in my experience, it is still very common to see articles with sample sizes too small to estimate correlations with sufficient precision.

Let's illustrate the problems associated with small sample sizes by looking at sampling distributions. But first, let's start with the example in Figure 2. Sample size is 30, and the estimated Pearson's correlation coefficient r is −0.51. It seems we have discovered a relatively strong association between variables 1 and 2! Unfortunately, this effect will not replicate, because the bivariate data in the scatterplot were sampled from a population with zero correlation (ρ = 0, the Greek letter ρ represents the population correlation coefficient). So the true effect is zero, but our sample leads us to believe otherwise. There is nothing new here: inaccurate effect sizes are a natural outcome of studies with small sample sizes. The problem only gets worse once we add questionable research practices (such as selective reporting) and incentives to publish novel, positive results to the equation. (To be fair, the example in Fig. 2 is the outcome of selective reporting: I generated all the correlations among 20 samples and picked the samples associated with the highest correlation to make a point!)

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2.

Nice looking but random correlation? The sample size is n = 30. The bivariate sample comes from a population with a true effect size ρ of 0. In other words, we are looking at sampling noise. This figure was created using the R notebook corr_sim.Rmd.

The effect size inflation with small sample sizes might seem counter-intuitive because, if a study lacks power, surely the true effect must be very strong to show up with such a small sample size. However appealing, this conclusion is wrong because it ignores a critical aspect of studies with small sample sizes: they are associated with large sampling variability, which means that estimation precision is poor. This error in reasoning has been described as the “which does not kill statistical significance makes it stronger” fallacy (Loken and Gelman, 2017). The problem becomes clear when we draw samples of different sizes from a normal bivariate population with a known population Pearson’s correlation ρ of 0. The sampling distributions of the estimates of ρ for different sample sizes are shown in Figure 3A, which illustrates that the sample estimates often differ considerably from the population value, particularly in small samples. And the situation gets worse when we compare correlation coefficients (Rousselet et al., 2023).

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3.

Examples of sampling distributions of correlation coefficients. Results are presented for Pearson's correlation. Similar results would be obtained using Spearman's correlation or some other measure of association. A, Sampling distribution for ρ = 0. B, Sampling distribution for ρ = 0, given p ≤ 0.05. C, Sampling distribution for ρ = 0.4. D, Sampling distribution for ρ = 0.4, given p ≤ 0.05. This figure was created using the R notebook corr_sim.Rmd.

Sampling distributions tell us about the behavior of a statistic in the long run, if we did many studies. With larger sample sizes, the sampling distributions are narrower, which means that the individual estimates tend to be more precise. However, a typical article reports only one correlation estimate, which could be completely off. So what sample size should we use to get a precise estimate? The answer depends on the following:

  1. The shape of the univariate and bivariate distributions

  2. The quantity we want to estimate (by default Pearson, a poor choice)

  3. The true effect size (the larger the effect, the fewer trials are needed—see below)

  4. The precision we want to afford

For the sampling distributions in Figure 3A, we can calculate the proportion of correlation estimates that are within a certain margin from the population correlation (ρ = 0). For instance, as shown by the black arrows in Figure 4:

  1. For 70% of estimates to be within 0.1 of the true correlation value (between r = −0.1 and 0.1), we need at least 110 observations.

  2. For 90% of estimates to be within 0.2 of the true correlation value (between r = −0.2 and 0.2), we need at least 69 observations.

Figure 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4.

Correlation: estimation precision. Proportions of estimates near the true value (ρ = 0), for different sample sizes, and for different levels of precision. This figure was created using the R notebook corr_sim.Rmd. Readers interested in matching curves for statistical power will find them in the R notebook corr_power.Rmd.

Using the full curves in Figure 4, researchers can consider different thresholds and reflect on the trade-off between sample size and measurement precision. The bottom line is that even if we're willing to settle for imprecise estimates (up to 0.2 from a true value of 0), we need plenty of observations to achieve this precision often enough—again, there is no guarantee whatsoever for any particular experiment we conduct. This approach fits well with a growing literature that advocates planning experiments for estimation precision rather than the traditional goal of power (Maxwell et al., 2008; Bland, 2009; Schönbrodt and Perugini, 2013; Gelman and Carlin, 2014; Peters and Crutzen, 2017; Trafimow and MacDonald, 2017; Rothman and Greenland, 2018; Trafimow et al., 2019). Researchers interested in estimation accuracy are also encouraged to consider robust alternatives to Pearson's correlation (Pernet et al., 2013; Wilcox and Rousselet, 2023).

When planning experiments, it is important to realize that if we meticulously survey the literature, we are very unlikely to obtain curves similar to the ones plotted in Figure 3A,C, because the literature almost certainly provides a biased estimation of the true population correlations (Ioannidis, 2008; Gelman and Weakliem, 2009). Indeed, consider what happens to a literature in which it is exceedingly difficult to publish nonsignificant findings (Smaldino and McElreath, 2016). We can illustrate this problem by looking at the sampling distribution of only those correlation coefficients associated with statistical significance (Fig. 3B).

For a correlation estimate to be significant, its absolute value needs to be large—more so for small samples. Selecting for a significant correlation coefficient therefore removes the part of the sampling distribution around r = 0 and only retains the more extreme values. If authors, reviewers, and editors are biased against nonsignificant findings, this means that readers only get to see an upwardly biased portion of the total distribution of correlation estimates. This, in turn, means that if researchers base their power analyses on the effect sizes encountered in the literature, they are likely to overestimate their targeted effect size and hence the power of their program of research.

So far, we have considered samples from a population with zero correlation, such that any large positive or negative correlations were due to random sampling. Let us see what happens when there is a nonzero effect for a fixed sample size of 30. As Figure 5 shows, the modes of the sampling distributions increase with increasing population correlations, whereas their spreads decrease.

Figure 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 5.

Correlation sampling distributions as a function of the population correlation. The sample size is always n = 30. This figure was created using the R notebook corr_sim.Rmd.

Consider in more detail the sampling distributions for ρ = 0.4 (Fig. 3C). The sampling distributions for n < 50 are negatively skewed. Consequently, the typical experiment will tend to overestimate the true value (distribution mode shifted to the right). If we report only correlations associated with p ≤ 0.05, the distributions look very different (Fig. 3D). Again, with small sample sizes, the estimates are inflated, albeit in the correct direction. There is nevertheless a small number of large negative correlations: indeed, in 0.77% of simulations, even though the population value was 0.4, a large and p ≤ 0.05 negative correlation was obtained—also known as a directional or type III error (Shaffer, 2002).

The main lesson from this section is that the minimum sample size needed to estimate a particular association is probably much larger than we tend to think. And this is not just a point about statistics, because the size of the true correlations we can expect in real data is also probably much weaker than often reported based on small sample sizes. For instance, large n studies and meta-analyses in the social sciences suggest that correlations between 0.2 and 0.4 should be considered relatively large, and weaker correlations are much more common (Schönbrodt and Perugini, 2013; Plonsky and Oswald, 2014; Frey et al., 2017), with some replication attempts demonstrating much smaller effect sizes when using larger samples than in the original studies (Zhang et al., 2018; Cai et al., 2019). Also, because the literature is biased toward positive results, meta-analyses tend to over-estimate the population effect sizes. In brain imaging, a study estimated correlations between resting-state fMRI data and personality traits across 884 participants (Dubois et al., 2018). The largest r was 0.27. And some of the results were highly susceptible to preprocessing methods, an analytical flexibility that could be exploited to find and only report the best possible outcome, contributing to inflated estimates in the literature. In another study involving 5,216 subjects, correlations between several brain structural measurements and two cognitive tasks were all inferior to 0.20 (Ritchie et al., 2018). A meta-analysis of 88 studies (8,036 subjects) of correlations between brain volume and IQ suggests an overall effect of r = 0.24 (Pietschnig et al., 2015). And a preregistered study using a larger sample size suggests that even lower correlations are more realistic (Nave et al., 2019). Thus, in the absence of large-sample replications, large correlations are best taken with a grain of salt. There are, however, a few notable exceptions where large effect sizes are expected, such as test–retest assessment and the comparison of related tests or measurements.

Proportion Data

By looking at sampling distributions of correlation coefficients, we learnt important lessons about the relationship between estimation precision and sample size and how conditioning on arbitrary cutoffs can dramatically bias effect sizes in the literature. We now turn our attention to another type of popular analyses, that of proportion data. Such data are common in both animal and human experiments that involve behavioral measurements to assess performance (proportion of correct trials in memory and navigation tasks for instance) or when a measurement is expressed as a proportion of another one (experimental group relative to a control group for instance). This section will again help us illustrate the importance of sample sizes but also highlight another important aspect of simulations: to be explicit about the data generating process.

Let's consider the simulated data in Figure 6. Panel A illustrates the theoretical relative probabilities of observing different proportions of correct responses for different true performance levels. For instance, imagine a participant who has a true performance level of 10% correct. Also imagine that we ask the participant to perform 100 trials. What will be the participant's actual performance across all these trials? More importantly, what range of actual performance can we expect across experiments? To answer this question, we simulate many experiments. The results for our 10% example are illustrated in the left-most curve in Figure 6A: it shows that for any given experiment, the number of correct trials ranges from roughly 0 to 20. If instead our imaginary participant is on average 50% correct, the values across experiments range roughly from 35 to 65. Thus, the population mean percent correct and standard deviation are dependent: there is stronger variability at 50% correct and the variability decreases as the mean tends toward zero or 1.

Figure 6.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 6.

Percent correct data. A, Binomial sampling distributions. PDFs, probability density functions. B, Beta sampling distribution of proportion correct, with mean 0.7 (vertical line). C, Proportion correct results from 10 random samples from 10 participants. For each participant, the mean percent correct was sampled from the beta distribution in panel B. Then 100 trials were sampled from the corresponding binomial distribution. The long vertical line marks the population value. For each sample, each disk corresponds to the average of 100 trials for a participant and the short vertical line indicates the sample mean across participants. D, Sampling distributions for group results for different numbers of trials (color coded) and different numbers of participants. For each combination of parameters, the distributions are based on a simulation with 20,000 iterations. The vertical line marks the population value. This figure was created using the R notebook pc.Rmd.

Although this type of data are best analyzed using hierarchical (mixed-effect) models (Jaeger, 2008; Kruschke, 2014), in neuroscience and psychology correct/incorrect results and other proportion data tend to be averaged across trials for each condition and participant and entered into an ANOVA. What do the sampling distributions across participants look like? Because participants cannot be less than 0% correct or more than 100% correct, the distributions must be bounded, which rules out any continuous distribution such as the normal distribution as a good model—standard analyses such as ANOVA and t test assume that participants can be more than 100% correct. Instead, a good candidate to capture the shape of the data is a beta distribution, which is bounded between 0 and 1 (Kruschke, 2014; Heiss, 2021). Figure 6B shows an example of a beta sampling distribution with a mean of 70% correct. The distribution is negatively skewed and quite broad. Using this beta distribution, we can simulate the sampling variability expected in an experiment: each sample from the distribution corresponds to the proportion correct of a participant; then for that value, we generate trials from a binomial distribution. Figure 6C illustrates 10 random samples from 10 participants. For each participant, a random value from the beta distribution determined their proportion of correct trials and 100 trials were generated by sampling from the corresponding binomial distribution. Again, these random samples help us grasp the large variability inherent to experimental sampling.

Using the same approach, we can perform 20,000 simulated experiments in which we vary the number of participants and the number of trials per participant. The results are presented in Figure 6D. As illustrated in our correlation examples, with increasing sample sizes, precision increases (sampling distributions get narrower). The results also suggest something very useful to plan experiments: the precision increases faster with the number of participants than with the number of trials (compare panel 10 to panel 50). For instance, with 10 participants and 100 trials per participant, the probability to observe a group mean more than 5% points from the population mean is about 10%. With only 50 trials per participant, this probability is now 14%. But if we test 50 participants with 10 trials each, the probability is now only 3.5%, even though both situations involve a total of 500 observations. The increased benefit of participants over trials seem to apply to many situations (Rouder and Haaf, 2018).

Response Latencies

ERP onsets

Besides proportion data, another popular type of data are measures of response latencies. Here we consider two examples, one using event-related potential (ERP) data and one using manual reaction time data. As such latency data are bounded and skewed, it is important to consider sampling distributions.

In our first example, ERP onsets (earliest differences between conditions) were estimated in 120 participants engaged in a face versus texture discrimination task; 74 of them were tested in a second session to assess test–retest reliability (Bieniek et al., 2016). Here, for convenience, we merge the two sessions to form a distribution of 194 ERP onsets, which is positively skewed (Fig. 7A). Because the typical ERP experiment has much fewer participants, we can use data-driven simulations to determine sampling distributions given smaller sample sizes. If we are interested in the central tendency of a skewed distribution, it can be informative to estimate the 50th quantile of the distribution—the value that splits the distribution of sorted observations in two equal parts (Rousselet and Wilcox, 2020). To estimate the sampling distributions of the 50th quantile, given our data, we sample from the distribution of onsets 10,000 times, using sample sizes from 10 to 40, incremented in steps of 5. The resulting distributions are presented in Figure 7B.

Figure 7.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 7.

ERP onsets. A, Scatterplot of 194 ERP onsets. The vertical lines mark the quartiles. B, Sampling distributions of the 50th (left) and the 15th (right) onset quantiles. The quantiles were computed using the Harrell–Davis estimator, which deals better with tied values than standard estimators (Harrell and Davis, 1982). The results are based on 10,000 simulated experiments. C, Sampling distributions of quantile differences between two experiments. This figure was created using the R notebook onsets.Rmd.

From these distributions, we can ask useful questions by treating the full sample as a population. For instance, what is the proportion of onsets estimates that are within ± a certain value from the population onset quantile?

Here we can motivate our choice of threshold value using the neuroscience literature. For instance, it has been estimated that on average, for short latency neurones, the transfer time between cortical areas is ∼10 ms (Nowak and Bullier, 1997). So when looking at fast visual responses, 10 ms seems like a meaningful level of precision we should care about. With a sample size of 10 participants, ∼80% of 50th quantile estimates (simulated experiments) are within ±10 ms of the population value. With a sample size of 35 participants, ∼97% of 50th quantile estimates are within ±10 ms of the population value. We can also determine the number of observations needed to achieve a certain level of estimation precision. For instance, to be within 10 ms of the full sample 50th quantile value in at least 90% of experiments, how many participants do we need? The answer is n = 19 participants.

There is no need to restrict our investigation to one quantile only. If the focus is on processing speed, it is of particular interest to quantify the fastest responses (Bieniek et al., 2016; Rousselet, 2025). This can be done, for instance, by estimating a lower quantile, say the 15th quantile (Fig. 7B).

Given the controversies about the latencies of the first face responses in the brain, it can be useful to perform a simulation in which we quantify, given our data, how far apart estimates can be between two experiments (Fig. 7C). We can determine, given a certain level of precision, the probability to observe similar effects in two experiments. For instance, with a sample size of 10 participants, only ∼63% of 50th quantile estimates are within ±10 ms of each other. With a sample size of 35 participants, ∼86% of 50th quantile estimates are within ±10 ms of each other. Conversely, we can determine the number of observations needed to achieve a certain performance. For instance, for 90% of pairs of experiments to generate results at most 15 ms apart, we need at least 18 observations. Such calculations might suggest that certain discrepancies in the literature are simply due to random sampling fluctuations (Wilcox and Rousselet, 2024) or might point to real differences due to other factors, such as the task or the age of the participants. Whatever the reason, simulations of sampling distributions help put results in perspective.

The sampling distributions in Figure 7C could also be used to calculate prediction intervals (Spence and Stanley, 2024). For instance, in the long run, over an infinite number of experiments, 95% prediction intervals calculated in one experiment will contain the estimate from a replication experiment 95% of the time. But keep in mind that prediction intervals, like confidence intervals, usually make strong parametric assumptions, such that their probability coverage is inaccurate when applied to real data (Wilcox and Rousselet, 2024).

Reaction time data

The previous example considered a situation with only one level of analysis: for simplicity we did not consider within-participant variability because of the large amount of data involved in computing ERP onsets in each participant (Rousselet, 2025). Here we use a dataset containing a large number of participants and a large number of trials for each participant. The data are from the French Lexicon Project (Ferrand et al., 2010): manual reaction times where measured in response to words and nonwords. After discarding participants who clearly did not pay attention, we are left with 959 participants, each with ∼1,000 trials per condition (no further data cleaning was performed, such as removing outlier trials). Examples of individual reaction time distributions are shown in Figure 8A. The distributions are positively skewed, as expected for RT data, and participants tend to be slower in the Non-Word condition compared with the Word condition. Usually, a single number is used to summarize each individual RT distribution—a better alternative would be to use a hierarchical model (Lindeløv, 2019; Rouder and Province, 2019). From 1,000 values to 1, that's some serious data compression! In psychology, the mean is often used, but when there is skewness it can be a misleading measure of location (Trafimow et al., 2018; Rousselet and Wilcox, 2020). Instead here we use the 20% trimmed mean, which gives a better indication of the location of the typical observation and protects against the influence of outliers (Wilcox and Rousselet, 2023). To compute a 20% trimmed mean, we sort the data, remove the 20% lowest values and the 20% highest values, and average the remaining values. In that context, the median is a 50% trimmed mean and the mean is a 0% trimmed mean. We could trim more or less, but 20% works well in many situations, and this particular choice is irrelevant to the points below.

Figure 8.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 8.

Reaction time data from the FLP dataset. A, Examples of individual reaction time distributions from 20 participants, in the two conditions. B, Distributions of 20% trimmed means in 959 participants: for each participant, a 20% trimmed mean was computed across trials to summarize the distribution in each condition. The right panel shows the distribution of differences between 20% trimmed means in the two conditions (Non-Word minus Word). Note how most of the differences are positive, but with very large variability. C, Sampling distributions for the group 20% trimmed means, based on simulations with 5,000 iterations. The vertical line in each panel marks the population 20% trimmed mean. D, Sampling distributions when there is no effect. This was simulated by pooling the two conditions together. The first row illustrates the sampling distributions as they are (raw data); the second row illustrates the same results after conditioning on p ≤ 0.05. This figure was created using the R notebook flp.Rmd.

The distributions of participants’ 20% trimmed means are positively skewed in both conditions (Fig. 8B). And because these distributions differ in skewness, the distribution of differences between the 20% trimmed means in the two conditions is also skewed (Non-Word minus Word). The distribution of differences overlaps very little with zero: 96.2% of participants have a positive difference, meaning that they tended to be faster in the Word than the Non-Word condition.

With this large dataset, as in previous examples, we can pretend that the full dataset is our population that we're trying to estimate. We perform data-driven simulations to get sampling distributions. In this example, the sampling is hierarchical: we sample with replacement participants, and for each randomly sampled participant, we sample trials with replacement. Because there is skewness at both levels of analysis, we compute 20% trimmed means across trials for each condition and then compute 20% trimmed means across participants. This was done for simulations with 5,000 iterations in which we varied independently the number of participants and the number of trials (Fig. 8C).

As we saw in the previous examples, by increasing the number of participants, we gain in precision: the distributions get narrower. Interestingly, the number of trials has little effect on the width of the sampling distributions. That's because the effects are large and positive in most participants but vary a lot in magnitude across participants—see the right panel in Figure 8B. Thus, in this situation, it is clearly more beneficial to recruit more participants than to increase the number of trials (Rouder and Haaf, 2018).

Similarly to what we did in the previous examples, we could compute the number of trials and participants needed to achieve a certain level of estimation precision. But here we use the data to address a different question: what differences could we observe with this type of data and task if there were no difference between conditions? This situation can be simulated by pooling trials across conditions for each participant and sampling with replacement from that one pool of trials. We proceed by sampling participants with replacement, then for each randomly selected participant, we sample with replacement two sets of random response times from the response times in the two conditions mixed together.

As expected, the sampling distributions are now centered near zero, as on average, in the long run, we expect zero difference—assuming we're trying to estimate an unbiased quantity (Fig. 8D). But the variability across simulated experiments is large, suggesting that if results were selected based on some arbitrary cutoff, researchers could easily fool themselves into reporting false positives of large magnitude. To see what the distributions of results conditional on p ≤ 0.05 would look like, for each simulation, we performed a group t test on 20% trimmed means, a technique that requires some adjustments to the standard t test equation (Tukey and McLaughlin, 1963; Wilcox, 2022). The conditional distributions look very different to the raw ones (Fig. 8D, second row): the distributions are now bimodal with a gap around zero. Based on these distributions, what would the typical result look like? With 10 participants and 50 trials per condition, the median of the absolute group differences is ∼24 ms. Among the simulated experiments that produced results with p ≤ 0.05, 10% have differences at least as large as 37 ms. With 50 participants and 200 trials per condition, the median of the absolute group differences is ∼7 ms, and among the experiments with p ≤ 0.05, 10% produce differences at least as large as 9 ms. So, large sample sizes can strongly damp down the effect sizes of false positives reported in the literature.

Finally, it is important to consider that the type of data-driven simulations reported here is affected by the relative size difference between the population and the sample. As sample sizes get closer to the population size, sampling distributions are distorted, which biases power estimation, and the sign of the bias depends on the effect size in the population. The problem is explained in detail in Burns et al. (2025).

Summary and Conclusions

In this article, we saw how simulations can be used to bring perspective to results from single experiments. Using simulations, we can explore potential results not from one experiment, but from thousands of them (sampling distributions). The main outcome of this process is the startling realization that in many situations, our measurements are so noisy that we should refrain from drawing strong conclusions about them. Instead, we can learn to plan more ambitious experiments from the results of simulations and develop a healthy skepticism about published results and our own.

Thus, the approach described here contributes to tackling one critical problem faced by many researchers: the over-confidence fueled by the irrational belief that statistical tests can deliver certainty in the face of measurement noise and sampling variability. More than a crisis of replicability, I would argue that we have a crisis of over-confidence, one that can be tackled not by teaching more stats, but by teaching statistical thinking in the broader sense, as well as scientific integrity and humility. In the words of Gelman (2018): “Forget about getting definitive results from a single experiment; instead embrace variation, accept uncertainty, and learn what you can.” And the best way to embrace variation and to accept uncertainty is by simulating data (McElreath, 2020; DeBruine and Barr, 2021; Vasishth and Gelman, 2021).

Simulations can be done using synthetic or real data. With data sharing and large studies on the rise, many useful questions can be addressed using data-driven simulations. It's a great time to be a data parasite! Real data have the advantage of providing rich shape information and within-participant correlations that can be tricky to simulate. So, given the need for large datasets for running simulations, and the benefits of doing so, this is another reason for scientists to release their data and code.

And we really need more datasets, particularly large multisite ones. Indeed, in the examples covered above, we only considered variance across trials and participants, but there are many more sources of variance we need to account for: for instance, variance across stimuli, study parameters, types of equipment, sites, cultures, and time of day. Without a good handle on all these sources of variability, we face a deep crisis of generalizability (Yarkoni, 2022). Maybe we should have lengthy and passionate discussions about sources of variability and how to improve the precision of our measurements, instead of wasting oxygen discussing p values.

Data Availability

All the figures and analyses presented in this article can be reproduced using notebooks in the R programming language (R Core Team, 2021), as part of a reproducibility package available on GitHub at https://github.com/GRousselet/sampdist. All the figures are licensed CC-BY 4.0. Each figure caption ends with the name of the RMarkdown file that can be used to reproduce it. Some of the examples presented in this article were previously posted in a different format as blog posts covering: sampling distributions of correlations (https://garstats.wordpress.com/2018/06/01/smallncorr/) and what happens when the results are conditioned on p values (https://garstats.wordpress.com/2018/06/22/corrcondpval/), reaction time sampling distributions (https://garstats.wordpress.com/2018/01/24/10000/), and illustrations of estimation precision (https://garstats.wordpress.com/2018/08/27/precision/). The main R packages used to generate the data and to make the figures and notebooks are ggplot2 (Wickham, 2016), cowplot (Wilke, 2017), Cairo (Urbanek and Horner, 2019), dplyr (Wickham et al., 2019), Rfast (Papadakis et al., 2019), tibble (Müller and Wickham, 2018), rogme (Rousselet et al., 2017), knitr (Xie, 2018), and the essential beepr (Bååth, 2018).

Footnotes

  • The author declares no competing financial interests.

  • I thank Jan Vanhove for comments on an earlier version of this article.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.

References

  1. ↵
    1. Bååth R
    (2018) beepr: Easily play notification sounds on any platform. https://CRAN.R-project.org/package=beepr
  2. ↵
    1. Baguley T
    (2009) Standardized or simple effect size: what should be reported? Br J Psychol 100:603–617. https://doi.org/10.1348/000712608X377117
    OpenUrlCrossRefPubMed
  3. ↵
    1. Baguley T
    (2010) When correlations go bad. Psychologist 23:122–123.
    OpenUrl
  4. ↵
    1. Baguley T
    (2012) Serious stats: a guide to advanced statistics for the behavioral sciences. New York: Palgrave.
  5. ↵
    1. Bieniek MM,
    2. Bennett PJ,
    3. Sekuler AB,
    4. Rousselet GA
    (2016) A robust and representative lower bound on object processing speed in humans. Eur J Neurosci 44:1804–1814. https://doi.org/10.1111/ejn.13100
    OpenUrlCrossRefPubMed
  6. ↵
    1. Bland JM
    (2009) The tyranny of power: is there a better way to calculate sample size? BMJ 339:b3985. https://doi.org/10.1136/bmj.b3985
    OpenUrlFREE Full Text
  7. ↵
    1. Burns CDG,
    2. Fracasso A,
    3. Rousselet GA
    (2025) Bias in data-driven replicability analysis of univariate brain-wide association studies. Sci Rep 15:6105. https://doi.org/10.1038/s41598-025-89257-w
    OpenUrl
  8. ↵
    1. Cai Z,
    2. Hahn AC,
    3. Zhang W,
    4. Holzleitner IJ,
    5. Lee AJ,
    6. DeBruine LM,
    7. Jones BC
    (2019) No evidence that facial attractiveness, femininity, averageness, or coloration are cues to susceptibility to infectious illnesses in a university sample of young adult women. Evol Hum Behav 40:156–159. https://doi.org/10.1016/j.evolhumbehav.2018.10.002
    OpenUrlCrossRef
  9. ↵
    1. DeBruine L
    (2025) faux: Simulation for factorial designs. Zenodo. https://doi.org/10.5281/zenodo.2669586
  10. ↵
    1. DeBruine L,
    2. Barr D
    (2021) Understanding mixed-effects models through data simulation. Adv Methods Pract Psychol Sci 4:2515245920965119. https://doi.org/10.1177/2515245920965119
    OpenUrl
  11. ↵
    1. Dubois J,
    2. Galdi P,
    3. Han Y,
    4. Paul LK,
    5. Adolphs R
    (2018) Resting-state functional brain connectivity best predicts the personality dimension of openness to experience. Personal Neurosci 1:e6. https://doi.org/10.1017/pen.2018.8
    OpenUrlCrossRef
  12. ↵
    1. Ferrand L,
    2. New B,
    3. Brysbaert M,
    4. Keuleers E,
    5. Bonin P,
    6. Méot A,
    7. Augustinova M,
    8. Pallier C
    (2010) The French lexicon project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods 42:488–496. https://doi.org/10.3758/BRM.42.2.488
    OpenUrlCrossRefPubMed
  13. ↵
    1. Forstmeier W,
    2. Wagenmakers E-J,
    3. Parker TH
    (2016) Detecting and avoiding likely false-positive findings – a practical guide. Biol Rev 92:1941–1968. https://doi.org/10.1111/brv.12315
    OpenUrl
  14. ↵
    1. Frey R,
    2. Pedroni A,
    3. Mata R,
    4. Rieskamp J,
    5. Hertwig R
    (2017) Risk preference shares the psychometric structure of major psychological traits. Sci Adv 3:e1701381. https://doi.org/10.1126/sciadv.1701381
    OpenUrlFREE Full Text
  15. ↵
    1. Gelman A
    (2018) The failure of null hypothesis significance testing when studying incremental changes, and what to do about it. Pers Soc Psychol Bull 44:16–23. https://doi.org/10.1177/0146167217729162
    OpenUrlCrossRefPubMed
  16. ↵
    1. Gelman A,
    2. Hill J,
    3. Vehtari A
    (2020) Regression and other stories. Cambridge: Cambridge University Press.
  17. ↵
    1. Gelman A,
    2. Carlin J
    (2014) Beyond power calculations: assessing type S (sign) and type M (magnitude) errors. Perspect Psychol Sci 9:641–651. https://doi.org/10.1177/1745691614551642
    OpenUrlCrossRefPubMed
  18. ↵
    1. Gelman A,
    2. Weakliem D
    (2009) Of beauty, sex and power: too little attention has been paid to the statistical challenges in estimating small effects. Am Sci 97:310–316. https://doi.org/10.1511/2009.79.310
    OpenUrlCrossRef
  19. ↵
    1. Gigerenzer G,
    2. Marewski JN
    (2015) Surrogate science: the idol of a universal method for scientific inference. J Manage 41:421–440. https://doi.org/10.1177/0149206314547522
    OpenUrl
  20. ↵
    1. Greenland S,
    2. Senn SJ,
    3. Rothman KJ,
    4. Carlin JB,
    5. Poole C,
    6. Goodman SN,
    7. Altman DG
    (2016) Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol 31:337–350. https://doi.org/10.1007/s10654-016-0149-3
    OpenUrlCrossRefPubMed
  21. ↵
    1. Harrell FE,
    2. Davis CE
    (1982) A new distribution-free quantile estimator. Biometrika 69:635–640. https://doi.org/10.1093/biomet/69.3.635
    OpenUrlCrossRef
  22. ↵
    1. Heiss A
    (2021) A guide to modeling proportions with Bayesian beta and zero-inflated beta regression models. https://doi.org/10.59350/7p1a4-0tw75
  23. ↵
    1. Howell DC
    (2013) Statistical methods for psychology, Ed 8. Wadsworth Cengage Learning.
  24. ↵
    1. Ioannidis JPA
    (2008) Why most discovered true associations are inflated. Epidemiology 19:640–648. https://doi.org/10.1097/EDE.0b013e31818131e7
    OpenUrlCrossRefPubMed
  25. ↵
    1. Jaeger TF
    (2008) Categorical data analysis: away from ANOVAs (transformation or not) and towards logit mixed models. J Mem Lang 59:434–446. https://doi.org/10.1016/j.jml.2007.11.007
    OpenUrlCrossRefPubMed
  26. ↵
    1. Kruschke JK
    (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen 142:573–603. https://doi.org/10.1037/a0029146
    OpenUrlCrossRefPubMed
  27. ↵
    1. Kruschke JK
    (2014) Doing Bayesian data analysis, Ed 2. Academic Press.
  28. ↵
    1. Lindeløv JK
    (2019) Reaction time distributions: an interactive overview. https://lindeloev.github.io/shiny-rt/#5_notes_on_good_models_for_rt_data
  29. ↵
    1. Loken E,
    2. Gelman A
    (2017) Measurement error and the replication crisis. Science 355:584–585. https://doi.org/10.1126/science.aal3618
    OpenUrlAbstract/FREE Full Text
  30. ↵
    1. Maxwell SE,
    2. Kelley K,
    3. Rausch JR
    (2008) Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol 59:537–563. https://doi.org/10.1146/annurev.psych.59.103006.093735
    OpenUrlCrossRefPubMed
  31. ↵
    1. McElreath R
    (2020) Statistical rethinking: a Bayesian course with examples in R and STAN, Ed 2. New York: Chapman and Hall/CRC.
  32. ↵
    1. McShane BB,
    2. Gal D,
    3. Gelman A,
    4. Robert C,
    5. Tackett JL
    (2019) Abandon statistical significance. Am Stat 73:235–245. https://doi.org/10.1080/00031305.2018.1527253
    OpenUrlCrossRef
  33. ↵
    1. Morey RD
    (2018) When the statistical tail wags the scientific dog. Medium. https://medium.com/@richarddmorey/when-the-statistical-tail-wags-the-scientific-dog-d09a9f1a7c63
  34. ↵
    1. Morris TP,
    2. White IR,
    3. Crowther MJ
    (2019) Using simulation studies to evaluate statistical methods. Stat Med 38:2074–2102. https://doi.org/10.1002/sim.8086
    OpenUrlCrossRefPubMed
  35. ↵
    1. Müller K,
    2. Wickham H
    (2018) tibble: Simple data frames [Computer software]. https://CRAN.R-project.org/package=tibble
  36. ↵
    1. Nave G,
    2. Jung WH,
    3. Karlsson Linnér R,
    4. Kable JW,
    5. Koellinger PD
    (2019) Are bigger brains smarter? Evidence from a large-scale preregistered study. Psychol Sci 30:43–54. https://doi.org/10.1177/0956797618808470
    OpenUrlCrossRefPubMed
  37. ↵
    1. Nowak LG,
    2. Bullier J
    (1997) The timing of information transfer in the visual system. In: Extrastriate cortex in primates (Rockland KS, Kaas JH, Peters A, eds), pp 205–241. Springer US.
  38. ↵
    1. Papadakis M, et al.
    (2019) Rfast: a collection of efficient and extremely fast R functions. https://CRAN.R-project.org/package=Rfast
  39. ↵
    1. Pernet CR,
    2. Wilcox RR,
    3. Rousselet GA
    (2013) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol 3:606. https://doi.org/10.3389/fpsyg.2012.00606
    OpenUrlCrossRefPubMed
  40. ↵
    1. Peters G-J,
    2. Crutzen R
    (2017) Knowing how effective an intervention, treatment, or manipulation is and increasing replication rates: accuracy in parameter estimation as a partial solution to the replication crisis. https://doi.org/10.31234/osf.io/cjsk2
  41. ↵
    1. Pietschnig J,
    2. Penke L,
    3. Wicherts JM,
    4. Zeiler M,
    5. Voracek M
    (2015) Meta-analysis of associations between human brain volume and intelligence differences: how strong are they and what do they mean? Neurosci Biobehav Rev 57:411–432. https://doi.org/10.1016/j.neubiorev.2015.09.017
    OpenUrlCrossRefPubMed
  42. ↵
    1. Plonsky L,
    2. Oswald FL
    (2014) How big is “big”? Interpreting effect sizes in L2 research. Lang Learn 64:878–912. https://doi.org/10.1111/lang.12079
    OpenUrl
  43. ↵
    R Core Team (2021) R: A language and environment for statistical computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/
  44. ↵
    1. Ritchie SJ, et al.
    (2018) Sex differences in the adult human brain: evidence from 5216 UK Biobank participants. Cereb Cortex 28:2959–2975. https://doi.org/10.1093/cercor/bhy109
    OpenUrlCrossRefPubMed
  45. ↵
    1. Rothman KJ,
    2. Greenland S
    (2018) Planning study size based on precision rather than power. Epidemiology 29:599. https://doi.org/10.1097/EDE.0000000000000876
    OpenUrlCrossRefPubMed
  46. ↵
    1. Rouder JN,
    2. Haaf JM
    (2018) Power, dominance, and constraint: a note on the appeal of different design traditions. Adv Methods Pract Psychol Sci 1:19–26. https://doi.org/10.1177/2515245917745058
    OpenUrl
  47. ↵
    1. Rouder JN,
    2. Province JM
    (2019) Bayesian hierarchical models in psychological science: a tutorial. In: New methods in cognitive psychology (Spieler D, Schumacher E, eds). Ed 1, pp 32–66. New York: Routledge.
  48. ↵
    1. Rousselet GA,
    2. Pernet CR,
    3. Wilcox RR
    (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. Eur J Neurosci 46:1738–1748. https://doi.org/10.1111/ejn.13610
    OpenUrlCrossRefPubMed
  49. ↵
    1. Rousselet GA,
    2. Pernet CR,
    3. Wilcox RR
    (2023) An introduction to the bootstrap: a versatile method to make inferences by using data-driven simulations. Meta-Psychol 7. https://doi.org/10.15626/MP.2019.2058
  50. ↵
    1. Rousselet GA
    (2025) Using cluster-based permutation tests to estimate MEG/EEG onsets: how bad is it? Eur J Neurosci 61:e16618. https://doi.org/10.1111/ejn.16618
    OpenUrlCrossRefPubMed
  51. ↵
    1. Rousselet GA,
    2. Wilcox RR
    (2020) Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychol 4. https://doi.org/10.15626/MP.2019.1630
  52. ↵
    1. Schönbrodt FD,
    2. Perugini M
    (2013) At what sample size do correlations stabilize? J Res Pers 47:609–612. https://doi.org/10.1016/j.jrp.2013.05.009
    OpenUrlCrossRef
  53. ↵
    1. Shaffer JP
    (2002) Multiplicity, directional (type III) errors, and the null hypothesis. Psychol Methods 7:356–369. https://doi.org/10.1037/1082-989X.7.3.356
    OpenUrlCrossRefPubMed
  54. ↵
    1. Smaldino PE,
    2. McElreath R
    (2016) The natural selection of bad science. R Soc Open Sci 3:160384. https://doi.org/10.1098/rsos.160384
    OpenUrlCrossRefPubMed
  55. ↵
    1. Spence JR,
    2. Stanley DJ
    (2024) Tempered expectations: a tutorial for calculating and interpreting prediction intervals in the context of replications. Adv Methods Pract Psychol Sci 7:25152459231217932. https://doi.org/10.1177/25152459231217932
    OpenUrl
  56. ↵
    1. Trafimow D,
    2. Wang T,
    3. Wang C
    (2018) Means and standard deviations, or locations and scales? That is the question!. New Ideas Psychol 50:34–37. https://doi.org/10.1016/j.newideapsych.2018.03.001
    OpenUrl
  57. ↵
    1. Trafimow D,
    2. Wang T,
    3. Wang C
    (2019) From a sampling precision perspective, skewness is a friend and not an enemy!. Educ Psychol Meas 79:129–150. https://doi.org/10.1177/0013164418764801
    OpenUrl
  58. ↵
    1. Trafimow D,
    2. MacDonald JA
    (2017) Performing inferential statistics prior to data collection. Educ Psychol Meas 77:204–219. https://doi.org/10.1177/0013164416659745
    OpenUrl
  59. ↵
    1. Tukey JW,
    2. McLaughlin DH
    (1963) Less vulnerable confidence and significance procedures for location based on a single sample: trimming/winsorization 1. Sankhya¯ 25:331–352. http://www.jstor.org/stable/25049278
    OpenUrl
  60. ↵
    1. Urbanek S,
    2. Horner J
    (2019) Cairo: R graphics device using Cairo graphics library for creating high-quality bitmap (PNG, JPEG, TIFF), vector (PDF, SVG, PostScript) and display (X11 and Win32) output. https://CRAN.R-project.org/package=Cairo
  61. ↵
    1. Vasishth S,
    2. Gelman A
    (2021) How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics 59:1311–1342. https://doi.org/10.1515/ling-2019-0051
    OpenUrl
  62. ↵
    1. Vul E,
    2. Harris C,
    3. Winkielman P,
    4. Pashler H
    (2009a) Puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition. Perspect Psychol Sci 4:274–290. https://doi.org/10.1111/j.1745-6924.2009.01125.x
    OpenUrlCrossRefPubMed
  63. ↵
    1. Vul E,
    2. Harris C,
    3. Winkielman P,
    4. Pashler H
    (2009b) Reply to comments on ‘puzzlingly high correlations in fMRI studies of emotion, personality, and social cognition’. Perspect Psychol Sci 4:319–324. https://doi.org/10.1111/j.1745-6924.2009.01132.x
    OpenUrlCrossRefPubMed
  64. ↵
    1. Wagenmakers E-J
    (2007) A practical solution to the pervasive problems of p values. Psychon Bull Rev 14:779–804. https://doi.org/10.3758/BF03194105
    OpenUrlCrossRefPubMed
  65. ↵
    1. Wickham H
    (2016) Ggplot2: elegant graphics for data analysis, Ed 2. New York: Springer International Publishing.
  66. ↵
    1. Wickham H,
    2. François R,
    3. Henry L,
    4. Müller K
    (2019) dplyr: a grammar of data manipulation. https://CRAN.R-project.org/package=dplyr
  67. ↵
    1. Wilcox RR
    (2022) Introduction to robust estimation and hypothesis testing, Ed 5. San Diego, CA: Academic Press.
  68. ↵
    1. Wilcox RR,
    2. Rousselet GA
    (2023) An updated guide to robust statistical methods in neuroscience. Curr Protoc 3:e719. https://doi.org/10.1002/cpz1.719
    OpenUrl
  69. ↵
    1. Wilcox RR,
    2. Rousselet GA
    (2024) More reasons why replication is a difficult issue. OSF. https://doi.org/10.31219/osf.io/9amhe
  70. ↵
    1. Wilke CO
    (2017) cowplot: Streamlined plot theme and plot annotations for ‘ggplot2’ [Computer software]. https://CRAN.R-project.org/package=cowplot
  71. ↵
    1. Xie Y
    (2018) knitr: A general-purpose package for dynamic report generation in R [Computer software]. https://yihui.name/knitr/
  72. ↵
    1. Yarkoni T
    (2009) Big correlations in little studies: inflated fMRI correlations reflect low statistical power-commentary on Vul et al. (2009). Perspect Psychol Sci 4:294–298. https://doi.org/10.1111/j.1745-6924.2009.01127.x
    OpenUrlCrossRefPubMed
  73. ↵
    1. Yarkoni T
    (2022) The generalizability crisis. Behav Brain Sci 45:e1. https://doi.org/10.1017/S0140525X20001685
    OpenUrlCrossRefPubMed
  74. ↵
    1. Zhang W,
    2. Hahn AC,
    3. Cai Z,
    4. Lee AJ,
    5. Holzleitner IJ,
    6. DeBruine LM,
    7. Jones BC
    (2018) No evidence that facial width-to-height ratio (fWHR) is associated with women’s sexual desire. PLoS One 13:e0200308. https://doi.org/10.1371/journal.pone.0200308
    OpenUrl

Synthesis

Reviewing Editor: Catherine Schevon, Columbia University

Decisions are customarily a result of the Reviewing Editor and the peer reviewers coming together and discussing their recommendations until a consensus is reached. When revisions are invited, a fact-based synthesis statement explaining their decision and outlining what is needed to prepare a revision will be listed below. The following reviewer(s) agreed to reveal their identity: Robert Calin-Jageman.

This paper provides a tutorial and a commentary, explaining how to use simulated sampling distributions to build sharper and more nuanced statistical thinking. The paper has sections discussing estimating correlations, proportions correct for performance-based outcomes, and time-based measurements (ERP latency and reaction times) - all 3 very common types of outcome measures in eNeuro papers. The paper is clear, well-written, and draws out numerous statistical insights that readers of eNeuro will likely find compelling. Although the material included in the manuscript is not novel, this is appropriate given the educational purpose of the article. All the underling code is available and well documented.

The reviewers agreed that the paper is suitable for publication, provided that the following minor modifications are made:

1. Please ensure that the purpose of the paper, and the fact that it contains no new information, is clear in the abstract.

2. Within the section on reaction times there is a passage on a one of the pitfalls of data-driven simulations, drawing on Burns et al. 2025. I found this passage difficult to follow: it comes at the end of a paper already requiring careful attention, and the setup for the question and the resulting simulation to answer the question is quite complex. Overall, I'm not sure the space available really allows the reader to absorb the lesson well - this may well be a topic that really needs its own paper or tutorial. Though it is always painful to cut a passage that has required extensive work, I ask the author to strike this section in favor of an exhortation for readers to consult Burns et al. (2025).

3. Please edit the manuscript throughout to soften the tone, in order to make it more suitable to the context of the eNeuro readership. A list of specific suggestions is included below, but please look for other similar instances requiring editing.

• Abstract - "most statistical inferences in psychology"... might want to amend to "psychology and neuroscience" to ensure readers of the abstract don't draw the wrong conclusion that the paper is directed primarily to psychologists

• Statement of significance: "that many researchers believe that nature owes them simple explanations" this seems a bit strong, and I'm also not sure if it captures the issue well (much of the rest of the paper seems to diagnose the issue as lack of knowledge/insight... this statement makes it sound more like a willful corruption of epistemic standards)

• "The typical psychology or neuroscience article reports data from one or a very few Experiments". I'm not sure if this is true in general, but it is not true for eNeuro - some papers report a single experiment, but the majority report a series of interconnected experiments. On the other hand, internal replication is rare for papers in eNeuro, so the point that researchers often leap from a single experiment to a sweeping causal claim would be accurate for this journal (and others). Just suggesting a slight rephrase here so that a reader skeptical of your position doesn't immediately jump on this claim.

• "This important point is discussed in any decent introductory frequentist statistics textbook (Howell, 2013)". - this is a bit tendentious. Perhaps something like "Sampling distributions are foundational to frequentists statistics, though often hard to grasp and so worth exploring in detail through simulations"... or something like that. The reference to Howell should probably be ".e.g. Howell, 2013)' as an example of the coverage, not as a source for the fact that sampling distributions are covered in decent stats textbooks.

• "First, they are skewed, " - just to make this clear, perhaps "First, they are skewed when drawn from a lognormal population..."

• "And that's the main reason to use large sample sizes: to get closer to the truth." .. perhaps "to get (on average) closer to the truth"?

• For "There is nothing new here: inflated effect sizes are a natural outcome of studies with small sample sizes" - perhaps "inaccurate effect sizes are a natural outcome" (with the specific push towards inflation coming by then applying a significance filter, right?)

• "Why 70%? Why {plus minus}0.1? This is not different from any thresholds used in statistics: for instance, what's the rational for p{less than or equal to}0.05, 95% confidence intervals, 80% power, Bayes factor > 3, and the inane trichotomisation of Cohen's d, other than habit?" - this is a bit of a mini-rant, and I'm not sure it helps support the point you want to make. Could just say these are possible points of reference but the full plots allow researchers to think critically about the sample-size accuracy tradeoffs inherent in their designs.

• You mentioned that the default choice of Pearson's r is poor. Perhaps the paragraph that starts with "Of course, the values I used above are completely arbitrary." Could end with something like "In addition to consulting Figure 4A, researchers may also want to explore alternatives to Pearson's r that are more robust (references)"

• I think you could consider cutting from "Not surprisingly..." up to "Indeed, consider", including Figure 4B. Specifically, I think there is so little planning currently in eNeuro papers and so little understanding of planning for power that I don't know the paragraph and figure are that useful, and you could move right on to demonstrating how adding the significance filter to these sampling distributions pollutes the literature.

• Correlations in the wild - a large, pre-registered study of brain-size and IQ found a slightly weaker correlation than even the weak one in the meta-analysis you cite (Nave et al., 2018; r = .19).. just pointing that out as another instance where the meta-analytic effect size is likely inflated. ☹

• "popular analyses, that of proportion correct data." - for this journal, you might add a sentence noting that this type of data often comes up in both human and animal experiments involving memory performance, navigation, and other performance-based outcomes

• "Let's consider" - it took me a bit of time to work through this paragraph. I suggest some slight edits for clarity that you might consider

o "imagine a participant who has a true performance level of 10% correct" (or something like that...that you are defining a true subject-level parameter is just a bit unclear as is)

o "imagine we will measure this participant's performance across 100 trials... what is the range of *actual* performance we might observe... we can find out by simulating thousands of such experiments... " - that is, I am suggesting going just a bit more slowly here, identifying the subject parameter, the number of trials, and the simulation strategy just a bit more clearly so that the reader can take it all in.

• Figure 7C and its description - I believe these are basically simulated prediction intervals. Worth a sentence to say that prediction intervals are a useful tool for setting expectations about expected variation between experiments with perhaps a reference?

• "I would argue that we have a crisis of over-confidence, one that can be tackled not by teaching more stats, but by teaching statistical thinking in the broader sense, but also scientific integrity and humility." - perhaps "as well as scientific integrity and humility"

• "within participant correlations" perhaps "within-participant correlations"

• "I leave you with an example."... I am not sure this paragraph add a lot. We have all seen presentations that we find maddening; I'm not sure this particular anecdote connects that well with the general points you've made. Also, I think the previous two paragraphs are quite good and make an excellent coda for the paper.

Back to top

In this issue

eneuro: 12 (10)
eNeuro
Vol. 12, Issue 10
October 2025
  • Table of Contents
  • Index by author
  • Masthead (PDF)
Email

Thank you for sharing this eNeuro article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Using Simulations to Explore Sampling Distributions: An Antidote to Hasty and Extravagant Inferences
(Your Name) has forwarded a page to you from eNeuro
(Your Name) thought you would be interested in this article in eNeuro.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Print
View Full Page PDF
Citation Tools
Using Simulations to Explore Sampling Distributions: An Antidote to Hasty and Extravagant Inferences
Guillaume A. Rousselet
eNeuro 23 October 2025, 12 (10) ENEURO.0339-25.2025; DOI: 10.1523/ENEURO.0339-25.2025

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Respond to this article
Share
Using Simulations to Explore Sampling Distributions: An Antidote to Hasty and Extravagant Inferences
Guillaume A. Rousselet
eNeuro 23 October 2025, 12 (10) ENEURO.0339-25.2025; DOI: 10.1523/ENEURO.0339-25.2025
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Significance Statement
    • Introduction
    • Sampling Distributions
    • Correlation Analyses
    • Proportion Data
    • Response Latencies
    • Summary and Conclusions
    • Data Availability
    • Footnotes
    • References
    • Synthesis
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF

Keywords

  • correlation
  • ERP
  • estimation
  • measurement precision
  • reaction times
  • statistical power

Responses to this article

Respond to this article

Jump to comment:

No eLetters have been published for this article.

Related Articles

Cited By...

More in this TOC Section

Commentary

  • A Bioscience Educators’ Purpose in a Modern World
  • Reflection and Experimental Rigor Are Our AiMS: A New Metacognitive Framework for Experimental Design
Show more Commentary

Novel Tools and Methods

  • A Bioscience Educators’ Purpose in a Modern World
  • Reflection and Experimental Rigor Are Our AiMS: A New Metacognitive Framework for Experimental Design
  • Reliable Single-trial Detection of Saccade-related Lambda Responses with Independent Component Analysis
Show more Novel Tools and Methods

Subjects

  • Improving Your Neuroscience
  • Home
  • Alerts
  • Follow SFN on BlueSky
  • Visit Society for Neuroscience on Facebook
  • Follow Society for Neuroscience on Twitter
  • Follow Society for Neuroscience on LinkedIn
  • Visit Society for Neuroscience on Youtube
  • Follow our RSS feeds

Content

  • Early Release
  • Current Issue
  • Latest Articles
  • Issue Archive
  • Blog
  • Browse by Topic

Information

  • For Authors
  • For the Media

About

  • About the Journal
  • Editorial Board
  • Privacy Notice
  • Contact
  • Feedback
(eNeuro logo)
(SfN logo)

Copyright © 2025 by the Society for Neuroscience.
eNeuro eISSN: 2373-2822

The ideas and opinions expressed in eNeuro do not necessarily reflect those of SfN or the eNeuro Editorial Board. Publication of an advertisement or other product mention in eNeuro should not be construed as an endorsement of the manufacturer’s claims. SfN does not assume any responsibility for any injury and/or damage to persons or property arising from or related to any use of any material contained in eNeuro.