Skip to main content
  • Research article
  • Open access
  • Published:

Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives

Abstract

Background

In neuroscience, experimental designs in which multiple measurements are collected in the same research object or treatment facility are common. Such designs result in clustered or nested data. When clusters include measurements from different experimental conditions, both the mean of the dependent variable and the effect of the experimental manipulation may vary over clusters. In practice, this type of cluster-related variation is often overlooked. Not accommodating cluster-related variation can result in inferential errors concerning the overall experimental effect.

Results

The exact effect of ignoring the clustered nature of the data depends on the effect of clustering. Using simulation studies we show that cluster-related variation in the experimental effect, if ignored, results in a false positive rate (i.e., Type I error rate) that is appreciably higher (up to ~20–~50 %) than the chosen \(\alpha\)-level (e.g., \(\alpha\) = 0.05). If the effect of clustering is limited to the intercept, the failure to accommodate clustering can result in a loss of statistical power to detect the overall experimental effect. This effect is most pronounced when both the magnitude of the experimental effect and the sample size are small (e.g., ~25 % less power given an experimental effect with effect size d of 0.20, and a sample size of 10 clusters and 5 observations per experimental condition per cluster).

Conclusions

When data is collected from a research design in which observations from the same cluster are obtained in different experimental conditions, multilevel analysis should be used to analyze the data. The use of multilevel analysis not only ensures correct statistical interpretation of the overall experimental effect, but also provides a valuable test of the generalizability of the experimental effect over (intrinsically) varying settings, and a means to reveal the cause of cluster-related variation in experimental effect.

Background

Nested data are common in neuroscience, where multiple observations are often collected in the same cell, tissue sample, litter, or treatment facility [14]. For example, consider a study of differences between wild type (WT) and knock-out (KO) animals in the number of docked vesicles within presynaptic boutons. As each neuron has multiple presynaptic boutons, one can measure the number of docked vesicles in multiple boutons of every neuron, resulting in multiple measurements within each neuron (Fig. 1a). As the measurements are clustered within neurons, data resulting from this type of experimental design is referred to as clustered or nested data.Footnote 1 Such data have a hierarchical, or multilevel, structure. In the present example, the number of presynaptic boutons within a neuron is referred to as the level 1 variable, and neuron is the level 2, clustering, variable. In this research design, which we refer to as design A, all observations from the same cluster belong to the same experimental condition (in our example: genotype). Research design A has received considerable attention in neuroscience literature, emphasizing that such clustered data are common in neuroscience, and that statistical accommodation of the clustered nature of the data is crucial to avoid false positive results (i.e., inflation of the Type I error rate) [16].

Fig. 1
figure 1

Graphical illustration of nested data in research design A and B. In design A a, all observations in a cluster are subject to the same experimental condition. An example of this design is the comparison of WT and KO animals with respect to the number of docked vesicles within presynaptic boutons: bouton-measurements are typically clustered within neurons, and all measurements from the same neuron belong to the same experimental condition, i.e., have the same genotype. In this hypothetical example, we assume that a single neuron is sampled from each animal. If multiple neurons are sampled from the same animal, a third “mouse” level is added to the nested structure of the data. In research design B b, observations from the same cluster are subject to different experimental conditions. An example of this design is the comparison of neurite outgrowth in cells that are treated, or not (control), with growth factor (GF). Here, typically multiple observations from both treated and untreated neurons are obtained from, and so clustered within, the same animal

Nested data, however, may arise in designs other than design A. In what we call research design B, observations from the same cluster are subjected to different experimental conditions. Classical examples are studies in which mice from the same litter are randomized over different experimental treatments. Research design B is common in the clinical and preclinical neurosciences [2, 7], but is also employed in the basic neurosciences. Examples include studies on the effect of different pharmacological compounds, recombinant proteins, or siRNA’s on cellular or subcellular features, where the experimental treatment is applied to different tissue samples of the same animal (Fig. 1b). Other examples include the comparison of morphological features from animals or tissue samples, where each animal or tissue sample provides multiple measurements on different morphological features. Examples of research design B data in biological neuroscience are given in Table 1.

Table 1 Examples of research design B nested data in biological neuroscience

In neuroscience literature, the discussion of research design B has been limited to the case in which the experimental effect is assumed to be the same for all clusters [2, 4]. This is a strong assumption, and there is often no reason to believe that the experimental manipulation will indeed have exactly the same effect in each cluster. Here we show that even a small amount of variation in the experimental effect across clusters inflates the false positive rate of the experimental effect, if that variation is not accommodated in the statistical model.

The aim of the present paper is to describe the intricacies of research design B, and explain how these can be accommodated in multilevel analysis (also known as ‘hierarchical modeling’, ‘mixed-’ or ‘random effects models’). In Neuroscience, the research question in nested designs is often formulated at the level of the individual observations. However, as a result of the clustering, the individual observations may show dependency, and this dependency needs to be accommodated in the statistical analysis. First, we briefly discuss research design A. Second, we focus specifically on the defining features of research design B, and show how these can be accommodated in multilevel analysis. Third, we demonstrate through simulations that misspecification of the statistical model for data obtained in design B results either in increased Type I error rate (i.e., spurious effects), or in decreased statistical power to detect the experimental effects. Finally, we discuss the use of cluster-related information to explain part of the variation in the experimental effect, with the aim of increasing statistical power to detect the experimental effect, and facilitating the biological understanding of variation in this effect.

Research design A

In research design A, multiple observations are collected in the same cluster, and only one experimental condition is represented in each cluster (Fig. 1a). We recently emphasized that design A is common in neuroscience research: at least 53 % of research papers published in 5 high profile neuroscience journals concerned data collected in this design [1]. This design has received some attention in the neuroscience literature, focusing specifically on ways to correctly analyze such data [14]. Our central message was that multiple measurements per cluster (e.g., neuron or mouse) cannot be considered independent observations, since measurements from the same cluster tend to be more similar to each other than to measures from different clusters. This can result in systematic differences between clusters, i.e., the mean of the dependent variable varies across clusters. Clustering implies that this variation exceeds that arising from random sampling fluctuation of individual observations within a clusterFootnote 2 (i.e., within cluster variation). Standard statistical techniques, such as regression analysis, t test, and ANOVA are unsuited to analyze clustered data, because these techniques rely on the assumption that all observations are independent. Given dependency, they produce underestimated standard errors, and so underestimated p values. The result is (possibly considerable) inflation of the Type I error rate, i.e., false positive rate (see [1] for an explanation on, and estimates of, this inflation).

There are two ways to handle research design A data. One can average across all observations within each cluster and apply standard techniques using these means, which are independent observations. Alternatively, avoiding the data reduction associated with such averaging, a multilevel model can be used to accommodate the nested structure. In multilevel analysis, the comparison of experimental conditions is conducted on the cluster level means, while retaining the distinction between the variance within clusters (e.g., differences between observations within a mouse) and variance between clusters (e.g., differences between the mice in cluster level means). See “Box 1” for a description of the statistical multilevel model for design A. Of these two approaches, multilevel analysis is preferable as it exploits all available information, and confers the greatest statistical power [1, 4]. The multilevel model also allows one to obtain the intracluster correlation (ICC), which quantifies the degree to which measurements from the same cluster are more similar to each other than to measures from different clusters. The ICC ranges between 0 (there is no variation between clusters and thus no dependency) and 1 (observations within clusters are equivalent and observations over clusters are different: complete dependency, i.e., the value of the observations depends completely on cluster-membership; see “Box 1”). The ICC is the standardized version of the variance between clusters, denoted by \(\sigma_{u0}^{2}\), and also referred to as the intercept variance (i.e., the variance in the cluster level means).

Methods

We use randomly generated (i.e., simulated) datasets to illustrate the effects of cluster-related variation in design B data on results of various statistical tests. We varied the magnitude of the experimental effect, the amount of cluster-related variation in the intercept and in the experimental effect, and the sample size. We determined how these variables influenced the obtained results. We considered a design with two experimental conditions, which we refer to as the control and the experimental condition. The generated datasets were analyzed using the following four statistical methods: a t test on the individual observations (i.e., modeling the data as shown in Fig. 2a), a paired t test on the experimental condition specific cluster means, a multilevel model on the individual observations that only accommodates cluster-related variation in the intercept (i.e., modeling the data as shown in Fig. 2b), and a multilevel model on the individual observations that accommodates cluster-related variation in both the intercept and the experimental effect (i.e., modeling the data as shown in Fig. 2d). Note that the standard statistical methods on summary statistics produces correct parameter estimates only if the (sub)sample sizes are equal over clusters and experimental conditions [8]. An overview of the parameter settings for each simulation study is provided in Table 2.

Table 2 Parameter settings used to generate the simulated datasets

First, we illustrate the effect on the statistical power to detect the overall experimental effect in the specific case that variation in the intercept is absent, or present but not accommodated, and cluster-related variation in the experimental effect is absent (i.e., study 1a and 1b in Table 2). That is, we ask: if data is generated according to Fig. 2a, b, how does the statistical power compare across the four statistical methods (i.e., the t test on the individual observations, the paired t test on summary statistics, and the two types of multilevel analysis)?

Second, we illustrate the effects of the presence of cluster-related variation in the experimental effect, in combination with either absent or present cluster-related variation in the intercept, on the false positive rate of the experimental effect (i.e., study 2a and 2b in Table 2). That is, we ask: if data is generated such that overall, i.e., taken over all clusters, the experimental manipulation has no effect, but the data includes cluster-related variation in the experimental effect, what is effect on the false positive rate? We illustrate this in case that the data includes no cluster-related variation in the intercept (Fig. 2c), or does include cluster-related variation in the intercept (Fig. 2d), and compare the false positive rate across the four statistical methods (i.e., the t test on the individual observations, the paired t test on summary statistics, and the two types of multilevel analysis).

For all scenarios, we generated 10,000 datasets. To establish statistical power in studies 1a and b, and the empirical false positive rate in studies 2a and 2b, we counted the number of times that the overall experimental effect was found to be statistically significant given \(\alpha\) = 0.05. The datasets were generated such that the experimental effect is expressed in terms of the effect size d (obtained by difference between experimental and control condition/within cluster standard deviation \(\sigma_{e}\) [16]), where we considered the effects 0.20, 0.50, and 0.80 to be small, medium, and large, respectively [17]. Condition was dummy coded 0 (control) and 1 (experimental), such that the amount of cluster-related variation in the experimental effect \(\sigma_{u1}^{2}\) could be interpreted according to the guidelines of Raudenbush and Liu [7]. Accordingly, \(\sigma_{u1}^{2}\) values of 0.05, 0.10, and 0.15 are considered small, medium, and large, respectively.

To understand the amount of variation in the experimental effect, consider a medium experimental effect of d = 0.50. If the variation in the experimental effect is small, i.e., \(\sigma_{u1}^{2}\) = 0.05, this corresponds to a standard deviation of ~0.22. Assuming normally distributed cluster specific deviations from the overall effect size, \(\beta_{1j}\), ~95 % of the cluster-specific experimental effects would lie between ~0.07 and ~0.93 (i.e., 0.50 − 1.96 × 0.22 and 0.50 + 1.96 × 0.22, respectively). Using the dummy coding 0 and 1 also ensures that the intercept variance equals the cluster-related variation in the intercept of the control condition in case that both the intercept and the experimental effect show cluster-related variation. The covariance between the intercept and the experimental effect was set to zero in all simulations.

All simulations were performed in R 2.15.3 [18], and multilevel models where fitted using the R package lme4 [19]. The R-code is available upon request from the corresponding author.

Results and discussion

Ignoring cluster-related variation can result in interpretational errors

Our simulation results showed that the failure to accommodate the cluster-related variation in either intercept or slope (i.e., in either the mean value of the control condition or the experimental effect) can result in interpretational errors. A general overview of all results is given in Table 3. Below, we discuss the results of studies 1a and 1b (i.e., no variation in the experimental effect) and studies 2a and 2b (i.e., variation in the experimental effect) in detail.

Table 3 Consequences of not accommodating cluster-related variation in research design B

Ignoring variation in the intercept in design B data can decrease statistical power

The obtained results are equal for the multilevel model that only includes variation in the intercept, and the multilevel model that includes variation in both the intercept and the experimental effect. Therefore, we do not differentiate between the two types of multilevel analysis in this section.

In case of design B data that includes no cluster-related variation in the intercept or experimental effect (i.e., study 1a), conventional statistical analysis (i.e., a t test) on individual observations is equally powerful as multilevel analysis, but using multilevel analysis is more powerful compared to conventional statistical analysis (i.e., a paired t test) on summary statistics (Fig. 3). The loss in statistical power when using conventional statistical analysis on summary statistics is only present when the number of clusters is small (i.e., N = 10).

Fig. 3
figure 3

Use of conventional analysis methods on design B data can result in a loss of power. Using conventional analysis methods to model design B data that includes cluster-related variation in the intercept and no cluster-related variation in the experimental effect (\(\sigma_{u0}^{2}\) >0 and \(\sigma_{u1}^{2}\) = 0; study 1b) results in a loss of statistical power compared to using a multilevel model. The presented results are equal for the multilevel model that only includes variation in the intercept, and the multilevel model that includes variation in both the intercept and the experimental effect. Fitted conventional analysis methods were a a t test on individual observations and b a paired t test on the experimental condition specific cluster means. The loss in statistical power is overall greatest when both the number of clusters and effect size d are small and the cluster-related variation in the intercept is considerable. In case that the cluster-related variation in the intercept and in the experimental effect both equal zero (that is, ICC = \(\sigma_{u1}^{2}\) = 0; study 1a), using a t test on individual observations is equally powerful as multilevel analysis, but using multilevel analysis is more powerful compared to a paired t test on summary statistics. The actual statistical power of multilevel analysis given \(\sigma_{u1}^{2}\) = 0, = 0.20 or 0.50, N = 10, and increasing numbers of observations per experimental effect per cluster is given in Fig. 5b, solid line

Analyzing design B data that only includes cluster-related variation in the intercept (i.e., study 1b) using conventional statistical analysis (e.g., a t test on individual observations or a paired t test on experimental condition specific cluster means) sometimes results in a loss of statistical power. Compared to using a t test on individual observations, the difference in statistical power is greatest when both the number of clusters and the magnitude of the experimental effect are small, and the amount of cluster—related variation in the intercept is large (Fig. 3). For example, in case of substantial cluster-related variation in the intercept giving rise to ICC = 0.50, using a t test on individual observations is ~25 % less powerful than multilevel analysis, given 10 clusters and an effect size of 0.20. In case that the overall experimental effect is medium (i.e., d = 0.50), multilevel analysis only results in more statistical power given substantial cluster-related variation in the intercept (i.e., ICC = 0.50), and a small number of clusters and small number of observations per experimental condition. Compared to using a paired t test on experimental condition specific cluster means, the loss in statistical power compared to multilevel analysis is only present when the number of clusters is small (i.e., N = 10), and does not depend on the amount of cluster-related variation in the intercept.

The occasionally observed increase in loss of power as function of increasing number of observations per experimental condition per cluster is due to the fact that multilevel analysis gains in power with increasing number of observations per experimental condition per cluster. The observed decrease in loss of power when the number of observations per experimental condition per cluster increases, is due to the fact that multilevel analysis approximates the maximum power of 100 %, and thus the difference in statistical power between multilevel analysis and conventional analysis methods becomes smaller. The actual statistical power of multilevel analysis given no cluster-related variation in the experimental effect, an effect size d of 0.20 or 0.50, 10 clusters, and increasing numbers of observations per experimental effect per cluster is provided in Fig. 5b (solid line; note that the ICC does not influence the power of multilevel analysis to detect the overall experimental effect in case of design B data, and as such does not feature in this figure).

In summary, the failure to take into account the hierarchical nature of data or using summary statistics, results in a loss of power to detect the experimental effect, especially when both the number of clusters and the overall effect are small. Neuroscience studies often report small effects, and may be underpowered due to small sample size [20]. Multilevel analysis of research design B data can increase statistical power compared to conventional analyses, unless of course the statistical power of the conventional analysis approaches 1.

Ignoring variation in the experimental effect increases the false positive rate

Given clustering with respect to the experimental effect, the use of a statistical model on individual observations that does not accommodate this variation results in an inflated false positive (i.e., Type I error) rate. First, when variation in the intercept is absent (i.e., study 2a), ignoring variation in the experimental effect results in an actual false positive rate as high as ~20–~50 % (Fig. 4a), depending on the number of observations per cluster and the amount of variation in the experimental effect. Specifically, if the overall experimental effect is zero, both a conventional t test and misspecified multilevel analysis (i.e., one that ignores variation in the experimental effect but does model variation in the intercept), yield similarly inflated Type I error rates (lines fully overlap in Fig. 4a). Even a very small amount of cluster-related variation in the experimental effect (i.e., \(\sigma_{u1}^{2}\) = 0.025) can results in a Type I error rate of ~20 % if it is not accommodated in the statistical model. In summary, the failure to accommodate cluster-related variation in the experimental effect results in a substantial inflation of the Type I error rate, and this inflation is considerable even when variation in the experimental effect is small.

Fig. 4
figure 4

Ignoring variation in the experimental effect results in inflated false positive (i.e., Type I error) rate. Inflation of the Type I error rate already occurs when a small amount of variation in the experimental effect (e.g., \(\sigma_{u1}^{2}\) = 0.025) remains unaccounted for in the statistical model, and occurs both when the intercept (i.e., mean value of the control condition) is invariant over clusters (a; ICC = 0; study 2a), and when the intercept varies substantially over clusters (b; ICC = 0.50; study 2b). In panel a, the lines depicting conventional analysis (i.e., t test on individual observations) and misspecified multilevel analysis completely overlap. Using a paired t test on the experimental condition specific cluster means results in a correct Type I error rate. In panel b, the lines depicting the paired t test and the correctly specified multilevel analysis completely overlap

Second, when variation is present in both the intercept and the experimental effect (i.e., study 2b), accommodating only the cluster-related variation in the intercept (i.e., a misspecified multilevel analysis), or not accommodating cluster-related variation at all (i.e., conventional t test) again results in an inflated Type I error rate (Fig. 4b). If the variation in the intercept is large (ICC = 0.50), the Type I error rate increases up to approximately 35 % when using a conventional t test. When using a multilevel analysis that only accommodates cluster-related variation in the intercept, the inflation in the Type I error increases up to approximately 50 %. In summary, the substantial inflation of the Type I error rate that arises if cluster-related variation in the experimental effect is not accommodated arises irrespective of the presence of variation in the intercept.

Accommodating cluster-related variation in the experimental effect by either using correctly specified multilevel analysis or using conventional models on summary statistics (i.e., a paired t test on the experimental condition specific cluster means), does result in a correct Type I error rate (i.e., study 2a and 2b). See “Box 3” for a detailed explanation of why ignoring cluster-related variation in the experimental effect results in an increased false positive rate.

Conclusions

To draw valid conclusions in a nested experimental design, it is crucial to use the appropriate statistical method. We showed previously that design A data (i.e., nested data that possibly show cluster-related variation in the intercept) are abundant in neuroscience literature, and that proper statistical analysis of such data is crucial to avoid false positives [1].

Here, we showed that in case of design B data (i.e., nested data that possibly show cluster-related variation both in the intercept and in the experimental effect), correct statistical modeling of such data is also critical to avoid incorrect inference. However, in case of design B data, the exact consequences of ignoring the dependency depend on the nature of clustering. If cluster-related variation in the experimental effect is present, not accommodating this cluster-related variation results in an inflated false positive rate. That is, in design A data, not accommodated variation in the intercept results in an inflated false positive rate, while in design B data variation in the experimental effect causes inflation in the false positive rate when not accommodated. Importantly, inflation of the false positive rate already occurs with a small amount of cluster-related variation in the experimental effect. In addition, if cluster-related variation is limited to the intercept (and absent in the experimental effect), failure to correctly accommodate this variation can result in a loss of statistical power to detect the experimental effect of interest. The loss in statistical power when using conventional analysis methods (i.e., t test) on individual observations instead of correctly specified multilevel analysis is noteworthy when both the number of clusters and the overall effect are small. In addition, we showed that using standard statistical methods on summary statistics (i.e., paired t test) does result in a correct false positive rate, but results in a loss of statistical power to detect the experimental effects of interest when the number of clusters is small. Importantly, the use of standard statistical methods on summary statistics only results in correct parameter estimates if the (sub)sample sizes are equal over clusters and experimental conditions (even if the summary statistics are weighted by the sample size of the cluster) [8].

Finally, multilevel analysis can provide valuable insight into the generalizability of the experimental effect over (biologically intrinsic) varying settings, and can be used to utilize cluster-related information to explain part of the variation in the experimental effect. Therefore, multilevel analysis not only ensures correct statistical interpretation of the results, and thus correct conclusions, but can also provide unique information on the collected research data that cannot be obtained when standard statistical methods are used on either individual observations or summary statistics.

Notes

  1. In the context of RCBD, the term “nested” is used to describe how experimental manipulations or treatments are combined. Specifically, treatments are referred to as nested if the various experimental conditions of treatment B do not appear with every level of treatment A (e.g., conditions B1 and B2 are only observed in combination with A1, while B3 and B4 are only observed with A2) [23]. In multilevel literature, however, the term nesting is used to describe the data. Specifically, nested data are characterized by a hierarchical, multi-level structure in which individual observations are clustered, or nested, within hierarchically higher organized groups or clusters [10].

  2. Let mi denote the true mean in the cluster i, and s denote the true within-cluster standard deviation, assumed to be equal for all clusters (hence no subscript). Let mi and si denote the estimated mean and standard deviation in cluster i based on n i observations. The standard error of the mean mi equals s.e.(mi) = si/√(n i). If all mi in a given experimental condition (e.g., the WT mice) are equal across clusters, s.e.(mi) reflects the variation in cluster means that is solely due to random sampling fluctuation. In that case, the data can be viewed as independent, and can be analyzed using standard statistical models. However, due to systematic differences between the clusters, clustering often gives rise to variation in mi that exceeds this random sampling fluctuation. In the case that data display such (cluster-related) dependency, multilevel analysis is called for. The systematic differences between clusters may be due to known (possibly measured) or unknown factors.

  3. In a previous paper specifically on design A data, we showed that not accommodating variation in the intercept results in an increased false positive rate. To avoid confusion on the effect of not accommodating variation in the cluster-related intercept, let us note the following. In design A data, the experimental effect is at the cluster level and thus explains systematic differences between clusters. In this case, variation in the intercept thus represents variation in the outcome of one of the experimental conditions. When this variation in the outcome is not taken into account, this results in a too precise estimate of the experimental effect (i.e., downward biased standard error), and hence an increased false positive rate. In design B data however, the experimental effect is at the level of the individual observations, and is thus explains systematic differences within clusters and hence determined within each cluster separately. Variation in the intercept (i.e., the mean value of the control condition) here represents fluctuations that do not influence the size of the experimental effect within each cluster. Not accommodating variation in the intercept in research design B thus results in a lower signal to noise ratio, hence decreased statistical power, instead of a higher false positive rate.

Abbreviations

ICC:

intracluster correlation

PinT:

power in two-level designs

References

  1. Aarts E, Verhage M, Veenvliet JV, Dolan CV, van der Sluis S. A solution to dependency: using multilevel analysis to accommodate nested data. Nat Neurosci. 2014;17:491–6.

    Article  PubMed  CAS  Google Scholar 

  2. Lazic SE, Essioux L. Improving basic and translational science by accounting for litter-to-litter variation in animal models. BMC Neurosci. 2013;14:37.

    Article  PubMed Central  PubMed  Google Scholar 

  3. Lazic SE. The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? BMC Neurosci. 2010;11:5.

    Article  PubMed Central  PubMed  Google Scholar 

  4. Galbraith S, Daniel JA, Vissel B. A study of clustered data and approaches to its analysis. J Neurosci. 2010;30:10601–8.

    Article  PubMed  CAS  Google Scholar 

  5. Zorrilla EP. Multiparous species present problems (and possibilities) to developmentalists. Dev Psychobiol. 1997;30:141–50.

    Article  PubMed  CAS  Google Scholar 

  6. Karakosta A, Vassilaki M, Plainis S, Elfadl NH, Tsilimbaris M, Moschandreas J. Choice of analytic approach for eye-specific outcomes: one eye or two? Am J Ophthalmol. 2012;153(571–579):e571.

    Article  Google Scholar 

  7. Raudenbush SW, Liu X. Statistical power and optimal design for multisite randomized trials. Psychol Methods. 2000;5:199–213.

    Article  PubMed  CAS  Google Scholar 

  8. Moerbeek M, van Breukelen GJ, Berger MP. A comparison between traditional methods and multilevel regression for the analysis of multicenter intervention studies. J Clin Epidemiol. 2003;56:341–50.

    Article  PubMed  Google Scholar 

  9. Senn S. Some controversies in planning and analysing multi-centre trials. Stat Med. 1998;17:1753–65.

    Article  PubMed  CAS  Google Scholar 

  10. Hox JJ. Multilevel analysis: techniques and applications. 2nd ed. New York: Routledge; 2010.

    Google Scholar 

  11. Goldstein H. Multilevel statistical models. 4th ed. West Sussex: Wiley; 2011.

    Google Scholar 

  12. Snijders TA, Bosker RJ. Multilevel analysis: an introduction to basic and advanced multilevel modeling. Thousand Oaks: Sage Publications; 2011.

    Book  Google Scholar 

  13. Maas CJ, Hox JJ. Sufficient sample sizes for multilevel modeling. Methodol Eur J Res Methods Behav Soc Sci. 2005;1:86.

    Google Scholar 

  14. Maas CJ, Hox JJ. Robustness issues in multilevel regression analysis. Stat Neerl. 2004;58:127–37.

    Article  Google Scholar 

  15. Stegmueller D. How many countries for multilevel modeling? A comparison of frequentist and Bayesian approaches. Am J Polit Sci. 2013;57:748–61.

    Article  Google Scholar 

  16. Hedges LV. Effect sizes in cluster-randomized designs. J Educ Behav Stat. 2007;32:341–70.

    Article  Google Scholar 

  17. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Hillsdale: Erlbaum; 1988.

    Google Scholar 

  18. R Core Team. R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2013.

    Google Scholar 

  19. Bates D, Mächler M, Bolker B, Walker S. Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:14065823; 2014.

  20. Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafò MR. Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci. 2013;14:365–76.

    Article  PubMed  CAS  Google Scholar 

  21. Tabachnick BG, Fidell LS. Using Multivariate Statistics. Boston: Pearson education inc.; 2007.

    Google Scholar 

  22. Snijders TA, Bosker RJ. Standard errors and sample sizes for two-level research. J Edu Behav Stat. 1993;18:237–59.

    Google Scholar 

  23. Casella G. Statistical design. New York: Springer Science & Business Media; 2008.

    Book  Google Scholar 

Download references

Authors’ contributions

EA planned and carried out the study, performed the simulation analysis and drafted the manuscript. MV, CVD and SvdS provided constructive input. All authors read and approved the final manuscript.

Acknowledgements

This work was funded by The Netherlands Scientific Organization (NWO/MaGW: VIDI-452-12-014 to S.v.d.S.), the European Research Council (Genetics of Mental Illness: ERC-230374 to C.V.D.), and by the NeuroBSIK Mouse Phenomics Consortium (NeuroBSIK BSIK03053 to M.V.). Simulations were run on the Genetic Cluster Computer, which is financially supported by a NWO Medium Investment grant (480-05-003); by the VU University Amsterdam, the Netherlands, and by the Dutch Brain Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests

The authors declare that they have no competing interests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Emmeke Aarts.

Additional files

12868_2015_228_MOESM1_ESM.pdf

Additional file 1. Effect of neurite location (axon/dendrite) on traveling speed of intracellular vesicles: a worked example. An example of multilevel analysis of research design B data, including syntax to perform multilevel analysis in the statistical packages SPSS and R.

12868_2015_228_MOESM2_ESM.pdf

Additional file 2. Calculating the optimal allocation of sample sizes and estimating statistical power to detect the overall experimental effect. Explanation on how to calculate the optimal allocation of sample sizes over clusters and within clusters given the available resources, and explanation on how to estimate power for a balanced (i.e., the number of observations per condition are equal both between conditions and between clusters) 2-level multilevel model without covariates.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aarts, E., Dolan, C.V., Verhage, M. et al. Multilevel analysis quantifies variation in the experimental effect while optimizing power and preventing false positives. BMC Neurosci 16, 94 (2015). https://doi.org/10.1186/s12868-015-0228-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12868-015-0228-5

Keywords