Elsevier

NeuroImage

Volume 108, March 2015, Pages 95-109
NeuroImage

Reliability and statistical power analysis of cortical and subcortical FreeSurfer metrics in a large sample of healthy elderly

https://doi.org/10.1016/j.neuroimage.2014.12.035Get rights and content

Highlights

  • We assessed reliability and statistical power of surface-based morphometry.

  • A large sample of healthy elderly was investigated.

  • Several cortical and subcortical measures are reported.

  • We provide a tool that allows researchers to perform their own power analysis.

  • This will enable researchers to design well-powered studies.

Abstract

FreeSurfer is a tool to quantify cortical and subcortical brain anatomy automatically and noninvasively. Previous studies have reported reliability and statistical power analyses in relatively small samples or only selected one aspect of brain anatomy. Here, we investigated reliability and statistical power of cortical thickness, surface area, volume, and the volume of subcortical structures in a large sample (N = 189) of healthy elderly subjects (64 + years). Reliability (intraclass correlation coefficient) of cortical and subcortical parameters is generally high (cortical: ICCs > 0.87, subcortical: ICCs > 0.95). Surface-based smoothing increases reliability of cortical thickness maps, while it decreases reliability of cortical surface area and volume. Nevertheless, statistical power of all measures benefits from smoothing. When aiming to detect a 10% difference between groups, the number of subjects required to test effects with sufficient power over the entire cortex varies between cortical measures (cortical thickness: N = 39, surface area: N = 21, volume: N = 81; 10 mm smoothing, power = 0.8, α = 0.05). For subcortical regions this number is between 16 and 76 subjects, depending on the region. We also demonstrate the advantage of within-subject designs over between-subject designs. Furthermore, we publicly provide a tool that allows researchers to perform a priori power analysis and sensitivity analysis to help evaluate previously published studies and to design future studies with sufficient statistical power.

Introduction

There is a long history of manual morphometry investigating the structural changes occurring with age. This research is based on manually tracing brain regions of interest on magnetic resonance imaging (MRI) data. While this approach has led to tremendous insight regarding age-related changes in the brain, it also heavily relies on multiple intensively trained human raters. Standardizing all aspects of these procedures across labs often is difficult. This might give rise to undesired variations in results (Raz and Rodrigue, 2006). However, over the last decade, surface-based morphometry tools made it possible to non-invasively quantify gray matter in the human brain in a more automated fashion. Software packages such as FreeSurfer1 or Caret2 provide measurements of cortical and subcortical gray matter features based on MRI data. Because the algorithms that are implemented in these packages work with minimal user intervention, it became feasible to investigate large samples and apply this approach to a wide variety of research questions. Structural brain measurements (a) change systematically in development and aging (Fjell et al., 2013b, Hogstrom et al., 2013, Salat et al., 2004, Sowell et al., 2003), (b) reflect neuroplasticity in the context of experience and practice (Engvig et al., 2010), (c) can be important predictors of neurodegenerative disease (Dickerson et al., 2013, Gaser et al., 2013), and (d) have a genetic component (Joshi et al., 2011).

In the field of aging, subcortical volumes (Jäncke et al., 2014), as well as a variety of cortical measures are being analyzed, for instance, cortical thickness, surface area or volume (Fjell et al., 2014, Hogstrom et al., 2013, Mills and Tamnes, 2014). Analyzing a variety of parameters is warranted as they provide independent and complementary information about brain anatomy (Meyer et al., 2013, Winkler et al., 2010). However, age-related cortical thinning may render the reconstruction of cortical surfaces less reliable and reduce statistical power in the analysis of interest when studying older samples (e.g., to detect within-person individual change in longitudinal studies of aging). Pertinent studies (e.g., Han et al., 2006, Jovicich et al., 2006, Jovicich et al., 2013, Morey et al., 2010, Schnack et al., 2010, Wonderlick et al., 2009) cannot adequately contribute to this matter as the majority of them investigated test-retest reliability in young samples and relied on rather small sample sizes (often around 5 to 20 subjects). As demonstrated by Shoukri et al. (2004) and recently emphasized by Buchanan et al. (2014), a large sample size is necessary to compute reliability within an acceptable confidence interval. Further limitations of previous studies include the restriction to subcortical or regional cortical analyses, the restriction to only one measure (e.g., cortical thickness), or reporting reliability metrics that are difficult to interpret (for example, absolute difference of cortical thickness, which might correspond to different percent changes in two different regions).

In addition to the topic of reliability, the field of human neuroimaging recently increased its focus on the issue of statistical power (Button et al., 2013, Suckling et al., 2014). A major concern with neuroimaging studies is that relatively small sample sizes are not sufficient to detect relatively small differences between groups or conditions (Yarkoni et al., 2010). Therefore, performing an a priori power analysis to determine the appropriate sample size in the planning phase of a study is recommended. However, in the field of neuroimaging, calculation of statistical power is complicated by the fact that power is not uniform over the entire cortex, which results in different required sample sizes for different brain regions (Pardoe et al., 2013). While there exists a well-described statistical framework for performing statistical power analyses, only a minority of neuroimaging studies perform power analyses beforehand (Button et al., 2013). Recent work provided sample size calculations for cortical thickness studies (Pardoe et al., 2013). Building on this study, we extend this framework by including additional anatomical measures (cortical surface area, volume, and subcortical volume), other types of power analyses (post-hoc and sensitivity), and additional statistical tests (paired sample t-test).

The primary objective of the present study is therefore to investigate test-retest reliability and statistical power of cortical and subcortical measures derived with FreeSurfer. In detail, we aim to

  • 1.

    conduct our calculations using data from a large sample (N = 189);

  • 2.

    use a standardized and easy-to-interpret metric for reliability, namely, the intraclass correlation coefficient (ICC);

  • 3.

    assess several cortical measures (cortical thickness, surface area, volume) as well as measures of subcortical volume;

  • 4.

    provide vertex-wise information in order to assess regional variability in reliability and statistical power;

  • 5.

    investigate the effect of surface-based smoothing kernel size on reliability and statistical power;

  • 6.

    publicly provide surface data of reliability and statistical power in order for others to inquire about reliability and statistical power in brain regions they are most interested in;

  • 7.

    publicly provide a tool that allows others to perform various types of power analyzes (e.g., to determine the required sample size to detect an effect in cortical or subcortical measures before performing an experiment or to determine the sensitivity of a previously published study).

Note that the results presented here apply to the specific scanner type and acquisition sequence used in the study. Results might deviate for data from different scanner types and other acquisition schemes. Importantly, the power analysis tool we provide also allows researchers to base their power analysis on data previously acquired at their local scanner with their specific sequence. This, therefore, enables researchers to customize power analysis to their situation.

Section snippets

Research participants

Data from 189 right-handed older adults (99 female; age: M = 70.4, SD = 5.0, min = 64, max = 87) were taken from the first wave of the LHAB (Longitudinal Healthy Aging Brain) database, which is currently being built at the International Normal Aging and Plasticity Center (University of Zurich, Switzerland) (Zöllig et al., 2011). Participants were cognitively healthy, right-handed, had no history of neurological or psychiatric disorder, and did not suffer from migraine, diabetes or tinnitus. Their Mini

Results

The results are presented in the order of the list of aims shown in the introduction.

Discussion

In this study, we investigated the reliability and statistical power of cortical and subcortical brain measures computed with FreeSurfer. Furthermore, we publicly provide a tool that enables researchers to perform power analyses. With this tool scientists could, for instance, calculate the sample size necessary to detect a difference in cortical thickness between a disease group and a healthy control group, or calculate the required sample size to detect brain changes in a repeated measures

Acknowledgments

The current analysis incorporates data from the Longitudinal Healthy Aging Brain (LHAB) database project, which is carried out as one of the core projects at the International Normal Aging and Plasticity Imaging Center/INAPIC and the University Research Priority Program “Dynamics of Healthy Aging” of the University of Zurich. The following members of the core INAPIC team were involved in the design, set-up, maintenance and support of the LHAB database: Anne Eschen, Lutz Jäncke, Mike Martin,

References (68)

  • B. Fischl et al.

    Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain

    Neuron

    (2002)
  • A.M. Fjell et al.

    Critical ages in the life course of the adult brain: nonlinear subcortical aging

    Neurobiol. Aging

    (2013)
  • M.F. Folstein et al.

    Mini-mental state. A practical method for grading the cognitive state of patients for the clinician

    J. Psychiatr. Res.

    (1975)
  • X. Han et al.

    Reliability of MRI-derived measurements of human cerebral cortical thickness: the effects of field strength, scanner upgrade and manufacturer

    NeuroImage

    (2006)
  • C. Hutton et al.

    A comparison between voxel-based cortical thickness and voxel-based morphometry in normal aging

    NeuroImage

    (2009)
  • J. Jovicich et al.

    Reliability in multi-site structural MRI studies: effects of gradient non-linearity correction on phantom and human data

    NeuroImage

    (2006)
  • J. Jovicich et al.

    MRI-derived measurements of human subcortical, ventricular and intracranial brain volumes: reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths

    NeuroImage

    (2009)
  • N. Raz et al.

    Differential aging of the brain: patterns, cognitive correlates and modifiers

    Neurosci. Biobehav. Rev.

    (2006)
  • M. Reuter et al.

    Highly accurate inverse consistent registration: a robust approach

    NeuroImage

    (2010)
  • M. Reuter et al.

    Within-subject template estimation for unbiased longitudinal image analysis

    NeuroImage

    (2012)
  • M. Reuter et al.

    Head motion during MRI acquisition reduces gray matter volume and thickness estimates

    NeuroImage

    (2015)
  • L.M. Rimol et al.

    Cortical volume, surface area, and thickness in schizophrenia and bipolar disorder

    Biol. Psychiatry

    (2012)
  • D.H. Salat et al.

    Age-associated alterations in cortical gray and white matter signal intensity and gray to white matter contrast

    NeuroImage

    (2009)
  • K.R.A. Van Dijk et al.

    The influence of head motion on intrinsic functional connectivity MRI

    NeuroImage

    (2012)
  • A.M. Winkler et al.

    Cortical thickness or grey matter volume? The importance of selecting the phenotype for imaging genetics studies

    NeuroImage

    (2010)
  • J.S. Wonderlick et al.

    Reliability of MRI-derived cortical and subcortical morphometric measures: effects of pulse sequence, voxel geometry, and parallel imaging

    NeuroImage

    (2009)
  • T. Yarkoni et al.

    Cognitive neuroscience 2.0: building a cumulative science of human brain function

    Trends Cogn. Sci.

    (2010)
  • M. Aksoy et al.

    Real-time optical motion correction for diffusion tensor imaging

    Magn. Reson. Med.

    (2011)
  • K.S. Button et al.

    Power failure: why small sample size undermines the reliability of neuroscience

    Nat. Rev. Neurosci.

    (2013)
  • F. Cardinale et al.

    Validation of freesurfer-estimated brain cortical thickness: comparison with histologic measurements

    Neuroinformatics

    (2014)
  • J. Cohen

    A power primer

    Psychol. Bull.

    (1992)
  • G. Cumming

    The new statistics: why and how

    Psychol. Sci.

    (2014)
  • A.M. Dale et al.

    Improved localizadon of cortical activity by combining EEG and MEG with MRI cortical surface reconstruction: a linear approach

    J. Cogn. Neurosci.

    (1993)
  • B.C. Dickerson et al.

    Biomarker-based prediction of progression in MCI: comparison of AD signature and hippocampal volume with spinal fluid amyloid-β and tau

    Front. Aging Neurosci.

    (2013)
  • Cited by (78)

    • The effect of a post-scan processing denoising system on image quality and morphometric analysis

      2022, Journal of Neuroradiology
      Citation Excerpt :

      This may be partially because the sample size was not large enough to detect atrophy. Previous reports have shown that in cross-sectional cortical thickness analyses, moderate smoothing reduces noise and within-subject variability, resulting in improvement in reliability and detectability, while smoothing also deteriorates spatial resolution, and small local change could be under-estimated when the FWHM was too large.25–27 A similar trend was observed in our cross-sectional and longitudinal cortical thickness analyses, which showed higher reliability with larger smoothing (FWHM = 0, 10, 20).

    View all citing articles on Scopus
    View full text