Reliability and statistical power analysis of cortical and subcortical FreeSurfer metrics in a large sample of healthy elderly

doi:10.1016/j.neuroimage.2014.12.035

NeuroImage

Volume 108, March 2015, Pages 95-109

https://doi.org/10.1016/j.neuroimage.2014.12.035 Get rights and content

Highlights

•
We assessed reliability and statistical power of surface-based morphometry.
•
A large sample of healthy elderly was investigated.
•
Several cortical and subcortical measures are reported.
•
We provide a tool that allows researchers to perform their own power analysis.
•
This will enable researchers to design well-powered studies.

Abstract

FreeSurfer is a tool to quantify cortical and subcortical brain anatomy automatically and noninvasively. Previous studies have reported reliability and statistical power analyses in relatively small samples or only selected one aspect of brain anatomy. Here, we investigated reliability and statistical power of cortical thickness, surface area, volume, and the volume of subcortical structures in a large sample (N = 189) of healthy elderly subjects (64 + years). Reliability (intraclass correlation coefficient) of cortical and subcortical parameters is generally high (cortical: ICCs > 0.87, subcortical: ICCs > 0.95). Surface-based smoothing increases reliability of cortical thickness maps, while it decreases reliability of cortical surface area and volume. Nevertheless, statistical power of all measures benefits from smoothing. When aiming to detect a 10% difference between groups, the number of subjects required to test effects with sufficient power over the entire cortex varies between cortical measures (cortical thickness: N = 39, surface area: N = 21, volume: N = 81; 10 mm smoothing, power = 0.8, α = 0.05). For subcortical regions this number is between 16 and 76 subjects, depending on the region. We also demonstrate the advantage of within-subject designs over between-subject designs. Furthermore, we publicly provide a tool that allows researchers to perform a priori power analysis and sensitivity analysis to help evaluate previously published studies and to design future studies with sufficient statistical power.

Introduction

There is a long history of manual morphometry investigating the structural changes occurring with age. This research is based on manually tracing brain regions of interest on magnetic resonance imaging (MRI) data. While this approach has led to tremendous insight regarding age-related changes in the brain, it also heavily relies on multiple intensively trained human raters. Standardizing all aspects of these procedures across labs often is difficult. This might give rise to undesired variations in results (Raz and Rodrigue, 2006). However, over the last decade, surface-based morphometry tools made it possible to non-invasively quantify gray matter in the human brain in a more automated fashion. Software packages such as FreeSurfer¹ or Caret² provide measurements of cortical and subcortical gray matter features based on MRI data. Because the algorithms that are implemented in these packages work with minimal user intervention, it became feasible to investigate large samples and apply this approach to a wide variety of research questions. Structural brain measurements (a) change systematically in development and aging (Fjell et al., 2013b, Hogstrom et al., 2013, Salat et al., 2004, Sowell et al., 2003), (b) reflect neuroplasticity in the context of experience and practice (Engvig et al., 2010), (c) can be important predictors of neurodegenerative disease (Dickerson et al., 2013, Gaser et al., 2013), and (d) have a genetic component (Joshi et al., 2011).

In the field of aging, subcortical volumes (Jäncke et al., 2014), as well as a variety of cortical measures are being analyzed, for instance, cortical thickness, surface area or volume (Fjell et al., 2014, Hogstrom et al., 2013, Mills and Tamnes, 2014). Analyzing a variety of parameters is warranted as they provide independent and complementary information about brain anatomy (Meyer et al., 2013, Winkler et al., 2010). However, age-related cortical thinning may render the reconstruction of cortical surfaces less reliable and reduce statistical power in the analysis of interest when studying older samples (e.g., to detect within-person individual change in longitudinal studies of aging). Pertinent studies (e.g., Han et al., 2006, Jovicich et al., 2006, Jovicich et al., 2013, Morey et al., 2010, Schnack et al., 2010, Wonderlick et al., 2009) cannot adequately contribute to this matter as the majority of them investigated test-retest reliability in young samples and relied on rather small sample sizes (often around 5 to 20 subjects). As demonstrated by Shoukri et al. (2004) and recently emphasized by Buchanan et al. (2014), a large sample size is necessary to compute reliability within an acceptable confidence interval. Further limitations of previous studies include the restriction to subcortical or regional cortical analyses, the restriction to only one measure (e.g., cortical thickness), or reporting reliability metrics that are difficult to interpret (for example, absolute difference of cortical thickness, which might correspond to different percent changes in two different regions).

In addition to the topic of reliability, the field of human neuroimaging recently increased its focus on the issue of statistical power (Button et al., 2013, Suckling et al., 2014). A major concern with neuroimaging studies is that relatively small sample sizes are not sufficient to detect relatively small differences between groups or conditions (Yarkoni et al., 2010). Therefore, performing an a priori power analysis to determine the appropriate sample size in the planning phase of a study is recommended. However, in the field of neuroimaging, calculation of statistical power is complicated by the fact that power is not uniform over the entire cortex, which results in different required sample sizes for different brain regions (Pardoe et al., 2013). While there exists a well-described statistical framework for performing statistical power analyses, only a minority of neuroimaging studies perform power analyses beforehand (Button et al., 2013). Recent work provided sample size calculations for cortical thickness studies (Pardoe et al., 2013). Building on this study, we extend this framework by including additional anatomical measures (cortical surface area, volume, and subcortical volume), other types of power analyses (post-hoc and sensitivity), and additional statistical tests (paired sample t-test).

The primary objective of the present study is therefore to investigate test-retest reliability and statistical power of cortical and subcortical measures derived with FreeSurfer. In detail, we aim to

1.
conduct our calculations using data from a large sample (N = 189);
2.
use a standardized and easy-to-interpret metric for reliability, namely, the intraclass correlation coefficient (ICC);
3.
assess several cortical measures (cortical thickness, surface area, volume) as well as measures of subcortical volume;
4.
provide vertex-wise information in order to assess regional variability in reliability and statistical power;
5.
investigate the effect of surface-based smoothing kernel size on reliability and statistical power;
6.
publicly provide surface data of reliability and statistical power in order for others to inquire about reliability and statistical power in brain regions they are most interested in;
7.
publicly provide a tool that allows others to perform various types of power analyzes (e.g., to determine the required sample size to detect an effect in cortical or subcortical measures before performing an experiment or to determine the sensitivity of a previously published study).

Note that the results presented here apply to the specific scanner type and acquisition sequence used in the study. Results might deviate for data from different scanner types and other acquisition schemes. Importantly, the power analysis tool we provide also allows researchers to base their power analysis on data previously acquired at their local scanner with their specific sequence. This, therefore, enables researchers to customize power analysis to their situation.

Section snippets

Research participants

Data from 189 right-handed older adults (99 female; age: M = 70.4, SD = 5.0, min = 64, max = 87) were taken from the first wave of the LHAB (Longitudinal Healthy Aging Brain) database, which is currently being built at the International Normal Aging and Plasticity Center (University of Zurich, Switzerland) (Zöllig et al., 2011). Participants were cognitively healthy, right-handed, had no history of neurological or psychiatric disorder, and did not suffer from migraine, diabetes or tinnitus. Their Mini

Results

The results are presented in the order of the list of aims shown in the introduction.

Discussion

In this study, we investigated the reliability and statistical power of cortical and subcortical brain measures computed with FreeSurfer. Furthermore, we publicly provide a tool that enables researchers to perform power analyses. With this tool scientists could, for instance, calculate the sample size necessary to detect a difference in cortical thickness between a disease group and a healthy control group, or calculate the required sample size to detect brain changes in a repeated measures

Acknowledgments

The current analysis incorporates data from the Longitudinal Healthy Aging Brain (LHAB) database project, which is carried out as one of the core projects at the International Normal Aging and Plasticity Imaging Center/INAPIC and the University Research Priority Program “Dynamics of Healthy Aging” of the University of Zurich. The following members of the core INAPIC team were involved in the design, set-up, maintenance and support of the LHAB database: Anne Eschen, Lutz Jäncke, Mike Martin,

References (68)

J.L. Bernal-Rusiel et al.
Spatiotemporal linear mixed effects modeling for the mass-univariate analysis of longitudinal neuroimage data
NeuroImage
(2013)
C.R. Buchanan et al.
Test-retest reliability of structural brain networks from diffusion MRI
NeuroImage
(2014)
M.J. Clarkson et al.
A comparison of voxel and surface based cortical thickness estimation methods
NeuroImage
(2011)
B. Couvy-Duchesne et al.
Heritability of head motion during resting state functional MRI in 462 healthy twins
NeuroImage
(2014)
A.M. Dale et al.
Cortical surface-based analysis. I. Segmentation and surface reconstruction
NeuroImage
(1999)
C. Destrieux et al.
Automatic parcellation of human cortical gyri and sulci using standard anatomical nomenclature
NeuroImage
(2010)
B.C. Dickerson et al.
Detection of cortical thickness correlates of cognitive performance: reliability across MRI scan sessions, scanners, and field strengths
NeuroImage
(2008)
S. Elmer et al.
Increased cortical surface area of the left planum temporale in musicians facilitates the categorization of phonetic and temporal speech sounds
Cortex
(2013)
A. Engvig et al.
Effects of memory training on cortical thickness in the elderly
NeuroImage
(2010)
B. Fischl et al.
Cortical surface-based analysis. II: Inflation, flattening, and a surface-based coordinate system
NeuroImage
(1999)

B. Fischl et al.

Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain

Neuron

(2002)

A.M. Fjell et al.

Critical ages in the life course of the adult brain: nonlinear subcortical aging

Neurobiol. Aging

(2013)

M.F. Folstein et al.

Mini-mental state. A practical method for grading the cognitive state of patients for the clinician

J. Psychiatr. Res.

(1975)

X. Han et al.

Reliability of MRI-derived measurements of human cerebral cortical thickness: the effects of field strength, scanner upgrade and manufacturer

NeuroImage

(2006)

C. Hutton et al.

A comparison between voxel-based cortical thickness and voxel-based morphometry in normal aging

NeuroImage

(2009)

J. Jovicich et al.

Reliability in multi-site structural MRI studies: effects of gradient non-linearity correction on phantom and human data

NeuroImage

(2006)

J. Jovicich et al.

MRI-derived measurements of human subcortical, ventricular and intracranial brain volumes: reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths

NeuroImage

(2009)

N. Raz et al.

Differential aging of the brain: patterns, cognitive correlates and modifiers

Neurosci. Biobehav. Rev.

(2006)

M. Reuter et al.

Highly accurate inverse consistent registration: a robust approach

NeuroImage

(2010)

M. Reuter et al.

Within-subject template estimation for unbiased longitudinal image analysis

NeuroImage

(2012)

M. Reuter et al.

Head motion during MRI acquisition reduces gray matter volume and thickness estimates

NeuroImage

(2015)

L.M. Rimol et al.

Cortical volume, surface area, and thickness in schizophrenia and bipolar disorder

Biol. Psychiatry

(2012)

D.H. Salat et al.

Age-associated alterations in cortical gray and white matter signal intensity and gray to white matter contrast

NeuroImage

(2009)

K.R.A. Van Dijk et al.

The influence of head motion on intrinsic functional connectivity MRI

NeuroImage

(2012)

A.M. Winkler et al.

Cortical thickness or grey matter volume? The importance of selecting the phenotype for imaging genetics studies

NeuroImage

(2010)

J.S. Wonderlick et al.

Reliability of MRI-derived cortical and subcortical morphometric measures: effects of pulse sequence, voxel geometry, and parallel imaging

NeuroImage

(2009)

T. Yarkoni et al.

Cognitive neuroscience 2.0: building a cumulative science of human brain function

Trends Cogn. Sci.

(2010)

M. Aksoy et al.

Real-time optical motion correction for diffusion tensor imaging

Magn. Reson. Med.

(2011)

K.S. Button et al.

Power failure: why small sample size undermines the reliability of neuroscience

Nat. Rev. Neurosci.

(2013)

F. Cardinale et al.

Validation of freesurfer-estimated brain cortical thickness: comparison with histologic measurements

Neuroinformatics

(2014)

J. Cohen

A power primer

Psychol. Bull.

(1992)

G. Cumming

The new statistics: why and how

Psychol. Sci.

(2014)

A.M. Dale et al.

Improved localizadon of cortical activity by combining EEG and MEG with MRI cortical surface reconstruction: a linear approach

J. Cogn. Neurosci.

(1993)

B.C. Dickerson et al.

Biomarker-based prediction of progression in MCI: comparison of AD signature and hippocampal volume with spinal fluid amyloid-β and tau

Front. Aging Neurosci.

(2013)

Cited by (78)

A multidimensional characterization of the neurocognitive architecture underlying age-related temporal speech processing
2023, NeuroImage
Healthy aging is often associated with speech comprehension difficulties in everyday life situations despite a pure-tone hearing threshold in the normative range. Drawing on this background, we used a multidimensional approach to assess the functional and structural neural correlates underlying age-related temporal speech processing while controlling for pure-tone hearing acuity. Accordingly, we combined structural magnetic resonance imaging and electroencephalography, and collected behavioral data while younger and older adults completed a phonetic categorization and discrimination task with consonant-vowel syllables varying along a voice-onset time continuum. The behavioral results confirmed age-related temporal speech processing singularities which were reflected in a shift of the boundary of the psychometric categorization function, with older adults perceiving more syllable characterized by a short voice-onset time as /ta/ compared to younger adults. Furthermore, despite the absence of any between-group differences in phonetic discrimination abilities, older adults demonstrated longer N100/P200 latencies as well as increased P200 amplitudes while processing the consonant-vowel syllables varying in voice-onset time. Finally, older adults also exhibited a divergent anatomical gray matter infrastructure in bilateral auditory-related and frontal brain regions, as manifested in reduced cortical thickness and surface area. Notably, in the younger adults but not in the older adult cohort, cortical surface area in these two gross anatomical clusters correlated with the categorization of consonant-vowel syllables characterized by a short voice-onset time, suggesting the existence of a critical gray matter threshold that is crucial for consistent mapping of phonetic categories varying along the temporal dimension. Taken together, our results highlight the multifaceted dimensions of age-related temporal speech processing characteristics, and pave the way toward a better understanding of the relationships between hearing, speech and the brain in older age.
The effect of a post-scan processing denoising system on image quality and morphometric analysis
2022, Journal of Neuroradiology
Citation Excerpt :
This may be partially because the sample size was not large enough to detect atrophy. Previous reports have shown that in cross-sectional cortical thickness analyses, moderate smoothing reduces noise and within-subject variability, resulting in improvement in reliability and detectability, while smoothing also deteriorates spatial resolution, and small local change could be under-estimated when the FWHM was too large.25–27 A similar trend was observed in our cross-sectional and longitudinal cortical thickness analyses, which showed higher reliability with larger smoothing (FWHM = 0, 10, 20).
Purpose: MR image quality and subsequent brain morphometric analysis are inevitably affected by noise. The purpose of this study was to evaluate the effectiveness of an artificial intelligence (AI)-based post-scan processing denoising system, intelligent Quick Magnetic Resonance (iQMR), on MR image quality and brain morphometric analysis.
Methods: We used 1.5T MP-RAGE MR images acquired from the Alzheimer's Disease Neuroimaging Initiative 1 database. The images of 21 subjects were used for cross-sectional analysis and 15 for longitudinal analysis. In the longitudinal analysis, two timepoints over a 2-year interval were used. Each subject was scanned twice at each timepoint. MR images processed with and without the denoising system were compared both visually and objectively using FreeSurfer cortical thickness analysis.
Results: The denoising system reduced the noise with good white–gray matter contrast (noise: p < 0.001; contrast: p = 0.49). The mean intraclass correlation coefficients (ICCs) of cortical thickness were slightly better in the images processed with the denoising system (0.739/0.859/0.883; Gaussian smoothing kernel of full width at half maximum = 0/10/20) compared with the unprocessed images (0.718/0.854/0.880). In the longitudinal analysis, the mean ICCs of symmetrized percent change improved in images processed with the denoising system (0.202/0.349/0.431) compared with the unprocessed images (0.167/0.325/0.404). In addition, the detectability of significant cortical thickness atrophy improved with denoising.
Conclusion: We confirm that the AI-based denoising system could effectively reduce the noise while retaining the contrast. We also confirm the improvement of the reliability and detectability of brain morphometric analysis with the denoising system.
Reliability of structural MRI measurements: The effects of scan session, head tilt, inter-scan interval, acquisition sequence, FreeSurfer version and processing stream
2022, NeuroImage
Large-scale longitudinal and multi-centre studies are used to explore neuroimaging markers of normal ageing, and neurodegenerative and mental health disorders. Longitudinal changes in brain structure are typically small, therefore the reliability of automated techniques is crucial. Determining the effects of different factors on reliability allows investigators to control those adversely affecting reliability, calculate statistical power, or even avoid particular brain measures with low reliability. This study examined the impact of several image acquisition and processing factors and documented the test-retest reliability of structural MRI measurements.
In Phase I, 20 healthy adults (11 females; aged 20–30 years) were scanned on two occasions three weeks apart on the same scanner using the ADNI-3 protocol. On each occasion, individuals were scanned twice (repetition), after re-entering the scanner (reposition) and after tilting their head forward. At one year follow-up, nine returning individuals and 11 new volunteers were recruited for Phase II (11 females; aged 22–31 years). Scans were acquired on two different scanners using the ADNI-2 and ADNI-3 protocols. Structural images were processed using FreeSurfer (v5.3.0, 6.0.0 and 7.1.0) to provide subcortical and cortical volume, cortical surface area and thickness measurements. Intra-class correlation coefficients (ICC) were calculated to estimate test-retest reliability. We examined the effect of repetition, reposition, head tilt, time between scans, MRI sequence and scanner on reliability of structural brain measurements. Mean percentage differences were also calculated in supplementary analyses.
Using the FreeSurfer v7.1.0 longitudinal pipeline, we observed high reliability for subcortical and cortical volumes, and cortical surface areas at repetition, reposition, three weeks and one year (mean ICCs>0.97). Cortical thickness reliability was lower (mean ICCs>0.82). Head tilt had the greatest adverse impact on ICC estimates, for example reducing mean right cortical thickness to ICC=0.74. In contrast, changes in ADNI sequence or MRI scanner had a minimal effect. We observed an increase in reliability for updated FreeSurfer versions, with the longitudinal pipeline consistently having a higher reliability than the cross-sectional pipeline.
Longitudinal studies should monitor or control head tilt to maximise reliability. We provided the ICC estimates and mean percentage differences for all FreeSurfer brain regions, which may inform power analyses for clinical studies and have implications for the design of future longitudinal studies.
Multimodal MRI cerebral correlates of verbal fluency switching and its impairment in women with depression
2022, NeuroImage: Clinical
The search of biomarkers in the field of depression requires easy implementable tests that are biologically rooted. Qualitative analysis of verbal fluency tests (VFT) are good candidates, but its cerebral correlates are unknown.
We collected qualitative semantic and phonemic VFT scores along with grey and white matter anatomical MRI of depressed (n = 26) and healthy controls (HC, n = 25) women. Qualitative VFT variables are the “clustering score” (i.e. the ability to produce words within subcategories) and the “switching score” (i.e. the ability to switch between clusters). The clustering and switching scores were automatically calculated using a data-driven approach. Brain measures were cortical thickness (CT) and fractional anisotropy (FA). We tested for associations between CT, FA and qualitative VFT variables within each group.
Patients had reduced switching VFT scores compared to HC. Thicker cortex was associated with better switching score in semantic VFT bilaterally in the frontal (superior, rostral middle and inferior gyri), parietal (inferior parietal lobule including the supramarginal gyri), temporal (transverse and fusiform gyri) and occipital (lingual gyri) lobes in the depressed group. Positive association between FA and the switching score in semantic VFT was retrieved in depressed patients within the corpus callosum, right inferior fronto-occipital fasciculus, right superior longitudinal fasciculus extending to the anterior thalamic radiation (all p < 0.05, corrected).
Together, these results suggest that automatic qualitative VFT scores are associated with brain anatomy and reinforce its potential use as a surrogate for depression cerebral bases.
Participant followup rate can bias structural imaging measures in longitudinal studies
2021, Neuroimage: Reports
Longitudinal MRI analysis is essential to accurately describe neuroanatomical changes over time. Loss of participants to followup (dropout) in longitudinal studies is inevitable and can lead to great difficulty in interpretation of statistical results if dropout is correlated with a study outcome or exposure. Beyond this, technical aspects of longitudinal MRI analysis require specialised processing pipelines to improve reliability while avoiding bias towards individual timepoints. In this article we test whether there is an additional problem that must be considered in longitudinal imaging studies, namely whether dropout has an impact on the function of FreeSurfer, a popular software pipeline used to estimate important structural brain metrics.
We find that the number of acquisitions available per individual can impact the estimation of cortical thickness and brain volume using the FreeSurfer longitudinal pipeline, and can induce group differences in brain metrics. The effect on trajectories of brain metrics is smaller than the effect on brain metrics.
Age-related decline in the brain: a longitudinal study on inter-individual variability of cortical thickness, area, volume, and cognition.
2021, NeuroImage
Magnetic Resonance Imaging (MRI) studies have shown that cortical volume declines with age. Although volume is a multiplicative measure consisting of thickness and area, few studies have focused on both its components. Information on decline variability and associations between person-specific changes of different brain metrics, brain regions, and cognition is sparse. In addition, the estimates have often been biased by the measurement error, because three repeated measures are minimally required to separate the measurement error from person-specific changes. With a sample size of N = 231, five repeated measures, and an observational time span of seven years, this study explores the associations between changes of different brain metrics, brain regions, and cognitive abilities in aging.
Person-specific changes were obtained by latent growth curve models using Bayesian estimation. Our data indicate that both thickness and area are important contributors to volumetric changes. In most brain regions, area clearly declined on average over the years, while thickness showed only little decline. However, there was also substantial variation around the average slope in thickness and area. The correlation pattern of changes in thickness between brain regions was strong and largely homogenous. The pattern for changes in area was similar but weaker, indicating that factors affecting area may be more region-specific. Changes in thickness and volume were substantially correlated with changes in cognition. In some brain regions, changes in area were also related to changes in cognition. Overall, studying the associations between the trajectories of brain regions in different brain metrics provides insights into the regional heterogeneity of structural changes.
Many studies have described volumetric brain changes in aging. Few studies have focused on both its individual components: area and thickness. Longitudinal studies with three or more time points are highly needed, because they provide more precise average change estimates and, more importantly, allow us to quantify the associations between changes in the different brain metrics, brain regions, and other variables (e.g. cognitive abilities). Studying these associations is important because they can provide information regarding possible underlying factors of these changes. Our study, with a large sample size, five repeated measures, and an observational time span of seven years, provides new insights about the associations between person-specific changes in thickness, area, volume, and cognitive abilities.

View all citing articles on Scopus

View full text

Reliability and statistical power analysis of cortical and subcortical FreeSurfer metrics in a large sample of healthy elderly

Highlights

Abstract

Introduction

Section snippets

Research participants

Results

Discussion

Acknowledgments

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

NeuroImage

Cortex

NeuroImage

NeuroImage

Neuron

Neurobiol. Aging

J. Psychiatr. Res.

NeuroImage

NeuroImage

NeuroImage

NeuroImage

Neurosci. Biobehav. Rev.

NeuroImage

NeuroImage

NeuroImage

Biol. Psychiatry

NeuroImage

NeuroImage

NeuroImage

NeuroImage

Trends Cogn. Sci.

Real-time optical motion correction for diffusion tensor imaging

Magn. Reson. Med.

Power failure: why small sample size undermines the reliability of neuroscience

Nat. Rev. Neurosci.

Validation of freesurfer-estimated brain cortical thickness: comparison with histologic measurements

Neuroinformatics

A power primer

Psychol. Bull.

The new statistics: why and how

Psychol. Sci.

Improved localizadon of cortical activity by combining EEG and MEG with MRI cortical surface reconstruction: a linear approach

J. Cogn. Neurosci.

Biomarker-based prediction of progression in MCI: comparison of AD signature and hippocampal volume with spinal fluid amyloid-β and tau

Front. Aging Neurosci.