Valid population inference for information-based imaging: From the second-level t-test to prevalence inference
Introduction
Since the seminal work of Haxby et al. (2001), an increasing number of neuroimaging studies have employed multivariate methods to complement the established mass-univariate approach (Friston et al., 1995) to the analysis of functional magnetic resonance imaging (fMRI) data, a field now known as multivariate pattern analysis (MVPA; Norman et al., 2006). Most MVPA studies use classification (Pereira et al., 2009) to examine activation patterns; the accuracy of a classifier in distinguishing activation patterns associated with different experimental conditions serves as a measure of multivariate effect strength. Since the target of MVPA is not a generally increased or decreased level of activation but the information content of activation patterns (cf. Pereira and Botvinick, 2011), it has also been characterized as information-based imaging and distinguished from traditional activation-based imaging (Kriegeskorte et al., 2006).
Many methodological aspects of MVPA have already been discussed in detail: what kind of classifier to use (Cox and Savoy, 2003, Norman et al., 2006), whether to adapt parametric multivariate statistics instead of classifiers (Allefeld and Haynes, 2014, Nili et al., 2014), how to understand searchlight-based accuracy maps (Etzel et al., 2013), or how classifier weights can be made interpretable (Haufe et al., 2014, Hoyos-Idrobo et al., 2015). By contrast, the topic of population inference based on per-subject measures of information content, i.e.the question whether an information effect observed in a sample of subjects generalizes to the population these subjects were recruited from, has not yet received sufficient attention (but see Brodersen et al., 2013).
In univariate analysis of multi-subject fMRI studies, the standard way to achieve population inference is to perform a ‘second-level’ null hypothesis test (Holmes and Friston, 1998). For each subject, a ‘first-level’ contrast (activation difference) is computed, and this contrast enters a second-level analysis, a t-test or an ANOVA. Specifically for a simple one-sided t-test vs 0, reaching statistical significance allows to infer that the experimental manipulation is associated with an increase of activation on average in the population of subjects. This is interpreted in such a way that the effect is ‘common’ or ‘stereotypical’ in that population (Penny and Holmes, 2007, p. 156).
With the adoption of information-based imaging, it has become accepted practice to apply the same second-level inferential procedures to the results of first-level multivariate analyses, in particular classification accuracy (see e.g. Haxby et al., 2001, Haynes et al., 2007, Spiridon and Kanwisher, 2002): A classifier is trained on part of the data and is tested on another part, using each part for testing once (cross-validation), and the classification performance is quantified in the form of an accuracy, the fraction of correctly classified test data points. Applied for example to two different experimental conditions, if there was no multivariate difference in the data between conditions, the classifier would operate at ‘chance level’, i.e. it would on average achieve a classification accuracy of 50%. At the second level, accuracies from different subjects are then entered into a one-sided one-sample t-test vs 50%, in order to show that the ability to classify above chance and therefore the presence of an information effect is typical in the population the subjects were recruited from.
In this paper we argue that despite of the seemingly analogous statistical procedure, a t-test vs chance level applied to accuracies cannot provide evidence that the corresponding effect is typical in the population. In contrast to other criticisms of this use of the t-test (see below), in our view the problem is not so much that the estimation distribution of cross-validated accuracies is not normal or even symmetric, or that a normal distribution model is generally inadequate for a quantity bounded to an interval [0%,100%]. Rather, the problem is that other than estimated accuracies, the true single-subject accuracy can never be below chance level because it measures an amount of information.2 We will show that this restriction changes the meaning of the t-test: It now tests the global null hypothesis (Nichols et al., 2005) that there is no information in any subject in the population. As a consequence, achieving a significant test result allows us only to infer that there are people in which there is an effect, but not that the presence of information generalizes to the population. The argument does not only hold for classification accuracy, but also for other ‘information-like’ measures.
The t-test on accuracies has been criticized before (Brodersen et al., 2013, Stelzer et al., 2013) on the grounds that its distributional assumptions are not fulfilled for cross-validated classification accuracies. Such a distributional error invalidates the calculation of critical values for the t-statistic and can therefore lead to an increased rate of false positives. This problem may be solved by better distribution models (Brodersen et al., 2013) or the use of non-parametric statistics (Stelzer et al., 2013). Our criticism goes significantly beyond that: Not only is the t-test quantitatively wrong, but it effectively tests a null hypothesis that is qualitatively different from its use with univariate statistics, with the consequence that rejection of this null hypothesis no longer supports population inference.
Please note that our criticism pertains specifically to a second-level t-test applied to per-subject classification accuracies or similar measures. It does not apply to the classification of subjects, e.g.into different patient groups in medical applications (Sabuncu, 2014, Sabuncu and Van Leemput, 2012), or to the classification of condition-specific patterns across subjects (Mourao-Miranda et al., 2005). Moreover, it only concerns quantities that measure the information content of data, but not related quantities like classifier weights (Gaonkar and Davatzikos, 2013, Gaonkar et al., 2015, Wang et al., 2007, see below).
The organization of the paper is as follows: In the section The problem with the t-test on accuracies we detail how a second-level t-test achieves population inference for univariate contrasts. We then explain that MVPA measures are ‘information-like’ and show, both theoretically and using simulations, that for such measures the t-test effectively tests the global null hypothesis that there is no effect in any subject. The section An alternative: information prevalence inference reviews possible alternatives to the t-test on accuracies, converging on the idea that population inference for information-based imaging should target the proportion of subjects in the population with an effect. One way to implement such an ‘information prevalence inference’ is described in detail in the section Permutation-based information prevalence inference using the minimum statistic, and results of its application to real data are compared with those of the t-test. We conclude with the discussion of a number of questions surrounding the problem of population inference for information-based imaging.
Section snippets
Population inference in univariate fMRI analysis
To see why the t-test on accuracies cannot provide population inference, we briefly recapitulate how standard univariate analysis does achieve it. In a single subject, an activation difference or contrast Δβ is estimated based on the general linear model (GLM; Friston et al., 1995). Because it is obtained from noisy data, the estimate is itself noisy, where denotes the estimation variance of the contrast (cf. Fig. 1a). If several subjects are included in a study, the true
An alternative: information prevalence inference
In the previous part we established that the second-level t-test applied to accuracies is not able to provide population inference. We now discuss alternative approaches, leading us to the idea that population inference for information-based imaging should target the proportion of people in the population in which there is an information effect.
Within the MVPA literature, there are three alternative proposals. First, Kriegeskorte and Bandettini (2007) recommend to apply the methods for
Permutation-based information prevalence inference using the minimum statistic
In this part we recapitulate the minimum-statistic approach to prevalence inference developed by Friston et al. (1999a), adapt it to be based on permutation statistics, and detail the resulting algorithm. Applied to information-like measures this method allows us to achieve information prevalence inference, i.e.inference with respect to the proportion of subjects in the population that exhibit an information effect. We demonstrate the method using an example data set.
The advantage of Friston et
Discussion
In this paper we have shown that the t-test on accuracies commonly used in MVPA studies is not able to provide population inference because the true single-subject accuracy a can never be below chance level. This constraint makes the effective null hypothesis of the test the global null hypothesis that there is no effect in any subject in the population, which means that in rejecting that null hypothesis we can only infer that there are some subjects in which there is an effect. This is in
Acknowledgments
Kai Görgen was supported by the German Research Foundation (DFG grants GRK1589/1 and FK:JA945/3-1).
The authors would like to thank Tom Nichols, Jakob Heinzle, Jörn Diedrichsen, Will Penny, María Herrojo Ruiz, Joram Soch, Martin Hebart, Jo Etzel, Yaroslav Halchenko, and Thomas Christophel for discussions, comments, and hints.
References (72)
- et al.
Searchlight-based multi-voxel pattern analysis of fMRI by cross-validated MANOVA
Neuroimage
(2014) - et al.
Variational Bayesian mixed-effects inference for classification studies
Neuroimage
(2013) - et al.
Encoding the identity and location of objects in human LOC
Neuroimage
(2011) - et al.
Functional magnetic resonance imaging (fMRI) ’brain reading’: detecting and classifying distributed patterns of fMRI activity in human visual cortex
Neuroimage
(2003) - et al.
What do differences between multi-voxel and univariate analysis mean? How subject-, voxel-, and trial-level variance impact fMRI analysis
Neuroimage
(2014) - et al.
Searchlight analysis: Promise, pitfalls, and potential
Neuroimage
(2013) - et al.
Multisubject fMRI studies and conjunction analyses
Neuroimage
(1999) - et al.
How many subjects constitute a study?
Neuroimage
(1999) - et al.
Analytic estimation of statistical significance maps for support vector machine based multi-variate image analysis and classification
Neuroimage
(2013) - et al.
ADNI, Interpreting support vector machine models for multivariate group wise analysis in neuroimaging
Med. Image Anal.
(2015)
On the interpretation of weight vectors of linear models in multivariate neuroimaging
Neuroimage
Multivariate pattern analysis of fMRI: The early beginnings
Neuroimage
A common, high-dimensional model of the representational space in human ventral temporal cortex
Neuron
Predicting the stream of consciousness from activity in human visual cortex
Curr. Biol.
Reading hidden intentions in the human brain
Curr. Biol.
Generalisability, random effects & population inference
Neuroimage
Analyzing for information, not activation, to exploit high-resolution fMRI
Neuroimage
Combining brains: A survey of methods for statistical pooling of information
Neuroimage
Classifying brain states and determining the discriminating activation patterns: Support vector machine on functional MRI data
Neuroimage
Valid conjunction inference with the minimum statistic
Neuroimage
Biased binomial assessment of cross-validated estimation of classification accuracies illustrated in diagnosis predictions
Neuroimage: Clinical
Beyond mind-reading: Multi-voxel pattern analysis of fMRI data
Trends Cogn. Sci.
Bayesian hypothesis testing for pattern discrimination in brain decoding
Pattern Recogn.
Random effects analysis
Information mapping with pattern classifiers: A comparative study
Neuroimage
Machine learning classifiers and fMRI: A tutorial overview
Neuroimage
Cognitive conjunction: A new approach to brain activation experiments
Neuroimage
Revisiting multi-subject random effects in fMRI: Advocating prevalence estimation
Neuroimage
How to avoid mismodelling in GLM-based fMRI data analysis: Cross-validated Bayesian model selection
Neuroimage
How distributed is visual category information in human occipito-temporal cortex? An fMRI study
Neuron
Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control
Neuroimage
Bayesian model selection for group studies
Neuroimage
Confounds in multivariate pattern analysis: Theory and rule representation case study
Neuroimage
Support vector machine learning-based fMRI data group analysis
Neuroimage
A test for a conjunction
Statistics & Probability Letters
A comparison of asymptotic error rate expansions for the sample linear discriminant function
Pattern Recogn.
Cited by (101)
Within-participant statistics for cognitive science
2022, Trends in Cognitive SciencesDistributed networks for auditory memory differentially contribute to recall precision
2022, NeuroImageCitation Excerpt :The group-level accuracy map was determined by averaging the classification accuracy across participants. To determine statistical assessment of classification performance above chance, we adopted a non-parametric permutation approach (Stelzer et al., 2013; Allefeld et al., 2016). To this end, we repeated the whole-brain searchlight analysis with randomly assigned classification labels for 1000 times for each participant.
Accelerated capacity model of lithium-ion battery based on non-linear polynomial method with stress coupling analysis under two electrical variables
2022, Measurement: Journal of the International Measurement ConfederationTime-resolved multivariate pattern analysis of infant EEG data: A practical tutorial
2022, Developmental Cognitive NeuroscienceSpectral pattern similarity analysis: Tutorial and application in developmental cognitive neuroscience
2022, Developmental Cognitive Neuroscience
- 1
Charité-Campus Mitte, Philippstr. 13, Haus 6, Berlin 10115, Germany.