Elsevier

NeuroImage

Volume 141, 1 November 2016, Pages 378-392
NeuroImage

Valid population inference for information-based imaging: From the second-level t-test to prevalence inference

https://doi.org/10.1016/j.neuroimage.2016.07.040Get rights and content

Highlights

  • A second level t-test applied to accuracies in MVPA does not provide population inference.

  • The same holds for other measures used in information-based imaging.

  • The reason is that the true value of ‘information-like’ measures cannot be below chance level.

  • This is in contrast to the use of the t-test in univariate analysis which does support generalization.

  • Population inference in MVPA can be achieved by targeting the effect prevalence instead of the mean.

Abstract

In multivariate pattern analysis of neuroimaging data, ‘second-level’ inference is often performed by entering classification accuracies into a t-test vs chance level across subjects. We argue that while the random-effects analysis implemented by the t-test does provide population inference if applied to activation differences, it fails to do so in the case of classification accuracy or other ‘information-like’ measures, because the true value of such measures can never be below chance level. This constraint changes the meaning of the population-level null hypothesis being tested, which becomes equivalent to the global null hypothesis that there is no effect in any subject in the population. Consequently, rejecting it only allows to infer that there are some subjects in which there is an information effect, but not that it generalizes, rendering it effectively equivalent to fixed-effects analysis. This statement is supported by theoretical arguments as well as simulations. We review possible alternative approaches to population inference for information-based imaging, converging on the idea that it should not target the mean, but the prevalence of the effect in the population. One method to do so, ‘permutation-based information prevalence inference using the minimum statistic’, is described in detail and applied to empirical data.

Introduction

Since the seminal work of Haxby et al. (2001), an increasing number of neuroimaging studies have employed multivariate methods to complement the established mass-univariate approach (Friston et al., 1995) to the analysis of functional magnetic resonance imaging (fMRI) data, a field now known as multivariate pattern analysis (MVPA; Norman et al., 2006). Most MVPA studies use classification (Pereira et al., 2009) to examine activation patterns; the accuracy of a classifier in distinguishing activation patterns associated with different experimental conditions serves as a measure of multivariate effect strength. Since the target of MVPA is not a generally increased or decreased level of activation but the information content of activation patterns (cf. Pereira and Botvinick, 2011), it has also been characterized as information-based imaging and distinguished from traditional activation-based imaging (Kriegeskorte et al., 2006).

Many methodological aspects of MVPA have already been discussed in detail: what kind of classifier to use (Cox and Savoy, 2003, Norman et al., 2006), whether to adapt parametric multivariate statistics instead of classifiers (Allefeld and Haynes, 2014, Nili et al., 2014), how to understand searchlight-based accuracy maps (Etzel et al., 2013), or how classifier weights can be made interpretable (Haufe et al., 2014, Hoyos-Idrobo et al., 2015). By contrast, the topic of population inference based on per-subject measures of information content, i.e.the question whether an information effect observed in a sample of subjects generalizes to the population these subjects were recruited from, has not yet received sufficient attention (but see Brodersen et al., 2013).

In univariate analysis of multi-subject fMRI studies, the standard way to achieve population inference is to perform a ‘second-level’ null hypothesis test (Holmes and Friston, 1998). For each subject, a ‘first-level’ contrast (activation difference) is computed, and this contrast enters a second-level analysis, a t-test or an ANOVA. Specifically for a simple one-sided t-test vs 0, reaching statistical significance allows to infer that the experimental manipulation is associated with an increase of activation on average in the population of subjects. This is interpreted in such a way that the effect is ‘common’ or ‘stereotypical’ in that population (Penny and Holmes, 2007, p. 156).

With the adoption of information-based imaging, it has become accepted practice to apply the same second-level inferential procedures to the results of first-level multivariate analyses, in particular classification accuracy (see e.g. Haxby et al., 2001, Haynes et al., 2007, Spiridon and Kanwisher, 2002): A classifier is trained on part of the data and is tested on another part, using each part for testing once (cross-validation), and the classification performance is quantified in the form of an accuracy, the fraction of correctly classified test data points. Applied for example to two different experimental conditions, if there was no multivariate difference in the data between conditions, the classifier would operate at ‘chance level’, i.e. it would on average achieve a classification accuracy of 50%. At the second level, accuracies from different subjects are then entered into a one-sided one-sample t-test vs 50%, in order to show that the ability to classify above chance and therefore the presence of an information effect is typical in the population the subjects were recruited from.

In this paper we argue that despite of the seemingly analogous statistical procedure, a t-test vs chance level applied to accuracies cannot provide evidence that the corresponding effect is typical in the population. In contrast to other criticisms of this use of the t-test (see below), in our view the problem is not so much that the estimation distribution of cross-validated accuracies is not normal or even symmetric, or that a normal distribution model is generally inadequate for a quantity bounded to an interval [0%,100%]. Rather, the problem is that other than estimated accuracies, the true single-subject accuracy can never be below chance level because it measures an amount of information.2 We will show that this restriction changes the meaning of the t-test: It now tests the global null hypothesis (Nichols et al., 2005) that there is no information in any subject in the population. As a consequence, achieving a significant test result allows us only to infer that there are people in which there is an effect, but not that the presence of information generalizes to the population. The argument does not only hold for classification accuracy, but also for other ‘information-like’ measures.

The t-test on accuracies has been criticized before (Brodersen et al., 2013, Stelzer et al., 2013) on the grounds that its distributional assumptions are not fulfilled for cross-validated classification accuracies. Such a distributional error invalidates the calculation of critical values for the t-statistic and can therefore lead to an increased rate of false positives. This problem may be solved by better distribution models (Brodersen et al., 2013) or the use of non-parametric statistics (Stelzer et al., 2013). Our criticism goes significantly beyond that: Not only is the t-test quantitatively wrong, but it effectively tests a null hypothesis that is qualitatively different from its use with univariate statistics, with the consequence that rejection of this null hypothesis no longer supports population inference.

Please note that our criticism pertains specifically to a second-level t-test applied to per-subject classification accuracies or similar measures. It does not apply to the classification of subjects, e.g.into different patient groups in medical applications (Sabuncu, 2014, Sabuncu and Van Leemput, 2012), or to the classification of condition-specific patterns across subjects (Mourao-Miranda et al., 2005). Moreover, it only concerns quantities that measure the information content of data, but not related quantities like classifier weights (Gaonkar and Davatzikos, 2013, Gaonkar et al., 2015, Wang et al., 2007, see below).

The organization of the paper is as follows: In the section The problem with the t-test on accuracies we detail how a second-level t-test achieves population inference for univariate contrasts. We then explain that MVPA measures are ‘information-like’ and show, both theoretically and using simulations, that for such measures the t-test effectively tests the global null hypothesis that there is no effect in any subject. The section An alternative: information prevalence inference reviews possible alternatives to the t-test on accuracies, converging on the idea that population inference for information-based imaging should target the proportion of subjects in the population with an effect. One way to implement such an ‘information prevalence inference’ is described in detail in the section Permutation-based information prevalence inference using the minimum statistic, and results of its application to real data are compared with those of the t-test. We conclude with the discussion of a number of questions surrounding the problem of population inference for information-based imaging.

Section snippets

Population inference in univariate fMRI analysis

To see why the t-test on accuracies cannot provide population inference, we briefly recapitulate how standard univariate analysis does achieve it. In a single subject, an activation difference or contrast Δβ is estimated based on the general linear model (GLM; Friston et al., 1995). Because it is obtained from noisy data, the estimate is itself noisy, Δβ^N(Δβ,σ12),where σ12 denotes the estimation variance of the contrast (cf. Fig. 1a). If several subjects are included in a study, the true

An alternative: information prevalence inference

In the previous part we established that the second-level t-test applied to accuracies is not able to provide population inference. We now discuss alternative approaches, leading us to the idea that population inference for information-based imaging should target the proportion of people in the population in which there is an information effect.

Within the MVPA literature, there are three alternative proposals. First, Kriegeskorte and Bandettini (2007) recommend to apply the methods for

Permutation-based information prevalence inference using the minimum statistic

In this part we recapitulate the minimum-statistic approach to prevalence inference developed by Friston et al. (1999a), adapt it to be based on permutation statistics, and detail the resulting algorithm. Applied to information-like measures this method allows us to achieve information prevalence inference, i.e.inference with respect to the proportion of subjects in the population that exhibit an information effect. We demonstrate the method using an example data set.

The advantage of Friston et

Discussion

In this paper we have shown that the t-test on accuracies commonly used in MVPA studies is not able to provide population inference because the true single-subject accuracy a can never be below chance level. This constraint makes the effective null hypothesis of the test the global null hypothesis that there is no effect in any subject in the population, which means that in rejecting that null hypothesis we can only infer that there are some subjects in which there is an effect. This is in

Acknowledgments

Kai Görgen was supported by the German Research Foundation (DFG grants GRK1589/1 and FK:JA945/3-1).

The authors would like to thank Tom Nichols, Jakob Heinzle, Jörn Diedrichsen, Will Penny, María Herrojo Ruiz, Joram Soch, Martin Hebart, Jo Etzel, Yaroslav Halchenko, and Thomas Christophel for discussions, comments, and hints.

References (72)

  • S. Haufe et al.

    On the interpretation of weight vectors of linear models in multivariate neuroimaging

    Neuroimage

    (2014)
  • J. Haxby

    Multivariate pattern analysis of fMRI: The early beginnings

    Neuroimage

    (2012)
  • J. Haxby et al.

    A common, high-dimensional model of the representational space in human ventral temporal cortex

    Neuron

    (2011)
  • J.D. Haynes et al.

    Predicting the stream of consciousness from activity in human visual cortex

    Curr. Biol.

    (2005)
  • J.D. Haynes et al.

    Reading hidden intentions in the human brain

    Curr. Biol.

    (2007)
  • A. Holmes et al.

    Generalisability, random effects & population inference

    Neuroimage

    (1998)
  • N. Kriegeskorte et al.

    Analyzing for information, not activation, to exploit high-resolution fMRI

    Neuroimage

    (2007)
  • N. Lazar et al.

    Combining brains: A survey of methods for statistical pooling of information

    Neuroimage

    (2002)
  • J. Mourao-Miranda et al.

    Classifying brain states and determining the discriminating activation patterns: Support vector machine on functional MRI data

    Neuroimage

    (2005)
  • T. Nichols et al.

    Valid conjunction inference with the minimum statistic

    Neuroimage

    (2005)
  • Q. Noirhomme et al.

    Biased binomial assessment of cross-validated estimation of classification accuracies illustrated in diagnosis predictions

    Neuroimage: Clinical

    (2014)
  • K. Norman et al.

    Beyond mind-reading: Multi-voxel pattern analysis of fMRI data

    Trends Cogn. Sci.

    (2006)
  • E. Olivetti et al.

    Bayesian hypothesis testing for pattern discrimination in brain decoding

    Pattern Recogn.

    (2012)
  • W. Penny et al.

    Random effects analysis

  • F. Pereira et al.

    Information mapping with pattern classifiers: A comparative study

    Neuroimage

    (2011)
  • F. Pereira et al.

    Machine learning classifiers and fMRI: A tutorial overview

    Neuroimage

    (2009)
  • C. Price et al.

    Cognitive conjunction: A new approach to brain activation experiments

    Neuroimage

    (1997)
  • J. Rosenblatt et al.

    Revisiting multi-subject random effects in fMRI: Advocating prevalence estimation

    Neuroimage

    (2014)
  • J. Soch et al.

    How to avoid mismodelling in GLM-based fMRI data analysis: Cross-validated Bayesian model selection

    Neuroimage

    (2016)
  • M. Spiridon et al.

    How distributed is visual category information in human occipito-temporal cortex? An fMRI study

    Neuron

    (2002)
  • J. Stelzer et al.

    Statistical inference and multiple testing correction in classification-based multi-voxel pattern analysis (MVPA): Random permutations and cluster size control

    Neuroimage

    (2013)
  • K. Stephan et al.

    Bayesian model selection for group studies

    Neuroimage

    (2009)
  • M. Todd et al.

    Confounds in multivariate pattern analysis: Theory and rule representation case study

    Neuroimage

    (2013)
  • Z. Wang et al.

    Support vector machine learning-based fMRI data group analysis

    Neuroimage

    (2007)
  • K. Worsley et al.

    A test for a conjunction

    Statistics & Probability Letters

    (2000)
  • F. Wyman et al.

    A comparison of asymptotic error rate expansions for the sample linear discriminant function

    Pattern Recogn.

    (1990)
  • Cited by (101)

    • Within-participant statistics for cognitive science

      2022, Trends in Cognitive Sciences
    • Distributed networks for auditory memory differentially contribute to recall precision

      2022, NeuroImage
      Citation Excerpt :

      The group-level accuracy map was determined by averaging the classification accuracy across participants. To determine statistical assessment of classification performance above chance, we adopted a non-parametric permutation approach (Stelzer et al., 2013; Allefeld et al., 2016). To this end, we repeated the whole-brain searchlight analysis with randomly assigned classification labels for 1000 times for each participant.

    View all citing articles on Scopus
    1

    Charité-Campus Mitte, Philippstr. 13, Haus 6, Berlin 10115, Germany.

    View full text