Elsevier

NeuroImage

Volume 166, 1 February 2018, Pages 247-258
NeuroImage

Causal cortical dynamics of a predictive enhancement of speech intelligibility

https://doi.org/10.1016/j.neuroimage.2017.10.066Get rights and content

Highlights

  • Cortical entrainment to the speech envelope is modulated by prior knowledge.

  • Prior knowledge enhances delta-band entrainment.

  • Enhanced envelope tracking in left IFG precedes the same effect in left HG.

  • Intelligible speech modulates causal cortico-cortical dynamics between temporal areas.

Abstract

Speech perception may be underpinned by a hierarchical cortical system, which attempts to match “external” incoming sensory inputs with “internal” top-down predictions. Prior knowledge modulates internal predictions of an upcoming stimulus and exerts its effects in temporal and inferior frontal cortex. Here, we used source-space magnetoencephalography (MEG) to study the spatiotemporal dynamics underpinning the integration of prior knowledge in the speech processing network. Prior knowledge was manipulated to i) increase the perceived intelligibility of speech sentences, and ii) dissociate the perceptual effects of changes in speech intelligibility from acoustical differences in speech stimuli. Cortical entrainment to the speech temporal envelope, which accounts for neural activity specifically related to sensory information, was affected by prior knowledge: This effect emerged early (∼50 ms) in left inferior frontal gyrus (IFG) and then (∼100 ms) in Heschl's gyrus (HG), and was sustained until latencies of ∼250 ms. Directed transfer function (DTF) measures were used for estimating direct Granger causal relations between locations of interest. In line with the cortical entrainment result, this analysis indicated that prior knowledge enhanced top-down connections from left IFG to all the left temporal areas of interest – namely HG, superior temporal sulcus (STS), and middle temporal gyrus (MTG). In addition, intelligible speech increased top-down information flow between left STS and left HG, and increased bottom-up flow in higher-order temporal cortex, specifically between STS and MTG. These results are compatible with theories that explain this mechanism as a result of both ascending and descending cortical interactions, such as predictive coding. Altogether, this study provides a detailed view of how, where and when prior knowledge influences continuous speech perception.

Introduction

Humans have the ability to understand speech despite the various sources of noise and degradation that characterise real-world listening environments. Under perceptually adverse listening conditions, the perception of sensory information is aided by additional factors, such as the prior knowledge of the content of the upcoming speech (Obleser, 2014). A major challenge is to understand exactly how, where, and when those predictions influence speech perception (Norris et al., 2016). It is widely accepted that speech comprehension is underpinned by a hierarchical network that is characterized by both bottom-up and top-down signals (Hickok and Poeppel, 2007, Peelle et al., 2010, Gross et al., 2013, Bornkessel-Schlesewsky et al., 2015). In particular, top-down connections may constitute a neural basis for the integration of prior knowledge in the speech processing network (Davis and Johnsrude, 2007, Wild et al., 2012, Lewis and Bastiaansen, 2015).

The ability to disentangle neural activity at distinct processing levels may be crucial to unveil how prior information affects the speech comprehension network. One way to achieve this goal is to focus on the cortical organization of speech processing using neuroimaging technologies with high spatial resolution, such as functional magnetic resonance imaging (fMRI) (Friederici et al., 2010, DeWitt and Rauschecker, 2012, Overath et al., 2015, Tuennerhoff and Noppeney, 2016). Such studies have contributed to the characterization of specific cortical areas in terms of their functional roles in speech comprehension. In particular, a hierarchical organization of temporal areas supporting the perceptual and lexical processing of speech has been identified: Key regions include the superior temporal gyrus (STG) (Humphries et al., 2014) and the superior temporal sulcus (STS) (Overath et al., 2015), which exhibit sensitivity to acoustic and phonetic features of speech. Furthermore, the middle temporal gyrus (MTG) has been implicated in higher-level lexical processing (Lau et al., 2008, Turken and Dronkers, 2011).

The low temporal resolution of fMRI constitutes a major impediment when investigating the fast cortical dynamics of the speech processing network (Gow and Segawa, 2009, Wild et al., 2012). Complementary insights may be provided using electrocorticography (ECoG) and non-invasive electroencephalography (EEG) and magnetoencephalography (MEG), which are more suited to characterising the precise temporal dynamics required to integrate bottom-up and top-down information. In particular, ECoG studies have identified a role for STG in categorical perception and, specifically, in the processing of phonetic-level features (Chang et al., 2010, Mesgarani et al., 2014). ECoG, EEG and MEG have all been used to show that prior knowledge elicits a top-down influence from inferior frontal gyrus (IFG) to STG (Sohoglu et al., 2012, Leonard et al., 2016, Sohoglu and Davis, 2016). Although these studies provide important insights into the mechanisms of key regions in the speech network, a number of fundamental questions remain unanswered, especially regarding the temporal dynamics and interactions between cortical areas. Previous studies either did not have the spatial resolution (Di Liberto et al., in review), the temporal resolution (Blank and Davis, 2016, Tuennerhoff and Noppeney, 2016) or the cortical coverage (Holdgraf et al., 2016) to characterize precise spatiotemporal dynamics between regions in the speech network. Studies with the requisite spatiotemporal resolution (Sohoglu et al., 2012, Sohoglu and Davis, 2016), focused on cortical (de)activation, rather than indexing the representational content that may underlie such responses, i.e., the neural encoding of speech features.

Here, we sought a better understanding of the spatiotemporal cortical dynamics that underpin the integration of prior knowledge within the speech comprehension network. Importantly, this study aimed to investigate these dynamics both in terms of changes in activity in key cortical areas and in terms of the neural encoding of the temporal envelope of speech. To this end, data from a perceptual “pop-out” experiment (Millman et al., 2015) were re-analyzed to isolate the effects of prior knowledge on cortical mechanisms supporting speech intelligibility. In Millman et al. (2015), perceptual “pop-out” (e.g. Davis et al., 2005) was used to change the percept of physically identical tone-carrier vocoded speech sentences (in short, tone-vocoded sentences) from unintelligible to intelligible during MEG data acquisition. The pop-out effect was obtained by preceding the presentation of some of the vocoded sentences with the original, unprocessed version of the stimulus. The pop-out approach dissociates the effects of (top-down) prior knowledge from (bottom-up) changes in sensory information (Sohoglu et al., 2012, Millman et al., 2015, Blank and Davis, 2016, Holdgraf et al., 2016, Sohoglu and Davis, 2016, Di Liberto et al., in review).

In order to assess how prior knowledge affects speech processing within the speech comprehension network, bespoke MEG beamformer-based analyses were used to estimate neural sources in bilateral locations of interest (Millman et al., 2015), corresponding to Heschl's gyrus (HG), STS, MTG, and IFG. These regions have been shown to provide distinct contributions to the speech recognition process and to represent progressively higher levels of the speech perception hierarchy (Davis and Johnsrude, 2003, Scott and Johnsrude, 2003, Hickok and Poeppel, 2007, Peelle et al., 2010, Peelle et al., 2013, Mesgarani et al., 2014, Overath et al., 2015, Leonard et al., 2016, Sohoglu and Davis, 2016, Tuennerhoff and Noppeney, 2016). The neural encoding of speech was estimated using measures of cortical entrainment to the temporal envelope of speech sentences (Lalor et al., 2009, Crosse et al., 2016b). The functional roles and interpretations of the cortical entrainment phenomenon are still debated (Ding and Simon, 2014b) and, crucially, previous research (including an analysis of the same MEG dataset used in the present study) failed to reveal any significant effect of intelligibility on entrainment measures (Millman et al., 2015). Here, we investigated this mechanism by combining more sophisticated measures of the cortical tracking of speech (Lalor et al., 2009, Crosse et al., 2016b), incorporating additional spatial, spectral, and temporal detail. Therefore, the primary goals of this study were to determine whether entrainment to the speech envelope i) is affected by perceptual pop-out, ii) entails sensitivity to the integration of prior knowledge with sensory information, iii) reflects the consequent change in perceived intelligibility. Secondly, we aimed to investigate the top-down/bottom-up dynamics of the pop-out effect by using measures of cortical entrainment, event-related power, and effective connectivity.

Section snippets

Methods

The present study is based on new analyses of a previously published MEG study on perceptual “pop-out” (Millman et al., 2015).

MEG recordings

Data were collected at the University of York, UK, using a Magnes 3,600 whole-head 248-channel magnetometer (formerly 4-D Neuroimaging, Inc., San Diego, CA). The signals were recorded at a sample rate of 678.17 Hz and were low-pass filtered online with a cutoff frequency of 200 Hz.

Before recording, individual facial and scalp landmarks (left and right preauricular points, Cz, nasion, and inion) were spatially coregistered using a Polhemus Fastrak System. The landmark locations in relation to

Coregistration

For the source-space analyses, the landmark locations were matched with the individual participants' anatomical magnetic resonance (MR) scans using a surface-matching technique adapted from Kozinska et al. (2001). T1-weighted MR images were acquired with a GE 3.0-T Signa Excite HDx system (General Electric, Milwaukee, WI) using an eight-channel head coil and a 3-D Fast Spoiled Gradient Recall sequence: repetition time/echo time/flip angle = 8.03 ms/3.07 ms/20°, spatial resolution of

Beamformer-based analyses

For further details on the beamformer-based analysis framework used in this study, please refer to Millman et al. (2015).

In brief, a vectorized, linearly constrained minimum-variance beamformer (Van Veen et al., 1997, Huang et al., 2004) was used to obtain the spatial filters with a multiple-spheres head model (Huang et al., 1999). Given that we expected different patterns of cortical activation in the two different experimental conditions of interest, this procedure was conducted separately

Locations of interest

The aim of this study was to characterize the effects of prior knowledge (and of the consequent enhancement in speech intelligibility) on the activity in, and interaction between, several bilateral key locations in the speech comprehension hierarchy (e.g. Hickok and Poeppel, 2007): These key locations included, as depicted in Fig. 2, HG, STS, [MNI: ±61, -22, 0] (coordinates taken from Overath et al., 2015); posterior MTG, [MNI: ±55, -46, -4] (coordinates taken from Lau et al., 2008); IFG, [MNI:

Frequency bands of interest

Spatial filters from the LOIs were generated using a time window of 2000 ms, including 500 ms prior to stimulus presentation. Broadband (1-45 Hz) data (obtained using 4th order Butterworth filters) from the conditions of interest (Pop-out, Unintelligible) were projected through the spatial filters in the first instance so that all analyses (i.e., power envelope, entrainment and causality) could be carried out using the same spatial filter orientation. Contributions from more specific brain

Event-related power analyses

Event-related fields time-locked to stimulus onset were derived for Pop-outpre, Pop-outpost, Unintelligiblepre, and Unintelligiblepost. Whilst fMRI studies are limited to overall measures of cortical activity over relatively long time windows, the current analysis also investigated the temporal dynamics of the cortical responses to speech. This information is conveyed by means of the cumulative event-related power, where cumulative power at time t is calculated as the sum of the squares for the

Cortical entrainment analyses

The mapping between stimulus and cortical activity was estimated using a system identification approach based on ridge-regression. In particular, the procedure involved identifying a mapping from source-space MEG signal to the speech envelope that optimized the following linear model:sˆloc(t)=τ=τiτi+winSizer(t+τ,loc)g(τ,loc),where sˆloc(t) is the estimated speech envelope using the MEG signal from a location of interest loc, r(t+τ,loc) is the MEG response time lag τi and location loc, and g(τ,l

Network effective connectivity analysis

Brain connectivity measures are used to infer neuronal spatiotemporal interactions which index and predict task-relevant changes in cognitive states and behavior. Whilst methods such as dynamic causal modeling (DCM) require a set of possible hypotheses for the neurobiological system of interest (Stephan et al., 2007), there exist approaches that do not impose such a constraint and that rely on data-driven analyses (Granger, 1969, Ding et al., 2006). Here, an exploratory dynamical framework for

Statistical analysis

All statistical analyses were conducted using Wilcoxon signed rank tests (paired if possible), except where otherwise stated. All numerical values are reported as mean ± SD. In the cortical power and in the entrainment analyses, permutation-based cluster-size statistics (Groppe et al., 2011, Maris, 2012) with 1,000 repetitions were used to correct for multiple comparisons while keeping in consideration that results for neighboring time points or time windows are not independent. The primary

Behavioral intelligibility ratings

The responses made during the probe trials were analyzed to confirm that the Pop-out sentences were perceived as more intelligible in block 2, i.e., after exposure to the unprocessed speech. The low intelligibility ratings for the Pop-outpre (mean = 15.7%, SD = 34%), Unintelligiblepre (mean = 11.1%, SD = 17.3%), and Unintelligiblepost (mean = 17.8%, SD = 28.9%) sentences indicate that they were perceived as unintelligible. The intelligibility ratings for Pop-outpost (mean = 93.5%, SD = 15.2%)

Discussion

The cortical mechanisms underpinning the integration of prior knowledge with sensory input during continuous speech comprehension are poorly understood. Here, we demonstrated that non-invasive MEG measures are sensitive to the predictive effects of prior knowledge on perceived speech intelligibility. Furthermore, we provided insight into the cortical spatiotemporal dynamics that underlie this process, and the implications for current views of the cortical underpinnings of speech comprehension.

Acknowledgements

This study was supported by an Irish Research Council Government (GOIPG, 2013-2017) (GOIPG/2013/1249) of Ireland Postgraduate Scholarship and by a travel grant from Guarantors of Brain (UK registered charity). The authors thank Bahman Nasseroleslami for useful discussions on the connectivity analysis approach.

References (84)

  • H. Park et al.

    Frontal top-down signals increase coupling of auditory low-frequency oscillations to continuous speech in human listeners

    Curr. Biol.

    (2015)
  • D. Poeppel

    The analysis of speech in different temporal integration windows: cerebral lateralization as ‘asymmetric sampling in time’

    Speech Commun.

    (2003)
  • J. Rademacher et al.

    Probabilistic mapping and volume measurement of human primary auditory cortex

    NeuroImage

    (2001)
  • S.K. Scott et al.

    The neuroanatomical and functional organization of speech perception

    Trends Neurosci.

    (2003)
  • J. Tuennerhoff et al.

    When sentences live up to your expectations

    NeuroImage

    (2016)
  • C.J. Wild et al.

    Human auditory cortex is sensitive to the perceived clarity of speech

    NeuroImage

    (2012)
  • L. Zhang et al.

    Task-dependent modulation of regions in the left temporal cortex during auditory sentence comprehension

    Neurosci. Lett.

    (2015)
  • B. Zoefel et al.

    EEG oscillations entrain their phase to high-level features of speech sound

    NeuroImage

    (2016)
  • E. Ahissar et al.

    Speech comprehension is correlated with temporal response patterns recorded from auditory cortex

    Proc. Natl. Acad. Sci.

    (2001)
  • S.J. Aiken et al.

    Human cortical responses to the speech envelope

    Ear Hear

    (2008)
  • L.H. Arnal et al.

    Delta–beta coupled oscillations underlie temporal prediction accuracy

    Cereb. Cortex (New York, NY)

    (2015)
  • L.H. Arnal et al.

    Transitions in neural oscillations reflect prediction errors generated in audiovisual speech

    Nat. Neurosci.

    (2011)
  • H. Blank et al.

    Prediction errors but not sharpened signals simulate multivoxel fMRI patterns during speech perception

    PLoS Biol.

    (2016)
  • K.J. Blinowska

    Review of the methods of determination of directed connectivity from multichannel data

    Med. Biol. Eng. Comput.

    (2011)
  • E.F. Chang et al.

    Categorical speech representation in human superior temporal gyrus

    Nat. Neurosci.

    (2010)
  • A. Clark

    Whatever next? Predictive brains, situated agents, and the future of cognitive science

    Behav. Brain Sci.

    (2013)
  • M.J. Crosse et al.

    Eye can hear clearly now: inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration

    J. Neurosci.

    (2016)
  • M.J. Crosse et al.

    The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli

    Frontiers Human Neurosci.

    (2016)
  • M.H. Davis et al.

    Hierarchical processing in spoken language comprehension

    J. Neurosci.Official J. Soc. Neurosci.

    (2003)
  • M.H. Davis et al.

    Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences

    J. Exp. Psychol. General

    (2005)
  • A. Delorme et al.

    EEGLAB, SIFT, NFT, BCILAB, and ERICA: new tools for advanced EEG processing

    Comput. Intell. Neurosci.

    (2011)
  • I. DeWitt et al.

    Phoneme and word recognition in the auditory ventral stream

    Proc. Natl. Acad. Sci. U. S. A.

    (2012)
  • Di Liberto, G.M., Crosse, M.J., Lalor, E.C., Cortical measures of phoneme-level speech encoding correlate with the...
  • M. Ding et al.

    17 Granger Causality: Basic Theory and Application to Neuroscience. Handbook of Time Series Analysis: Recent Theoretical Developments and Applications

    (2006)
  • M. Ding et al.

    Short-window spectral analysis of cortical event-related potentials by adaptive multivariate autoregressive modeling: data preprocessing, model validation, and variability assessment

    Biol. Cybern.

    (2000)
  • N. Ding et al.

    Adaptive temporal encoding leads to a background-insensitive cortical representation of speech

    J. Neurosci. Official J. Soc. Neurosci.

    (2013)
  • N. Ding et al.

    Cortical entrainment to continuous speech: functional roles and interpretations

    Frontiers Human Neurosci.

    (2014)
  • N. Ding et al.

    Cortical entrainment to continuous speech: functional roles and interpretations

    Front. Hum. Neurosci.

    (2014)
  • H. Dudley

    Remaking speech

    J. Acoust. Soc. Am.

    (1939)
  • A.C. Evans et al.

    3D statistical neuroanatomical models from 305 MRI volumes

  • L. Fontolan et al.

    The contribution of frequency-specific activity to hierarchical information processing in the human auditory cortex

    Nat. Commun.

    (2014)
  • J.R. Foster et al.

    Lip-reading the BKB sentence lists: corrections for list and practice effects

    Br. J. Audiol.

    (1993)
  • Cited by (50)

    View all citing articles on Scopus
    View full text