Causal cortical dynamics of a predictive enhancement of speech intelligibility
Introduction
Humans have the ability to understand speech despite the various sources of noise and degradation that characterise real-world listening environments. Under perceptually adverse listening conditions, the perception of sensory information is aided by additional factors, such as the prior knowledge of the content of the upcoming speech (Obleser, 2014). A major challenge is to understand exactly how, where, and when those predictions influence speech perception (Norris et al., 2016). It is widely accepted that speech comprehension is underpinned by a hierarchical network that is characterized by both bottom-up and top-down signals (Hickok and Poeppel, 2007, Peelle et al., 2010, Gross et al., 2013, Bornkessel-Schlesewsky et al., 2015). In particular, top-down connections may constitute a neural basis for the integration of prior knowledge in the speech processing network (Davis and Johnsrude, 2007, Wild et al., 2012, Lewis and Bastiaansen, 2015).
The ability to disentangle neural activity at distinct processing levels may be crucial to unveil how prior information affects the speech comprehension network. One way to achieve this goal is to focus on the cortical organization of speech processing using neuroimaging technologies with high spatial resolution, such as functional magnetic resonance imaging (fMRI) (Friederici et al., 2010, DeWitt and Rauschecker, 2012, Overath et al., 2015, Tuennerhoff and Noppeney, 2016). Such studies have contributed to the characterization of specific cortical areas in terms of their functional roles in speech comprehension. In particular, a hierarchical organization of temporal areas supporting the perceptual and lexical processing of speech has been identified: Key regions include the superior temporal gyrus (STG) (Humphries et al., 2014) and the superior temporal sulcus (STS) (Overath et al., 2015), which exhibit sensitivity to acoustic and phonetic features of speech. Furthermore, the middle temporal gyrus (MTG) has been implicated in higher-level lexical processing (Lau et al., 2008, Turken and Dronkers, 2011).
The low temporal resolution of fMRI constitutes a major impediment when investigating the fast cortical dynamics of the speech processing network (Gow and Segawa, 2009, Wild et al., 2012). Complementary insights may be provided using electrocorticography (ECoG) and non-invasive electroencephalography (EEG) and magnetoencephalography (MEG), which are more suited to characterising the precise temporal dynamics required to integrate bottom-up and top-down information. In particular, ECoG studies have identified a role for STG in categorical perception and, specifically, in the processing of phonetic-level features (Chang et al., 2010, Mesgarani et al., 2014). ECoG, EEG and MEG have all been used to show that prior knowledge elicits a top-down influence from inferior frontal gyrus (IFG) to STG (Sohoglu et al., 2012, Leonard et al., 2016, Sohoglu and Davis, 2016). Although these studies provide important insights into the mechanisms of key regions in the speech network, a number of fundamental questions remain unanswered, especially regarding the temporal dynamics and interactions between cortical areas. Previous studies either did not have the spatial resolution (Di Liberto et al., in review), the temporal resolution (Blank and Davis, 2016, Tuennerhoff and Noppeney, 2016) or the cortical coverage (Holdgraf et al., 2016) to characterize precise spatiotemporal dynamics between regions in the speech network. Studies with the requisite spatiotemporal resolution (Sohoglu et al., 2012, Sohoglu and Davis, 2016), focused on cortical (de)activation, rather than indexing the representational content that may underlie such responses, i.e., the neural encoding of speech features.
Here, we sought a better understanding of the spatiotemporal cortical dynamics that underpin the integration of prior knowledge within the speech comprehension network. Importantly, this study aimed to investigate these dynamics both in terms of changes in activity in key cortical areas and in terms of the neural encoding of the temporal envelope of speech. To this end, data from a perceptual “pop-out” experiment (Millman et al., 2015) were re-analyzed to isolate the effects of prior knowledge on cortical mechanisms supporting speech intelligibility. In Millman et al. (2015), perceptual “pop-out” (e.g. Davis et al., 2005) was used to change the percept of physically identical tone-carrier vocoded speech sentences (in short, tone-vocoded sentences) from unintelligible to intelligible during MEG data acquisition. The pop-out effect was obtained by preceding the presentation of some of the vocoded sentences with the original, unprocessed version of the stimulus. The pop-out approach dissociates the effects of (top-down) prior knowledge from (bottom-up) changes in sensory information (Sohoglu et al., 2012, Millman et al., 2015, Blank and Davis, 2016, Holdgraf et al., 2016, Sohoglu and Davis, 2016, Di Liberto et al., in review).
In order to assess how prior knowledge affects speech processing within the speech comprehension network, bespoke MEG beamformer-based analyses were used to estimate neural sources in bilateral locations of interest (Millman et al., 2015), corresponding to Heschl's gyrus (HG), STS, MTG, and IFG. These regions have been shown to provide distinct contributions to the speech recognition process and to represent progressively higher levels of the speech perception hierarchy (Davis and Johnsrude, 2003, Scott and Johnsrude, 2003, Hickok and Poeppel, 2007, Peelle et al., 2010, Peelle et al., 2013, Mesgarani et al., 2014, Overath et al., 2015, Leonard et al., 2016, Sohoglu and Davis, 2016, Tuennerhoff and Noppeney, 2016). The neural encoding of speech was estimated using measures of cortical entrainment to the temporal envelope of speech sentences (Lalor et al., 2009, Crosse et al., 2016b). The functional roles and interpretations of the cortical entrainment phenomenon are still debated (Ding and Simon, 2014b) and, crucially, previous research (including an analysis of the same MEG dataset used in the present study) failed to reveal any significant effect of intelligibility on entrainment measures (Millman et al., 2015). Here, we investigated this mechanism by combining more sophisticated measures of the cortical tracking of speech (Lalor et al., 2009, Crosse et al., 2016b), incorporating additional spatial, spectral, and temporal detail. Therefore, the primary goals of this study were to determine whether entrainment to the speech envelope i) is affected by perceptual pop-out, ii) entails sensitivity to the integration of prior knowledge with sensory information, iii) reflects the consequent change in perceived intelligibility. Secondly, we aimed to investigate the top-down/bottom-up dynamics of the pop-out effect by using measures of cortical entrainment, event-related power, and effective connectivity.
Section snippets
Methods
The present study is based on new analyses of a previously published MEG study on perceptual “pop-out” (Millman et al., 2015).
MEG recordings
Data were collected at the University of York, UK, using a Magnes 3,600 whole-head 248-channel magnetometer (formerly 4-D Neuroimaging, Inc., San Diego, CA). The signals were recorded at a sample rate of 678.17 Hz and were low-pass filtered online with a cutoff frequency of 200 Hz.
Before recording, individual facial and scalp landmarks (left and right preauricular points, Cz, nasion, and inion) were spatially coregistered using a Polhemus Fastrak System. The landmark locations in relation to
Coregistration
For the source-space analyses, the landmark locations were matched with the individual participants' anatomical magnetic resonance (MR) scans using a surface-matching technique adapted from Kozinska et al. (2001). T1-weighted MR images were acquired with a GE 3.0-T Signa Excite HDx system (General Electric, Milwaukee, WI) using an eight-channel head coil and a 3-D Fast Spoiled Gradient Recall sequence: repetition time/echo time/flip angle = 8.03 ms/3.07 ms/20°, spatial resolution of
Beamformer-based analyses
For further details on the beamformer-based analysis framework used in this study, please refer to Millman et al. (2015).
In brief, a vectorized, linearly constrained minimum-variance beamformer (Van Veen et al., 1997, Huang et al., 2004) was used to obtain the spatial filters with a multiple-spheres head model (Huang et al., 1999). Given that we expected different patterns of cortical activation in the two different experimental conditions of interest, this procedure was conducted separately
Locations of interest
The aim of this study was to characterize the effects of prior knowledge (and of the consequent enhancement in speech intelligibility) on the activity in, and interaction between, several bilateral key locations in the speech comprehension hierarchy (e.g. Hickok and Poeppel, 2007): These key locations included, as depicted in Fig. 2, HG, STS, [MNI: ±61, -22, 0] (coordinates taken from Overath et al., 2015); posterior MTG, [MNI: ±55, -46, -4] (coordinates taken from Lau et al., 2008); IFG, [MNI:
Frequency bands of interest
Spatial filters from the LOIs were generated using a time window of 2000 ms, including 500 ms prior to stimulus presentation. Broadband (1-45 Hz) data (obtained using 4th order Butterworth filters) from the conditions of interest (Pop-out, Unintelligible) were projected through the spatial filters in the first instance so that all analyses (i.e., power envelope, entrainment and causality) could be carried out using the same spatial filter orientation. Contributions from more specific brain
Event-related power analyses
Event-related fields time-locked to stimulus onset were derived for Pop-outpre, Pop-outpost, Unintelligiblepre, and Unintelligiblepost. Whilst fMRI studies are limited to overall measures of cortical activity over relatively long time windows, the current analysis also investigated the temporal dynamics of the cortical responses to speech. This information is conveyed by means of the cumulative event-related power, where cumulative power at time t is calculated as the sum of the squares for the
Cortical entrainment analyses
The mapping between stimulus and cortical activity was estimated using a system identification approach based on ridge-regression. In particular, the procedure involved identifying a mapping from source-space MEG signal to the speech envelope that optimized the following linear model:where is the estimated speech envelope using the MEG signal from a location of interest loc, is the MEG response time lag τi and location loc, and
Network effective connectivity analysis
Brain connectivity measures are used to infer neuronal spatiotemporal interactions which index and predict task-relevant changes in cognitive states and behavior. Whilst methods such as dynamic causal modeling (DCM) require a set of possible hypotheses for the neurobiological system of interest (Stephan et al., 2007), there exist approaches that do not impose such a constraint and that rely on data-driven analyses (Granger, 1969, Ding et al., 2006). Here, an exploratory dynamical framework for
Statistical analysis
All statistical analyses were conducted using Wilcoxon signed rank tests (paired if possible), except where otherwise stated. All numerical values are reported as mean ± SD. In the cortical power and in the entrainment analyses, permutation-based cluster-size statistics (Groppe et al., 2011, Maris, 2012) with 1,000 repetitions were used to correct for multiple comparisons while keeping in consideration that results for neighboring time points or time windows are not independent. The primary
Behavioral intelligibility ratings
The responses made during the probe trials were analyzed to confirm that the Pop-out sentences were perceived as more intelligible in block 2, i.e., after exposure to the unprocessed speech. The low intelligibility ratings for the Pop-outpre (mean = 15.7%, SD = 34%), Unintelligiblepre (mean = 11.1%, SD = 17.3%), and Unintelligiblepost (mean = 17.8%, SD = 28.9%) sentences indicate that they were perceived as unintelligible. The intelligibility ratings for Pop-outpost (mean = 93.5%, SD = 15.2%)
Discussion
The cortical mechanisms underpinning the integration of prior knowledge with sensory input during continuous speech comprehension are poorly understood. Here, we demonstrated that non-invasive MEG measures are sensitive to the predictive effects of prior knowledge on perceived speech intelligibility. Furthermore, we provided insight into the cortical spatiotemporal dynamics that underlie this process, and the implications for current views of the cortical underpinnings of speech comprehension.
Acknowledgements
This study was supported by an Irish Research Council Government (GOIPG, 2013-2017) (GOIPG/2013/1249) of Ireland Postgraduate Scholarship and by a travel grant from Guarantors of Brain (UK registered charity). The authors thank Bahman Nasseroleslami for useful discussions on the connectivity analysis approach.
References (84)
- et al.
Neurobiological roots of language in primate audition: common computational properties
Trends Cognitive Sci.
(2015) - et al.
Hearing speech sounds: top-down influences on the interface between audition and speech perception
Hear Res.
(2007) - et al.
Low-frequency cortical entrainment to speech reflects phoneme-level processing
Curr. Biol.
(2015) - et al.
Robust cortical entrainment to the speech envelope relies on the spectro-temporal fine structure
Neuroimage
(2014) - et al.
Articulatory mediation of speech perception: a causal analysis of multi-modal imaging data
Cognition
(2009) - et al.
Auditory cortical delta-entrainment interacts with oscillatory power in multiple fronto-parietal networks
NeuroImage
(2017) - et al.
Determination of information flow direction among brain structures by a modified directed transfer function (dDTF) method
J. Neurosci. methods
(2003) - et al.
Automatic alignment of EEG/MEG and MRI data sets
Clin. Neurophysiol. Official J. Int. Fed. Clin. Neurophysiol.
(2001) - et al.
A predictive coding framework for rapid neural dynamics during sentence-level language comprehension
Cortex
(2015) - et al.
Representations of the temporal envelope of sounds in human auditory cortex: can the results from invasive intracortical “depth” electrode recordings be replicated using non-invasive MEG “virtual electrodes”?
NeuroImage
(2013)
Frontal top-down signals increase coupling of auditory low-frequency oscillations to continuous speech in human listeners
Curr. Biol.
The analysis of speech in different temporal integration windows: cerebral lateralization as ‘asymmetric sampling in time’
Speech Commun.
Probabilistic mapping and volume measurement of human primary auditory cortex
NeuroImage
The neuroanatomical and functional organization of speech perception
Trends Neurosci.
When sentences live up to your expectations
NeuroImage
Human auditory cortex is sensitive to the perceived clarity of speech
NeuroImage
Task-dependent modulation of regions in the left temporal cortex during auditory sentence comprehension
Neurosci. Lett.
EEG oscillations entrain their phase to high-level features of speech sound
NeuroImage
Speech comprehension is correlated with temporal response patterns recorded from auditory cortex
Proc. Natl. Acad. Sci.
Human cortical responses to the speech envelope
Ear Hear
Delta–beta coupled oscillations underlie temporal prediction accuracy
Cereb. Cortex (New York, NY)
Transitions in neural oscillations reflect prediction errors generated in audiovisual speech
Nat. Neurosci.
Prediction errors but not sharpened signals simulate multivoxel fMRI patterns during speech perception
PLoS Biol.
Review of the methods of determination of directed connectivity from multichannel data
Med. Biol. Eng. Comput.
Categorical speech representation in human superior temporal gyrus
Nat. Neurosci.
Whatever next? Predictive brains, situated agents, and the future of cognitive science
Behav. Brain Sci.
Eye can hear clearly now: inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration
J. Neurosci.
The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli
Frontiers Human Neurosci.
Hierarchical processing in spoken language comprehension
J. Neurosci.Official J. Soc. Neurosci.
Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences
J. Exp. Psychol. General
EEGLAB, SIFT, NFT, BCILAB, and ERICA: new tools for advanced EEG processing
Comput. Intell. Neurosci.
Phoneme and word recognition in the auditory ventral stream
Proc. Natl. Acad. Sci. U. S. A.
17 Granger Causality: Basic Theory and Application to Neuroscience. Handbook of Time Series Analysis: Recent Theoretical Developments and Applications
Short-window spectral analysis of cortical event-related potentials by adaptive multivariate autoregressive modeling: data preprocessing, model validation, and variability assessment
Biol. Cybern.
Adaptive temporal encoding leads to a background-insensitive cortical representation of speech
J. Neurosci. Official J. Soc. Neurosci.
Cortical entrainment to continuous speech: functional roles and interpretations
Frontiers Human Neurosci.
Cortical entrainment to continuous speech: functional roles and interpretations
Front. Hum. Neurosci.
Remaking speech
J. Acoust. Soc. Am.
3D statistical neuroanatomical models from 305 MRI volumes
The contribution of frequency-specific activity to hierarchical information processing in the human auditory cortex
Nat. Commun.
Lip-reading the BKB sentence lists: corrections for list and practice effects
Br. J. Audiol.
Cited by (50)
ROSE: A neurocomputational architecture for syntax
2024, Journal of Neurolinguistics