Robust cortical entrainment to the speech envelope relies on the spectro-temporal fine structure
Introduction
Normal hearing listeners exhibit a surprising ability to understand speech in noisy acoustic environments, even in the absence of visual cues. A number of studies have suggested that the target speech and the listening background are separated in auditory cortex (Ding and Simon, 2012a, Zion Golumbic et al., 2013, Horton et al., 2013, Kerlin et al., 2010, Mesgarani and Chang, 2012, Power et al., 2012). In particular, when a listener attends to a speech stream, auditory cortical activity is reliably entrained to the temporal envelope of that stream, regardless of the listening background. This reliable neural representation of the speech envelope, i.e. slow temporal modulations below 16 Hz, is a key candidate mechanism underlying the reliable recognition of speech, since the temporal envelopes carry important cues for speech recognition (Shannon et al., 1995). It remains mysterious, however, how such reliable cortical entrainment to the speech envelope is achieved, since envelope is not an effective cue for segregation of speech from noise (Friesen et al., 2001).
Moreover, even the nature of cortical entrainment to the speech envelope is heavily debated, especially about whether it encodes the temporal envelope per se or instead other speech features that are correlated with the speech envelope (Obleser et al., 2012, Peelle et al., 2013). Many speech features, including pitch and spatial cues, are temporally coherent and correlated with the temporal envelope (Shamma et al., 2011). Therefore it has been proposed that the envelope entrainment in fact reflects a collective neural representation of multiple speech features that are synchronized to the syllabic and phrasal rhythm of speech (Ding and Simon, 2012a). Because of the collective nature of this representation, it has been suggested as a representation of speech as a whole auditory object.
If envelope entrainment indeed reflects an object-level, collective representation of speech features, reliable envelope entrainment in complex auditory scenes is likely to involve an analysis-by-synthesis process (Poeppel et al., 2008, Shamma et al., 2011, Shinn-Cunningham, 2008): In such a process, multiple features of a complex auditory scene are extracted subcortically in the analysis phase and then, based on speech segregation cues such as pitch, features belonging to the same speech stream are grouped into an auditory object in the synthesis phase. In contrast, if envelope entrainment involves only direct neural processing of the envelope, its robustness to noise may arise from more basic processes such as contrast gain control (Ding and Simon, 2013, Rabinowitz et al., 2011).
In this study, we investigate whether noise-robust cortical entrainment to the speech envelope involves merely envelope processing or instead reflects an analysis-by-synthesis process that includes the processing of spectro-temporal fine structure and reflects envelope properties of the re-synthesized auditory object. Here, the spectro-temporal fine structure refers to the acoustic information not included in the broadband envelope of speech (< 16 Hz), including, for example, the acoustic cues responsible for the pitch and formant structure of speech. We degrade the spectro-temporal fine structure of speech or speech–noise mixtures using noise vocoders and investigate whether vocoded stimuli are cortically represented differently from natural speech using MEG. If cortical entrainment only depends on the temporal envelope, it will not be affected by degradation of the spectro-temporal fine structure, even in a noisy listening environment. In contrast, if reliable cortical entrainment to speech requires an analysis-by-synthesis process that relies on the spectro-temporal fine structure, it should be severely degraded for vocoded speech.
Section snippets
Subjects
Twelve normal hearing, right-handed (Oldfield, 1971) young adults (6 females), all between 19 and 32 years old (23 years old on average) participated in the experiment. Subjects were paid, and the experimental procedures were approved by the University of Maryland institutional review board. Written informed consent form was obtained before the experiment.
Stimuli
The stimuli were selected from of a narration of the story Alice's Adventures in Wonderland (Chapter One, //librivox.org/alices-adventures-in-wonderland-by-lewis-carroll-4/
Procedure
The stimuli were presented in two orders, each to half of the subjects. In either order, the story continued naturally between stimuli and was repeated twice after the first presentation (3 trials in total). In the progressive order, the first two speech segments were natural speech presented in quiet, followed by 8-band vocoded speech in quiet and then 4-band vocoded speech in quiet. Then, natural speech in noise, 8-band vocoded speech in noise, and 4-band vocoded speech in noise were
Results
MEG responses were recorded from subjects listening to a narrated story presented either in quiet or in spectrally matched stationary noise (3 dB SNR). The speech stimuli were presented either without additional processing, referred to as natural speech, or after being processed by a noise vocoder (4-band or 8-band), referred to as vocoded speech. Noise vocoding reduces the spectral resolution of speech, as is demonstrated by the auditory spectrograms of the stimuli (Fig. 1). The temporal
Discussion
This study demonstrates that although the cortical entrainment to natural speech is robust to noise the cortical entrainment to vocoded speech is not. This phenomenon cannot be explained by passive envelope tracking mechanisms since noise vocoding does not directly affect the stimulus envelope to which that cortical activity is entrained. Instead, the results illustrate that the spectro-temporal fine structure, which is degraded for noise-vocoded speech, is critical to segregating speech from
Acknowledgment
We thank NIH grants R01 DC 008342 to J.Z.S. and R01 DC 004786 to M.C. for support.
References (40)
- et al.
Denoising based on time-shift PCA
J. Neurosci. Methods
(2007) - et al.
Denoising based on spatial filtering
J. Neurosci. Methods
(2008) - et al.
Derivation of auditory filter shapes from notched-noise data
Hear. Res.
(1990) - et al.
Phase patterns of neuronal responses reliably discriminate speech in human auditory cortex
Neuron
(2007) Processing of complex sounds in the auditory system
Curr. Opin. Neurobiol.
(2008)The assessment and analysis of handedness: the Edinburgh inventory
Neuropsychologia
(1971)- et al.
Contrast gain control in auditory cortex
Neuron
(2011) - et al.
Low-frequency neuronal oscillations as instruments of sensory selection
Trends Neurosci.
(2009) - et al.
Neuronal oscillations and visual amplification of speech
Trends Cogn. Sci.
(2008) - et al.
Temporal coherence and attention in auditory scene analysis
Trends Neurosci.
(2011)
Object-based auditory and visual attention
Trends Cogn. Sci.
Vision as Bayesian inference: analysis by synthesis?
Trends Cogn. Sci.
Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party”
Neuron
Auditory M50 and M100 responses to broadband noise: functional implications
Neuroreport
Estimating sparse spectro-temporal receptive fields with natural stimuli
Netw. Comput. Neural Syst.
neural population coding of sound level adapts to stimulus statistics
Nat. Neurosci.
Emergence of neural encoding of auditory objects while listening to competing speakers
Proc. Natl. Acad. Sci. U. S. A.
Neural coding of continuous speech in auditory cortex during monaural and dichotic listening
J. Neurophysiol.
Adaptive temporal encoding leads to a background-insensitive cortical representation of speech
J. Neurosci.
Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants
J. Acoust. Soc. Am.
Cited by (174)
Reliability and generalizability of neural speech tracking in younger and older adults
2024, Neurobiology of Aging