Elsevier

NeuroImage

Volume 88, March 2014, Pages 41-46
NeuroImage

Robust cortical entrainment to the speech envelope relies on the spectro-temporal fine structure

https://doi.org/10.1016/j.neuroimage.2013.10.054Get rights and content

Highlights

  • Cortical entrainment to vocoded speech is sensitive to background noise.

  • Robust cortical entrainment to speech relies on the spectro-temporal fine structure.

  • Delta-band entrainment predicts individual speech recognition score.

Abstract

Speech recognition is robust to background noise. One underlying neural mechanism is that the auditory system segregates speech from the listening background and encodes it reliably. Such robust internal representation has been demonstrated in auditory cortex by neural activity entrained to the temporal envelope of speech. A paradox, however, then arises, as the spectro-temporal fine structure rather than the temporal envelope is known to be the major cue to segregate target speech from background noise. Does the reliable cortical entrainment in fact reflect a robust internal “synthesis” of the attended speech stream rather than direct tracking of the acoustic envelope? Here, we test this hypothesis by degrading the spectro-temporal fine structure while preserving the temporal envelope using vocoders. Magnetoencephalography (MEG) recordings reveal that cortical entrainment to vocoded speech is severely degraded by background noise, in contrast to the robust entrainment to natural speech. Furthermore, cortical entrainment in the delta-band (1–4 Hz) predicts the speech recognition score at the level of individual listeners. These results demonstrate that reliable cortical entrainment to speech relies on the spectro-temporal fine structure, and suggest that cortical entrainment to the speech envelope is not merely a representation of the speech envelope but a coherent representation of multiscale spectro-temporal features that are synchronized to the syllabic and phrasal rhythms of speech.

Introduction

Normal hearing listeners exhibit a surprising ability to understand speech in noisy acoustic environments, even in the absence of visual cues. A number of studies have suggested that the target speech and the listening background are separated in auditory cortex (Ding and Simon, 2012a, Zion Golumbic et al., 2013, Horton et al., 2013, Kerlin et al., 2010, Mesgarani and Chang, 2012, Power et al., 2012). In particular, when a listener attends to a speech stream, auditory cortical activity is reliably entrained to the temporal envelope of that stream, regardless of the listening background. This reliable neural representation of the speech envelope, i.e. slow temporal modulations below 16 Hz, is a key candidate mechanism underlying the reliable recognition of speech, since the temporal envelopes carry important cues for speech recognition (Shannon et al., 1995). It remains mysterious, however, how such reliable cortical entrainment to the speech envelope is achieved, since envelope is not an effective cue for segregation of speech from noise (Friesen et al., 2001).

Moreover, even the nature of cortical entrainment to the speech envelope is heavily debated, especially about whether it encodes the temporal envelope per se or instead other speech features that are correlated with the speech envelope (Obleser et al., 2012, Peelle et al., 2013). Many speech features, including pitch and spatial cues, are temporally coherent and correlated with the temporal envelope (Shamma et al., 2011). Therefore it has been proposed that the envelope entrainment in fact reflects a collective neural representation of multiple speech features that are synchronized to the syllabic and phrasal rhythm of speech (Ding and Simon, 2012a). Because of the collective nature of this representation, it has been suggested as a representation of speech as a whole auditory object.

If envelope entrainment indeed reflects an object-level, collective representation of speech features, reliable envelope entrainment in complex auditory scenes is likely to involve an analysis-by-synthesis process (Poeppel et al., 2008, Shamma et al., 2011, Shinn-Cunningham, 2008): In such a process, multiple features of a complex auditory scene are extracted subcortically in the analysis phase and then, based on speech segregation cues such as pitch, features belonging to the same speech stream are grouped into an auditory object in the synthesis phase. In contrast, if envelope entrainment involves only direct neural processing of the envelope, its robustness to noise may arise from more basic processes such as contrast gain control (Ding and Simon, 2013, Rabinowitz et al., 2011).

In this study, we investigate whether noise-robust cortical entrainment to the speech envelope involves merely envelope processing or instead reflects an analysis-by-synthesis process that includes the processing of spectro-temporal fine structure and reflects envelope properties of the re-synthesized auditory object. Here, the spectro-temporal fine structure refers to the acoustic information not included in the broadband envelope of speech (< 16 Hz), including, for example, the acoustic cues responsible for the pitch and formant structure of speech. We degrade the spectro-temporal fine structure of speech or speech–noise mixtures using noise vocoders and investigate whether vocoded stimuli are cortically represented differently from natural speech using MEG. If cortical entrainment only depends on the temporal envelope, it will not be affected by degradation of the spectro-temporal fine structure, even in a noisy listening environment. In contrast, if reliable cortical entrainment to speech requires an analysis-by-synthesis process that relies on the spectro-temporal fine structure, it should be severely degraded for vocoded speech.

Section snippets

Subjects

Twelve normal hearing, right-handed (Oldfield, 1971) young adults (6 females), all between 19 and 32 years old (23 years old on average) participated in the experiment. Subjects were paid, and the experimental procedures were approved by the University of Maryland institutional review board. Written informed consent form was obtained before the experiment.

Stimuli

The stimuli were selected from of a narration of the story Alice's Adventures in Wonderland (Chapter One, //librivox.org/alices-adventures-in-wonderland-by-lewis-carroll-4/

Procedure

The stimuli were presented in two orders, each to half of the subjects. In either order, the story continued naturally between stimuli and was repeated twice after the first presentation (3 trials in total). In the progressive order, the first two speech segments were natural speech presented in quiet, followed by 8-band vocoded speech in quiet and then 4-band vocoded speech in quiet. Then, natural speech in noise, 8-band vocoded speech in noise, and 4-band vocoded speech in noise were

Results

MEG responses were recorded from subjects listening to a narrated story presented either in quiet or in spectrally matched stationary noise (3 dB SNR). The speech stimuli were presented either without additional processing, referred to as natural speech, or after being processed by a noise vocoder (4-band or 8-band), referred to as vocoded speech. Noise vocoding reduces the spectral resolution of speech, as is demonstrated by the auditory spectrograms of the stimuli (Fig. 1). The temporal

Discussion

This study demonstrates that although the cortical entrainment to natural speech is robust to noise the cortical entrainment to vocoded speech is not. This phenomenon cannot be explained by passive envelope tracking mechanisms since noise vocoding does not directly affect the stimulus envelope to which that cortical activity is entrained. Instead, the results illustrate that the spectro-temporal fine structure, which is degraded for noise-vocoded speech, is critical to segregating speech from

Acknowledgment

We thank NIH grants R01 DC 008342 to J.Z.S. and R01 DC 004786 to M.C. for support.

References (40)

  • B.G. Shinn-Cunningham

    Object-based auditory and visual attention

    Trends Cogn. Sci.

    (2008)
  • A. Yuille et al.

    Vision as Bayesian inference: analysis by synthesis?

    Trends Cogn. Sci.

    (2006)
  • E.M. Zion Golumbic et al.

    Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party”

    Neuron

    (2013)
  • M. Chait et al.

    Auditory M50 and M100 responses to broadband noise: functional implications

    Neuroreport

    (2004)
  • S.V. David et al.

    Estimating sparse spectro-temporal receptive fields with natural stimuli

    Netw. Comput. Neural Syst.

    (2007)
  • I. Dean et al.

    neural population coding of sound level adapts to stimulus statistics

    Nat. Neurosci.

    (2005)
  • N. Ding et al.

    Emergence of neural encoding of auditory objects while listening to competing speakers

    Proc. Natl. Acad. Sci. U. S. A.

    (2012)
  • N. Ding et al.

    Neural coding of continuous speech in auditory cortex during monaural and dichotic listening

    J. Neurophysiol.

    (2012)
  • N. Ding et al.

    Adaptive temporal encoding leads to a background-insensitive cortical representation of speech

    J. Neurosci.

    (2013)
  • L.M. Friesen et al.

    Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants

    J. Acoust. Soc. Am.

    (2001)
  • Cited by (174)

    View all citing articles on Scopus
    View full text