Elsevier

Hearing Research

Volume 305, November 2013, Pages 57-73
Hearing Research

Review
Representation of speech in human auditory cortex: Is it special?

https://doi.org/10.1016/j.heares.2013.05.013Get rights and content

Highlights

  • Comparable neural responses to speech in monkey A1 and Heschl's gyrus.

  • Entrainment to the amplitude envelope is neither specific to humans nor to speech.

  • VOT is represented by responses time-locked to consonant release and voicing onset.

  • Fundamental frequency of male speakers is represented by phase-locked responses.

  • Place of articulation encoding is based on the frequency selectivity of neurons.

Abstract

Successful categorization of phonemes in speech requires that the brain analyze the acoustic signal along both spectral and temporal dimensions. Neural encoding of the stimulus amplitude envelope is critical for parsing the speech stream into syllabic units. Encoding of voice onset time (VOT) and place of articulation (POA), cues necessary for determining phonemic identity, occurs within shorter time frames. An unresolved question is whether the neural representation of speech is based on processing mechanisms that are unique to humans and shaped by learning and experience, or is based on rules governing general auditory processing that are also present in non-human animals. This question was examined by comparing the neural activity elicited by speech and other complex vocalizations in primary auditory cortex of macaques, who are limited vocal learners, with that in Heschl's gyrus, the putative location of primary auditory cortex in humans. Entrainment to the amplitude envelope is neither specific to humans nor to human speech. VOT is represented by responses time-locked to consonant release and voicing onset in both humans and monkeys. Temporal representation of VOT is observed both for isolated syllables and for syllables embedded in the more naturalistic context of running speech. The fundamental frequency of male speakers is represented by more rapid neural activity phase-locked to the glottal pulsation rate in both humans and monkeys. In both species, the differential representation of stop consonants varying in their POA can be predicted by the relationship between the frequency selectivity of neurons and the onset spectra of the speech sounds. These findings indicate that the neurophysiology of primary auditory cortex is similar in monkeys and humans despite their vastly different experience with human speech, and that Heschl's gyrus is engaged in general auditory, and not language-specific, processing.

This article is part of a Special Issue entitled “Communication Sounds and the Brain: New Directions and Perspectives”.

Introduction

The ease with which speech is perceived underscores the refined operations of a neural network capable of rapidly decoding complex acoustic signals and categorizing them into meaningful phonemic sequences. A number of models have been devised to explain how phonemes are extracted from the continuous stream of speech (e.g., McClelland and Elman, 1986, Church, 1987, Pisoni and Luce, 1987, Stevens, 2002). Common to all these models is the recognition that phonemic perception is a categorization task based on sound profiles derived from a multidimensional space encompassing numerous acoustic features unfolding over time (Holt and Lotto, 2010). Features are all characterized by acoustic parameters that vary along intensity, spectral, and temporal dimensions. Increased intensity, especially in the low to mid-frequency ranges, helps to distinguish vowels from consonants (McClelland and Elman, 1986, Stevens, 2002). Distinct spectral (formant) patterns during these periods of increased intensity promote accurate vowel identification (Hillenbrand et al., 1995).

The temporal dimension of phonemic categorization has received increased attention in recent years. An influential proposal posits that speech perception occurs over several overlapping time scales (e.g., Poeppel et al., 2008, Poeppel et al., 2012, Giraud and Poeppel, 2012). Syllabic analyses occur within a time frame of about 150–300 m, and correlate with the amplitude envelope of speech. Speech comprehension remains high even when sentence fragments are time-reversed in 50 ms bins, and only becomes severely degraded when time-reversals occur at frequencies overlapping those of the speech envelope (Saberi and Perrott, 1999). Furthermore, temporal smearing of the speech envelope leads to significant degradation in the intelligibility of sentences only at frequencies commensurate with the speech envelope (Drullman et al., 1994).

More refined acoustic feature analyses are performed within shorter temporal windows of integration that vary between about 20 and 80 m. Segmentation of speech within this range is critical for phonetic feature encoding, especially for shorter duration consonants. Times at which rapid temporal and spectral changes occur are informationally rich landmarks in the speech waveform (Stevens, 1981, Stevens, 2002). Both the spectra and formant transition trajectories occurring at these landmarks are crucial for accurate identification of true consonants such as the stops (Kewley-Port, 1983, Walley and Carrell, 1983, Alexander and Kluender, 2009). Voice onset time (VOT), the time between consonant release and the onset of rhythmic vocal cord vibrations, is a classic example of rapid temporal discontinuities that help to distinguish voiced consonants (e.g., /b/, /d/, and /g/) from their unvoiced counterparts (e.g., /p/, /t/, and /k/) (e.g., Lisker and Abramson, 1964, Faulkner and Rosen, 1999). Indeed, when semantic information is lacking, listeners of time-reversed speech have significant comprehension difficulties at the shorter temporal intervals required for phonetic feature encoding (Kiss et al., 2008).

Early stations in the human auditory system are exquisitely tuned to encode speech-related acoustic features. Population brainstem responses accurately represent the intensity, spectrum, and temporal envelope of speech sounds (Chandrasekaran et al., 2009, Anderson and Kraus, 2010). Magnetoencephalographic (MEG) responses reflect consonant place of articulation (POA) within 50 ms after sound onset (Tavabi et al., 2007), and within 100 ms, responses differentiate intelligible versus unintelligible speech (Obleser et al., 2006). Neural responses obtained from intracranial recordings in Heschl's gyrus (HG), the putative location of primary auditory cortex in humans (Hackett et al., 2001), demonstrate categorical-like changes to syllables that vary in their VOT in a manner that parallels perception (Steinschneider et al., 1999, Steinschneider et al., 2005). Spectrotemporal receptive fields derived from single unit activity in HG elicited by one portion of a movie soundtrack dialog can accurately predict response patterns elicited by a different portion of the same dialog (Bitterman et al., 2008). Finally, both MEG responses and responses obtained from invasive recordings within HG have shown that accurate tracking of the speech envelope degrades in parallel with the ability to perceive temporally compressed speech (Ahissar et al., 2001, Nourski et al., 2009; see also Peelle et al., 2013). These observations lend support to the conclusion that “acoustic–phonetic features of the speech signal such as voicing, spectral shape, formants or amplitude modulation are made accessible by the computations of the ascending auditory pathway and primary auditory cortex” (Obleser and Eisner, 2008, p. 16).

An important and unresolved question is whether the representation of acoustic features of speech in the brain is based on neural processing mechanisms that are unique to humans and shaped by learning and experience with an individual's native language. The role of experience in modifying auditory cortical physiology is prominently observed during early development. The appearance of the mismatch negativity component of the event-related potential becomes restricted to native-language phonemic contrasts by 7½ months of age (Kuhl and Rivera-Gaxiola, 2008). Better native language-specific responses predict enhanced language skills at two years of age. The emergence of new event-related potentials that parallel developmental milestones in speech processing provides an additional example of neural circuitry changes derived from language experience (Friederici, 2005). In adults, both gray matter volume of primary auditory cortex and the amplitude of short-latency auditory evoked potentials generated in primary auditory cortex are larger in adult musicians than in musically-naïve subjects (Schneider et al., 2002). Recordings from animal models that are complex vocal learners such as songbirds also demonstrate pronounced modifications that occur in auditory forebrain processing of sound based on developmental exposure to species-specific vocalizations (e.g., Woolley, 2012). In sum, it remains unclear how “special” or unique in mammalian physiology human primary auditory cortex is with regard to decoding the building blocks of speech.

Here, we examine this question by comparing the neural activity elicited by speech in primary auditory cortex (A1) of macaque monkeys, who are limited vocal learners, with that in HG of humans, who are obviously expert vocal learners (Petkov and Jarvis, 2012). Neural activity from human primary auditory cortex was acquired during intracranial recordings in patients undergoing surgical evaluation for medically intractable epilepsy. Measures included averaged evoked potentials (AEPs) and event-related-band-power (ERBP) in the high gamma (70–150 Hz) frequency range. Comparable population recordings were performed in the macaques. Measures included AEPs, the derived current source density (CSD), and multiunit activity (MUA). The focus of this report will be on clarifying the neural representation of acoustic features of speech that vary along both temporal and spectral dimensions. Some of the results represent a summary of previous studies from human and monkey primary auditory cortex. The remainder of the results represents new data that extend the previous findings. If perceptually-relevant features of speech are encoded similarly in humans and monkeys, then it is reasonable to conclude that human primary auditory cortex is not special.

Section snippets

Subjects

Results presented in this report represent neurophysiological data obtained from multiple male monkeys (Macaca fascicularis) that have been accumulated over many years. During this time, there have been gradual changes in methodology. The reader is referred to the cited publications for methodological details (i.e., Fig. 3, Steinschneider et al., 2003; six subjects; Fig. 8, Steinschneider and Fishman, 2011; four subjects). Methods described here refer to studies involving two monkey subjects

Monkey

Entrainment to the temporal envelope of vocalizations within auditory cortex is specific neither to humans nor to human speech. Fig. 1 demonstrates neural entrainment to the temporal envelope of three monkey vocalizations at a low BF location within A1. The left-hand graph in Fig. 1A depicts the FRF of this site based on responses to pure tones presented at 60 dB SPL. The BF of this site is approximately 400 Hz, with a secondary peak at the 200 Hz subharmonic. FRFs based on responses to tones

Summary and general conclusions

The key finding of this study is that neural representations of fundamental temporal and spectral features of speech by population responses in primary auditory cortex are remarkably similar in monkeys and humans, despite their vastly different experience with human language. Thus, it appears that plasticity-induced language learning does not significantly alter response patterns elicited by the acoustical properties of speech sounds in primary auditory cortex of humans as compared with

Contributors

All authors participated in the writing of this manuscript. Drs. Fishman and Steinschneider collected and analyzed data obtained from monkeys. Drs. Nourski and Steinschneider collected and analyzed data obtained from humans. All authors have approved the final article.

Conflict of interest

All authors state that they have no actual or potential conflict of interest.

Acknowledgments

The authors thank Drs. Joseph C. Arezzo, Charles E. Schroeder and David H. Reser, and Ms. Jeannie Hutagalung for their assistance in the monkey studies, and Drs. Hiroyuki Oya, Hiroto Kawasaki, Ariane Rhone, Christopher Kovach, John F. Brugge, and Matthew A. Howard for their assistance in the human studies. Primate studies supported by the NIH (DC-00657), and human studies supported by NIH (DC04290, UL1RR024979), Hearing Health Foundation, and the Hoover Fund.

References (143)

  • A.M. Liberman et al.

    On the relation of speech to language

    Trends Cogn. Sci.

    (2000)
  • J.L. McClelland et al.

    The TRACE model of speech perception

    Cognit. Psychol.

    (1986)
  • P. Müller-Preuss et al.

    Functional anatomy of the inferior colliculus and the auditory cortex: current source density analyses of click-evoked potentials

    Hear. Res.

    (1984)
  • D.F. Newbury et al.

    Genetic advances in the study of speech and language disorders

    Neuron

    (2010)
  • E. Oshurkova et al.

    Click train encoding in primary and non-primary auditory cortex of anesthetized macaque monkeys

    Neuroscience

    (2008)
  • D.B. Pisoni et al.

    Acoustic–phonetic representations in word recognition

    Cognition

    (1987)
  • E. Ahissar et al.

    Speech comprehension is correlated with temporal response patterns recorded from auditory cortex

    Proc. Natl. Acad. Sci. U. S. A.

    (2001)
  • J.M. Alexander et al.

    Spectral tilt change in stop consonant perception

    J. Acoust. Soc. Am.

    (2008)
  • J.M. Alexander et al.

    Spectral tilt change in stop consonant perception by listeners with hearing impairment

    J. Speech Lang. Hear. Res.

    (2009)
  • S. Anderson et al.

    Objective neural indices of speech-in-noise perception

    Trends Amplif.

    (2010)
  • D.A. Bendor et al.

    Dual-pitch processing mechanisms in primate auditory cortex

    J. Neurosci.

    (2012)
  • D.A. Bendor et al.

    Differential neural coding of acoustic flutter within primate auditory cortex

    Nat. Neurosci.

    (2007)
  • J. Bertoncini et al.

    Discrimination in neonates of very short CVs

    J. Acoust. Soc. Am.

    (1987)
  • A. Bieser

    Processing of twitter-call fundamental frequencies in insula and auditory cortex of squirrel monkeys

    Exp. Brain Res.

    (1998)
  • Y. Bitterman et al.

    Ultra-fine frequency tuning revealed in single neurons of human auditory cortex

    Nature

    (2008)
  • S.E. Blumstein et al.

    Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants

    J. Acoust. Soc. Am.

    (1979)
  • S.E. Blumstein et al.

    Perceptual invariance and onset spectra for stop consonant vowel environments

    J. Acoust. Soc. Am.

    (1980)
  • S. Borsky et al.

    “How to milk a coat”: the effects of semantic and acoustic information on phoneme categorization

    J. Acoust. Soc. Am.

    (1998)
  • M. Brosch et al.

    Stimulus-dependent modulations of correlated high-frequency oscillations in cat visual cortex

    Cereb. Cortex

    (1997)
  • J.F. Brugge et al.

    Coding of repetitive transients by auditory cortex on Heschl's gyrus

    J. Neurophysiol.

    (2009)
  • R.P. Carlyon et al.

    Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms

    J. Acoust. Soc. Am.

    (1994)
  • A.E. Carney et al.

    Noncategorical perception of stop consonants differing in VOT

    J. Acoust. Soc. Am.

    (1977)
  • S. Chang et al.

    The role of onsets in perception of stop consonant place of articulation: effects of spectral and temporal discontinuity

    J. Acoust. Soc. Am.

    (1981)
  • E.F. Chang et al.

    Categorical speech representation in human superior temporal gyrus

    Nat. Neurosci.

    (2010)
  • O. Creutzfeldt et al.

    Thalamocortical transformation of responses to complex auditory stimuli

    Exp. Brain Res.

    (1980)
  • S.J. Cruikshank et al.

    Auditory thalamocortical synaptic transmission in vitro

    J. Neurophysiol.

    (2002)
  • S.V. David et al.

    Task reward structure shapes rapid receptive field plasticity in auditory cortex

    Proc. Natl. Acad. Sci. U. S. A.

    (2012)
  • R.L. Diehl et al.

    Speech perception

    Annu. Rev. Psychol.

    (2004)
  • N. Ding et al.

    Emergence of neural encoding of auditory objects while listening to competing speakers

    Proc. Natl. Acad. Sci. U. S. A.

    (2012)
  • R. Drullman et al.

    Effect of temporal envelope smearing on speech reception

    J. Acoust. Soc. Am.

    (1994)
  • J.J. Eggermont

    Neural correlates of gap detection and auditory fusion in cat auditory cortex

    Neuroreport

    (1995)
  • J.J. Eggermont

    Representation of spectral and temporal sound features in three cortical fields of the cat. Similarities outweigh differences

    J. Neurophysiol.

    (1998)
  • J.J. Eggermont

    Neural responses in primary auditory cortex mimic psychophysical, across-frequency-channel, gap-detection thresholds

    J. Neurophysiol.

    (2000)
  • J.J. Eggermont

    Temporal modulation transfer functions in cat primary auditory cortex: separating stimulus effects from neural mechanisms

    J. Neurophysiol.

    (2002)
  • R. Eilers et al.

    Linguistic experience and phonetic perception in infancy: a crosslinguistic study

    Child Dev.

    (1979)
  • P. Eimas et al.

    Speech perception in infants

    Science

    (1971)
  • C.T. Engineer et al.

    Cortical activity patterns predict speech discrimination ability

    Nat. Neurosci.

    (2008)
  • A. Faulkner et al.

    Contributions of temporal encodings of voicing, voicelessness, fundamental frequency, and amplitude variation to audio-visual and auditory speech perception

    J. Acoust. Soc. Am.

    (1999)
  • Y.I. Fishman et al.

    Searching for the mismatch negativity in primary auditory cortex of the awake monkey: deviance detection or stimulus specific adaptation?

    J. Neurosci.

    (2012)
  • J.L. Flanagan et al.

    On the pitch of periodic pulses

    J. Acoust. Soc. Am.

    (1960)
  • Cited by (91)

    • Neural tracking of the speech envelope is differentially modulated by attention and language experience

      2021, Brain and Language
      Citation Excerpt :

      Based on the existing and emerging literature on the neural tracking of the speech envelope (Ding & Simon, 2014), two prominent hypotheses have been put forward regarding the mechanisms underlying neural tracking of the speech envelope. Per the domain-general auditory encoding hypothesis, neural tracking of the speech envelope is primarily driven by general auditory mechanisms that can also be evidenced in animal models (Doelling et al., 2014; Joris et al., 2004; Steinschneider et al., 2013). An alternative hypothesis, the interactive processing hypothesis, suggests that the neural tracking of the speech envelope reflects dynamic interactions between the processing of low-level acoustic cues and higher-level linguistic information (Zou et al., 2019).

    • Direct neural coding of speech: Reconsideration of Whalen et al. (2006) (L)

      2024, Journal of the Acoustical Society of America
    View all citing articles on Scopus
    View full text