ReviewRepresentation of speech in human auditory cortex: Is it special?
Introduction
The ease with which speech is perceived underscores the refined operations of a neural network capable of rapidly decoding complex acoustic signals and categorizing them into meaningful phonemic sequences. A number of models have been devised to explain how phonemes are extracted from the continuous stream of speech (e.g., McClelland and Elman, 1986, Church, 1987, Pisoni and Luce, 1987, Stevens, 2002). Common to all these models is the recognition that phonemic perception is a categorization task based on sound profiles derived from a multidimensional space encompassing numerous acoustic features unfolding over time (Holt and Lotto, 2010). Features are all characterized by acoustic parameters that vary along intensity, spectral, and temporal dimensions. Increased intensity, especially in the low to mid-frequency ranges, helps to distinguish vowels from consonants (McClelland and Elman, 1986, Stevens, 2002). Distinct spectral (formant) patterns during these periods of increased intensity promote accurate vowel identification (Hillenbrand et al., 1995).
The temporal dimension of phonemic categorization has received increased attention in recent years. An influential proposal posits that speech perception occurs over several overlapping time scales (e.g., Poeppel et al., 2008, Poeppel et al., 2012, Giraud and Poeppel, 2012). Syllabic analyses occur within a time frame of about 150–300 m, and correlate with the amplitude envelope of speech. Speech comprehension remains high even when sentence fragments are time-reversed in 50 ms bins, and only becomes severely degraded when time-reversals occur at frequencies overlapping those of the speech envelope (Saberi and Perrott, 1999). Furthermore, temporal smearing of the speech envelope leads to significant degradation in the intelligibility of sentences only at frequencies commensurate with the speech envelope (Drullman et al., 1994).
More refined acoustic feature analyses are performed within shorter temporal windows of integration that vary between about 20 and 80 m. Segmentation of speech within this range is critical for phonetic feature encoding, especially for shorter duration consonants. Times at which rapid temporal and spectral changes occur are informationally rich landmarks in the speech waveform (Stevens, 1981, Stevens, 2002). Both the spectra and formant transition trajectories occurring at these landmarks are crucial for accurate identification of true consonants such as the stops (Kewley-Port, 1983, Walley and Carrell, 1983, Alexander and Kluender, 2009). Voice onset time (VOT), the time between consonant release and the onset of rhythmic vocal cord vibrations, is a classic example of rapid temporal discontinuities that help to distinguish voiced consonants (e.g., /b/, /d/, and /g/) from their unvoiced counterparts (e.g., /p/, /t/, and /k/) (e.g., Lisker and Abramson, 1964, Faulkner and Rosen, 1999). Indeed, when semantic information is lacking, listeners of time-reversed speech have significant comprehension difficulties at the shorter temporal intervals required for phonetic feature encoding (Kiss et al., 2008).
Early stations in the human auditory system are exquisitely tuned to encode speech-related acoustic features. Population brainstem responses accurately represent the intensity, spectrum, and temporal envelope of speech sounds (Chandrasekaran et al., 2009, Anderson and Kraus, 2010). Magnetoencephalographic (MEG) responses reflect consonant place of articulation (POA) within 50 ms after sound onset (Tavabi et al., 2007), and within 100 ms, responses differentiate intelligible versus unintelligible speech (Obleser et al., 2006). Neural responses obtained from intracranial recordings in Heschl's gyrus (HG), the putative location of primary auditory cortex in humans (Hackett et al., 2001), demonstrate categorical-like changes to syllables that vary in their VOT in a manner that parallels perception (Steinschneider et al., 1999, Steinschneider et al., 2005). Spectrotemporal receptive fields derived from single unit activity in HG elicited by one portion of a movie soundtrack dialog can accurately predict response patterns elicited by a different portion of the same dialog (Bitterman et al., 2008). Finally, both MEG responses and responses obtained from invasive recordings within HG have shown that accurate tracking of the speech envelope degrades in parallel with the ability to perceive temporally compressed speech (Ahissar et al., 2001, Nourski et al., 2009; see also Peelle et al., 2013). These observations lend support to the conclusion that “acoustic–phonetic features of the speech signal such as voicing, spectral shape, formants or amplitude modulation are made accessible by the computations of the ascending auditory pathway and primary auditory cortex” (Obleser and Eisner, 2008, p. 16).
An important and unresolved question is whether the representation of acoustic features of speech in the brain is based on neural processing mechanisms that are unique to humans and shaped by learning and experience with an individual's native language. The role of experience in modifying auditory cortical physiology is prominently observed during early development. The appearance of the mismatch negativity component of the event-related potential becomes restricted to native-language phonemic contrasts by 7½ months of age (Kuhl and Rivera-Gaxiola, 2008). Better native language-specific responses predict enhanced language skills at two years of age. The emergence of new event-related potentials that parallel developmental milestones in speech processing provides an additional example of neural circuitry changes derived from language experience (Friederici, 2005). In adults, both gray matter volume of primary auditory cortex and the amplitude of short-latency auditory evoked potentials generated in primary auditory cortex are larger in adult musicians than in musically-naïve subjects (Schneider et al., 2002). Recordings from animal models that are complex vocal learners such as songbirds also demonstrate pronounced modifications that occur in auditory forebrain processing of sound based on developmental exposure to species-specific vocalizations (e.g., Woolley, 2012). In sum, it remains unclear how “special” or unique in mammalian physiology human primary auditory cortex is with regard to decoding the building blocks of speech.
Here, we examine this question by comparing the neural activity elicited by speech in primary auditory cortex (A1) of macaque monkeys, who are limited vocal learners, with that in HG of humans, who are obviously expert vocal learners (Petkov and Jarvis, 2012). Neural activity from human primary auditory cortex was acquired during intracranial recordings in patients undergoing surgical evaluation for medically intractable epilepsy. Measures included averaged evoked potentials (AEPs) and event-related-band-power (ERBP) in the high gamma (70–150 Hz) frequency range. Comparable population recordings were performed in the macaques. Measures included AEPs, the derived current source density (CSD), and multiunit activity (MUA). The focus of this report will be on clarifying the neural representation of acoustic features of speech that vary along both temporal and spectral dimensions. Some of the results represent a summary of previous studies from human and monkey primary auditory cortex. The remainder of the results represents new data that extend the previous findings. If perceptually-relevant features of speech are encoded similarly in humans and monkeys, then it is reasonable to conclude that human primary auditory cortex is not special.
Section snippets
Subjects
Results presented in this report represent neurophysiological data obtained from multiple male monkeys (Macaca fascicularis) that have been accumulated over many years. During this time, there have been gradual changes in methodology. The reader is referred to the cited publications for methodological details (i.e., Fig. 3, Steinschneider et al., 2003; six subjects; Fig. 8, Steinschneider and Fishman, 2011; four subjects). Methods described here refer to studies involving two monkey subjects
Monkey
Entrainment to the temporal envelope of vocalizations within auditory cortex is specific neither to humans nor to human speech. Fig. 1 demonstrates neural entrainment to the temporal envelope of three monkey vocalizations at a low BF location within A1. The left-hand graph in Fig. 1A depicts the FRF of this site based on responses to pure tones presented at 60 dB SPL. The BF of this site is approximately 400 Hz, with a secondary peak at the 200 Hz subharmonic. FRFs based on responses to tones
Summary and general conclusions
The key finding of this study is that neural representations of fundamental temporal and spectral features of speech by population responses in primary auditory cortex are remarkably similar in monkeys and humans, despite their vastly different experience with human language. Thus, it appears that plasticity-induced language learning does not significantly alter response patterns elicited by the acoustical properties of speech sounds in primary auditory cortex of humans as compared with
Contributors
All authors participated in the writing of this manuscript. Drs. Fishman and Steinschneider collected and analyzed data obtained from monkeys. Drs. Nourski and Steinschneider collected and analyzed data obtained from humans. All authors have approved the final article.
Conflict of interest
All authors state that they have no actual or potential conflict of interest.
Acknowledgments
The authors thank Drs. Joseph C. Arezzo, Charles E. Schroeder and David H. Reser, and Ms. Jeannie Hutagalung for their assistance in the monkey studies, and Drs. Hiroyuki Oya, Hiroto Kawasaki, Ariane Rhone, Christopher Kovach, John F. Brugge, and Matthew A. Howard for their assistance in the human studies. Primate studies supported by the NIH (DC-00657), and human studies supported by NIH (DC04290, UL1RR024979), Hearing Health Foundation, and the Hoover Fund.
References (143)
- et al.
Response to broadband repetitive stimuli in auditory cortex of the unanesthetized rat
Hear. Res.
(2006) - et al.
Context-dependent encoding in the human auditory brainstem relates to hearing speech in noise: implications for developmental dyslexia
Neuron
(2009) Phonological parsing and lexical retrieval
Cognition
(1987)- et al.
Lifelong plasticity in the rat auditory cortex: basic mechanisms and role of sensory experience
Prog. Brain Res.
(2011) - et al.
Temporally dynamic frequency tuning of population responses in monkey primary auditory cortex
Hear. Res.
(2009) Neurophysiological markers of early language acquisition: from syllables to sentences
Trends Cogn. Sci.
(2005)A temporal sampling framework for developmental dyslexia
Trends Cogn. Sci.
(2011)- et al.
Improved optimization for the robust and accurate linear registration and motion correction of brain images
NeuroImage
(2002) Brain mechanisms in early language acquisition
Neuron
(2010)- et al.
The motor theory of speech perception revised
Cognition
(1985)
On the relation of speech to language
Trends Cogn. Sci.
The TRACE model of speech perception
Cognit. Psychol.
Functional anatomy of the inferior colliculus and the auditory cortex: current source density analyses of click-evoked potentials
Hear. Res.
Genetic advances in the study of speech and language disorders
Neuron
Click train encoding in primary and non-primary auditory cortex of anesthetized macaque monkeys
Neuroscience
Acoustic–phonetic representations in word recognition
Cognition
Speech comprehension is correlated with temporal response patterns recorded from auditory cortex
Proc. Natl. Acad. Sci. U. S. A.
Spectral tilt change in stop consonant perception
J. Acoust. Soc. Am.
Spectral tilt change in stop consonant perception by listeners with hearing impairment
J. Speech Lang. Hear. Res.
Objective neural indices of speech-in-noise perception
Trends Amplif.
Dual-pitch processing mechanisms in primate auditory cortex
J. Neurosci.
Differential neural coding of acoustic flutter within primate auditory cortex
Nat. Neurosci.
Discrimination in neonates of very short CVs
J. Acoust. Soc. Am.
Processing of twitter-call fundamental frequencies in insula and auditory cortex of squirrel monkeys
Exp. Brain Res.
Ultra-fine frequency tuning revealed in single neurons of human auditory cortex
Nature
Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants
J. Acoust. Soc. Am.
Perceptual invariance and onset spectra for stop consonant vowel environments
J. Acoust. Soc. Am.
“How to milk a coat”: the effects of semantic and acoustic information on phoneme categorization
J. Acoust. Soc. Am.
Stimulus-dependent modulations of correlated high-frequency oscillations in cat visual cortex
Cereb. Cortex
Coding of repetitive transients by auditory cortex on Heschl's gyrus
J. Neurophysiol.
Comparing the fundamental frequencies of resolved and unresolved harmonics: evidence for two pitch mechanisms
J. Acoust. Soc. Am.
Noncategorical perception of stop consonants differing in VOT
J. Acoust. Soc. Am.
The role of onsets in perception of stop consonant place of articulation: effects of spectral and temporal discontinuity
J. Acoust. Soc. Am.
Categorical speech representation in human superior temporal gyrus
Nat. Neurosci.
Thalamocortical transformation of responses to complex auditory stimuli
Exp. Brain Res.
Auditory thalamocortical synaptic transmission in vitro
J. Neurophysiol.
Task reward structure shapes rapid receptive field plasticity in auditory cortex
Proc. Natl. Acad. Sci. U. S. A.
Speech perception
Annu. Rev. Psychol.
Emergence of neural encoding of auditory objects while listening to competing speakers
Proc. Natl. Acad. Sci. U. S. A.
Effect of temporal envelope smearing on speech reception
J. Acoust. Soc. Am.
Neural correlates of gap detection and auditory fusion in cat auditory cortex
Neuroreport
Representation of spectral and temporal sound features in three cortical fields of the cat. Similarities outweigh differences
J. Neurophysiol.
Neural responses in primary auditory cortex mimic psychophysical, across-frequency-channel, gap-detection thresholds
J. Neurophysiol.
Temporal modulation transfer functions in cat primary auditory cortex: separating stimulus effects from neural mechanisms
J. Neurophysiol.
Linguistic experience and phonetic perception in infancy: a crosslinguistic study
Child Dev.
Speech perception in infants
Science
Cortical activity patterns predict speech discrimination ability
Nat. Neurosci.
Contributions of temporal encodings of voicing, voicelessness, fundamental frequency, and amplitude variation to audio-visual and auditory speech perception
J. Acoust. Soc. Am.
Searching for the mismatch negativity in primary auditory cortex of the awake monkey: deviance detection or stimulus specific adaptation?
J. Neurosci.
On the pitch of periodic pulses
J. Acoust. Soc. Am.
Cited by (91)
Brain electrical dynamics in speech segmentation depends upon prior experience with the language
2021, Brain and LanguageNeural tracking of the speech envelope is differentially modulated by attention and language experience
2021, Brain and LanguageCitation Excerpt :Based on the existing and emerging literature on the neural tracking of the speech envelope (Ding & Simon, 2014), two prominent hypotheses have been put forward regarding the mechanisms underlying neural tracking of the speech envelope. Per the domain-general auditory encoding hypothesis, neural tracking of the speech envelope is primarily driven by general auditory mechanisms that can also be evidenced in animal models (Doelling et al., 2014; Joris et al., 2004; Steinschneider et al., 2013). An alternative hypothesis, the interactive processing hypothesis, suggests that the neural tracking of the speech envelope reflects dynamic interactions between the processing of low-level acoustic cues and higher-level linguistic information (Zou et al., 2019).
Direct neural coding of speech: Reconsideration of Whalen et al. (2006) (L)
2024, Journal of the Acoustical Society of America