Elsevier

NeuroImage

Volume 31, Issue 3, 1 July 2006, Pages 1284-1296
NeuroImage

Locating the initial stages of speech–sound processing in human temporal cortex

https://doi.org/10.1016/j.neuroimage.2006.01.004Get rights and content

Abstract

It is commonly assumed that, in the cochlea and the brainstem, the auditory system processes speech sounds without differentiating them from any other sounds. At some stage, however, it must treat speech sounds and nonspeech sounds differently, since we perceive them as different. The purpose of this study was to delimit the first location in the auditory pathway that makes this distinction using functional MRI, by identifying regions that are differentially sensitive to the internal structure of speech sounds as opposed to closely matched control sounds. We analyzed data from nine right-handed volunteers who were scanned while listening to natural and synthetic vowels, or to nonspeech stimuli matched to the vowel sounds in terms of their long-term energy and both their spectral and temporal profiles. The vowels produced more activation than nonspeech sounds in a bilateral region of the superior temporal sulcus, lateral and inferior to regions of auditory cortex that were activated by both vowels and nonspeech stimuli. The results suggest that the perception of vowel sounds is compatible with a hierarchical model of primate auditory processing in which early cortical stages of processing respond indiscriminately to speech and nonspeech sounds, and only higher regions, beyond anatomically defined auditory cortex, show selectivity for speech sounds.

Introduction

Processing in the ascending auditory pathway of the primate is largely sequential and hierarchical up to and including auditory cortex with its core, belt and parabelt regions (Rauschecker, 1998, Kaas and Hackett, 2000). Speech is recoded by the auditory nuclei of the brainstem and thalamus before it reaches auditory cortex (Irvine, 1992, Eggermont, 2001, Frisina, 2001), and both the anatomical and physiological literature suggest that the recoding involves processing of acoustic features relevant to speech. However, these nuclei seem to apply general processes to all sounds independent of their source, and there is no indication that speech-specific processing begins before auditory cortex.

Neuroimaging evidence indicates that, in humans, information is processed in a hierarchical manner in auditory cortex, and that this hierarchical organization extends into the multimodal regions beyond auditory cortex (Hall et al., 2001, Hall et al., 2003, Wessinger et al., 2001, Patterson et al., 2002, Scott and Johnsrude, 2003). It is clear that speech sounds have to pass through primary auditory cortex (PAC), which is active in the presence of speech sounds (e.g., Belin et al., 2002, Ahissar et al., 2001). But the corresponding area in nonhuman mammals (A1) is also active in the presence of speech sounds (Steinschneider et al., 2003, Versnel and Shamma, 1998), so this activity appears to represent general auditory processing of acoustic features rather than speech-specific processes. Regions well beyond PAC, in the superior temporal sulcus and middle temporal gyrus, are sensitive to the intelligibility of sentences, particularly in the left hemisphere (Davis and Johnsrude, 2003, Narain et al., 2003, Scott et al., 2000, Giraud et al., 2004). However, intervening between PAC and these middle temporal regions, there are at least three anatomically differentiable regions that appear to be predominantly auditory, and the connectivity of these regions suggests at least three stages of processing (the human homologues of belt and parabelt cortex, and the upper STS region TAa; Kaas and Hackett, 2000, Seltzer and Pandya, 1991). We hypothesize that the earliest stages to show some specialization for the processing of speech will be within these three regions.

A large number of imaging studies have investigated temporal-lobe involvement in speech-specific processing (e.g., Binder et al., 1997, Binder et al., 2000, Binder et al., 2004, Benson et al., 2001, Callan et al., 2004, Crinion et al., 2003, Davis and Johnsrude, 2003, Dehaene-Lambertz et al., 2005, Demonet et al., 1992, Gandour et al., 2003, Giraud and Price, 2001, Giraud et al., 2004, Hugdahl et al., 2003, Jancke et al., 2002, Joanisse and Gati, 2003, Liebenthal et al., 2003, Liebenthal et al., 2005, Mummery et al., 1999, Narain et al., 2003, Poeppel et al., 2004, Rimol et al., 2005, Schlosser et al., 1998, Scott et al., 2000, Specht and Reul, 2003, Thierry et al., 2003, Vouloumanos et al., 2001, Zatorre et al., 1992; see also Belin et al., 2000, Price et al., 2005). However, many of these studies have investigated speech at the level of the word, phrase or sentence, and such stimuli would probably have engaged lexical, semantic and syntactic processes in addition to speech-sound processing (e.g., Crinion et al., 2003, Davis and Johnsrude, 2003, Giraud et al., 2004, Narain et al., 2003, Scott et al., 2000, Schlosser et al., 1998, Zatorre et al., 1992). In many studies, the acoustic characteristics of the speech stimuli differed substantially from the nonspeech stimuli (e.g., Benson et al., 2001, Binder et al., 2000, Binder et al., 2004, Demonet et al., 1992, Giraud and Price, 2001, Vouloumanos et al., 2001). Many studies include a task that required or encouraged the listener to make an explicit linguistic judgement about the sounds they were hearing (Callan et al., 2004, Jancke et al., 2002, Liebenthal et al., 2003, Binder et al., 2004, Dehaene-Lambertz et al., 2005). But none of these studies has compared the activity to elementary speech sounds (such as vowels) with that to acoustically matched nonspeech sounds, and done so while listeners are engaged on a task that encourages attention to the stimuli, but that does not require or encourage linguistic processing of the sounds.

In this paper, we identify the cortical regions where processing of speech and nonspeech sounds begins to diverge by comparing a new class of synthetic vowels with a set of nonspeech controls that are closely matched in terms of the distribution of energy across frequency and over time. The synthetic vowels, with distinctive properties that identify them as linguistically relevant sounds produced by a human vocal tract, are immediately heard as speech sounds, while the nonspeech controls cannot be heard as speech even with deliberate effort. We furthermore deliberately chose to use a task that would not preferentially engage speech processing—listeners performed a simple intensity monitoring task.

Section snippets

Stimulus development

In creating synthetic vowels, we adhered to the following constraints: (1) Their spectra exhibit three to four relatively narrow, formants in the frequency region below 4000 Hz. (2) Stimuli are presented in sequences, where the individual elements within each sequence occur at a rate of about three per second. (3) The spectra of successive vowels within a sequence differ, but they all appear to come from the same vocal tract (e.g., they are different vowels spoken by the same person). All three

Activation in response to sound

When each of the five sound conditions was contrasted with silence, similar patterns of activation were observed (Fig. 4). The activation was largely confined to the temporal lobes, bilaterally, and there were always two main foci of activation in auditory cortex; one towards the medial end of Heschl's gyrus and one towards the lateral end of HG. These two foci are highly consistent in their locations across conditions.

Main effect of speechlikeness

No significant regions of activation were observed when any of the three

Discussion

In this imaging study, volunteers heard natural vowels produced by a human speaker and four classes of synthetic sounds that were closely matched to each other in terms of the distribution of energy over frequency and time. The two conditions with speech-like carrier frequencies (formants) were perceived as vowels. The two conditions perceived as nonspeech were produced from these synthesized vowels by simple manipulations in the time and frequency domains. When the speech conditions were

Summary

This paper has identified regions of the brain that appear to be differentially involved in the processing of those stimulus characteristics that make sequences of vowels unique in the natural environment. With the aid of carefully controlled sets of stimuli, we have identified candidate areas that appear to be specifically involved in processing those properties of the acoustic signal that indicate whether or not the sound will be perceived as speech. The most active of these centers is in the

Acknowledgments

This work was supported by grants from the Medical Research Council (G9900369, G9901257, G0500221; authors RP and SU). The functional MRI was carried out at the Wolfson Brain Imaging Centre in Cambridge; we thank the radiographers for their assistance with data acquisition and Chris Wood for his help with data preprocessing.

References (89)

  • K. Hugdahl et al.

    The effects of attention on speech perception: an fMRI study

    Brain Lang.

    (2003)
  • L. Jancke et al.

    Phonetic perception and the temporal cortex

    NeuroImage

    (2002)
  • M.F. Joanisse et al.

    Overlapping neural regions for processing rapid temporal cues in speech and nonspeech signals

    NeuroImage

    (2003)
  • J.H. Kaas et al.

    Auditory processing in primate cerebral cortex

    Curr. Opin. Neurobiol.

    (1999)
  • P. Morosan et al.

    Human primary auditory cortex: cytoarchitectonic subdivisions and mapping into a spatial reference system

    NeuroImage

    (2001)
  • A.R. Palmer et al.

    A high-output, high-quality sound system for use in auditory fMRI

    NeuroImage

    (1998)
  • R.D. Patterson et al.

    The processing of temporal pitch and melody information in auditory cortex

    Neuron

    (2002)
  • D. Poeppel et al.

    Auditory lexical decision, categorical perception, and FM direction discrimination differentially engage left and right auditory cortex

    Neuropsychologia

    (2004)
  • C. Price et al.

    Speech-specific auditory processing: where is it?

    Trends Cogn. Sci.

    (2005)
  • J. Rademacher et al.

    Probabilistic mapping and volume measurement of human primary auditory cortex

    NeuroImage

    (2001)
  • J.P. Rauschecker

    Cortical processing of complex sounds

    Curr. Opin. Neurobiol.

    (1998)
  • F. Rivier et al.

    Cytochrome oxidase, acetylcholinesterase, and NADPH-diaphorase staining in human supratemporal and insular cortex: evidence for multiple auditory areas

    NeuroImage

    (1997)
  • L.M. Rimol et al.

    Processing of sub-syllabic speech units in the posterior temporal lobe: an fMRI study

    NeuroImage

    (2005)
  • S.K. Scott et al.

    The neuroanatomical and functional organization of speech perception

    Trends Neurosci.

    (2003)
  • B. Seltzer et al.

    Afferent cortical connections and architectonics of the superior temporal sulcus and surrounding cortex in the rhesus monkey

    Brain Res.

    (1978)
  • K. Specht et al.

    Functional segregation of the temporal lobes into highly differentiated subsystems for auditory perception: an auditory rapid event-related fMRI-task

    NeuroImage

    (2003)
  • G. Thierry et al.

    Hemispheric dissociation in access to the human semantic system

    Neuron

    (2003)
  • K.E. Watkins et al.

    Seeing and hearing speech excites the motor system involved in speech production

    Neuropsychologia

    (2003)
  • E. Ahissar et al.

    Speech comprehension is correlated with temporal response patterns recorded from auditory cortex

    Proc. Natl. Acad. Sci. U. S. A.

    (2001)
  • P. Belin et al.

    Voice-selective areas in human auditory cortex

    Nature

    (2000)
  • J.R. Binder et al.

    Human brain language areas identified by functional magnetic resonance imaging

    J. Neurosci.

    (1997)
  • J.R. Binder et al.

    Human temporal lobe activation by speech and nonspeech sounds

    Cereb. Cortex

    (2000)
  • J.R. Binder et al.

    Neural correlates of sensory and decision processes in auditory object identification

    Nat. Neurosci.

    (2004)
  • M. Brett et al.

    The problem of functional localization in the human brain

    Nat. Rev., Neurosci.

    (2002)
  • J.F. Brugge et al.

    Functional connections between auditory cortex on Heschl's gyrus and on the lateral superior temporal gyrus in humans

    J. Neurophysiol.

    (2003)
  • J.T. Crinion et al.

    Temporal lobe regions engaged during normal speech comprehension

    Brain

    (2003)
  • H.A. David

    The Method of Paired Comparisons

    (1988)
  • M.H. Davis et al.

    Hierarchical processing in spoken language comprehension

    J. Neurosci.

    (2003)
  • J.F. Demonet et al.

    The anatomy of phonological and semantic processing in normal subjects

    Brain

    (1992)
  • W.B. Edmister et al.

    Improved auditory cortex imaging using clustered volume acquisitions

    Hum. Brain Mapp.

    (1999)
  • J. Gandour et al.

    Neural correlates of segmental and tonal information in speech perception

    Hum. Brain Mapp.

    (2003)
  • A.L. Giraud et al.

    The constraints functional neuroimaging places on classical models of auditory word processing

    J. Cogn. Neurosci.

    (2001)
  • A.L. Giraud et al.

    Contributions of sensory input, auditory search and verbal comprehension to cortical activity during speech processing

    Cereb. Cortex

    (2004)
  • T.D. Griffiths et al.

    Encoding of the temporal regularity of sound in the human brainstem

    Nat. Neurosci.

    (2001)
  • Cited by (141)

    • Cortical activity evoked by voice pitch changes: A combined fNIRS and EEG study

      2022, Hearing Research
      Citation Excerpt :

      Specifically, stronger activity in response to pitch changes was observed in the right planum polare and superior temporal sulcus (STS), both located in the superior temporal cortex (STC). In agreement with these findings, fMRI evidence demonstrated that vowels with a fixed pitch elicit prominent bilateral activity in superior temporal areas (Uppenkamp et al., 2006), while vowels with prosodic modulations evoke additional activity in right superior temporal areas (Obleser et al., 2006). Accordingly, theoretical models of auditory cortical processing (Poeppel, 2003; Zatorre et al., 2002) posit that pitch changes and slow spectral modulations should engage right superior temporal areas, in particular their anterior part.

    • Compound words are decomposed regardless of semantic transparency and grammatical class: An fMRI study in Persian

      2021, Lingua
      Citation Excerpt :

      We randomly selected 30 stimuli from each condition and then extracted their temporal envelope. The extracted envelopes were then filled with MuR by jittering 10 ms fragments of vowel formants (Bozic et al., 2010; Uppenkamp et al., 2006). The resulting MuR stimuli were matched in length with other conditions, and had the same long-term spectro-temporal distribution of energy.

    View all citing articles on Scopus
    1

    The first two authors contributed equally to this work.

    View full text