Hearing speech sounds: Top-down influences on the interface between audition and speech perception

doi:10.1016/j.heares.2007.01.014

Hearing Research

Volume 229, Issues 1–2, July 2007, Pages 132-147

https://doi.org/10.1016/j.heares.2007.01.014 Get rights and content

Abstract

This paper focuses on the cognitive and neural mechanisms of speech perception: the rapid, and highly automatic processes by which complex time-varying speech signals are perceived as sequences of meaningful linguistic units. We will review four processes that contribute to the perception of speech: perceptual grouping, lexical segmentation, perceptual learning and categorical perception, in each case presenting perceptual evidence to support highly interactive processes with top-down information flow driving and constraining interpretations of spoken input. The cognitive and neural underpinnings of these interactive processes appear to depend on two distinct representations of heard speech: an auditory, echoic representation of incoming speech, and a motoric/somatotopic representation of speech as it would be produced. We review the neuroanatomical system supporting these two key properties of speech perception and discuss how this system incorporates interactive processes and two parallel echoic and somato-motoric representations, drawing on evidence from functional neuroimaging studies in humans and from comparative anatomical studies. We propose that top-down interactive mechanisms within auditory networks play an important role in explaining the perception of spoken language.

Introduction

You receive an unexpected call on your mobile phone. Despite the background noise on the line you immediately recognise your colleague’s voice and can hear that she is excited about something. Catching her breath, she tells you that your joint grant application has been approved for funding and that you should meet to celebrate. In the space of a few seconds, this phone conversation has communicated a vital piece of information, conveyed the emotional significance of this news and provided physical information about the talker. While such exciting news is almost certainly not a daily occurrence, the cognitive and neural mechanisms that are at the heart of this scenario are so ubiquitous as to go largely unnoticed in our day-to-day life. We invariably focus on the information being communicated rather than the means by which it is conveyed, even in difficult listening situations.g.¹

This paper will focus on the cognitive and neural mechanisms by which a complex time-varying acoustic signal is perceived as sequences of sounds that convey meaning; addressing precisely those stages of processing that occur so rapidly, automatically and effortlessly as to be beneath our notice. We suggest that a complete account of speech perception requires an understanding of both basic auditory and higher-level cognitive processes (see Plomp, 2001, for similar arguments). We will present evidence for an interactive processing system in which bottom-up and top-down processes combine to support speech perception. This interactive account provides mechanisms by which perceptual processing can rapidly change so as to optimally perceive and comprehend speech – including those important mobile-phone calls.

In the first section of the paper we will review behavioural evidence for interactive processes playing a critical role in speech perception. The background provided by these several decades of behavioural evidence must be accounted for by any neural account of speech perception and therefore constitutes the majority of the evidence presented here. Building on this behavioural evidence, the second section of the paper describes two types of representation that are integral to the implementation of an interactive account of speech perception. These multiple, parallel representations of the speech input make distinct contributions to the robustness of speech perception. In the third and final section of the paper we briefly review evidence from the anatomy of the auditory system that is consistent with this computational account, reviewing evidence both for interactive processes, and for multiple perceptual pathways.

Section snippets

Evidence for interactivity in speech perception

In this section, we will discuss four processes that contribute to speech perception: (1) perceptual grouping of speech sounds into a single coherent stream, (2) segmentation of speech into meaningful (lexical) units, (3) perceptual learning mechanisms by which distorted and degraded speech is perceived and comprehended, and (4) mechanisms for perceiving variable forms of speech in a categorical fashion. For each of these four cases we suggest that evidence supports highly interactive processes

Computational requirements for interactive processes in speech perception

We have reviewed four domains in which top-down processes appear to contribute to speech perception: in promoting perceptual grouping, in achieving lexical segmentation, in supporting perceptual learning of distorted speech, and in maintaining categorical perception of speech segments. In this section, we will address the computational implications of such interactions and suggest that: (1) top-down influences act on auditory, echoic representations of incoming speech, and (2) top-down

Towards a neuroanatomical account of speech perception

This section will discuss the neural basis of the two central propositions that we make concerning speech perception: (1) that bidirectional, interactive connectivity allows higher-level constraints to influence ongoing speech perception and support the rapid retuning of perceptual processes, and (2) that parallel processing pathways support both an auditory-echoic record of incoming speech and the mapping of heard speech onto somatomotor representations involved in speech production. In

Concluding remarks

“Whereas elementary functions of a tissue can, by definition, have a precise localization in particular cell groups, there can of course be no question of the localization of complex functional systems in limited areas of the brain or of its cortex.” Luria (1976), p. 30.

In this paper we have proposed a multiple-pathway account of auditory processes that are critically important for a complex and uniquely human function – the comprehension of spoken language. As the quotation from Luria

Acknowledgements

Preparation of this paper was supported by the UK Medical Research Council, and the Canada Research Chairs program. We thank Maggie Kemmner, Sarah Hawkins and two anonymous reviewers for comments on an earlier draft of the paper.

References (176)

J.E. Andruski et al.
The effect of subphonetic differences on lexical access
Cognition
(1994)
J. Barker et al.
Is the sine-wave speech cocktail party worth attending?
Speech Commun.
(1999)
M.R. Brent et al.
Distributional regularity and phonotactic constraints are useful for segmentation
Cognition
(1996)
B.R. Buchsbaum et al.
Human dorsal and ventral auditory streams subserve rehearsal-based and echoic processes during verbal working memory
Neuron
(2005)
P. Cairns et al.
Bootstrapping word boundaries: a bottom-up corpus based approach to speech segmentation
Cognitive Psychol.
(1997)
D.E. Callan et al.
Learning-induced neural plasticity associated with improved identification performance after training of a difficult second-language phonetic contrast
Neuroimage
(2003)
A. Cutler et al.
The predominance of strong initial syllables in the English vocabulary
Comput. Speech Lang.
(1987)
G. Dehaene-Lambertz et al.
Neural correlates of switching from auditory to speech perception
Neuroimage
(2005)
I.T. Diamond et al.
The projection of the auditory cortex upon the diencephalon and brain stem in the cat
Brain Res.
(1969)
J.L. Elman et al.
Cognitive penetration of the mechanisms of perception: Compensation for coarticulation of lexically restored phonemes
J. Mem. Lang.
(1988)

J.A. Fodor et al.

The psychological reality of linguistic segments

J. Verb. Learn. Verb. Behav.

(1965)

A.D. Friederici et al.

Auditory language comprehension: an event-related fMRI study on the processing of syntactic and lexical information

Brain Lang.

(2000)

S. Garrod et al.

Why is conversation so easy?

Trends Cogn. Sci.

(2004)

N. Golestani et al.

Learning new sounds of speech: reallocation of neural substrates

Neuroimage

(2004)

F.H. Guenther et al.

Neural modeling and imaging of the cortical interactions underlying syllable production

Brain Lang.

(2006)

T.A. Hackett et al.

Prefrontal connections of the parabelt auditory cortex in macaque monkeys

Brain Res.

(1999)

T. Hartley et al.

A linguistically constrained model of short term memory for nonwords

J. Mem. Lang.

(1996)

S. Hawkins

Roles and representations of systematic fine phonetic detail in speech understanding

J. Phonetics

(2003)

G. Hickok et al.

Dorsal and ventral streams: a framework for understanding aspects of the functional anatomy of language

Cognition

(2004)

R.F. Huffman et al.

The descending auditory pathway and acousticomotor systems: connections with the inferior colliculus

Brain Res. Brain Res. Rev.

(1990)

J.H. Kaas et al.

Auditory processing in primate cerebral cortex

Curr. Opin. Neurobiol.

(1999)

D. Kersten et al.

Bayesian models of object perception

Curr. Opin. Neurobiol.

(2003)

T. Kraljic et al.

Perceptual learning for speech: Is there a return to normal?

Cogn. Psychol.

(2005)

I. Lehiste

Isochrony reconsidered

J. Phonetics

(1977)

A.M. Liberman et al.

On the relation of speech to language

Trends in Cognitive Science

(2000)

D.G. MacKay et al.

Relations between word perception and production: New theory and data on the verbal transformation effect

J. Mem. Lang.

(1993)

J.S. Magnuson et al.

Lexical effects on compensation for coarticulation: The ghost of Christmash past

Cogn. Sci.

(2003)

V.A. Mann et al.

Some differences between phonetic and auditory modes of perception

Cognition

(1983)

E. Ahissar et al.

Speech comprehension is correlated with temporal response patterns recorded from auditory cortex

Proc. Natl. Acad. Sci. USA

(2001)

A.D. Baddeley

Working Memory

(1986)

P.J. Bailey et al.

Information in speech: observations on the perception of [s]-stop clusters

J. Exp. Psychol. Hum. Percept. Perform.

(1980)

S.A. Brady et al.

Range effect in the perception of voicing

J. Acoust. Soc. Am.

(1978)

A.S. Bregman

Auditory Scene Analysis

(1990)

M.R. Brent

Towards a unified model of lexical acquisition and lexical access

J. Psycholinguist. Res.

(1997)

J.F. Brugge et al.

Functional connections between auditory cortex on Heschl’s gyrus and on the lateral superior temporal gyrus in humans

J. Neurophysiol.

(2003)

J. Bybee et al.

Alternatives to the combinatorial paradigm of linguistic theory based on domain general principles of human cognition

Linguist. Rev.

(2005)

R.P. Carlyon et al.

Effects of attention and unilateral neglect on auditory stream segregation

J. Exp. Psychol. Hum. Percept. Perform.

(2001)

R.P. Carlyon et al.

The continuity illusion and vowel identification

Acta Acust. Unit. Acust.

(2002)

M.H. Christiansen et al.

Learning to segment speech using multiple cues: a connectionist model

Lang. Cognitive Process.

(1998)

C.M. Clarke et al.

Rapid adaptation to foreign-accented English

J. Acoust. Soc. Am.

(2004)

R.G. Crowder

The purity of auditory memory

Philos. Trans. Royal Soc. Lond. B Biol. Sci.

(1983)

R.G. Crowder et al.

Precategorical acoustic storage

Percept. Psychophy.

(1969)

A. Cutler et al.

The role of strong syllables in segmentation for lexical access

J. Exp. Psychol. Hum. Percept. Perform.

(1988)

J.E. Cutting

Aspects of phonological fusion

J. Exp. Psychol. Hum. Percept. Perform.

(1975)

M.H. Davis

Connectionist modelling of lexical segmentation and vocabulary acquisition

M.H. Davis et al.

Hierarchical processing in spoken language comprehension

J. Neurosci.

(2003)

M.H. Davis et al.

Leading up the lexical garden path: segmentation and ambiguity in spoken word recognition

J. Exp. Psychol. Hum. Percept. Perform.

(2002)

M.H. Davis et al.

Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences

J. Exp. Psychol. Gen.

(2005)

Davis, M.H., Coleman, M.R., Absalom, A., Rodd, J.M., Johnsrude, I.S., Matta, B., Owen, A.M., Menon, D.K., in...

L.A. de la Mothe et al.

Cortical connections of the auditory cortex in marmoset monkeys: core and medial belt regions

J. Comp. Neurol.

(2006)

Cited by (311)

Contra assertions, feedback improves word recognition: How feedback and lateral inhibition sharpen signals over noise
2024, Cognition
Whether top-down feedback modulates perception has deep implications for cognitive theories. Debate has been vigorous in the domain of spoken word recognition, where competing computational models and agreement on at least one diagnostic experimental paradigm suggest that the debate may eventually be resolvable. Norris and Cutler (2021) revisit arguments against lexical feedback in spoken word recognition models. They also incorrectly claim that recent computational demonstrations that feedback promotes accuracy and speed under noise (Magnuson et al., 2018) were due to the use of the Luce choice rule rather than adding noise to inputs (noise was in fact added directly to inputs). They also claim that feedback cannot improve word recognition because feedback cannot distinguish signal from noise. We have two goals in this paper. First, we correct the record about the simulations of Magnuson et al. (2018). Second, we explain how interactive activation models selectively sharpen signals via joint effects of feedback and lateral inhibition that boost lexically-coherent sublexical patterns over noise. We also review a growing body of behavioral and neural results consistent with feedback and inconsistent with autonomous (non-feedback) architectures, and conclude that parsimony supports feedback. We close by discussing the potential for synergy between autonomous and interactive approaches.
Of words and whistles: Statistical learning operates similarly for identical sounds perceived as speech and non-speech
2024, Cognition
Statistical learning is an ability that allows individuals to effortlessly extract patterns from the environment, such as sound patterns in speech. Some prior evidence suggests that statistical learning operates more robustly for speech compared to non-speech stimuli, supporting the idea that humans are predisposed to learn language. However, any apparent statistical learning advantage for speech could be driven by signal acoustics, rather than the subjective perception per se of sounds as speech. To resolve this issue, the current study assessed whether there is a statistical learning advantage for ambiguous sounds that are subjectively perceived as speech-like compared to the same sounds perceived as non-speech, thereby controlling for acoustic features. We first induced participants to perceive sine-wave speech (SWS)—a degraded form of speech not immediately perceptible as speech—as either speech or non-speech. After this induction phase, participants were exposed to a continuous stream of repeating trisyllabic nonsense words, composed of SWS syllables, and then completed an explicit familiarity rating task and an implicit target detection task to assess learning. Critically, participants showed robust and equivalent performance on both measures, regardless of their subjective speech perception. In contrast, participants who perceived the SWS syllables as more speech-like showed better detection of individual syllables embedded in speech streams. These results suggest that speech perception facilitates processing of individual sounds, but not the ability to extract patterns across sounds. Our findings suggest that statistical learning is not influenced by the perceived linguistic relevance of sounds, and that it may be conceptualized largely as an automatic, stimulus-driven mechanism.
A representation of abstract linguistic categories in the visual system underlies successful lipreading
2023, NeuroImage
There is considerable debate over how visual speech is processed in the absence of sound and whether neural activity supporting lipreading occurs in visual brain areas. Much of the ambiguity stems from a lack of behavioral grounding and neurophysiological analyses that cannot disentangle high-level linguistic and phonetic/energetic contributions from visual speech. To address this, we recorded EEG from human observers as they watched silent videos, half of which were novel and half of which were previously rehearsed with the accompanying audio. We modeled how the EEG responses to novel and rehearsed silent speech reflected the processing of low-level visual features (motion, lip movements) and a higher-level categorical representation of linguistic units, known as visemes. The ability of these visemes to account for the EEG – beyond the motion and lip movements – was significantly enhanced for rehearsed videos in a way that correlated with participants’ trial-by-trial ability to lipread that speech. Source localization of viseme processing showed clear contributions from visual cortex, with no strong evidence for the involvement of auditory areas. We interpret this as support for the idea that the visual system produces its own specialized representation of speech that is (1) well-described by categorical linguistic features, (2) dissociable from lip movements, and (3) predictive of lipreading ability. We also suggest a reinterpretation of previous findings of auditory cortical activation during silent speech that is consistent with hierarchical accounts of visual and audiovisual speech perception.
Multivariate fMRI responses in superior temporal cortex predict visual contributions to, and individual differences in, the intelligibility of noisy speech
2023, NeuroImage
Humans have the unique ability to decode the rapid stream of language elements that constitute speech, even when it is contaminated by noise. Two reliable observations about noisy speech perception are that seeing the face of the talker improves intelligibility and the existence of individual differences in the ability to perceive noisy speech. We introduce a multivariate BOLD fMRI measure that explains both observations. In two independent fMRI studies, clear and noisy speech was presented in visual, auditory and audiovisual formats to thirty-seven participants who rated intelligibility. An event-related design was used to sort noisy speech trials by their intelligibility. Individual-differences multidimensional scaling was applied to fMRI response patterns in superior temporal cortex and the dissimilarity between responses to clear speech and noisy (but intelligible) speech was measured. Neural dissimilarity was less for audiovisual speech than auditory-only speech, corresponding to the greater intelligibility of noisy audiovisual speech. Dissimilarity was less in participants with better noisy speech perception, corresponding to individual differences. These relationships held for both single word and entire sentence stimuli, suggesting that they were driven by intelligibility rather than the specific stimuli tested. A neural measure of perceptual intelligibility may aid in the development of strategies for helping those with impaired speech perception.
The effects of speech masking on neural tracking of acoustic and semantic features of natural speech
2023, Neuropsychologia
Listening environments contain background sounds that mask speech and lead to communication challenges. Sensitivity to slow acoustic fluctuations in speech can help segregate speech from background noise. Semantic context can also facilitate speech perception in noise, for example, by enabling prediction of upcoming words. However, not much is known about how different degrees of background masking affect the neural processing of acoustic and semantic features during naturalistic speech listening. In the current electroencephalography (EEG) study, participants listened to engaging, spoken stories masked at different levels of multi-talker babble to investigate how neural activity in response to acoustic and semantic features changes with acoustic challenges, and how such effects relate to speech intelligibility. The pattern of neural response amplitudes associated with both acoustic and semantic speech features across masking levels was U-shaped, such that amplitudes were largest for moderate masking levels. This U-shape may be due to increased attentional focus when speech comprehension is challenging, but manageable. The latency of the neural responses increased linearly with increasing background masking, and neural latency change associated with acoustic processing most closely mirrored the changes in speech intelligibility. Finally, tracking responses related to semantic dissimilarity remained robust until severe speech masking (−3 dB SNR). The current study reveals that neural responses to acoustic features are highly sensitive to background masking and decreasing speech intelligibility, whereas neural responses to semantic features are relatively robust, suggesting that individuals track the meaning of the story well even in moderate background sound.
Temporal lobe perceptual predictions for speech are instantiated in motor cortex and reconciled by inferior frontal cortex
2023, Cell Reports
Humans use predictions to improve speech perception, especially in noisy environments. Here we use 7-T functional MRI (fMRI) to decode brain representations of written phonological predictions and degraded speech signals in healthy humans and people with selective frontal neurodegeneration (non-fluent variant primary progressive aphasia [nfvPPA]). Multivariate analyses of item-specific patterns of neural activation indicate dissimilar representations of verified and violated predictions in left inferior frontal gyrus, suggestive of processing by distinct neural populations. In contrast, precentral gyrus represents a combination of phonological information and weighted prediction error. In the presence of intact temporal cortex, frontal neurodegeneration results in inflexible predictions. This manifests neurally as a failure to suppress incorrect predictions in anterior superior temporal gyrus and reduced stability of phonological representations in precentral gyrus. We propose a tripartite speech perception network in which inferior frontal gyrus supports prediction reconciliation in echoic memory, and precentral gyrus invokes a motor model to instantiate and refine perceptual predictions for speech.

View all citing articles on Scopus

View full text

Research paperHearing speech sounds: Top-down influences on the interface between audition and speech perception

Abstract

Introduction

Section snippets

Evidence for interactivity in speech perception

Computational requirements for interactive processes in speech perception

Towards a neuroanatomical account of speech perception

Concluding remarks

Acknowledgements

Cognition

Speech Commun.

Cognition

Neuron

Cognitive Psychol.

Neuroimage

Comput. Speech Lang.

Neuroimage

Brain Res.

J. Mem. Lang.

J. Verb. Learn. Verb. Behav.

Brain Lang.

Trends Cogn. Sci.

Neuroimage

Brain Lang.

Brain Res.

J. Mem. Lang.

J. Phonetics

Cognition

Brain Res. Brain Res. Rev.

Curr. Opin. Neurobiol.

Curr. Opin. Neurobiol.

Cogn. Psychol.

J. Phonetics

Trends in Cognitive Science

J. Mem. Lang.

Cogn. Sci.

Cognition

Speech comprehension is correlated with temporal response patterns recorded from auditory cortex

Proc. Natl. Acad. Sci. USA

Working Memory

Information in speech: observations on the perception of [s]-stop clusters

J. Exp. Psychol. Hum. Percept. Perform.

Range effect in the perception of voicing

J. Acoust. Soc. Am.

Auditory Scene Analysis

Towards a unified model of lexical acquisition and lexical access

J. Psycholinguist. Res.

Functional connections between auditory cortex on Heschl’s gyrus and on the lateral superior temporal gyrus in humans

J. Neurophysiol.

Alternatives to the combinatorial paradigm of linguistic theory based on domain general principles of human cognition

Linguist. Rev.

Effects of attention and unilateral neglect on auditory stream segregation

J. Exp. Psychol. Hum. Percept. Perform.

The continuity illusion and vowel identification

Acta Acust. Unit. Acust.

Learning to segment speech using multiple cues: a connectionist model

Lang. Cognitive Process.

Rapid adaptation to foreign-accented English

J. Acoust. Soc. Am.

The purity of auditory memory

Philos. Trans. Royal Soc. Lond. B Biol. Sci.

Precategorical acoustic storage

Percept. Psychophy.

The role of strong syllables in segmentation for lexical access

J. Exp. Psychol. Hum. Percept. Perform.

Aspects of phonological fusion

J. Exp. Psychol. Hum. Percept. Perform.

Connectionist modelling of lexical segmentation and vocabulary acquisition

Hierarchical processing in spoken language comprehension

J. Neurosci.

Leading up the lexical garden path: segmentation and ambiguity in spoken word recognition

J. Exp. Psychol. Hum. Percept. Perform.

Lexical information drives perceptual learning of distorted speech: evidence from the comprehension of noise-vocoded sentences

J. Exp. Psychol. Gen.

Cortical connections of the auditory cortex in marmoset monkeys: core and medial belt regions

J. Comp. Neurol.

Research paper
Hearing speech sounds: Top-down influences on the interface between audition and speech perception