Skip to main content

Main menu

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT

User menu

Search

  • Advanced search
eNeuro

eNeuro

Advanced Search

 

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT
PreviousNext
Research ArticleResearch Article: New Research, Cognition and Behavior

Dynamic Time-Locking Mechanism in the Cortical Representation of Spoken Words

A. Nora, A. Faisal, J. Seol, H. Renvall, E. Formisano and R. Salmelin
eNeuro 8 June 2020, 7 (4) ENEURO.0475-19.2020; DOI: https://doi.org/10.1523/ENEURO.0475-19.2020
A. Nora
1Department of Neuroscience and Biomedical Engineering, and Aalto NeuroImaging, Aalto University, Espoo FI-00076, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for A. Nora
A. Faisal
1Department of Neuroscience and Biomedical Engineering, and Aalto NeuroImaging, Aalto University, Espoo FI-00076, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
J. Seol
1Department of Neuroscience and Biomedical Engineering, and Aalto NeuroImaging, Aalto University, Espoo FI-00076, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
H. Renvall
1Department of Neuroscience and Biomedical Engineering, and Aalto NeuroImaging, Aalto University, Espoo FI-00076, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
E. Formisano
2Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht 6200 MD, The Netherlands
3Maastricht Center for Systems Biology (MaCSBio), Maastricht University, Maastricht 6200 MD, The Netherlands
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for E. Formisano
R. Salmelin
1Department of Neuroscience and Biomedical Engineering, and Aalto NeuroImaging, Aalto University, Espoo FI-00076, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Article
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF
Loading

Article Figures & Data

Figures

  • Extended Data
  • Additional Files
  • Figure 1.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 1.

    Different decoding models and their prediction accuracy. A, Stimulus features for a spoken word. The FFT model represents the frequency spectrum of the sound extracted in 128 frequency bands with logarithmically spaced center frequencies. The MPS model represents energy in four spectral scales (wide to narrow) and four temporal modulation rates (slow to fast), averaged over time. The sound spectrogram quantifies the time-evolving frequency content of the sound extracted in 128 frequency bands and 10-ms time windows by using short-term FFT. The amplitude envelope carries the temporal changes without frequency information (shown below the phoneme model). The phoneme sequence is the phoneme annotation of the word for each 10-ms time window. Semantic features were represented by scores of 99 questions (a few example questions shown) and a 300-dimensional vector trained with the co-occurrences of context words (words occurring near the stimulus words) in a large text corpus. B, Model estimation for the regression model (left) and the convolution model (right). A mapping between the cortical MEG responses (here illustrated on one sensor), and each stimulus feature was learned with kernel regression or kernel convolution model. As illustrated here, the regression model predicts, for example, power at each frequency-rate-scale point of the MPS by multiplying cortical responses at all (or selected) time points with unknown weights (w). The convolution model predicts the amplitude at each time-frequency point of the spectrogram by convolving the time-sequence of cortical responses with an unknown spatiotemporal response function (g). Specifically, values at each frequency band of a new spoken word are predicted at each time point t (moving from 0 to end of the sound) based on MEG responses in the time range from (t – τ2) to (t – τ1), here illustrated for the lag window –τ2 = 100 to –τ1 = 180 ms at time points t. C, Model testing aimed to tell apart two left-out sounds by reconstructing the sound features (here, spectrogram) and correlating them with the original features. The procedure was repeated for all possible pairings of sounds. Predictive accuracy for the spoken words across all test sound pairs and 16 participants (mean ± SEM) is shown on the right. The regression model was used for the decoding of non-time-varying features (FFT frequency bins/MPS rate-scale-frequency points/semantic question scores and corpus statistics), and the convolution model was used for decoding of time-varying features (spectrogram time-frequency points/phonemes at each time point/amplitude envelope); for a control analysis, the spectrogram was also decoded with the regression model. Predictive accuracy improved markedly when spoken words were modeled with the convolution spectrogram model, formalizing the concept that the neuronal population response follows closely in time the unfolding time-sequence of the sound acoustics. Even better performance was obtained when the spoken words were modeled using both the spectrogram and the phoneme sequence descriptions, and the best performance was obtained using the sounds’ amplitude envelope.

  • Figure 2
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 2

    Influence of convolution lag and cortical sources for spectrogram and phoneme decoding. A, Spectrogram and phoneme decoding accuracy at different lags between a time point in the stimulus and a time window in the MEG response (average across all 16 participants ± SEM). Note that the lag window does not correspond to timing relative to the stimulus onset in the MEG evoked response. The best predictive accuracy for decoding spectrogram and phoneme sequence of spoken words was reached with a lag of 100–180 ms (significant difference to 180- to 260-ms lag, p = 0.000031 for spectrogram, p = 0.016 for phonemes). B, Cortical sources contributing to decoding of acoustic and phoneme features in spoken words with the convolution model at 100- to 180-ms lag (this time window showed best performance). Color scale denotes average decoding accuracy (>50%) across all 16 participants.

  • Figure 3.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 3.

    Comparison of different acoustic models for environmental sound decoding. A, Predictive accuracy for environmental sounds (blue) across all test sound pairs and 16 participants (mean ± SEM). The regression model was used for the decoding of non-time-varying features (FFT frequency bins/MPS rate-scale-frequency points/semantic features), and the convolution model was used for decoding of time-varying features (spectrogram time-frequency points/amplitude envelope); for a control analysis, the spectrogram was also decoded with the regression model. B, Investigation of convolution lag on spectrogram decoding accuracy for environmental sounds (average across all 16 participants ± SEM). C, Cortical sources contributing to decoding of acoustic features in environmental sounds with the MPS regression model at 50–100 ms after stimulus onset (performance was best with this model and time window). Color scale denotes average decoding accuracy (>50%) across all 16 participants.

  • Figure 4.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 4.

    Comparison of human nonverbal sounds and spoken words. A, Dissimilarity (1 minus correlation) of the MPS temporal rates across the environmental sounds and spoken words. B, Average temporal modulation rate × frequency representations for human sounds (top) and spoken words (bottom). C, Power (mean ± SEM) at different temporal modulation rates for spoken words (red), human nonverbal sounds (gray) and other categories of environmental sounds (blue). D, Dissimilarity of different frequency bands of the sounds, indicating the degree of correlated temporal modulations (co-modulations), calculated separately for each item, and averaged over spoken words (left), human nonverbal sounds (middle), and other environmental sounds (right).

  • Figure 5.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 5.

    Comparison of semantic decoding for spoken words and environmental sounds. A, Time course of regression-model based predictive accuracy in semantic features of environmental sounds and spoken words (16 participants, mean ± SEM). The values on the x-axis represent the starting time of the 50-ms time windows in the MEG responses. Time 0 indicates the onset of the sound stimulus. Decoding of the semantic feature set was performed for successive 50-ms time windows for the whole stimulus duration, for both stimulus types. The time windows with statistically significant decoding (p < 0.01) are indicated by thick horizontal lines above the x-axis. Gray solid line denotes chance level performance (50%). The gray bar represents the time window used for source level decoding. B, Source areas contributing to decoding of semantic features for environmental sounds and spoken words at 650–700 ms from stimulus onset. This time window had significant decoding performance for both classes of sounds.

Extended Data

  • Figures
  • Additional Files
  • Extended Data Figure 1-1

    Examples of spoken words (left) and environmental sounds (right), one from each semantic category. For each sound, the signal waveform is depicted on top, with the three models of the sound below: frequency spectrum (FFT), spectrogram, and MPS (temporal rates and spectral scales shown here separately). Note the considerable variability in the spectral content and temporal evolution of the different sounds, especially for the environmental sounds. Download Figure 1-1, EPS file.

  • Audio 1.

    Reconstructed audio files of a selection of spoken words and environmental sounds, based on the convolution model and spectrogram. The original stimulus is presented first, followed by the reconstructed sound; this order of presentation is used to ease the listener’s perception of relevant speech features, but please note that it produces a priming effect. For corresponding spectrograms, see Figure Extended Data 2-1. Download Audio 1, WAV file.

  • Extended Data Figure 1-2

    A, Sensor-level responses to spoken words (left, orange) and environmental sounds (right, blue), in one MEG sensor above the left and right temporal lobes, for one participant. The signals were averaged across 20 presentations of the same sound, from 300 ms before to 2000 ms after stimulus onset, and here also averaged over the different items within each semantic category (for visualization only). Both spoken words and environmental sounds elicited a typical time sequence of activation, with a transient response at about 100 ms that was followed by a more sustained response from about 250 ms onwards and return to baseline after 1000 ms. B, Grand average (n = 16) dSPMs to spoken words (left) and environmental sounds (right) demonstrate that, for both types of stimuli, activation originated mainly in the bilateral temporal regions in the vicinity of the primary auditory cortex, with additional activation in inferior frontal areas, and the left hemisphere was highlighted particularly for spoken words in the later time window. Download Figure 1-2, EPS file.

  • Audio 2.

    Reconstructed audio files of a selection of environmental sounds, based on the convolution model and spectrogram. The original stimulus is presented first, followed by the reconstructed sound. For corresponding spectrograms, see Extended Data Figure 2-1. Download Audio 2, WAV file.

  • Extended Data Figure 1-3

    Table containing a complete list of stimulus items. Download Figure 1-3, DOC file.

  • Extended Data Figure 2-1

    Original and reconstructed spectrograms of selected spoken words (top) and environmental sounds (bottom) based on the convolution model. The corresponding sound files are in Audios 1, 2. Download Figure 2-1, EPS file.

  • Extended Data Figure 3-1

    A, The median (with 25% and 75% interquartile ranges) power spectra (left) as well as spectral scales (middle) and temporal rates (right) pooled across the 128 frequency bins, for speech and environmental stimuli, illustrating the differences between spoken words and environmental sounds. Also, the environmental sounds have more variability in their spectral scales and temporal rates, whereas spoken words display prominent slow temporal modulations. B, Pair-wise distances (1 minus correlation) of the original stimulus MPS and spectrogram (mean ± SD). C, Leave-one-out reconstruction fidelity, i.e. correlations of reconstructed and original MPSs/spectrograms/amplitude envelopes for spoken words and environmental sounds (mean ± SEM). D, Scatterplot of the pair-wise distances (1 minus correlation) among original (x-axis) and pair-wise distances among reconstructed (y-axis) spectrograms. Download Figure 3-1, EPS file.

  • Extended Data Figure 3-2

    The median (with 25% and 75% interquartile ranges) power spectra (left), as well as spectral scales (middle) and temporal rates (right) pooled across the 128 frequency bins, for speech and different categories of environmental stimuli. The human non-speech sounds are most similar to speech in the spectral scales and temporal rates. Download Figure 3-2, EPS file.

  • Extended Data Figure 5-1

    Illustration of the cortical parcellation template and the top 20 ranking cortical areas (parcels) for the decoding of acoustic (orange) and semantic (red) features for spoken words (left), and acoustic (light blue), and semantic (dark blue) features for environmental sounds (right). The parcel names are listed in the table below. Sources for acoustic decoding of spoken words are based on the convolution model and spectrogram, with the best lag (100–180 ms). Sources for acoustic decoding of environmental sounds are based on the regression model and MPS, at 50–100 ms. For semantic decoding, the regression model at 650–700 ms was used for both classes of sounds. L = left, R = right. Download Figure 5-1, EPS file.

Additional Files

  • Figures
  • Extended Data
    • Audio 2, WAV file -

      Reconstructed audio files of a selection of environmental sounds, based on the convolution model and spectrogram. The original stimulus is presented first, followed by the reconstructed sound.

  • Extended Data

    • Audio 1, WAV file -

      Reconstructed audio files of a selection of spoken words and environmental sounds, based on the convolution model and spectrogram. The original stimulus is presented first, followed by the reconstructed sound; this order of presentation is used to ease the listener’s perception of relevant speech features, but please note that it produces a priming effect.



    • Audio 2, WAV file -

      Reconstructed audio files of a selection of environmental sounds, based on the convolution model and spectrogram. The original stimulus is presented first, followed by the reconstructed sound.

Back to top

In this issue

eneuro: 7 (4)
eNeuro
Vol. 7, Issue 4
July/August 2020
  • Table of Contents
  • Index by author
  • Ed Board (PDF)
Email

Thank you for sharing this eNeuro article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Dynamic Time-Locking Mechanism in the Cortical Representation of Spoken Words
(Your Name) has forwarded a page to you from eNeuro
(Your Name) thought you would be interested in this article in eNeuro.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Print
View Full Page PDF
Citation Tools
Dynamic Time-Locking Mechanism in the Cortical Representation of Spoken Words
A. Nora, A. Faisal, J. Seol, H. Renvall, E. Formisano, R. Salmelin
eNeuro 8 June 2020, 7 (4) ENEURO.0475-19.2020; DOI: 10.1523/ENEURO.0475-19.2020

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Respond to this article
Share
Dynamic Time-Locking Mechanism in the Cortical Representation of Spoken Words
A. Nora, A. Faisal, J. Seol, H. Renvall, E. Formisano, R. Salmelin
eNeuro 8 June 2020, 7 (4) ENEURO.0475-19.2020; DOI: 10.1523/ENEURO.0475-19.2020
Reddit logo Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Significance Statement
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
    • Acknowledgments
    • Footnotes
    • References
    • Synthesis
    • Author Response
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF

Keywords

  • auditory system
  • magnetoencephalography
  • Neural decoding
  • speech processing

Responses to this article

Respond to this article

Jump to comment:

No eLetters have been published for this article.

Related Articles

Cited By...

More in this TOC Section

Research Article: New Research

  • Capacity Limits Lead to Information Bottlenecks in Ongoing Rapid Motor Behaviors
  • Nonlinear Theta-Gamma Coupling between the Anterior Thalamus and Hippocampus Increases as a Function of Running Speed
  • Contrast and Luminance Gain Control in the Macaque’s Lateral Geniculate Nucleus
Show more Research Article: New Research

Cognition and Behavior

  • Environment Enrichment Facilitates Long-Term Memory Consolidation Through Behavioral Tagging
  • Effects of cortical FoxP1 knockdowns on learned song preference in female zebra finches
  • The genetic architectures of functional and structural connectivity properties within cerebral resting-state networks
Show more Cognition and Behavior

Subjects

  • Cognition and Behavior

  • Home
  • Alerts
  • Visit Society for Neuroscience on Facebook
  • Follow Society for Neuroscience on Twitter
  • Follow Society for Neuroscience on LinkedIn
  • Visit Society for Neuroscience on Youtube
  • Follow our RSS feeds

Content

  • Early Release
  • Current Issue
  • Latest Articles
  • Issue Archive
  • Blog
  • Browse by Topic

Information

  • For Authors
  • For the Media

About

  • About the Journal
  • Editorial Board
  • Privacy Policy
  • Contact
  • Feedback
(eNeuro logo)
(SfN logo)

Copyright © 2023 by the Society for Neuroscience.
eNeuro eISSN: 2373-2822

The ideas and opinions expressed in eNeuro do not necessarily reflect those of SfN or the eNeuro Editorial Board. Publication of an advertisement or other product mention in eNeuro should not be construed as an endorsement of the manufacturer’s claims. SfN does not assume any responsibility for any injury and/or damage to persons or property arising from or related to any use of any material contained in eNeuro.