Combining predictive coding and neural oscillations enables online syllable recognition in natural speech

Hovsepyan, Sevada; Olasagasti, Itsaso; Giraud, Anne-Lise

doi:10.1038/s41467-020-16956-5

Download PDF

Article
Open access
Published: 19 June 2020

Combining predictive coding and neural oscillations enables online syllable recognition in natural speech

Nature Communications volume 11, Article number: 3117 (2020) Cite this article

7656 Accesses
37 Citations
70 Altmetric
Metrics details

Subjects

Abstract

On-line comprehension of natural speech requires segmenting the acoustic stream into discrete linguistic elements. This process is argued to rely on theta-gamma oscillation coupling, which can parse syllables and encode them in decipherable neural activity. Speech comprehension also strongly depends on contextual cues that help predicting speech structure and content. To explore the effects of theta-gamma coupling on bottom-up/top-down dynamics during on-line syllable identification, we designed a computational model (Precoss—predictive coding and oscillations for speech) that can recognise syllable sequences in continuous speech. The model uses predictions from internal spectro-temporal representations of syllables and theta oscillations to signal syllable onsets and duration. Syllable recognition is best when theta-gamma coupling is used to temporally align spectro-temporal predictions with the acoustic input. This neurocomputational modelling work demonstrates that the notions of predictive coding and neural oscillations can be brought together to account for on-line dynamic sensory processing.

Speech rhythms and their neural foundations

Article 06 May 2020

David Poeppel & M. Florencia Assaneo

Sequences of Intonation Units form a ~ 1 Hz rhythm

Article Open access 28 September 2020

Maya Inbar, Eitan Grossman & Ayelet N. Landau

Neural dynamics of phoneme sequences reveal position-invariant code for content and order

Article Open access 03 November 2022

Laura Gwilliams, Jean-Remi King, … David Poeppel

Introduction

Neural oscillations are involved in many different cognitive operations^1,2,3, and considering their cross-frequency coupling permits to more closely approach their function, e.g., perception, memory, attention, etc.⁴. In the domain of natural speech recognition, an important role has been assigned to the coupling of theta and gamma oscillations^5,6,7, as it permits to hierarchically coordinate the encoding of phonemes within syllables, without prior knowledge of their duration and temporal occurrence, i.e. in a purely bottom-up online manner⁸.

Natural speech recognition also strongly relies on contextual cues to anticipate the content and temporal structure of the speech signal^9,10,11. Recent studies underline the importance of top-down predictive mechanisms during continuous speech perception and relate them to another range of oscillatory activity, the low-beta band^{12,13,14,15,16,17}. Predictive coding^18,19,20,21, on the other hand, offers a theory of brain function that, in common with Analysis-by-Synthesis^22,23 and the Bayesian Brain hypothesis²⁴, relies on the agent having an internal model of how sensory signals are generated from their underlying hidden causes. Predictive coding also provides a message passing scheme in which top-down predictions and bottom-up prediction errors work together to identify the hidden causes of sensory signals. It therefore incorporates the contextual and prior knowledge which are invoked as critical in speech processing²⁵.

Bottom-up and top-down approaches of speech processing both find support in theoretical studies. A neurocomputational model involving the coupling of realistic theta and gamma excitatory/inhibitory networks was able to pre-process speech in such a way that it could then be correctly decoded⁸. This model aimed at understanding the computational potential of biophysically realistic oscillatory neural processes rather than simply fitting existing data. A radically different model, solely based on predictive coding, could faithfully recognise isolated speech items (such as words, or full sentences when considered as a single speech item)²⁶. Although both approaches intend to describe speech perception, one model focused on the on-line parsing aspect of speech processing, and the other on the recognition of isolated speech segments (no parsing needed). Combining the physiological notion of neural oscillations with the cognitive notion of predictive coding is appealing²⁷ as it could broaden the capacity, improve performance, and enhance the biological realism of neurocomputational models of speech processing. More fundamentally, such an attempt offers the opportunity to explore the possible orchestration between two equally important neuroscientific levels of description, computational/algorithmic for analysis-by-synthesis and algorithmic/implementational for neural oscillations²⁸.

In this study, we addressed whether a predictive coding speech recognition model could benefit from neural oscillation processes. We designed the Precoss neurocomputational model based on the predictive coding framework in which we included theta and gamma oscillatory functions to deal with the continuous nature of natural speech. Although natural sentence comprehension involves many processing steps up to the syntax level, the parsing of syllables and their on-line recognition is a crucial step and a challenging issue, even for current automatic speech recognition (ASR) systems^29,30,31. The specific goal of this modelling work was hence to address whether combining predictive coding and neural oscillations could be advantageous for on-line identification of the syllabic components of natural sentences. Specifically, we examined the possible mechanisms by which theta oscillations can interact with bottom-up and top-down information flows and assessed the effects of this interaction on the efficacy of the syllable decoding process. We show that on-line syllable identification from speech works best when theta-gamma coupling is combined with the internal knowledge about syllables’ spectral structure, and more broadly when continuous inferential processes are informed by dynamic oscillation-based cues.

Results

Precoss architecture and theta-gamma coupling

An important prerequisite of the model is that it must be able to use the temporal information/cues present in continuous speech, to define syllable boundaries. We hypothesised that internal generative models including temporal predictions should benefit from such cues. To address this hypothesis and to account for recurrent processes occurring during speech recognition^13,32,33,34, we used a continuous predictive coding model (described in Methods). Our model explicitly separates what from when, with what referring to the identity of a syllable and its spectral representation (a non-temporal but ordered sequence of spectral vectors), and when to the prediction of the timing and duration of syllables as implemented through periodic/oscillatory processes^35,36. When predictions take two forms: syllable onset as signalled by a theta module, and syllable duration signalled either by exogenous or endogenous theta oscillations that set the duration of a sequence of gamma-timed units (see Methods for details) (Fig. 1).

**Fig. 1: Model of on-line syllable parsing and identification from natural sentences—Precoss.**

Precoss retrieves the sensory signal from internal representations about its causes by inverting a generative model. In this case the sensory input corresponds to the slow amplitude modulation of the speech waveform (see Methods section) and a 6-channel auditory spectrogram³⁷ of a full natural sentence (bottom rows of Fig. 1), which the model internally generates from the four components depicted in Fig. 1: 1/ a theta oscillation, 2/ a slow amplitude modulation unit in a theta-module, 3/ a pool of syllable units (as many syllables as present in the natural input sentence, i.e., from 4 to 25), and 4/ a bank of eight gamma units in a spectrotemporal-module. Together, gamma and syllable units generate top-down predictions about the input spectrogram. Each of the eight gamma units represents a phase in the syllable; they activate sequentially and the whole activation sequence repeats itself. Each syllable unit is hence associated with a sequence of eight vectors (one per gamma unit) with 6 components each (one per frequency channel) (see Supplementary Fig. 1). The acoustic spectrogram of a single syllable is generated by the activation of the corresponding syllable unit over the entire syllable duration. While the syllable unit encodes a specific acoustic pattern, the gamma units temporally deploy the corresponding spectral prediction over the syllable duration. Information about syllable duration is provided by the theta oscillation (dashed arrow in Fig. 1), its instantaneous rate affecting the rate/duration of the gamma sequence. Finally, the accumulated evidence about inferred syllable has to be erased before the next syllable is processed. To achieve this, the last (8th) gamma unit (Fig. 1 upwards blue arrows, Supplementary Fig. 1), which encodes the last part of the syllable, resets all syllable units to a common low activation level, enabling new evidence accumulation.

The model performance depends on whether the gamma sequence aligns with syllable onsets, and whether its duration is consistent with the syllable duration (50–600 ms, mean = 182 ms in the dataset used here, Fig. 2). To meet these essential criteria, the theta oscillator, as modelled using the Ermentrout-Koppel canonical model³⁸, is fed by the dynamics of the slow amplitude modulation of the speech signal inferred from the input, and whenever it reaches a specific, predefined phase, the model generates a Gaussian trigger signalling syllable onset. This setup implies that the operating theta frequency dynamically changes as a function of the inferred slow amplitude fluctuations and can hence estimate syllables duration. This estimation is used to adjust the duration of the gamma sequence (via a hidden variable controlling the rate of the gamma sequence—see Methods section, Eq. (12)).

The model’s estimate about the syllable sequence is provided by the syllable units, which, together with gamma units, generate expected spectrotemporal patterns that are compared with the input spectrogram. The model updates its estimates about the ongoing syllable to minimise the difference between the generated and the actual spectrogram. The activity level increases in those syllable units whose spectrogram is consistent with the sensory input and decreases in the others. In the ideal case, online prediction error minimisation leads to elevated activity in one single syllable unit matching the input syllable.

We quantify the model’s performance by its ability to correctly identify syllables in the input, which is determined by the activity of syllable units within the temporal window between two consecutive activations of the first gamma unit. For each window the model selects the syllable unit that has the highest average activation, the winning syllable. Optimal recognition occurs when the winning syllable and the inferred gamma sequence duration correspond to the real identity and duration of the syllable in the input (see corresponding Methods section and Supplementary Fig. 2 for more details).

Model variants and performance

The model presented above includes a physiologically motivated theta oscillation that is driven by the slow amplitude modulations of the speech waveform and signals information about syllable onset and duration to a gamma component. This theta-gamma coupling achieves the temporal alignment of internally generated predictions with the syllable boundaries detected from the input (variant A in Fig. 3). To assess the relevance of syllable timing based on the slow amplitude modulation, we compared the performance of model A against that of a variant in which theta activity is not modelled by an oscillation but emerges from the self-repetition of the gamma sequence (Model B, Fig. 3). In this alternative version, the duration of the gamma sequence is no longer controlled exogenously by the theta oscillation, but endogenously using a preferred gamma rate that, as the sequence repeats, results in an intrinsic theta rhythm (Eqs. (12) and (15)). As with the theta oscillation, the duration of the gamma sequence has a preferred rate in the theta range that can potentially adapt to variable syllable durations during the inference process, via prediction errors that change the hidden variable responsible for its rate. With this variant, the model allows for testing a theta rhythm arising from the repetition of the gamma sequence rather than from an explicit stimulus driven oscillator. This scheme where theta emerges from endogenously rhythmic modulations of gamma activity is argued in the literature³⁹ as a possible alternative to a separate cortical theta rhythm driving gamma activity⁴⁰. Note that both model versions (A and B) operate with a reset of accumulated evidence by silencing syllable units at the end of each syllable.

**Fig. 3: Precoss model variants and their performance.**

In order to more precisely assess the specific effects of theta-gamma coupling and of the reset of accumulated evidence in syllable units, we run additional simulations in two more variants of models A and B. Variants C and D differed by having no preferred gamma rate (Supplementary Fig. 1). Variants E and F further differed from variants C and D by having no reset of accumulated evidence in syllable units (Fig. 3, left panel, Table 1 in Methods). Among all model variants only A has a true theta-gamma coupling, where gamma activity is determined by the theta module (Fig. 3, red dashed arrow), whereas for B the gamma rate is set endogenously (red dashed circle). In all other cases, the rate of gamma activity can freely adapt to the input. To address the impact of the reset of gamma units by exogenous theta triggers we compared models A, C and E, with B, D and F, respectively.

Table 1 Triggers for each model configuration.

Full size table

Simulations were performed on natural sentences from the TIMIT⁴¹ speech corpus. Bar plots in Fig. 3 show the median performance for each architecture. Although all model variants performed well above chance, there were significant differences across them (p-values for pairwise comparisons and Bonferroni corrected critical α-value are presented in Supplementary Table 1). Relative to models A and B, performance was significantly lower in models E and F (on average by 23%), and C and D (by 15%) indicating that erasing accumulated evidence about the previous syllable before processing a new syllable is a crucial factor for syllable stream encoding in natural speech.

The comparison of variants A and B versus C and D indicates that theta-gamma coupling, whether stimulus-driven (A) or endogenous (B), significantly improved the model’s performance (on average by 8.6%).

The simulations further show that the model performed best when syllable units were reset after completion of each gamma-units sequence (based on internal information about the spectral syllable structure), and when the gamma rate was driven by theta-gamma coupling irrespective of whether it was stimulus- (red dashed arrow in A) or endogenously-timed (red dashed circle in B). The model’s performance with natural sentences hence depended neither on the precise signalling of syllable onsets via a stimulus-driven theta oscillation, nor on the exact mechanism of the theta-gamma coupling. Although this is a surprising finding, the absence of performance difference between stimulus-driven and endogenous theta-gamma coupling reflects that the duration of the syllables in natural, regular speech is very close to the model’s expectations, in which case there would be no advantage for a theta signal driven directly from the input.

To address this point, we run additional simulations on model variants A (theta-gamma coupling through stimulus-driven theta) and B (endogenous theta-gamma coupling via preferred gamma rate) using compressed speech signals (normal speed ×2 and ×3). Behavioural studies^14,42,43 show that comprehension remains almost intact when speech is compressed by 2 but drops for a compression factor of 3. Under such adverse conditions, stimulus-driven theta-gamma coupling might become beneficial for parsing and decoding syllables. Simulation results with compressed speech stimuli are presented in Fig. 4. For the sake of comparison, we display again performance of variants A and B with natural uncompressed sentences. As expected, overall performance dropped with increased compression factor. However, while for compression factor 2, there was still no significant difference between stimulus-driven and endogenous theta-gamma coupling, a significant difference emerged for compression factor 3, (mean difference = 1.74% [0.28, ∞, 95% CI], Cohen’s d = 0.1363, p = 0.0248), indicating that a stimulus-driven theta oscillation driving theta-gamma coupling was more advantageous for the syllable encoding process than an endogenously-set theta rate. These simulations therefore suggest that natural speech can be processed with a relatively fixed endogenous theta oscillator but that a more flexible, stimulus-driven, theta oscillator signalling precise syllable timing to the gamma encoding module, might become essential for decoding speech in adverse conditions such as when speech rate is variable.

**Fig. 4: Performance of model variants with and without stimulus-driven theta oscillator on compressed speech.**

Bayesian information criterion

The ability of the model to accurately recognise syllables in the input sentence does not take into account the variable complexity of the different models compared. We therefore estimated the Bayesian Information Criterion⁴⁴ (BIC) for each model, which quantifies the trade-off between model accuracy and complexity (Fig. 5).

Variant A, the model in which exogenously driven theta oscillation informs gamma rate and resets the accumulated evidence based on the internal knowledge of syllable structure, showed the highest BIC value. While, the comparison based on syllable recognition accuracy (Fig. 3) could not distinguish between variants A and B, the BIC (Supplementary Table 2) shows that variant A results in more confident syllable recognition than the model without stimulus-driven theta oscillation (Variant B).

Overall, these simulations show that two processes determine the performance of online syllable identification in the model. The first and most important one is the reset of accumulated evidence based on the model’s information about the syllable content (here its spectral structure). The second one is the coupling between theta and gamma periodic processes, which ensures that gamma activity is embedded within a theta cycle corresponding to the expected syllable duration. Importantly, the coupling was more efficient when it arose from a theta oscillator driven by the speech acoustics; the model was marginally more resilient to speech rate variations (Fig. 4), and it had the best accuracy versus complexity trade-off (Fig. 5).

Discussion

This work focuses on a necessary operation for speech recognition; the parsing of the continuous speech signal into discrete elements. Precoss was exapted from a speech recognition model²⁶ inspired by birdsong generation⁴⁵. While most simulations described in Yildiz et al.²⁶ were performed on monosyllabic, normalised (in the temporal domain) words (digits), our proposed model extends the approach to natural continuous speech made up of subunits (here syllables) whose order is not known a priori. Since speech is made of linguistic building blocks allowing for (quasi)infinite combinatorial possibilities, this constitutes an important step toward neurobiological realism compared with Yildiz et al.²⁶. The model emulates the combinatorial freedom of language by assuming that syllables present in a given sentence can appear in any order. Given real speech statistics, context can reduce—but not generally eliminate—the uncertainty about which syllable might come next. Our main observation suggests that when uncertainty about the identity of the next syllable remains, which is true in most cases, coordination of bottom-up and top-down information flow is critical for recognition and that neural oscillations can enhance this coordination by signalling syllable temporal structure (onsets and duration).

Syllabification is a non-trivial problem that is classically dealt with by off-line methods, e.g., Mermelstein algorithm⁴⁶. It was shown, however, that a theta-range natural oscillator built from reciprocal connections between excitatory and inhibitory leaky integrate-and-fire neurons could achieve highly accurate on-line syllabification^8,31. The efficiency of such theta oscillator is due to the fact that its intrinsic frequency can adapt, within limits, to that of an external stimulus. In Precoss, we took advantage of the theta oscillation for syllable parsing, using a simplified version of this network including one single canonical theta neuron³⁸, whose function is to signal syllable onset based on the slow amplitude modulation of the speech waveform. This simplification was necessary because integrate and fire neurons do not permit to work in a continuous framework. This simplified theta oscillation implementation allowed us to reach around 53% correct syllable onset detection (theta trigger within 50 ms of a real onset), a relatively modest rate, which however was sufficient to demonstrate the advantage of an oscillatory mechanism that can adapt to the acoustics. We indeed compared this model, which includes a stimulus-informed theta oscillation, with a model where theta-gamma coupling is endogenous, i.e., when it arises from the repetition of the gamma sequence at a predetermined rate that flexibly adapts to the variable syllable lengths during the inference process (Eq. (12), Methods). While both models performed similarly for natural speech up to a compression factor of 2, the model with stimulus informed theta oscillator performed better for a compression factor 3. The advantage of the latter model is that gamma activity precisely aligns with the syllables in the input, allowing to encode syllables of variable length as they occur in continuous natural speech.

A critical feature of the model is the coupling between the theta and gamma modules. Many experimental studies indicate that neural theta and gamma activity interact^2,8,47,48,49 and most likely that theta organises gamma activity to align neural encoding timing with the temporal structure of the auditory sensory signal. In speech processing, cross-frequency coupling is thought to preserve the hierarchy between phonemes, syllables and even higher order linguistic elements such as phrases^{7,50,51,52,53}. Here, by having two alternative implementations for the theta rhythm, we effectively modelled two forms of theta-gamma coupling: one in which the gamma sequence is controlled (onset and duration) by the stimulus-driven theta oscillation, and another one in which the theta rhythm emerges from an endogenous gamma sequence. The observation that model variants A and B, which implement one or the other theta-gamma coupling option perform better than the variants without coupling (C, D, E, F), suggests that the nesting of two temporal timescales facilitates on-line syllable recognition, most presumably by aligning top-down and bottom-up information flows with the syllabic structures in the input.

However, we also observed that theta-gamma coupling was not the most important factor for correct performance. Irrespective of theta-gamma coupling the model required explicit reset of syllable units to function at optimal level (Fig. 3 C, D vs. E, F): the high activity level of the winning syllable unit adversely affected the processing of the next syllable, reflecting that syllable units were integrating information across syllable boundaries. This issue was dealt with by incorporating a reset of the accumulated evidence about the previous syllable, which increased performance by up to 15–20%.

Importantly however, model variants A and B, which combine theta-gamma coupling with the reset of accumulated evidence in syllable units, performed better (by 8–10%) than model variants C and D that only relied on syllable units reset, even when the latter had a flexible gamma network rhythm and could in principle have adapted to the variable syllable length in the input sentences.

The performance measure was based on the syllable unit with the highest average activity within a cycle of the gamma network (Supplementary Fig. 2). This measure does not take into account model complexity or the robustness of the model’s syllable estimate. After accounting for both by considering BIC, we found that the model that best accounted for the actual syllables present in the input was variant A, the model in which both the theta rhythm and the gamma network rhythm are stimulus driven. The log-likelihood value used to calculate the BIC value uses the actual values of syllable units together with their precision, and therefore is more sensitive than the performance measure used in Fig. 3. The larger BIC for variant A suggests that even in the cases in which the identity of the syllable unit with highest average activity within a gamma network cycle is the same in both variants, the precision of recognised syllables may be higher for variant A than B. This also suggests that variant A might be more robust to other input modifications such as background noise.

The absence of a difference in our basic measure of performance between models with stimulus-driven versus endogenous theta-gamma coupling at natural speech rate is a puzzling finding. It suggests that the notion that a theta rhythm drives the gamma network integration window, i.e. that the gamma sequence has a plausible syllable-length duration, is a more important factor than a precise syllable timing. Yet, a possible reason for this observation could be that theta-signalled syllable onsets were too imprecise to provide an optimal gain (as the single neuron theta model only yielded 53% accuracy). To address this point, we ran yet another round of simulations where we compared model performance depending on whether we use the spiking theta oscillator or true syllable onsets to reset the gamma sequence (Fig. 6 model A versus A’).

**Fig. 6: Explicit versus theta signalled onsets.**

Performance was higher for variant A’ (Fig. 6, by 5.44% ± 2.04 (95% CI), Cohen’s d = 0.363, p = 3.63e-7), i.e. when endogenous theta-gamma coupling was driven by true syllable onset information, suggesting that the model can still benefit from accurate syllable onset signalling. Although such an ideal onset detection is non-realistic in a biological system and in natural listening conditions (noise, cocktail-party situations, etc.), this simulation suggests that by improving the way the model handles syllable onset detection the gain provided by exogenous theta-gamma coupling could be further increased. So far, the best performance for on-line syllable boundary detection was obtained using a theta oscillation network of 32 neurons (16 excitatory and 16 inhibitory) bursting together, which made boundary detection more flexible and robust to fluctuations in the speech waveform^8,31. Increasing the number of theta neurons in the model’s theta module would significantly increase the computational load, yet presumably improve the accuracy of the syllable onset detection.

Although top-down control is constitutively present in the predicting coding implementation, our model lacks the notion that top-down processes involve specific discharge rates, and that, as recently demonstrated, they occur preferentially at a low-beta rate range^12,13,17,21. The predictive coding model we present here works in a continuous inferential mode, which is discretized only by virtue of syllable timing. In human auditory cortex, bottom-up gamma activity is modulated at the low-beta rate^13,54, which could offer top-down integration constants that are intermediate between syllables and gamma phonemic-range chunks, and whose advantage could be to smooth the decoding process by providing sequential priors at an intermediate scale between phonemes and syllables. Alternatively, the beta top-down rhythm could be related to expected precision, thus encoding second-order statistics²⁰. Expected precision weighs bottom-up prediction errors, hypothesised to work at gamma rate, and could control their impact on the evidence integration process. When the sensory input corresponding to a new syllable arrives, the large prediction error could decrease the estimated confidence in top-down prediction and boost confidence in bottom-up prediction error. If the relative weight of bottom-up and top-down information is carried by low beta activity, we would then expect an alternation between up- and down-going processes at a theta rhythm, a finding that was experimentally observed¹³. An important generalisation for the model would thus consist of adopting a framework that estimates precision^55,56,57. These proposals remain speculative, and neurocomputational modelling could be one way to address whether the principle of frequency and temporal division of bottom-up and top-down processing is functionally interesting and whether the low-beta rate for top-down flow is optimal or merely a “just so” phenomenon.

Although our goal was not to design a speech processing model that can compete with those used in the domain of automatic speech recognition^58,59 it turns out that the notion of neural oscillations can be relevant for the latter⁸, even at the hardware level⁶⁰. Hyafil and Cernak³¹ demonstrated that a biophysically plausible theta oscillator which can syllabify speech online in a flexible manner makes a speech recognition system more resilient to noise and to variable speech rates. Here, we confirm that a biologically realistic theta oscillation is useful to signal syllable boundaries, mostly when they are out of natural statistics. More importantly, we show that the coupling between two processing timescales at theta and gamma rates can boost on-line syllable recognition in natural speech. It is hence possible that introducing oscillatory mechanisms in ASR could further improve performance and resilience to noise. For instance, using a basic sampling rate in the low-gamma range as implemented in the current model instead of faster ones as commonly used⁶¹ could save computing resources without reducing ASR systems performance. The possible coupling of a syllable parsing oscillatory mechanism, with an encoding process that is both exogenously and endogenously timed could also be useful. Finally, using oscillation-based top-down updating, which could deploy predictions about when speech events are expected to happen, a process that most ASR systems do not yet implement⁶², might also turn out to be advantageous.

This theoretical work shows the interest of extending predictive coding approaches to the domain of neural oscillations. This combination permits both to emulate a neurally plausible interface with the real world that is able to deal with the continuous nature of biological stimuli and the difficulty to parse them into elements with discrete representational value, and to provide internal orchestration of the information flow that supply possible deficiencies of interfacing mechanisms. Overall, this study suggests an alternative perspective of combining neural oscillations with predictive coding: namely the longitudinal effect of theta-gamma coupling on the temporal alignment of top-down/bottom-up informational flows during inferential processes.

Methods

Speech Input

We used 220 recorded English sentences from the TIMIT database⁴¹ for our simulations. The sentences were spoken by 22 different speakers (10 sentences each). Overall, those 220 sentences include 2888 syllables. We used the first 10 sentences of the dataset to adjust parameters of the model (Eq. (7) and precisions, Table 2). Thus, these sentences were not included in the model performance analysis.

Table 2 Precisions.

Full size table

Input to the model consisted of (1) a time-frequency representation of the sound wave and (2) the slow amplitude modulation of the sound waveform. We used a biologically inspired model of the auditory periphery³⁷ to obtain the time-frequency representation, therefore we will refer to it as the auditory spectrogram. The model transforms the auditory signal into 128 logarithmically spaced frequency channels, ranging from 150 Hz up to 7 kHz. We normalised the auditory spectrogram so that the maximum value across all channels is 1 and the minimum 0. After averaging the activity of adjacent channels, we reduced the number of channels to 6, covering the range of frequencies from 150 Hz to 5 kHz. The slow amplitude modulation is calculated following the procedures described in Hyafil et al.⁸ and Hyafil and Cernak³¹. The procedure follows these steps: (1) transform the auditory spectrogram into 32 frequency channels by averaging every 4 channels (from the full 128 channel auditory spectrogram) into 1 channel. (2) convolve with a spectrotemporal filter optimised to signal syllable boundaries. In Hyafil et al.⁸ the output of the spectrotemporal filter, which corresponds to the slow amplitude modulation in the speech waveform, was used to drive a theta oscillator made of integrate and fire model neurons. In Precoss, it is used to drive the theta module that is implemented by a continuous theta neuron.

In summary, each recorded sentence was represented by seven sensory input channels, the slow amplitude modulation (S(t)) plus 6 frequency bands (F_f(t); f = 1, … 6).

Syllabification

The model’s goal is to recognise syllables on-line, which requires defining syllables in the input to subsequently assess the model’s performance. The TIMIT corpus provides phonemic boundaries labelled by professional phoneticians. This information was passed to Tsylb2, a programme that provides candidate syllable boundaries based on the TIMIT phonemic annotation and on English grammar rules⁶³. These boundaries were then used to calculate spectrotemporal patterns ST_fγω (f = 1, … 6, number of frequency channels and γ = 1, … 8, number of gamma units) for each syllable ω of a given sentence. Firstly, each syllable (ω) was divided into 8 equal ΔT temporal chunks. Then, syllable spectrotemporal patterns ST_fγω were calculated by averaging each of the 6 frequency channels (F_f(t)) within each of the eight ΔT windows (γ) for each syllable (ω)). We used 6 frequency channels per syllable, thus, the spectrotemporal patterns are matrices with 6 rows and 8 columns for each syllable (ω)) (Supplementary Fig. 1).

Generative model

We used a predictive coding model to parse and recognise individual syllables from the continuous acoustic waveform of spoken English sentences. The core of the predictive coding framework is a hierarchically structured generative model that represents the internal knowledge about the statistics and structure of the external world. During the inference process, the brain inverts the generative model and tries to infer the hidden causes of the sensory input. To invert the generative model, we used the Dynamic Expectation Maximisation algorithm^55,64, which is based on top-down predictions and bottom-up prediction errors.

We considered a generative model with two hierarchically related levels. At each level i in the hierarchy, dynamics are determined by local hidden states (denoted by x⁽ⁱ⁾) and causal states from the level above (denoted by υ⁽ⁱ⁾). At the same time, each level generates causal states that pass information to the level below (υ^(i-1)). Hidden states are subject to dynamical equations while causal states are defined by static, generally nonlinear transformations of hidden states and causal states. Schematically

Top level (i = 2)

$${\dot x}^{\left( 2 \right)} = {\it{f}}^{\left( 2 \right)}\left( {{\it{x}}^{\left( 2 \right)}} \right) + {\it{\upvarepsilon }}^{\left( 2 \right)}$$

(1)

$${\it{v}}^{(1)}{\it{ = g}}^{(2)}\left( {{\it{x}}^{(2)}} \right)+ \upeta^{(2)}$$

(2)

The dynamics at this level are only determined by hidden states x⁽²⁾. υ⁽¹⁾ is the output to the level below. The hidden states at this level include hidden states for a theta oscillator, the slow amplitude modulation of speech waveform, syllable units, and gamma units (see below for details). ε⁽ⁱ⁾ and η⁽ⁱ⁾ (i = 1, 2) stand for random fluctuations for hidden and causal states respectively (the same notation is used in the next sections); their precision determines how predictions errors are weighted⁵⁵. Causal states passing information to the bottom level include causal states for syllable units, gamma units, and the slow amplitude modulation.

Bottom level (i = 1)

$${\dot x}^{\left( 1 \right)} = {\it{f}}^{\left( 1 \right)}\left( {{\it{x}}^{\left( 1 \right)},{\it{v}}^{\left( 1 \right)}} \right) + {\it{\upvarepsilon }}^{\left( 1 \right)}$$

(3)

$${\it{v}}^{\left( 0 \right)}{\it{ = g}}^{\left( 1 \right)}\left( {{\it{x}}^{\left( 1 \right)}{\it{,v}}^{\left( 1 \right)}} \right) + \upeta^{\left( 1 \right)}$$

(4)

At this level, there are hidden and causal states related to the 6 frequency channels and a causal state for the slow amplitude modulation (which is relayed without modification from the top level).

The output of the bottom level υ⁽⁰⁾ is then compared with the input Z(t): a vector containing the slow amplitude modulation and the reduced 6-channel auditory spectrogram.

$${\it{v}}^{\left( 0 \right)} = {\bf{{Z}}}\left( {\it{t}} \right)$$

(5)

In the following, we write the explicit form of these equations. Supplementary Fig. 3 provides a schematic with all the variables used in the model.

Top level

The top level has two modules; a theta module with slow amplitude modulation and theta oscillator units and a spectrotemporal module with gamma and syllable units.

We have used Ermentrout-Kopell’s canonical model³⁸ for the model’s theta oscillation. The theta model receives as an input the slow amplitude modulation signal in the input. Whenever the theta oscillator reaches a predefined phase, the model generates a gaussian trigger, which constitutes the model’s estimate about syllable onsets (called also theta triggers).

The model tracks the slow amplitude modulation in the input with a perfect integrator:

$$\frac{{{\it{dA}}}}{{{\it{dt}}}} = 0 + {\it{\upvarepsilon }}_A^{\left( 2 \right)}$$

(6)

During inference, A generates an equivalent signal that is compared with the slow amplitude fluctuations in the input (see Supplementary Fig. 3); precision weighted prediction errors in generalised coordinates drive Eq. 6 and result in variable A tracking the slow amplitude modulations.

We modify the tracked amplitude and use it as an input to the canonical theta model.

$$R = 0.25 + 0.21A$$

(7)

The first parameter on the right-hand side is chosen so that theta frequency is around 5 Hz whenever A is 0. The coefficient for A ensures that the range of the resulting rhythm is within the biological theta range (3–8 Hz, tested on the first 10 sentences of the dataset).

The following pair of the equations corresponds to the theta oscillator in the theta module.

$$\begin{array}{*{20}{c}} {\frac{{dq_1}}{{dt}} = - kq_2\left( {1 + R + q_1\left( {R - 1} \right)} \right) + \varepsilon _{{\mathrm{q}}_{\mathrm{1}}}^{\left( 2 \right)}} \\ {\frac{{dq_2}}{{dt}} = kq_1\left( {1 + R + q_1\left( {R - 1} \right)} \right) + \varepsilon _{{\mathrm{q}}_{\mathrm{2}}}^{\left( 2 \right)}} \\ {k = \frac{{2\pi \Omega }}{{1000}}} \end{array}$$

(8)

Where 1000 is the sampling rate and Ω = 5 Hz is the frequency of theta in the absence of input (A = 0). The quantity within brackets of the right-hand side of Eq. (8) stands for the normalized instantaneous rate of the theta oscillation (instantaneous rate = k ∙ s_θ).

$$s_{\uptheta} = 1 + R + q_1\left( {R - 1} \right)$$

(9)

The value of k ∙ s_θ determines the interval between two consecutive theta triggers, hence also the interval between two consecutive syllable onsets. As this interval indirectly estimates the syllable duration between signalled onsets, we have used the value of s_θ to set the preferred rate of the gamma sequence (and by extension gamma sequence duration) as discussed in the next section.

Gamma units are modelled as a stable heteroclinic channel, which results in their sequential activation⁶⁵ (for details see^26,45). The duration of the gamma sequence depends on the hidden variable s (Eq. (12)); which sets the rate of the gamma sequence through κ₂ in Eq. (10) and whose dynamics depends on whether the model variant: 1) has no preferred rate for the gamma sequence 2) has a fixed preferred rate or 3) has a preferred rate determined via the exogenous theta oscillation. For the second case, an endogenous rhythm is associated with the gamma sequence (endogenous theta) whose rate is set to generate a rhythm at 5 Hz (Eq. (15)), therefore corresponding approximately to the average syllable duration in English³⁰ and to the syllabic rate in our data set (Fig. 2).

In the model, gamma units provide processing windows for the syllable encoding process. The active gamma unit determines which part of the syllable is encoded at each moment of time. For example, if the first gamma unit is active, then the first 1/8 part of the spectral content of a syllable is encoded, if the second gamma unit is active then the second 1/8 part is encoded and so on.

The mathematical equations are adapted from Yildiz et al.²⁶.

$$\frac{{{\it{dz}}}}{{{\it{dt}}}} = {\it{\upkappa }}_2\left( {\it{s}} \right)\left[ { - \lambda {\mathbf{z}} - {\it{\uprho S}}\left( {\mathbf{z}} \right) + 1} \right] - {\it{\upbeta }}\left( {{\mathbf{z}} - {\mathbf{z}}_0} \right){\it{T}}_\gamma + {\it{\upvarepsilon }}_z^{\left( 2 \right)}$$

(10)

$$\frac{{{\it{dy}}_{\it{i}}}}{{{\it{dt}}}} = {\it{e}}^{{\it{z}}_{\it{i}}} - {\it{y}}_{\it{i}}\mathop {\sum}\limits_{{\it{j}} = 1}^{{\it{N}}_{\it{\upgamma }}} {{\it{e}}^{{\it{z}}_{\it{j}}}} - {\it{\upbeta }}\left( {{\it{y}}_{\it{i}}{\it{ - y}}_{{\it{i}},0}} \right){\it{T}}_{\it{\upgamma }} + {\it{\upvarepsilon }}_{{\it{y}}_{\it{i}}}^{\left( 2 \right)}$$

(11)

$$\frac{{ds}}{{dt}} = f\left( s \right) + \varepsilon _{\mathrm{s}}^{\left( 2 \right)}$$

(12)

Where

i represents the index of the gamma unit and takes values from 1 to N_γ = 8.
z is a vector of 8 units encoding the amplitude fluctuations of the gamma units, whereas the vector y represents the amplitude of the gamma units scaled to the [0, 1] interval.
z₀ and y₀ represent the reset values of z and y, corresponding to the state when the first gamma unit is active (the start of the gamma sequence).
T_γ stands for the trigger that gamma units receive from the theta module (Table 1).
β= 0.5 is a scaling factor for theta triggers.
S(z) = 1/(1 + e^-z) is applied component-wise.
ρ_ij ≥ 0 is the connectivity matrix, determining the inhibition strength from unit j to i. Its values are:

$${\it{\uprho }}_{{\it{ij}}} = \left\{ \begin{array}{l}0\quad {\it{i}} = {\it{j}}\\ 1.5\quad {\it{j}} = {\it{i}} + 1\\ 0.5\quad {\it{j}} = {\it{i}} - 1\\ 1\quad {\rm{otherwise}}\end{array} \right.$$

(13)

The first term on the right-hand side of both Eqs. (10) and (11) is taken from Yildiz et al. (2013)²⁶. The trigger term introduced on the right-hand side of Eqs. (10) and (11) ensures that irrespective of the current state of the network, the first gamma unit is activated, and the gamma sequence is deployed from the beginning whenever there is a trigger. When the trigger corresponds to a syllable onset, it ensures that the gamma sequence is properly aligned with the input and therefore that the spectrotemporal predictions temporally align with the input.

We have also modified the equations so that the value of κ₂ now is a function of the hidden variable s (Eq. (12)).

$$\begin{array}{l}\kappa _2\left( s \right) = \kappa _0e^{\left( {s - 1} \right)}\\ s\left( {t = 0} \right) = 1\\ \kappa _0 = 0.2625\end{array}$$

(14)

The transformation from s to κ₂, guarantees that the latter stays positive. s was always initialised to s(t = 0) = 1. This initial value leads to κ₂ = κ₀ in Eq. (10), which corresponds to a 25 ms duration for each gamma unit (40 Hz frequency) and an overall duration of 200 ms for the whole gamma sequence. During inference, the value of s (and consequently the value of κ₂) can vary depending on the form of f(s) in Eq. (12).

$$f\left( s \right) = \left\{ {\begin{array}{*{20}{c}} 0 \\ {1 - s} \\ {s_{\uptheta} - s} \end{array}} \right.$$

(15)

f(s) = 0, corresponds to the no preferred gamma rate condition (no theta-gamma coupling); the value of s can change freely. For f(s) = 1-s, s can change its value (and as a consequence the duration of the gamma sequence) but it will have a tendency to decay back to s = 1, that is, the model has a preferred gamma sequence duration of 200 ms; we refer to this as the endogenous theta rhythm and endogenous theta-gamma coupling. Finally, when f(s) = s_θ-s, the preferred rate of the gamma sequence is set by the exogenous theta oscillation, corresponding to the exogenous theta-gamma coupling. For all but model variants A, B and A’ f(s) = 0, variant A features f(s) = s_θ-s (exogenous theta-gamma coupling) and finally, variants B and A’ have endogenous theta-gamma coupling with f(s) = 1-s. A summary of the model variants can be found in Table 1.

Across the simulations, in the case of the exogenous theta-gamma coupling (variant A, Fig. 3) the average value of κ₂ across all sentences was equal to 1.23κ₀ (with standard deviation equal to 0.45κ₀), which corresponds to a gamma sequence duration around 163 ms. This suggests that the stimulus driven theta oscillation adapts to the syllable duration in the input. Even though the theta frequency was tuned to 5 Hz (corresponding to a gamma sequence duration to 200 ms), the resulting average frequency corresponds to the median syllable duration in our dataset (Fig. 2). In case of variant B (endogenous theta), the gamma rate remained fixed at κ₂ = κ₀ (with standard deviation equal to 0.0025κ₀).

The last module of the top-level contains the syllable units; they represent evidence that the associated syllable corresponds to the syllable in the input. The number of syllable units varies from sentence to sentence and corresponds to the number of syllables in the input sentence plus a silent unit which generates silence; e.g. it should be maximally active whenever there is no signal in the input. The equations for syllable units are:

$$\frac{{{\it{d\upomega }}}}{{{\it{dt}}}} = - \left( {{\it{\upomega - \upomega }}_0} \right){\it{T}}_{\it{\upomega }} + {\it{\upvarepsilon }}_{\it{\upomega }}^{\left( 2 \right)}$$

(16)

where omega is a vector with as many components as syllables in the sentence (plus silent unit). T_ω corresponds to triggers (Table 1) that reset the activation level of the syllable units. A trigger drives the activity level of all syllable units towards an equal value ω₀. As we will specify below, triggers originated from the last gamma unit, signalling internal expectations about the end of a syllable; in the case of model variants without the reset of syllable units, the trigger was set to 0. Between triggers, syllable units act as evidence accumulators. The activation level of each unit determines its contribution to the generated auditory spectrogram (Eqs. (18) and (21)).

The causal states of the second level pass information to the bottom level:

$${\it{\upnu }}_{\it{\upgamma }}^{\left( 1 \right)} = {\mathbf{y}} + {\it{\upeta }}_{\it{\upgamma }}^{\left( 2 \right)}$$

(17)

$${\it{\upnu }}_{\it{\upomega }}^{\left( 1 \right)} = \frac{{{\it{e}}^{{\it{ - \upomega }}}}}{{{\sum} {{\it{e}}^{{\it{ - \upomega }}}} }} + {\it{\upeta }}_{\it{\upomega }}^{\left( 2 \right)}$$

(18)

$${\it{\upnu }}_A^{\left( 1 \right)} = {\it{A}} + {\it{\upeta }}_A^{\left( 2 \right)}$$

(19)

Equation (17) corresponds to the 8 scaled gamma units (Eq. (11)); that have sequential activation. Equation (18) corresponds to the syllable units; where we used the softmax function to scale the activity of the syllable units. Since all the syllables in the input are present in the syllable pool, prediction error (the difference between predicted and actual spectrotemporal patterns at the first level) will be minimised when the causal state of the corresponding syllable unit in the model is driven close to 1 while all others are driven close to 0. Finally, Eq. (19) sends information about the current estimate of the slow amplitude modulation.

Bottom level

The bottom level contains variables related to the amplitude fluctuations of the frequency channels as well as the slow amplitude modulation.

The amplitude fluctuations of the frequency channels are modelled with a Hopfield attractor-based neural network⁶⁶. The following equations were adapted from Yildiz et al.²⁶

$$\frac{{{\it{d}}{\mathbf{x}}^{\left( 1 \right)}}}{{{\it{dt}}}} = {\it{\upkappa }}_1\left[ { - {\it{D}}{\mathbf{x}}^{\left( 1 \right)} + {\it{W}}{\mathbf{{tanh}}}\left( {{\mathbf{x}}^{\left( 1 \right)}} \right) + {\it{I}}} \right] + {\it{\upvarepsilon }}^{\left( 1 \right)}$$

(20)

$${\it{I}}_{\it{f}} = \mathop {\sum}\limits_{{\it{\upgamma }} = 1}^8 {\mathop {\sum}\limits_{{\it{\upomega }} = 1}^{{\it{N}}_{{\it{syl}}}} {{\it{\upupsilon }}_{\it{\upgamma }}^{\left( 1 \right)}{\it{\upupsilon }}_{\it{\upomega }}^{\left( 1 \right)}{\it{P}}_{{\it{f\upgamma \upomega }}}} }$$

(21)

x⁽¹⁾ is a vector with 6 components (one per frequency channel), D is a diagonal self-connectivity matrix and W is an asymmetric synaptic connectivity matrix; they were designed so that the Hopfield network has a global attractor whose location depends on vector I²⁶. In Eq. (21), ν_γ⁽¹⁾ and ν_ω⁽¹⁾ are the causal states for the gamma and syllable units from the top level (Eqs. (17) and (18)). P_fγω is defined from the spectrotemporal patterns ST_fγω associated with each syllable as follows:

$${\it{P}}_{{\it{f\upgamma \upomega }}} = \mathop {\sum}\limits_{{\it{i}} = 1}^6 {{\it{D}}_{{\it{fi}}}{\it{ST}}_{{\it{i\upgamma \upomega }}}} - \mathop {\sum}\limits_{{\it{i}} = 1}^6 {{\it{W}}_{{\it{fi}}}{\mathrm{tanh}}\left( {{\it{ST}}_{{\it{i\upgamma \upomega }}}} \right)}$$

(22)

Because the vector I_f determines the global attractor, sequential activation of the gamma units makes the global attractor change continuously over time and generate the pattern corresponding to syllable ‘ω’ when υ⁽¹⁾_ω = 1 and υ⁽¹⁾_{not ω} = 0.

The outputs of this level are the states of the Hopfield network, which predict the activity of the frequency channels in the input, and the causal state associated with the slow amplitude modulation (relayed from the top level):

$$\begin{array}{c}{\it{\upupsilon }}_{\it{f}}^{\left( 0 \right)} = {\mathbf{x}}^{\left( 1 \right)} + {\it{\upeta }}_{\it{f}}^{\left( 1 \right)}\\ {\it{\upupsilon }}_A^{\left( 0 \right)} = {\it{\upupsilon }}_A^{\left( 1 \right)} + {\it{\upeta }}_A^{\left( 1 \right)}\end{array}$$

(23)

These quantities are compared with the slow amplitude modulation (S(t)) and frequency channels (F_f(t)) in the input signal:

$$\begin{array}{l}{\it{\upnu }}_A^{\left( 0 \right)} = {\it{S}}\left( {\it{t}} \right)\\ {\it{\upnu }}_{\it{f}}^{\left( 0 \right)} = {\it{F}}_{\it{f}}\left( {\it{t}} \right)\end{array}$$

(24)

The discrepancy between top-down predictions and sensory input is propagated back in the hierarchy to update hidden state and causal state estimates so that prediction errors at every level of the hierarchy are minimized.

The values of all parameters used in the model, as well as precisions for hidden and causal states for both levels, are presented in Tables 3 and 2 respectively.

Table 3 Parameter values.

Full size table

Resets/triggers and model variants

To ensure that predictions are temporally aligned with the input, the model needs to align the gamma network with syllable onsets. Moreover, ideally, evidence accumulation should be reset before syllable onset. Although both resets could in principle be driven by prediction errors, our basic model also involves explicit resets.

When present, the trigger to reset gamma units (denoted by T_γ in Eqs. (10) and (11)) was driven either by exogenous theta signalled onset triggers (referred to as theta triggers T_θ) or by explicitly provided onsets Tr.

Theta-triggers T_θ: A Gaussian pulse was generated whenever the phase of the exogenous theta oscillation reached a specific phase:

$$\begin{array}{l}{\it{T}}_\theta = {\it{e}}^{{\it{ - }}\frac{{\left( {{\it{q}}_{1{\it{n}}}{\it{ + }}1} \right)^2{\it{ + q}}_{2{\it{n}}}^2}}{{2{\it{\upsigma }}^2}}}{\kern 1pt} \;\;\;\;\;\;\\ {\it{q}}_{1n}{\it{ = }}\frac{{{\it{q}}_1}}{{\sqrt {{\it{q}}_1^2{\it{ + q}}_2^2} }}{\it{,}}\quad {\it{q}}_{2n}{\it{ = }}\frac{{{\it{q}}_2}}{{\sqrt {{\it{q}}_1^2{\it{ + q}}_2^2} }}\end{array}$$

(25)

When present, the reset to syllable units (T_ω, Eq. (16)) was driven by the model’s knowledge about syllable spectral structure (e.g. knowledge that each syllable is a sequence of 8 spectral target points). Since the last (eighth) gamma unit signals the end of the syllable, it can define a trigger that we refer as the internal trigger (denoted as T_int):

$${\it{T}}_{int} = {\it{y}}_8$$

(26)

In summary, to reset gamma units the model uses theta triggers, which are derived from the slow amplitude modulation of the sound waveform, whereas to reset syllable units the model uses internal information about the spectral structure of syllables - T_int. To explore the relative importance of each resetting mechanism for the overall performance of the model, we compared different model variants (Table 1).

Model output

To define the syllables identified by the model, we considered the time average of the causal state (v_ω, Eq. (18)) of each of the syllable units taken within the boundaries defined by the gamma sequence (Supplementary Fig. 2). We have calculated for how long the unit with highest average activation between the internally signalled boundaries corresponds to the syllable in the input at that time. Thus, for each sentence, we calculate the percentage of the total sentence duration which was correctly identified by the model (Supplementary Fig. 2).

Statistical analyses

As described in the previous section, a single number (% correctly identified sentence duration) describes the performance of the model for each sentence. Simulations were run on 210 sentences, and the performance of each model architecture is thus described by a vector of 210 numbers. The non-parametric Wilcoxon signed-rank test for repeated measures was used to compare models’ performance. The Bonferroni correction was used to control for multiple comparisons. Supplementary Table 1 reports p-values for pairwise comparisons for the model variants presented in Fig. 3. Table also includes the adjusted α-value.

For the simulations with compressed speech, we only compared the performance difference between 2 model variants for each condition separately. Each pair was compared using the paired t-test (we only compare performance difference of the 2 model variants (A and B) for a particular compression factor).

For the simulations presented in Figs. 4 and 6, Cohen’s d value was calculated as the difference between the means of each group, divided by the standard deviation calculated from differences in observations.

Finally, to estimate the chance level, the following steps were performed: 1) for each sentence in the database, we mimicked the sequence of model-selected syllables (Supplementary Fig. 2, panel c) by randomly choosing 2) a duration of “selected” syllable from the syllable duration distribution in Figs. 2a and 3) an identity from the syllables of the corresponding sentence. Thus, we have calculated what would be the model’s performance when syllable identity and duration were selected by chance. For each sentence, the procedure was repeated 1000 times and the mean performance value was stored (a chance level for a sentence). Furthermore, the whole procedure was repeated 1000 times on all 220 sentences in the database, and the median value of the resulting distribution was selected as a chance level for the whole dataset, which was equal to 6.9%.

Model comparison

We have calculated the Bayesian information criterion value based on the posterior probability of the true syllable sequence s (Supplementary Fig. 2d) for each sentence and model variant m; p(s | m). This quantity also represents the likelihood of the model given the input (true syllables that the model aims to recover).

Similar to the syllable recognition accuracy presented in Fig. 3, the likelihood value is based on the dynamics of the second level causal states (Supplementary Fig. 2) but it also takes into account the precisions (confidence) of the estimated dynamics of each variable (e.g. dynamics of syllable and gamma units in Supplementary Fig. 2a, b).

To derive the BIC, we first calculated model m likelihood for each sentence i, and each moment of time (for convenience we do not write the explicit time dependence of D and ν_μ on the right-hand side of the equations below):

$$\begin{array}{l}p\left( {{\mathbf{s}}_i\left( t \right)|m} \right) = {\int} {dz\,p\left( {{\mathbf{s}}_i,{\mathbf{z}}|m_i} \right)} \equiv \frac{{\sqrt {\det \,D} }}{{\sqrt {(2\pi )^d} }}{\int} {dz\,\exp \left[ { - \frac{1}{2}\left( {\left( {{\mathbf{\upsilon }} - {\mathbf{\upsilon }}_\mu } \right)^TD\left( {{\mathbf{\upsilon }} - {\mathbf{\upsilon }}_\mu } \right)} \right)} \right]} \\ D \equiv \left[ {\begin{array}{*{20}{c}} A & C \\ {C^T} & B \end{array}} \right],\,\upsilon \equiv \left[ {\begin{array}{*{20}{c}} {\mathbf{s}} \\ {\mathbf{z}} \end{array}} \right]\end{array}$$

(27)

s_i is the vector of syllable units as they appear in the input sentence i at time t (Supplementary Fig. 2d). D is the overall conditional precision matrix, which we separate into the components corresponding to the syllable units A, components corresponding to the other causal states B, and their interaction terms C. ν represents the vector of causal states in the second level and ν_μ the model’s estimated means. It contains the syllable units s (mean values s_μ, Supplementary Fig. 2a) and all other causal states represented here by z (mean values z_μ) for variants B, D, F z represents the dynamics of the gamma units, e.g. Supplementary Fig. 2b and, for variants A, D and E, the gamma units and the slow amplitude modulation.

After integration over z, the likelihood of the model given the true syllables is:

$$p\left( {{\mathbf{s}}_i\left( t \right)|m_i} \right) = \frac{{\sqrt {\det \,D} }}{{\sqrt {\left( {2\pi } \right)^{n_s}} \sqrt {\det B} }}\exp \left[ { - \frac{1}{2}\left( {{\mathbf{s}}_i - {\mathbf{s}}_\mu } \right)^T\left( {A - CB^{ - 1}C^T} \right)\left( {{\mathbf{s}}_i - {\mathbf{s}}_\mu } \right)} \right]$$

(28)

This value was calculated for each moment of time, thus, to get the value for the whole sentence we calculated the average value of log(p(s_i(t)|m)) per syllable and sum across all syllables in the sentence.

$$\log p\left( {{\mathbf{s}}_i|m} \right) \approx \mathop {\sum}\limits_{j = 1}^{N_i} {\mathop {\sum}\limits_k {\Delta t\,\frac{{\log \left( {p\left( {{\mathbf{s}}_i\left( {t_k} \right)|m} \right)} \right.}}{{T_j}}} }$$

(29)

Where T_j is the duration of syllable j in the sentence i, N_i is the number of syllables in the sentence and Δt = 1 ms.

Equations (28) and (29) represent the approximate log-likelihood value for model m given the sentence i. Thus, the value for the whole dataset is:

$$\log p\left( {{\mathbf{s}}|m} \right) = \mathop {\sum}\limits_i {\log p\left( {{\mathbf{s}}_i|m_i} \right)}$$

(30)

This was used to approximate the Bayesian information criterion (BIC), which considers also the number of free variables in the model.

$$BIC\left( m \right) = \log p\left( {{\mathbf{s}}|m} \right) - \frac{{N_p\left( m \right)}}{2}\log N_{{\mathrm{sentences}}}$$

(31)

Where N_p(m) is the number of free variables for model variant m, which includes state precision values (Table 2), number of resets and the presence of gamma rate information. Thus, the most complex variant A would have 14 (precisions) + 2 (reset of gamma units by theta signalled onsets, reset of syllable units by the last gamma unit) + 1 (rate) = 17 parameters. On the other hand, the simplest version F would have only 10 parameters, no precisions values for theta and slow amplitude modulations (−(1 + 3)), values for resets of syllable and gamma units (−2) and no preferred gamma rate value (−1). The number of parameters for each model variant is given in Table 4.

Table 4 Number of free parameters for each model configuration.

Full size table

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Source data files are provided for Figs. 3, 4 and 6. Subset of the TIMIT dataset⁴¹ used for simulations and the corresponding Tsylb2⁶³ generated syllable boundaries can be found at: https://github.com/sevadah/precoss/tree/master/data_construction. Source data are provided with this paper.

Code availability

Simulations were performed with DEM Toolbox in SPM⁶⁴ using MATLAB 2018b, The MathWorks, Inc., Natick, Massachusetts, United States. Custom code is available at: https://github.com/sevadah/precoss. Source data are provided with this paper.

References

Buzsáki, G. & Draguhn, A. Neuronal olscillations in cortical networks. Science 304, 1926–1929 (2004).
Article ADS PubMed CAS Google Scholar
Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I. & Schroeder, C. E. Entrainment of neuronal oscillations as a mechanism of attentional selection. Science 320, 110–113 (2008).
Article ADS CAS PubMed Google Scholar
Wang, X. J. Neurophysiological and computational principles of cortical rhythms in cognition. Physiol. Rev. 90, 1195–1268 (2010).
Article PubMed Google Scholar
Hyafil, A., Giraud, A. L., Fontolan, L. & Gutkin, B. Neural cross-frequency coupling: connecting architectures, mechanisms, and functions. Trends Neurosci. 38, 725–740 (2015).
Article CAS PubMed Google Scholar
Canolty, R. T. et al. High gamma power is phase-locked to theta oscillations in human neocortex. Science 313, 1626–1628 (2006).
Article ADS CAS PubMed PubMed Central Google Scholar
Ghitza, O. Linking speech perception and neurophysiology: Speech decoding guided by cascaded oscillators locked to the input rhythm. Front. Psychol. 2, 130 (2011).
Article PubMed PubMed Central Google Scholar
Giraud, A. L. & Poeppel, D. Cortical oscillations and speech processing: emerging computational principles and operations. Nat. Neurosci. 15, 511–517 (2012).
Article CAS PubMed PubMed Central Google Scholar
Hyafil, A., Fontolan, L., Kabdebon, C., Gutkin, B. & Giraud, A. L. Speech encoding by coupled cortical theta and gamma oscillations. Elife 4, 1–45 (2015).
Article Google Scholar
Rimmele, J. M., Zion Golumbic, E., Schröger, E. & Poeppel, D. The effects of selective attention and speech acoustics on neural speech-tracking in a multi-talker scene. Cortex 68, 144–154 (2015).
Article PubMed PubMed Central Google Scholar
Klimovich-Gray, A. et al. Balancing prediction and sensory input in speech comprehension: the spatiotemporal dynamics of word recognition in context. J. Neurosci. 39, 519–527 (2019).
Article CAS PubMed PubMed Central Google Scholar
Donhauser, P. W. & Baillet, S. Two distinct neural timescales for predictive speech processing. Neuron. https://doi.org/10.1016/j.neuron.2019.10.019 (2019).
Chao, Z. C., Takaura, K., Wang, L., Fujii, N. & Dehaene, S. Large-scale cortical networks for hierarchical prediction and prediction error in the primate brain. Neuron 100, 1252–1266.e3 (2018).
Article CAS PubMed Google Scholar
Fontolan, L., Morillon, B., Liegeois-Chauvel, C. & Giraud, A. L. The contribution of frequency-specific activity to hierarchical information processing in the human auditory cortex. Nat. Commun. 5, 4694 (2014).
Article ADS CAS PubMed Google Scholar
Pefkou, M., Arnal, L. H., Fontolan, L. & Giraud, A. L. Θ-Band and Β-band neural activity reflects independent syllable tracking and comprehension of time-compressed speech. J. Neurosci. 37, 7930–7938 (2017).
Article CAS PubMed PubMed Central Google Scholar
Park, H., Ince, R. A. A., Schyns, P. G., Thut, G. & Gross, J. Frontal top-down signals increase coupling of auditory low-frequency oscillations to continuous speech in human listeners. Curr. Biol. 25, 1649–1653 (2015).
Article CAS PubMed PubMed Central Google Scholar
Lewis, A. G., Schoffelen, J. M., Schriefers, H. & Bastlaansen, M. A predictive coding perspective on beta oscillations during sentence-level language comprehension. Front. Hum. Neurosci. 10, 85 (2016).
Article PubMed PubMed Central Google Scholar
Sedley, W. et al. Neural signatures of perceptual inference. Elife 5, e11476 (2016).
Article PubMed PubMed Central Google Scholar
Mumford, D. On the computational architecture of the neocortex - II The role of cortico-cortical loops. Biol. Cybern. 66, 241–251 (1992).
Article CAS PubMed Google Scholar
Rao, R. P. N. & Ballard, D. H. Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 2, 79–87 (1999).
Article CAS PubMed Google Scholar
Friston, K. & Kiebel, S. Predictive coding under the free-energy principle. Philos. Trans. R. Soc. B Biol. Sci. 364, 1211–1221 (2009).
Article Google Scholar
Bastos, A. M. et al. Canonical microcircuits for predictive coding. Neuron 76, 695–711 (2012).
Article CAS PubMed PubMed Central Google Scholar
Liberman, A. M., Cooper, F. S., Shankweiler, D. P. & Studdert-Kennedy, M. Perception of the speech code. Psychol. Rev. 74, 431–461 (1967).
Article CAS PubMed Google Scholar
Halle, M. & Stevens, K. Speech recognition: a model and a program for research. IRE Trans. Inf. Theory 8, 155–159 (1962).
Article Google Scholar
Knill, D. C. & Pouget, A. The Bayesian brain: the role of uncertainty in neural coding and computation. Trends Neurosci. 27, 712–719 (2004).
Article CAS PubMed Google Scholar
Poeppel, D., Idsardi, W. J. & Van Wassenhove, V. Speech perception at the interface of neurobiology and linguistics. Philos. Trans. R. Soc. B Biol. Sci. 363, 1071–1086 (2008).
Article Google Scholar
Yildiz, I. B., von Kriegstein, K. & Kiebel, S. J. From birdsong to human speech recognition: Bayesian inference on a hierarchy of nonlinear dynamical systems. PLoS Comput. Biol. 9, e1003219 (2013).
Article CAS PubMed PubMed Central Google Scholar
Giraud, A. L. & Arnal, L. H. Hierarchical predictive information is channeled by asymmetric oscillatory activity. Neuron 100, 1022–1024 (2018).
Article CAS PubMed Google Scholar
Marr, D. C. & Poggio, T. From understanding computation to understanding neural circuitry. Neurosci. Res. Program Bull. 15, 470–488 (1977).
Google Scholar
Kamakshi Prasad, V., Nagarajan, T. & Murthy, H. A. Automatic segmentation of continuous speech using minimum phase group delay functions. Speech Commun. 42, 429–446 (2004).
Article Google Scholar
Greenberg, S. Speaking in shorthand - a syllable-centric perspective for understanding pronunciation variation. Speech Commun. 29, 159–176 (1999).
Article Google Scholar
Hyafil, A. & Cernak, M. Neuromorphic based oscillatory device for incremental syllable boundary detection. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH vols 2015-Janua 1191–1195 (ISCA, 2015).
Wacongne, C. et al. Evidence for a hierarchy of predictions and prediction errors in human cortex. Proc. Natl. Acad. Sci. USA 108, 20754–20759 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Gagnepain, P., Henson, R. N. & Davis, M. H. Temporal predictive codes for spoken words in auditory cortex. Curr. Biol. 22, 615–621 (2012).
Article CAS PubMed PubMed Central Google Scholar
Lewis, A. G. & Bastiaansen, M. A predictive coding framework for rapid neural dynamics during sentence-level language comprehension. Cortex 68, 155–168 (2015).
Article PubMed Google Scholar
Arnal, L. H. & Giraud, A. L. Cortical oscillations and sensory predictions. Trends Cogn. Sci. 16, 390–398 (2012).
Arnal, L. H., Doelling, K. B. & Poeppel, D. Delta-beta coupled oscillations underlie temporal prediction accuracy. Cereb. Cortex 25, 3077–3085 (2015).
Article PubMed Google Scholar
Chi, T., Ru, P. & Shamma, S. A. Multiresolution spectrotemporal analysis of complex sounds. J. Acoust. Soc. Am. 118, 887–906 (2005).
Article ADS PubMed Google Scholar
Ermentrout, G. B. & Kopell, N. Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math. 46, 233–253 (1986).
Article MathSciNet MATH Google Scholar
Kösem, A., Basirat, A., Azizi, L. & van Wassenhove, V. High-frequency neural activity predicts word parsing in ambiguous speech streams. J. Neurophysiol. 116, 2497–2512 (2016).
Article PubMed PubMed Central Google Scholar
Schroeder, C. E. & Lakatos, P. Low-frequency neuronal oscillations as instruments of sensory selection. Trends Neurosci. 32, 9–18 (2009).
Article CAS PubMed Google Scholar
Garofolo, J. et al. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web Download. Philadelphia Linguist. Data Consortium, 1–94 (1993).
Nourski, K. V. et al. Temporal envelope of time-compressed speech represented in the human auditory cortex. J. Neurosci. 29, 15564–15574 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ghitza, O. & Greenberg, S. On the possible role of brain rhythms in speech perception: intelligibility of time-compressed speech with periodic and aperiodic insertions of silence. Phonetica 66, 113–126 (2009).
Article PubMed Google Scholar
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 6, 461–464 (1978).
Article MathSciNet MATH Google Scholar
Yildiz, I. B. & Kiebel, S. J. A hierarchical neuronal model for generation and online recognition of birdsongs. PLoS Comput. Biol. 7, 1–18 (2011).
Article CAS Google Scholar
Mermelstein, P. Automatic segmentation of speech into syllabic units. J. Acoust. Soc. Am. 58, 880–883 (1975).
Article ADS CAS PubMed Google Scholar
Luo, H. & Poeppel, D. Cortical oscillations in auditory perception and speech: Evidence for two temporal windows in human auditory cortex. Front. Psychol. 3, 170 (2012).
Article PubMed PubMed Central Google Scholar
Lisman, J. E. & Jensen, O. The theta-gamma neural code. Neuron 77, 1002–1016 (2013).
Article CAS PubMed PubMed Central Google Scholar
Lam, N. H. L., Schoffelen, J. M., Uddén, J., Hultén, A. & Hagoort, P. Neural activity during sentence processing as reflected in theta, alpha, beta, and gamma oscillations. Neuroimage 142, 43–54 (2016).
Article PubMed Google Scholar
Ding, N. et al. Characterizing neural entrainment to hierarchical linguistic units using electroencephalography (EEG). Front. Hum. Neurosci. 11, 481–490 (2017).
Article PubMed PubMed Central Google Scholar
Martin, A. E. & Doumas, L. A. A. A mechanism for the cortical computation of hierarchical linguistic structure. PLoS Biol. 15, 1–23 (2017).
Article CAS Google Scholar
Venezia, J. H., Thurman, S. M., Richards, V. M. & Hickok, G. Hierarchy of speech-driven spectrotemporal receptive fields in human auditory cortex. Neuroimage 66, 647–666 (2019).
Article Google Scholar
de Heer, W. A., Huth, A. G., Griffiths, T. L., Gallant, J. L. & Theunissen, F. E. The hierarchical cortical organization of human speech processing. J. Neurosci. 37, 6539–6557 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bouton, S. et al. Focal versus distributed temporal cortex activity for speech sound category assignment. Proc. Natl. Acad. Sci. USA 115, E1299–E1308 (2018).
Article CAS PubMed PubMed Central Google Scholar
Friston, K. J., Trujillo-Barreto, N. & Daunizeau, J. DEM: a variational treatment of dynamic systems. Neuroimage 41, 849–885 (2008).
Article CAS PubMed Google Scholar
Friston, K. J. Variational filtering. Neuroimage 41, 747–766 (2008).
Article CAS PubMed Google Scholar
Friston, K., Stephan, K., Li, B. & Daunizeau, J. Generalised filtering. Math. Probl. Eng. 2010, e621670 (2010).
Li, J., Deng, L., Gong, Y. & Haeb-Umbach, R. An overview of noise-robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 22, 745–777 (2014).
Article Google Scholar
Prabhavalkar, R. et al. A Comparison of sequence-to-sequence models for speech recognition. in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH vols 2017-Augus 939–943 (ISCA, 2017).
Romera, M. et al. Vowel recognition with four coupled spin-torque nano-oscillators. Nature 563, 230–234 (2018).
Article ADS CAS PubMed Google Scholar
Hirsch, H. G., Hellwig, K. & Dobler, S. Speech recognition at multiple sampling rates. EUROSPEECH 2001 - SCANDINAVIA - 7th European Conference on Speech Communication and Technology vol. 2001 http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Speech+Recognition+at+Multiple+Sampling+Rates#0 (2001).
Davis, M. H. & Scharenborg, O. Speech perception by humans and machines. In Speech Perception and Spoken Word Recognition 181–203 (Psychology Press, 2016). https://doi.org/10.4324/9781315772110.
Fisher, W. M. Tsylb2-1.1 Syllabification Software. http://www.nist.gov/speech/tools (1996).
SPM - Statistical Parametric Mapping. https://www.fil.ion.ucl.ac.uk/spm/.
Rabinovich, M. I., Varona, P., Selverston, A. I. & Abarbanel, H. D. I. Dynamical principles in neuroscience. Rev. Mod. Phys. 78, 1213–1265 (2006).
Article ADS Google Scholar
Hopfield, J. J. Neural network and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA. https://doi.org/10.1073/pnas.79.8.2554 (1982).

Download references

Acknowledgements

This work was funded by a grant from the Swiss National Science Foundation (#320030_163040) and the National Centre of Competence in Research - Evolving Language. Authors are thankful Julien Diard and Jaime Delgado for helpful discussions and feedback.

Author information

Authors and Affiliations

Department of Basic Neurosciences, University of Geneva, Biotech Campus, 9 Chemin des Mines, C.P. 87, 1211, Genève, Switzerland
Sevada Hovsepyan, Itsaso Olasagasti & Anne-Lise Giraud

Authors

Sevada Hovsepyan
View author publications
You can also search for this author in PubMed Google Scholar
Itsaso Olasagasti
View author publications
You can also search for this author in PubMed Google Scholar
Anne-Lise Giraud
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.L.G. and I.O. designed the study, S.H. carried out the study under I.O. and A.L.G. guidance, S.H., I.O., and A.L.G. wrote the manuscript.

Corresponding author

Correspondence to Sevada Hovsepyan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Andrea Martin and Jonas Obleser for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Hovsepyan, S., Olasagasti, I. & Giraud, AL. Combining predictive coding and neural oscillations enables online syllable recognition in natural speech. Nat Commun 11, 3117 (2020). https://doi.org/10.1038/s41467-020-16956-5

Download citation

Received: 15 January 2019
Accepted: 04 June 2020
Published: 19 June 2020
DOI: https://doi.org/10.1038/s41467-020-16956-5

This article is cited by

Imagined speech can be decoded from low- and cross-frequency intracranial EEG features
- Timothée Proix
- Jaime Delgado Saa
- Anne-Lise Giraud
Nature Communications (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Precoss architecture and theta-gamma coupling

Model variants and performance

Bayesian information criterion

Discussion

Methods

Speech Input

Syllabification

Generative model

Top level

Bottom level

Resets/triggers and model variants

Model output

Statistical analyses

Model comparison

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links