Abstract
More and more patients worldwide are diagnosed with dementia, which emphasizes the urgent need for early detection markers. In this study, we built on the auditory hypersensitivity theory of a previous study—which postulated that responses to auditory input in the subcortex as well as cortex are enhanced in cognitive decline—and examined auditory encoding of natural continuous speech at both neural levels for its indicative potential for cognitive decline. We recruited study participants aged 60 years and older, who were divided into two groups based on the Montreal Cognitive Assessment, one group with low scores (n = 19, participants with signs of cognitive decline) and a control group (n = 25). Participants completed an audiometric assessment and then we recorded their electroencephalography while they listened to an audiobook and click sounds. We derived temporal response functions and evoked potentials from the data and examined response amplitudes for their potential to predict cognitive decline, controlling for hearing ability and age. Contrary to our expectations, no evidence of auditory hypersensitivity was observed in participants with signs of cognitive decline; response amplitudes were comparable in both cognitive groups. Moreover, the combination of response amplitudes showed no predictive value for cognitive decline. These results challenge the proposed hypothesis and emphasize the need for further research to identify reliable auditory markers for the early detection of cognitive decline.
Significance Statement
Research on cognitive decline needs more studies uncovering markers for subtle neural changes in the pre-dementia stage. Neural markers for auditory processing have a potential that has not been widely explored in studies. Here, for the first time, we used natural, continuously spoken language to examine neural processing in two groups of older adults with and without cognitive decline. We quantified how well the brain tracks speech not only at the cortical but also at the subcortical level. In contrast to previous research suggesting that subcortical and cortical auditory responses are enhanced in cognitive decline, we found no group differences. We believe that this study represents an important contribution to the search for markers of cognitive health in old age.
Introduction
Considering the increasing prevalence of dementia worldwide, markers for early detection are of crucial importance. Dementia is preceded by a protracted, preclinical phase before symptoms in cognition and behavior manifest. By the time they become apparent, the disease is often advanced and brain damage is irreversible. Therefore, more subtle neuropathological changes that precede cognitive decline are of particular interest. A promising approach for early detection lies in the neurophysiology of auditory processing (Bidelman et al., 2017). When a sound reaches the ear, it undergoes complex sensory and cognitive processing before meaning can be derived. This pathway involves subcortical relay stations in the brainstem and midbrain and processing stations in the cortex (Ades and Brookhart, 1950). Cognition and hearing are closely linked (Lindenberger and Baltes, 1994) and age-related hearing loss is associated with cognitive decline (Lin and Albert, 2014; Wayne and Johnsrude, 2015; Thomson et al., 2017). Despite this link, studies investigating neurophysiological auditory processing in dementia are scarce and often neglect auditory function. Hearing loss affects auditory processing, with the most evident effects manifesting as reduced and delayed encoding in the brainstem (Verhulst et al., 2016; Presacco et al., 2019), leading to increased cortical representation (Decruy et al., 2020; Fuglsang et al., 2020).
There is one study that suggested a link between cognitive decline and auditory encoding hypersensitivity. Bidelman et al. (2017) showed that adults with signs of cognitive decline exhibited increased activity at the brainstem level, which was subsequently reflected in increased cortical representation. They used electroencephalography (EEG) to examine frequency-following responses (FFRs) and auditory evoked potentials (AEPs). The FFR serves as indicator of the auditory system’s ability to encode fundamental frequency and harmonics (Skoe and Kraus, 2010) and originates from subcortical structures influenced by the brainstem and the inferior colliculus in the midbrain (Chandrasekaran and Kraus, 2010). The AEP represents a cortical response to auditory stimuli and has three peaks: a positive deflection (P1) approximately 50 ms after stimulus onset, followed by a negative deflection (N1) at 100 ms and a positive deflection (P2) at 200 ms (Picton and Hink, 1974). P1 and N1 refer to initial detection, while P2 reflects higher level processing of information. Bidelman et al. (2017) reported that adults with signs of cognitive decline showed increased FFR-residual mean squared (RMS) amplitudes and N1–P2 voltage differences, providing a model for predicting decline based on response amplitudes. Despite the promising implications of these findings, it is notable that no subsequent research has yet validated and built on these observations.
Recent methodological advances in neurophysiological auditory processing allow the experimental use of natural continuous speech. Such stimuli have a stronger real-world connection than repetitive sounds, enhancing the ecological validity of results and making experiments more engaging for participants. Speech is characterized by low-frequency envelope modulations. During listening, the cortex synchronizes its activity with the temporal structure of the envelope, essentially “tracking” it (Luo and Poeppel, 2007). To quantitatively assess this speech tracking, linearized stimulus-response models known as temporal response functions (TRFs) are used (Crosse et al., 2016). TRF models are created by regressing a speech-derived feature against the EEG, resulting in a time-lagged model that describes the relationship between the two signals. These models exhibit deflections akin to P1–N1–P2 complexes and allow neurophysiological interpretation (Crosse et al., 2021). TRF models have recently been extended to subcortical processing (Maddox and Lee, 2018; Bachmann et al., 2021; Kulasingham et al., 2024), which show a peak approximating wave V in auditory brainstem responses (ABRs), reflecting activity in subcortical structures and occurring approximately 5 ms after sound onset. The advantage of TRF models over conventional potentials lies in their ability to assess the response to speech over a longer listening period.
Building on this framework, we investigated neurophysiological speech processing in participants with and without signs of cognitive decline. We used an auditory EEG paradigm inspired by Bidelman et al. (2017) and Bachmann et al. (2021), which allowed us to simultaneously record subcortical and cortical responses while participants listened to an audiobook, and complemented it with an ABR recording. To control for hearing ability, we performed audiometric tests to ensure that the observed differences were due to cognitive factors and not hearing impairment. Our study was guided by two hypotheses: First, that cognitive decline is associated with encoding hypersensitivity at cortical and subcortical levels, as evidenced by increased TRF and evoked responses. Second, we expected that response peak amplitudes might serve as predictors of cognitive decline, providing novel insight into the neurophysiological markers of cognitive health in older adults.
Materials and Methods
Participants
We invited 45 participants, aged 60 and older, to take part in our study. Recruitment was conducted through an advertisement with Pro Senectute (prosenectute.ch). All participants were native Swiss German speakers, raised monolingually until the age of seven years, right-handed, and retired. Our selection process involved a preliminary telephone interview, followed by a detailed on-site interview for final participants. We confirmed right-handedness using the Annett Hand Preference Questionnaire (Annett, 1970). Exclusion criteria were set to avoid confounding factors such as a professional music career, dyslexia, significant neurological impairments including dementia, depression, severe or asymmetrical hearing loss (defined as average pure-tone hearing threshold exceeding 60 dB hearing level (HL) or an interaural difference exceeding 20 dB HL), and the use of hearing aids. Eventually, we had to exclude one participant due to technical issues. This study was conducted with the approval of the Ethics Committee of the Canton of Zurich (BASEC No. 2019-01400). All participants provided written informed consent and received monetary compensation for their participation. A study session lasted approximately three and a half hours and was conducted at the Linguistic Research Infrastructure (LiRI, liri.uzh.ch).
Montreal Cognitive Assessment-based cognitive grouping
In line with Bidelman et al. (2017), we grouped participants based on cognitive performance as assessed by the Montreal Cognitive Assessment (MoCA; Nasreddine et al., 2005). All experiment instructors had completed training certification prior to data collection (see mocacognition.com/training-certification). MoCA scores range from 0 to 30 points, where higher scores indicate better cognitive function. Using a cutoff score of 26 points, the clinical threshold for mild cognitive impairment (MCI), participants were divided into two groups: Participants with signs of cognitive decline (low MoCA group, 19 participants with 23.6 ± 2.1 points; 12 women, 7 men) and participants with normal cognitive performance (normal MoCA group, 25 participants with 27.4 ±1.2 points, 14 women, 11 men). The distribution of MoCA scores across groups is depicted in Figure 1A. Age for the low MoCA group was 71.7 ± 6.3 (range: 60–82) years and for the normal MoCA group, 68.6 ± 5.3 (range: 60–83) years. The age difference between the groups was not statistically significant (two-tailed Mann–Whitney U test: U = 311.5, p = 0.081, r = 0.31).
Audiometric testing
We conducted audiometric testing using the Affinity Compact audiometer (Interacoustics, Middelfart, Denmark), equipped with a DD450 circumaural headset (Radioear, New Eagle, PA, USA). Participants sat in a soundproof booth to minimize external interference during the assessments.
Pure-tone audiometry
Pure-tone hearing thresholds were measured for both ears at frequencies of 125, 250, 500, 1,000, 2,000, 4,000, and 8,000 Hz, with individual thresholds averaged by ear presented in Figure 1B. We based the overall hearing ability on the four-frequency pure-tone average (PTA), calculated as the average of thresholds at 500, 1,000, 2,000, and 4,000 Hz (Lin and Reed, 2021). The interaural PTA difference was 4.5 ± 4.4 (range: 0–17.5) dB HL, indicating symmetrical hearing. PTA was then averaged across both ears, with the low MoCA group averaging 20.4 ± 11.5 dB HL and the normal MoCA group averaging 19.6 ± 11.8 dB HL. Most participants had no to mild hearing impairment, although seven had moderate impairment, i.e., a PTA exceeding 34 dB HL (Humes, 2019), two of them in the low MoCA group and five in the normal MoCA group. No significant difference in hearing ability was observed between groups (Mann–Whitney U test: U = 262.5, p = 0.561, r = 0.11). However, PTA correlated with age (Fig. 1C; ρ = 0.5, p = 0.0003, Fisher’s z = 0.57, n = 44).
Speech-in-noise
To complement pure-tone audiometry, we conducted a speech-in-noise (SiN) test, which is more cognitively demanding and relevant for understanding cognitive decline (Moore et al., 2014). Using the SiN test provided by the Affinity software, participants repeated two-syllable compound words embedded in white noise (fixed at 65 dB HL), with the speaker’s level adjusted for comprehension. We determined the SiN perception level as the signal-to-noise ratio (SNR) at which participants correctly repeated a word at least twice out of three repetitions. SNR loss values, higher values indicating poorer perception, were averaged for both ears. The low MoCA group had an average SNR loss of 5.5 ± 2.5 (range: 2–12) dB HL, and the normal MoCA group 5.9 ± 2.5 (range: 2–14) dB HL, within the normal to mild loss range (Taylor, 2003). We observed no significant difference in SNR loss between groups (Mann–Whitney U test: U = 216.5, p = 0.612, r = −0.09), and SNR loss was not age-related (ρ = −0.1, p = 0.497, Fisher’s z = 0.11, n = 44).
PTA and SNR loss were correlated (ρ = 0.5, p = 0.0003, Fisher’s z = 0.57, n = 44), indicating that the two measures of hearing ability lead to a comparable result. Finally, we chose PTA as a control variable for hearing ability in further analyses.
EEG recording
We recorded EEG using the Biosemi ActiveTwo system (Biosemi, Amsterdam, The Netherlands) with 32 electrodes (10–20 system), four external electrodes and at a sampling rate of 16.384 kHz. We positioned two external electrodes above and below the right eye to measure the electrooculogram (EOG), and two on the left and right mastoids. Throughout the experiment, we monitored and maintained electrode offsets below 20 μV. Participants sat in a soundproof, electromagnetically shielded booth for the recordings. During the listening tasks, we instructed them to focus on a fixation cross displayed on a screen and to minimize movement, especially when the cross was visible. The entire EEG recording session, including preparation and breaks, lasted for approximately 90 min. This session comprised four parts: an initial click-ABR measurement, a resting-state measurement, and two speech-related paradigms using audiobook segments and matrix sentences, respectively. The resting state and matrix sentences paradigms are not discussed in this paper.
Audio playback and experiment control
To display instructions and play auditory stimuli to participants, we used the Presentation® software (Neurobehavioral Systems, Inc., Berkeley, CA, USA; research resource identifier (RRID): SCR_002521). However, we opted for audio file triggering to achieve the level of temporal precision required for neurophysiological experiments concerned with subcortical processing. Specifically, we encoded the onset of auditory inputs into the stereo wav container. The first channel contained the auditory stimulus—speech and click trains, respectively—while the second channel held pulses coinciding with the analysis-relevant sound onsets in the stimulus. The first channel was connected to the audio system, and the second channel was routed to the EEG system. This integration eliminated the latency between the playback of the stimulus on the sound card and the input of the port code into the EEG system. Speech and click trains were presented bilaterally through electromagnetically shielded ER3 insert earphones (Etymotic Research, Elk Grove Village, IL, USA). The sound card used was a RME Fireface UFX II (RME Audio, Haimhausen, Germany). To validate the timing accuracy, we measured the latency between the input of the trigger into the EEG system and the output at the earphones using an artificial ear. This resulted in a constant latency of 1.07 ms, which we accounted for in the analysis.
Click stimuli
Following the procedure and technique outlined by Bachmann et al. (2021), we conducted click-ABR recordings. The click-ABR measurements involved presenting a train of rectangular-shaped clicks, each lasting 80 μs, with alternating polarities, and a repetition rate of 10 Hz. Each participant listened to 3,000 click repetitions, which amounted to a total duration of 5 min. For consistent presentation, we employed the peak equivalent sound pressure level (peSPL) calibration method (Laukli and Burkard, 2015) to set the sound level. Clicks were presented at 93 dB peSPL.
Natural speech stimuli and task
We used excerpts from Sylvia Plath’s novel, “The Bell Jar”, identical to those used by Kurthen et al. (2021). Participants listened to 25 segments from the German audiobook version, which featured a professional female speaker with a fundamental frequency (F0) of 160.5 ± 8.9 Hz. The recording was sampled at 44.1 kHz and was organized into content-coherent segments. We used Praat (Boersma and Van Heuven, 2001, RRID: SCR_016564) to limit silent gaps (defined by a threshold
To sustain participants’ attention, we interspersed the listening sessions with simple comprehension questions related to the content they were hearing. The first question was asked after the first segment to familiarize participants with the task. The remaining questions were asked after every third section, so we analyzed a total of eight questions (excluding the first one) to assess participants’ attention. These questions followed a forced-choice format, where participants selected one of four answer choices by pressing a key.
Data preprocessing
Our auditory EEG paradigm enabled the derivation of both a subcortical and a cortical TRF from the same speech and EEG data through different preprocessing steps. Crucial to this paradigm are the high sampling rate and the use of trigger pulses with the highest temporal precision we could achieve with our equipment. To supplement the TRF analyses, we also computed click-ABR and AEP derived from the speech onsets. We performed the data preprocessing using Python (version 3.11.6) and the MNE-Python package (Gramfort et al., 2013, RRID: SCR_005972). Before processing the speech waves, we removed the silent periods before the trigger pulses so that time zero of the speech wave matches time zero in the EEG signal. Before processing the EEG, we examined the raw data to identify bad electrodes, with an average of 1.5 ± 1.6 bad electrodes per participant. After their removal, the EEG signals were referenced to the linked mastoids, with the exception of six participants with noisy mastoid electrodes, where the cap electrodes T7 and T8 were used (Bachmann et al., 2021). We processed speech and EEG signals with equivalent filters whenever possible, and for the subcortical analyses, causal filters were used in the data preprocessing pipeline whenever feasible (Maddox and Lee, 2018; Bachmann et al., 2021). Anti-alias filters were consistently applied at a cutoff frequency of 1/3 of the target rate.
Subcortical data
Our protocol for the subcortical processing pipeline was largely based on Maddox and Lee (2018), Bachmann et al. (2021), and Kulasingham et al. (2024). We extracted speech features from audiobook segments using the auditory nerve firing rate model by Zilany et al. (2014). This model, based on inner ear physiology, simulates auditory nerve responses to acoustic stimuli and has proven effective in subcortical TRF modeling, outperforming other speech-derived features (Kulasingham et al., 2024; Shan et al., 2024). We adapted the implementation by Shan et al. (2024) with the Python cochlear package (Rudnicki et al., 2015). The process involved upsampling the speech wave to 100 kHz and converting it to a pascal pressure waveform at 70 dB SPL. We then simulated the firing rates of 43 auditory nerve fibers across logarithmically spaced center frequencies from 125 Hz to 16 kHz. The model outputs, representing the nerve rate response, were averaged to produce a single time series, which was then downsampled back to 44.1 kHz. This procedure was performed for both the original and inverted speech waves, following methodologies described by Maddox and Lee (2018), Shan et al. (2024), and Kulasingham et al. (2024).
In alignment with clinical ABR settings and conventions in subcortical TRF modeling (Bachmann et al., 2021; Kulasingham et al., 2024), we used the EEG signal from the vertex electrode Cz. For three participants with noisy Cz electrodes, we used Pz instead as vertex electrode, as previously done by Bachmann et al. (2020). We subjected speech features and EEG to the following preprocessing steps: to eliminate power line noise, we applied infinite impulse response (IIR) notch filters at all multiples of 50 between 50 and 1,350 Hz to the continuous EEG (zero-phase non-causal Butterworth filter, effective order: 8, all transition bandwidths: 5 Hz). We applied an anti-alias filter to the speech features at 1,365.3 Hz using a one-pass, non-linear phase, causal low-pass finite impulse response (FIR) filter with a Hamming window (0.0194 pass-band ripple, 53 dB stop-band attenuation, upper transition bandwidth: 409.6 Hz, filter length: 357 samples), after which the speech features were resampled to the target rate of 4,096 Hz using the resample function of MNE. The EEG data were filtered from 80 to 1,365.3 Hz using a one-pass, non-linear phase, causal band-pass FIR filter with a Hamming window (0.0194 pass-band ripple, 53 dB stop-band attenuation, lower transition bandwidth: 20 Hz, upper transition bandwidth: 409.6 Hz, filter length: 2,705 samples). We chose the additional lower limit of 80 Hz to reduce cortical contributions (Bidelman et al., 2017; Bachmann et al., 2021). Continuous EEG data were segmented into epochs from −4 to 54 s, in which time point zero corresponds to the onset of speech, and then decimated to 4,096 Hz. For minimal EEG cleaning, we zeroed out 1-s segments around samples exceeding 100 μV, following an approach comparable to Maddox and Lee (2018) and Kulasingham et al. (2024). Initially, we replaced the bad segments with NaNs, which would then be replaced with zeros after feature-target scaling described in “TRF modeling”. Finally, we truncated the first and last second of each speech feature, and cropped the EEG epochs to 1–49 s relative to speech onset, which corresponds to the length of the longest speech segment post-truncation. We performed this step to avoid fitting the model to the neural response elicited by the onset of the sound and to discard possible filtering artifacts (Crosse et al., 2021).
Cortical data
A recent study by Lindboom et al. (2023) showed that model responses of the inferior colliculus, such as auditory nerve responses, also improve the performance of cortical stimulus-response models compared to, for example, the envelope or spectrogram of a speech stimulus. Inspired by their work, we decided to also use the nerve rate response derived from the original speech wave as speech features for cortical TRF modeling. All filters applied in the cortical processing pipeline were zero-phase, non-causal FIR filters with a Hamming window (0.0194 pass-band ripple, 53 dB stop-band attenuation) with varying pass-band frequencies and transition bandwidths. We subjected the speech features and EEG to the following preprocessing steps: before downsampling, we applied an anti-alias filter at 170.7 Hz (upper transition bandwidth: 51.2 Hz, filter length: 775 samples for the speech features, 1,057 for the EEG due to the difference in sampling rates) to the speech features and the continuous EEG. We segmented the continuous EEG data into epochs from −4 to 54 s relative to speech onset, and then decimated them to 512 Hz, while the speech features were resampled to 512 Hz. We created a copy of the epochs instance for independent component analysis (ICA) and high-pass filtered the epochs copy at 1 Hz (lower transition bandwidth: 1 Hz, filter length: 1,691 samples), a process which facilitates ICA decomposition (Klug and Gramann, 2020). We employed the Picard algorithm (Ablin et al., 2018) to perform ICA on the epochs copy with 1,000 iterations, with the goal to retain
Within the same preprocessing pipeline, we also obtained AEP responses to the audiobook segment onsets. To this end, we copied the epochs after the final filtering step, cropped them to −300 to 600 ms relative to speech onset, applied a baseline correction from −200 to −50 ms, and then averaged the signals across epochs to obtain the evoked response to 25 speech onsets.
Click-ABR data
The EEG signal from the click-ABR recording was subjected to the same subcortical filtering procedure as the EEG from the speech experiment, with the analysis performed on vertex electrode Cz and with the same filter specifications. After filtering, we segmented the EEG into epochs from −10 to 30 ms and applied a baseline correction from −10 to −5 ms. We retained the vertex electrode signals and excluded epochs that exceeded 40 μV, which lead to an average exclusion of 117.8 ± 222.6 epochs per participant. We then averaged the signal over epochs to obtain the ABR response.
TRF modeling
Matrix representation of feature-target pairs
For TRF modeling, as required by the ReceptiveField (see “TRF modeling”), we stored speech features and EEG epochs in matrix format. Since the audiobook segments varied in length and we wanted to avoid data loss, we decided to zero-pad shorter segments up to the longest segment length instead of taking an approach where we truncate longer segments to a uniform length corresponding to the shortest segment. We thus padded shorter segments with NaN to reach the length of the longest segment, which was 48 s after preprocessing. The NaN were replaced with zeros after feature-target scaling as described in “Modeling procedure”. We have stored the speech features and the EEG in matrices, which can be represented as
Modeling procedure
To quantify speech-evoked subcortical and cortical responses, we used forward TRF models (Crosse et al., 2016). We built the models using the ReceptiveField class by MNE and cross-validation tools by Scikit-Learn (Pedregosa et al., 2011, RRID: SCR_002577). These models are based on a time-lagged linearized ridge regression with a Laplacian regularization term that imposes a smoothing condition for nearby time samples. The deconvolution equation for the TRF model weights vector w is given by
Participant-specific tuning involved randomly splitting data into training and test sets, for which we used 22 segments for training and three for testing. For feature-target scaling, we z-standardized the EEG responses globally before the split and normalized speech features using training set statistics. After this scaling step, we replaced all NaNs in the padded speech features and to be zeroed out subcortical EEG 1-s segments with zeros. We opted for this approach such that the scaling procedure would not be affected by the zero-padding and zeroing-out of the data. We determined the optimal λ by performing a grid search over λ = 10n, with n = { − 3, − 2, …, 9} using leave-one-out cross-validation. Specifically, each of the 22 segments in the training set was used as a validation segment once. Pearson’s correlation coefficients between predicted and actual EEG responses served as encoding accuracies. We averaged the encoding accuracies across all cross-validation folds and chose the λ that maximized the mean encoding accuracy. A final model was fitted on the training set with the optimal λ and then tested for encoding accuracy on the held-out test set, which did not contribute to the final TRF model. Subcortical TRF models were fitted for time lags between −10 and 30 ms, cortical models spanned −300 to 600 ms. We fitted the subcortical model twice (for both the original and inverted speech features) and averaged the weights and encoding accuracies for a unified subcortical TRF model. This procedure has been shown to reduce the influence of stimulus artifacts, similar to the common alternating polarity technique used in click-ABR (Maddox and Lee, 2018; Bachmann et al., 2023; Kulasingham et al., 2024; Shan et al., 2024). To assess the encoding accuracies of the cortical models, we averaged the encoding accuracies across all electrodes.
Code accessibility and data availability
The Python scripts for data preprocessing described in “Data preprocessing” and for TRF modeling described in “TRF modeling” are available at github.com/elbolt/subCortex-speech. The data used in this study are available upon request from the corresponding author.
Cluster-based response peak extraction
We extracted response peaks from subcortical and cortical TRF weights, click-ABR, and AEP. For subcortical analyses, we focused on wave V, while for cortical analyses, we concentrated on N1 and P2 peaks. We applied baseline corrections to the TRF weights by subtracting the mean weights between −10 and −5 ms for subcortical models and −200 and −50 ms for cortical models. These weights were then z-standardized across all participants. Click-ABR and AEP data remained in μV units. We initially examined the global TRF and evoked response waveforms averaged across all participants. Figure 2 shows the neural components we obtained from our data, with the TRF models in the top half and the evoked responses in the bottom half. The global subcortical TRF at vertex electrode Cz showed a peak at 9.3 ms, resembling wave V. The global cortical TRF response across electrodes exhibited a P1–N1–P2-like complex, with an average N1 peak at 91 ms and P2 peak at 170 ms, most pronounced at frontotemporal electrodes. The lower half of Figure 2 displays the global click-ABR and AEP waveforms. The click-ABR identified a peak at 5.8 ms as wave V. The global AEP showed N1 and P2 peaks at 115 ms and 256 ms, respectively, with a frontotemporal distribution similar to the cortical TRF.
Electrode clustering
Before performing the analysis in an electrode cluster, we investigated whether there were electrode-based cortical processing differences between the groups. To this end, we determined the mean amplitude within a 50 ms window around the grand average N1 and P2 peaks (shown in Fig. 2) for each participant and each electrode and calculated permutation tests (Maris and Oostenveld, 2007) to compare the mean amplitudes between the groups. Since this analysis did not reveal any significant clusters (not shown), we decided to base our peak extraction on an auditory-relevant frontotemporal cluster, consisting of electrodes F3, FC1, FC5, FC6, FC2, and F4, as in previous studies (Di Liberto et al., 2015; Hjortkjær et al., 2020; Bachmann et al., 2021; Bachmann, 2022).
Peak extraction
For each participant, we identified a subcortical wave V-like peak between 4 and 11 ms and extracted its mean amplitude using a 3 ms window centered on the peak. For cortical responses, we identified N1-like peaks within 50–120 ms and P2-like peaks within 120–300 ms in the mean response of the frontotemporal cluster and extracted the mean amplitudes within a 50 ms window around these peaks. The click-ABR peak was similarly extracted between 4 and 9 ms, while the AEP N1 and P2 peaks were extracted within 50–150 ms and 150–300 ms windows, respectively.
Statistical analyses
We performed all statistical analyses, including the statistical parameters reported in the Materials and Methods section, using JASP (version 0.18.3; RRID: SCR_015823). Given the limited sample size and the deviation from normality of the data residuals, we used non-parametric tests. In general, we analyzed the data using frequentist approaches, but supplemented the testing of our two hypotheses with an analytical Bayesian approach. This combination ensures robustness under non-normal conditions and allows a nuanced interpretation of the parameter uncertainties. We calculated Spearman’s rank correlation coefficient when examining correlations and Mann–Whitney U tests to compare variables between MoCA groups.
In testing our first hypothesis that peak amplitudes are increased with cognitive decline, we performed an additional Bayesian Mann–Whitney U test to obtain a Bayes factor (BF) for each peak amplitude, assuming a Cauchy prior width of γ = 0.707, the default setting in JASP, and a prior that is considered weakly informative (Kruschke, 2015). In addition to the BF, we evaluated the region of practical equivalence (ROPE) to assess the evidence for the absence of an effect (Kruschke, 2018). We defined the ROPE bounds as ±0.1, aligning with Cohen’s criteria for a small effect size. Using a custom Python script, we determined the proportion of the posterior distributions—represented by the
In testing our second hypothesis that peak amplitudes predict signs of cognitive decline, we implemented generalized linear models assuming a Bernoulli distribution and a corresponding logit link function equivalent to logistic regression. Building on the approach of Bidelman et al. (2017), we used the absolute difference between the N1 and P2 peak amplitudes as a predictor in this model. Predictors thus included wave V amplitude, N1–P2 amplitude difference, and their interaction. Additionally, we controlled for PTA and age. The model was configured with the following formula:
Results
Behavioral performance in audiobook task
Overall, participants performed well on the audiobook questions. In the low MoCA group, participants answered
Encoding accuracies
We first examined whether the encoding accuracies of the TRF models differed between the groups. This step was important to ensure that any observed differences in response peaks were not primarily confounded by fluctuations in encoding accuracy. The subcortical models showed lower accuracies than the cortical models, but these were comparable to those reported by Bachmann (2022) for participants in a similar age range. Subcortical and cortical encoding accuracies did not differ between groups (two-tailed Mann–Whitney U test: U = 287, p = 0.248, r = 0.21, and U = 223, p = 0.743, r = −0.06, respectively). Figure 3 shows the subcortical and cortical encoding accuracies plotted by group. As some participants’ subcortical models scored particularly low on the test set, we examined the correlations between subcortical and cortical encoding accuracy to determine whether low accuracy at the subcortical level was also associated with low accuracy at the cortical level. In the low MoCA group, subcortical and cortical accuracies were not associated (ρ = 0.2, p = 0.240, Fisher’s z = 0.17, n = 19). In the normal MoCA group, accuracies at both levels were also not significantly related (ρ = −0.3, p = 0.943, Fisher’s z = −0.34, n = 25). These results suggested that the encoding accuracies were not associated with cognitive grouping.
TRF models
Response latencies and amplitudes between groups
The subcortical TRF models showed visual variability in the time-lagged response weights. We identified wave V-like peaks at a latency of 8.2 ± 2.5 ms in the low MoCA group and at 8.5 ± 2.2 ms in the normal MoCA group (range in both groups: 4–11 ms). Peak latency was not different between groups (two-tailed Mann–Whitney U test: U = 228, p = 0.830, r = −0.04). The average response weights by group and the amplitudes of the individual wave V-like peaks are shown in Figure 4A. In contrast, the cortical TRF models showed a more consistent pattern of response weights. We identified an N1-like peak at a latency of 89 ± 11 (range: 76–115) ms in the low MoCA group and at 90 ± 15 (range: 52–115) ms in the normal MoCA group. The P2-like peak occurred at 175 ± 19 (range: 146–217) ms in the low MoCA group and at 181 ± 22 (range: 154–248) ms in the normal MoCA group. Neither N1 nor P2 peak latencies differed significantly between groups (two-tailed Mann–Whitney U test: U = 219.5, p = 0.672, r = −0.08, and U = 203.5, p = 0.422, r = −0.14, respectively). The average response weights by group and the individual N1- and P2-like peak amplitudes are shown in Figure 4B.
We then tested whether the amplitudes of the response peaks were increased in participants with signs of cognitive decline. Contrary to our expectation, amplitudes were not increased in the low MoCA group, neither for the wave V-like peak (one-tailed Mann–Whitney U test: U = 199, p = 0.820, r = −0.16) nor for the N1- and P2-like peaks (U = 243, p = 0.453, r = 0.02, and U = 232, p = 0.463, r = 0.02, respectively). Bayesian Mann–Whitney U tests confirmed these results, showing moderate evidence for the null hypothesis with a BF of 0.16 (
Predictive potential of TRF weights for group
The TRF-based logistic regression results are summarized in Table 1. Contrary to our expectations neither wave V amplitude (z = −0.66, p = 0.664), N1–P2 difference (z = −0.34, p = 0.458) nor their interaction (z = 0.39, p = 0.440) emerged as significant predictors for the MoCA group in the frequentist framework. Control variables PTA (z = 0.95, p = 0.340) and age (z = −1.90, p = 0.058) also showed no significant effects. The Bayesian analysis supported these findings, yielding a BF of 0.55 for wave V amplitude, 0.48 for N1–P2 difference, and 0.48 for their interaction, all indicating anecdotal evidence for the null hypothesis. For the control variables with a BF of 0.64 and 1.41, respectively, the evidence for the null hypothesis was also anecdotal for PTA and anecdotal against the null hypothesis for age. In the ROPE analysis,
Evoked responses
Response latencies and amplitudes between groups
In the individual click-ABR waveforms, we identified peaks of wave V with a latency of 5.7 ± 1.2 (range: 4–8.8) ms in the low MoCA group and at 5.5 ± 1.0 (range: 4–8.6) ms in the normal MoCA group. Latencies between groups did not differ significantly (two-tailed Mann–Whitney U test: U = 267, p = 0.241, r = 0.13). Although the AEP was based on the speech onsets of only 25 audiobook segments, all participants showed individual waveforms that allowed unambiguous identification of N1 and P2 peaks. We identified N1 peaks at a latency of 112 ± 20 (range: 52–138) ms in the low MoCA group and at 119 ± 10 (range: 91–138) ms in the normal MoCA group. The P2 peaks occurred at 252 ± 31 (range: 201–295) ms in the low MoCA group and at 251 ± 33 (range: 185–295) ms in the normal MoCA group. Neither N1 nor P2 latencies differed between groups (two-tailed Mann–Whitney U test: U = 185.5, p = 0.898, r = −0.22, and U = 238.5, p = 0.495, r = 0.01, respectively). The peaks of evoked responses by group and the individual peak amplitudes are shown in Figure 5.
Concerning the amplitudes of the response peaks, the same pattern emerged as in the TRF analysis. In the frequentist framework, the amplitudes were not increased in the group with low MoCA. This was observed for the wave V peak (one-tailed Mann–Whitney U test: U = 256, p = 0.336, r = 0.08), and similarly for the N1 and P2 peaks (U = 224, p = 0. 629, r = −0.06, and U = 273, p = 0.205, r = 0.15). The Bayesian tests confirmed these results. For the wave V peak, the BF was 0.51 (
Predictive potential of evoked response peaks for group
The results of the evoked response-based logistic regression analysis are summarized in Table 2. In alignment with the TRF-based analysis, the frequentist approach revealed that the main effects and interaction did not predict MoCA group, neither for wave V amplitude (z = 0.22, p = 0.825), nor N1–P2 difference (z = −0.36, p = 0.716), and their interaction (z = −0.58, p = 0.563). Similarly, the control variables, PTA (z = 0.61, p = 0.542) and age (z = −1.81, p = 0.071), were also not significant. The Bayesian analysis suggested anecdotal evidence against the inclusion of these predictor with a BF of 0.61 for wave V amplitude, 0.69 for the N1–P2 difference, and 0.63 for their interaction, indicating marginal support for the null hypothesis. In the ROPE analysis,
Discussion
The main objective of this study was to investigate the neurophysiological processing of natural continuous speech in older adults with and without signs of cognitive decline at both the subcortical and cortical levels. To this end, we used a novel paradigm based on TRF models that allowed us to examine speech encoding at both levels from the same EEG recording. Contrary to our first hypothesis, we did not observe increased TRF response amplitudes in the group with signs of cognitive decline, neither at the subcortical nor at the cortical level. This result is consistent with our analyses of the ABR to click sounds and the AEP to audiobook onsets. Additionally, response peak amplitudes from subcortical and cortical sources did not reveal any predictive value for cognitive decline, leading us to refute our second hypothesis. Thus, our results did not support the proposed hypothesis of auditory encoding hypersensitivity in the early stages of cognitive decline and could not confirm their predicative value for cognitive decline. Instead, auditory response patterns appeared to be similar in both groups of participants. While the Bayesian approach validated the frequentist analysis, it is important to note that, first, the evidence for the null hypotheses was only anecdotal and, second, the use of noninformative priors represents a conservative stance that may limit sensitivity to the detection of subtle but significant effects.
Our outcomes contrast with the findings of Bidelman et al. (2017), who found that older adults with signs of cognitive decline had increased FFR amplitudes in response to syllable sounds, which was then reflected in increased N1 and P2 AEP amplitudes at the cortical level. In their study, they trained a linear discriminant analysis classifier (Fisher, 1936) to discriminate participants with and without evidence of cognitive decline based on the RMS amplitude in the 50–150 ms window of FFR and N1–P2 AEP voltage difference. They were able to classify participants into the respective groups with a cross-validated accuracy of
Components of auditory processing as predictors of cognitive decline
We hypothesized that TRF models could capture a more comprehensive component of cognitive components in natural speech processing than evoked potentials. This is attributed to the fact that TRF models incorporate the entire temporal sequence of stimulus and response, unlike evoked potentials which average responses over a brief period following sound onset. The subcortical auditory activity is influenced both by bottom-up input signals and top-down corticofugal connections to the auditory cortex (Chandrasekaran et al., 2014). Considering the hypersensitivity hypothesis in cognitive decline postulated by the findings of Bidelman et al. (2017), we anticipated similar or more pronounced effects in our TRF models, given the enhanced interplay between predictive top-down modulation of neural dynamics and bottom-up sensing in speech tracking (Zion Golumbic et al., 2012). The conclusions we drew from the TRF models bear limitations, particularly regarding the subcortical models, as discussed in “Methodological challenges of the subcortical TRF analysis”. Nevertheless, the analyses of evoked potentials confirmed the conclusions from the TRF models and are also not consistent with the hypersensitivity hypothesis. We would like to note that “hypersensitivity” within the scope of our study refers only to the increased subcortical and cortical responses as observed by Bidelman et al. (2017) in individuals with signs of cognitive decline, and not to the cortical hyperactivity in auditory processing as discussed mainly in the context of hearing loss (Herrmann and Butler, 2021, for an overview).
Focusing on subcortical processing, it is important to clarify that the click-ABR is not directly analogous to the FFR: we targeted wave V, a key component of subcortical auditory processing. We chose this focus because subcortical TRF models’ wave V-like peaks closely resemble the click-ABR wave V (Maddox and Lee, 2018; Bachmann et al., 2021), primarily originating from subcortical sources. The inferior colliculus in the midbrain is identified as the main generator of wave V (Møller and Jannetta, 1983). A simulation study by Saiz-Alía and Reichenbach (2020) suggested that the subcortical response to natural speech is also predominantly influenced by the inferior colliculus. In contrast, the FFR is influenced by both subcortical and cortical activities (Coffey et al., 2016, 2019), suggesting that cortical processing differences, reflected in N1–P2 complexes, might interact with FFR variations. Our results suggest that neither the amplitudes of the subcortical TRF nor the click-ABR waves are indicative of early cognitive decline. It might be that the FFR has a higher predictive potential for signs of cognitive decline after all. Compared to the click-ABR, which is used clinically primarily to assess auditory neuropathy and hearing loss, the FFR can detect other deficits in auditory processing that have more real-life consequences (Krizman and Kraus, 2019).
At the cortical level, our findings correspond with the conclusions drawn in the review by Morrison et al. (2018) on the use of AEP components as markers of cognitive decline: they reported that the N1 and P2 components did not consistently differ between patients with MCI, Alzheimer’s disease, and healthy controls. Rather, the review proposed that mismatch negativity (MMN)-related N200 and P300 components derived from active and passive oddball paradigms be examined as potential indicators of sensitivity to auditory differentiation and auditory attention, respectively. MMN-related components have also been associated with age-related differences in cortical thickness and surface area (Giroud et al., 2019), and may prove to be more sensitive to cognitive decline than the N1 and P2 components.
Looking ahead, further research is required to identify the most sensitive auditory processing components for cognitive decline. While we modeled acoustic speech aspects through auditory nerve firing rates, TRF models can also incorporate linguistic features derived from the speech signal such as words and phoneme rates and even higher order linguistic processing such as surprisal, entropies, and frequencies at the word or phoneme level (Brodbeck et al., 2018; Weissbart et al., 2020; Gillis et al., 2022, 2023; Brodbeck et al., 2023). The potential of such models has already been explored, for example in differential speech processing in patients with aphasia after a stroke (Kries et al., 2023). There is also potential in the MMN-related components, against the background of which we would like to make an argument for the relevance of the stimulus material. In the scope of the predictive coding theory in the auditory cortex (Heilbron and Chait, 2018), natural speech, particularly when supplemented with audiovisual elements like the speaker’s face, as well as context manipulation, could be beneficial. This is supported by Chauvin et al. (2021), who found that patients with MCI’s speech-in-noise comprehension benefited significantly when context was enhanced in sentences or when audiovisual speech cues were available, compared to healthy participants. Future research should therefore aim to identify the most effective auditory processing components for detecting cognitive decline, encompassing acoustic, linguistic, and cognitive aspects of speech processing.
Relationship between auditory processing and cognitive decline
While our results did not suggest a link between auditory processing and cognition in the measures we studied, this does not necessarily mean that there is no connection between the two. It is plausible that this link might be more pronounced under adverse listening conditions or during complex linguistic processing or speech-in-noise tasks, as suggested by models like the ease of language understanding model (Rönnberg et al., 2013). This perspective assumes that auditory–cognitive connections may be more prominent in demanding listening environments or in tasks that require higher linguistic processing. Consequently, in a generally healthy group of older individuals, such as our participants, these connections may only manifest in more demanding scenarios. This hypothesis is consistent with the potential importance of components such as the MMN for auditory differentiation and attention (Morrison et al., 2018; Giroud et al., 2019). A promising direction for future research would therefore be to investigate this relationship under more demanding listening conditions. The inclusion of this aspect could lead to a more differentiated understanding of the interplay between auditory processing and cognitive decline.
Participant selection and effects of the MoCA score on cognitive grouping
The selection of participants was a critical factor in our study. We speculate that individuals who chose to participate are likely to represent rather high-functioning members of their age group, especially in research on cognition and hearing loss in old age, despite some of them achieving a low MoCA score. This selection bias affects the generalizability of our results. Factors such as participants’ motivation to participate in cognitive studies, their general health status, and their cognitive resilience play a central role.
In general, the use of the MoCA as a criterion for assigning participants to different cognitive groups poses challenges. While the MoCA is an effective, internationally used screening tool for the detection of potential cognitive decline, it is of course not equivalent to a comprehensive clinical diagnosis. In addition, some participants in our sample were around the 26 threshold. Nine participants scored 25, placing them in the lower MoCA group, while six participants scored 26, placing them in the normal group. Re-running the statistical pipeline excluding participants with scores 25 and 26, the results did, however, not change. The same was true for the exclusion of three participants who had a score below 22 (Fig. 1A). In addition, we exploratively examined regression models that treated the MoCA score as a continuous variable (not shown in this paper), which also found no significant effects of the predictors on the MoCA score. The range of MoCA scores and the age range of our participants were comparable to those examined in the study by Bidelman et al. (2017). However, we would like to point out that the low MoCA group in their study was on average seven years older than the normal group, whereas in our study the difference was only three years, which could also have influenced the results.
Ideally, the inclusion of participants with clinically diagnosed MCI would have increased the robustness of our study. Furthermore, the study would have benefited from including a comprehensive neurocognitive test battery to provide improved cognitive grouping that was not solely based on the MoCA score. Our decision to use the MoCA was influenced by its ease of use and high validity as a rapid clinical screening tool for MCI (Sala et al., 2020), which is consistent with the methods used by Bidelman et al. (2017). In summary, although our study provides valuable insights into the cognitive functioning of older adults, the scope for interpretation is limited by participant selection factors and MoCA-based cognitive grouping.
Methodological challenges of the subcortical TRF analysis
There are some considerations and limitations of the subcortical models that we would like to point out. To begin with, our TRF models were trained using approximately 17 min of data. While this duration is generally adequate for cortical models (Crosse et al., 2021), its sufficiency for subcortical models is less certain. The main reason for not recording longer was that this paradigm was part of a larger EEG data collection and we were therefore required to keep the audiobook part short. Kulasingham et al. (2024) emphasized that a minimum of 12 min is necessary to achieve a positive SNR in single V-like wave peaks derived from TRF models, but they also showed that with increasing recording length, the SNR improves significantly. This suggests that a longer training period would have yielded more robust subcortical models in our study. In addition, some of our subcortical TRF models performed poorly on the test set, as evidenced by negative encoding accuracies in some participants (Fig. 3). On the one hand, this could be partly due to our methodological decision to optimize the hyperparameter λ by leave-one-out cross-validation and train the final model on the entire training set instead of using nested cross-validation (as used by Fuglsang et al., 2020; Bachmann et al., 2021). The rationale was that we did not have enough training data to justify a nested cross-validation approach. On the other hand, our approach might have led to overfitting of the models, which could explain why the models did not generalize well in the test set. Another point of concern is the latency of wave V-like peaks in our subcortical TRF models. Compared to Maddox and Lee (2018), reporting an average peak latency of 6.2 ms in younger participants (19–45 years), our latencies were comparably larger. However, the peak latency was comparable to the latency found by Bachmann (2022), who reported an average peak latency of 8.7 ms in older participants (54–80 years) with hearing loss.
In general, subcortical TRF models are not as well established as their cortical counterpart. We adapted our model fitting approach, which uses linearized ridge regression, to methods commonly used in cortical models (Fuglsang et al., 2020; Bachmann et al., 2021, for example). In contrast, other studies used frequency domain methods for subcortical TRF analysis (Maddox and Lee, 2018; Kulasingham et al., 2024; Shan et al., 2024). In addition, different speech features, such as, e.g., rectified speech waves or firing rates of the auditory nerve, have already been used in subcortical TRF modeling. Determining the most effective method and the minimum data length required for reliable subcortical TRF models is still under investigation and requires further studies, such as those by Kulasingham et al. (2024) and also by Bachmann et al. (2023), who showed that subcortical TRF models can be reliably estimated even when speech is presented in the sound-field, which then requires a slightly longer recording time. The need for longer recordings for accurate subcortical TRF models would question their practical applicability, especially in clinical settings. What does speak in favor of the practical applicability and also for the TRF framework is the fact that the subcortical TRF models can be estimated from the same data as the cortical models. Thus, data acquisition—once the subcortical TRF is established in clinical application—would no longer be associated with the effort of a separate paradigm (such as click or syllable paradigms). Overall, however, the application of subcortical TRF models in clinical practice remains an open question that requires further research. Our study contributes to this ongoing research by applying subcortical TRF models to a specific research question and helping to understand their application and limitations.
Conclusion and outlook
The search for early markers of cognitive decline remains a central topic in research on aging and cognition. Our study represents a first step in investigating the subcortical and cortical processing of natural continuous speech in cognitive decline for its clinically relevant potential. Our results did not indicate a significant association between early signs of cognitive decline and differences in auditory processing of natural continuous speech. Further research is essential to investigate the components of auditory cognitive speech processing in the context of cognitive decline. In particular, a better understanding of the relationships between different components of auditory processing, such as subcortical and cortical features, hearing loss, and their associations with cognitive decline, remains essential. In addition, exploring other components of auditory processing in terms of their predictive potential is crucial to expand our knowledge. Investigating these components could pave the way for more comprehensive and accurate methods for early detection and monitoring of cognitive decline. This study emphasizes the need to continue the study of auditory cognitive markers, focusing on a more comprehensive assessment of their interaction and predictive abilities, in order to make progress in the early detection and understanding of cognitive decline.
Footnotes
The authors declare no competing financial interests.
We would like to thank Katarina Kliestenec and Elainne Vibal for their valuable help with data collection and Andrew Clark from LiRI for his support in setting up the EEG experiment. We would also like to thank Florine Bachmann, Joshua Kulasingham, and Jens Hjortkjær for their inputs to this project at CHSCOM22 and ISAAR23. Finally, we would like to thank Fabian Schmidt and the other anonymous reviewer who contributed significantly to the improvement of this paper.
This study was funded by the Swiss National Science Foundation (SNSF, snf.ch, grant number PR00P1_185715 to N.G.). E.B. is a pre-doctoral fellow at International Max Planck Research School on the Life Course (IMPRS LIFE).
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.