Skip to main content

Main menu

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT

User menu

Search

  • Advanced search
eNeuro
eNeuro

Advanced Search

 

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT
PreviousNext
Research ArticleResearch Article: New Research, Cognition and Behavior

Neural Speech Tracking during Selective Attention: A Spatially Realistic Audiovisual Study

Paz Har-shai Yahav, Eshed Rabinovitch, Adi Korisky, Renana Vaknin Harel, Martin Bliechner and Elana Zion Golumbic
eNeuro 2 June 2025, 12 (6) ENEURO.0132-24.2025; https://doi.org/10.1523/ENEURO.0132-24.2025
Paz Har-shai Yahav
1The Gonda Center for Multidisciplinary Brain Research, Bar Ilan University, Ramat Gan, 5290002, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Eshed Rabinovitch
1The Gonda Center for Multidisciplinary Brain Research, Bar Ilan University, Ramat Gan, 5290002, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Adi Korisky
1The Gonda Center for Multidisciplinary Brain Research, Bar Ilan University, Ramat Gan, 5290002, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Renana Vaknin Harel
1The Gonda Center for Multidisciplinary Brain Research, Bar Ilan University, Ramat Gan, 5290002, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Martin Bliechner
2Department of Psychology, Carl von Ossietzky Universität Oldenburg, Germany
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Elana Zion Golumbic
1The Gonda Center for Multidisciplinary Brain Research, Bar Ilan University, Ramat Gan, 5290002, Israel
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Elana Zion Golumbic
  • Article
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF
Loading

Abstract

Paying attention to a target talker in multitalker scenarios is associated with its more accurate neural tracking relative to competing non-target speech. This “neural bias” to target speech has largely been demonstrated in experimental setups where target and non-target speech are acoustically controlled and interchangeable. However, in real-life situations this is rarely the case. For example, listeners often look at the talker they are paying attention to while non-target speech is heard (but not seen) from peripheral locations. To enhance the ecological-relevance of attention research, here we studied whether neural bias toward target speech is observed in a spatially realistic audiovisual context and how this is affected by switching the identity of the target talker. Group-level results show robust neural bias toward target speech, an effect that persisted and generalized after switching the identity of the target talker. In line with previous studies, this supports the utility of the speech-tracking approach for studying speech processing and attention in spatially realistic settings. However, a more nuanced picture emerges when inspecting data of individual participants. Although reliable neural speech tracking could be established in most participants, this was not correlated with neural bias or with behavioral performance, and >50% of participants showed similarly robust neural tracking of both target and non-target speech. These results indicate that neural bias toward the target is not a ubiquitous, or necessary, marker of selective attention (at least as measured from scalp-EEG), and suggest that individuals diverge in their internal prioritization among concurrent speech, perhaps reflecting different listening strategies or capabilities under realistic conditions.

  • EEG
  • selective attention
  • spatial
  • speech processing
  • TRF

Significance Statement

This work contributes to ongoing efforts to study the neural mechanisms involved in selective attention to speech under ecologically relevant conditions, emulating the type of speech materials, multisensory experience, and spatial realism of natural environments. Group-level results show that under these more realistic conditions, the hallmark signature of selective attention—namely, the modulation of sensory representation and its robustness to switches in target identity—is conserved, at least at the group level. At the same time, results point to an underlying diversity among participants in how that this modulation manifests, raising the possibility that differences in listening strategies, motivation, or personal traits lead to differences in the way that individuals encode and process competing stimuli, under ecological conditions.

Introduction

Effectively directing attention to a particular talker, and prioritizing its processing over competing non-target speech, can be challenging. Selective attention to speech has been associated at the neural level with enhanced neural tracking of target speech, compared with non-target speech. This “neural bias” for target speech has been demonstrated in numerous EEG and MEG studies of selective attention to speech (Kerlin et al., 2010; Ding et al., 2012; Zion Golumbic, et al., 2013b; O’Sullivan et al., 2015; Fiedler et al., 2019) and mirrors similar effects of selective attention in modulating sensory responses to simpler stimuli (Broadbent, 1958; Treisman, 1969; Hillyard et al., 1973; Näätänen et al., 1992). However, to date, this effect has been mostly studied under conditions that do not fully capture the real-life challenge of attention to speech. For example, the speech materials used in many studies are often comprised either of short, context-less utterances (Brungart, 2001; Humes et al., 2017) or recordings of audiobooks that are highly edited and professionally rehearsed and recorded (Fiedler et al., 2019; Fu et al., 2019), functioning more as “spoken texts” than as natural speech. In contrast, natural speech is continuous, contextual, and produced on the fly resulting in added disfluencies, pauses, and repetitions (Agmon et al., 2023).

Another non-ecological aspect of many studies is that speech is presented only auditorily, often in a dichotic manner where the audio from different talkers is presented to different ears (Cherry, 1953; Bentin et al., 1995; Aydelott et al., 2015; Brodbeck et al., 2020; Kaufman and Zion Golumbic, 2023; Makov et al., 2023). However, in many real-life situations, listeners also look at the talker that they are paying attention to; hence, target speech is often audiovisual by nature and is emitted from a central location relative to the listener. In contrast, other non-target talkers are—by default—heard but not seen (unless listeners overtly move their head/eyes), and their audio emanates from peripheral spatial locations. Accordingly, under spatially realistic audiovisual conditions, there are stark qualitative differences between the sensory features of target and non-target speech, which likely assist listeners in focusing their attention appropriately (Fleming et al., 2021). Supporting this, it has been shown that having corresponding visual input of a talker facilitates speech processing as well as selective attention (Sumby and Pollack, 1954; Grant and Seitz, 2000; Schwartz et al., 2004; Ahmed et al., 2023a,b; Haider et al., 2024; Karthik et al., 2024; Wikman et al., 2024) and also improves the precision of neural speech tracking (Zion Golumbic et al., 2013a; Crosse et al., 2015; Fu et al., 2019).

The current study is part of ongoing efforts to increase the ecological validity of selective attention research and to advance our understanding of how the brain processes and prioritizes competing speech in the type of circumstances encountered in real life (Freyman et al., 2001; Ross et al., 2007; Tye-Mmurray et al., 2016; Shavit-Cohen and Zion Golumbic, 2019; Keidser et al., 2020; Uhrig et al., 2022; Brown et al., 2023). We capitalize on the potential of the speech-tracking approach for gaining insight into how the brain encodes and represents concurrent, continuous, and natural speech stimuli (Ding et al., 2012; Mesgarani and Chang, 2012; Zion Golumbic et al., 2013b; Kaufman and Zion Golumbic, 2023). To our knowledge, only a handful of previous studies have measured neural speech tracking in a spatially real audiovisual selective attention paradigm (O’Sullivan et al., 2019; Wang et al., 2023). In one such study, O’Sullivan et al. (2019) found that it is possible to determine from the neural signal whether a listener is paying attention to the talker they are looking at or if they are “eavesdropping” and paying attention to a peripheral talker whom they cannot see. These results nicely demonstrate the dissimilarity of the neural representation of audiovisual target speech and concurrent audio-only non-target speech. However, they leave open the question of the degree to which the brain suppresses irrelevant speech and exhibits “neural bias” for preferential encoding of target speech under spatially realistic audiovisual conditions.

There is a long-standing theoretical debate about how selective attention affects the neural representation of non-target stimuli. One possibility is that non-target speech is attenuated at an early sensory level structuring (Broadbent, 1958; Treisman, 1960; Carlyon, 2004; Ding et al., 2018), and the degree to which non-target speech is represented/attenuated is thought to reflect the efficacy of selective attention. Alternatively, target and non-target speech can be co-represented at the sensory level, with selection occurring only at later stages (e.g., the level of linguistic/semantic processing; late-selection; Deutsch and Deutsch, 1963; Murphy et al., 2013). As noted, numerous studies have demonstrated reliable “neural bias” in the sensory representation of concurrent speech in dichotic listening paradigms, showing that the acoustic envelope of target speech is tracked more precisely than non-target speech (Kerlin et al., 2010; Zion Golumbic et al., 2013b; O’Sullivan et al., 2015; Fuglsang et al., 2017; Fiedler et al., 2019; Har-Shai Yahav et al., 2024). However, although this effect is robust when averaging across multiple participants, recently, Kaufman and Zion Golumbic (2023) showed that this effect was driven by ∼30% of participants, whereas the majority of participants did not show reliable neural bias but exhibited comparable neural representation for both target and non-target speech in bilateral auditory cortex. This raised the possibility that suppression of non-target speech, at least at the sensory level, may not be a necessary component of selective attention and that differences between individuals may reflect different listening strategies or capability for multiplexed listening. However, that study, like most previous work in the field, used an auditory-only dichotic listening design, which does not capture the spatial realism of real-life audiovisual contexts.

Here we sought to replicate and extend our previous work using a more ecologically valid spatially realistic audiovisual design and to study the relative representation of target and non-target speech in the brain under these circumstances. We simulated a common real-life situation in which individuals pay attention to a talker whom they can see (in this case, watching a video recording of a lecture) but also hear another talker off to the side, whom they cannot see and are asked to ignore. We presented the audio of both target and non-target in a free-field fashion, from their respective spatial locations, rather than through earphones, to ensure realistic spatial propagation of the sound. Importantly, we used unedited recordings of actual lectures delivered by academics for the general public, to preserve the natural properties of the speech. In addition, mid-way along the experiment we switched between the target and non-target talkers, to study how this affected listeners and to test the generalizability of results across talkers and over time.

We recorded participants neural activity using electroencephalography (EEG) and analyzed their neural tracking of target and non-target speech, focusing both on group averages, as is common in the field, as well as on individual-level data (Ding et al., 2012; Ding and Simon, 2012; Fuglsang et al., 2017; Rosenkranz et al., 2021; Kaufman and Zion Golumbic, 2023). We use data-driven statistics to determine the degree to which each talker is represented in the neural signal as well as the “neural bias” toward target speech. We also tested the reliability of results between the first and second half of the experiment, and tested if switching the identity of the target talker affected the pattern of neural tracking.

Materials and Methods

Participants

We collected data from 24 adult volunteers (16 female, 8 male), ranging in age between 19 and 34 (M = 23.83, SD = ±3.42). All participants were fluent Hebrew speakers with self-reported normal hearing and no history of psychiatric or neurological disorders. The study was approved by the IRB ethics board of Bar-Ilan University, and participants gave their written informed consent prior to the experiment. Participants were either paid or received course credit for participating in the experiment. Data from one participant was excluded from all analysis due to technical issues during EEG recording, therefore all further analyses are reported for N = 23.

Speech stimuli

The stimuli consisted of two 20 min video recordings of a public lecture on popular science topics, one delivered by a male talker and the other by a female talker. Each video recording included the lecturer as well as the slides accompanying the talk. Both talkers gave their approval to use these materials for research purposes. Lecture videos were segmented and edited using the software Filmora (filmora.wondershare.net) and FFMPEG (www.ffmpeg.org). Lectures were cut into 63 segments ranging between 22 and 40 s each (lecture 1: M = 31.83, SD = ±4.2; lecture 2: M = 30.68, SD = 2.24). The varied lengths were necessary to ensure that the segments did not cut off the lecture mid-sentence or mid-thought. We equated the loudness of lecture segments using peak normalization, based on the average min and max dB across all segments (separately for each talker). Then, to equate the perceived loudness of the two talkers (male and female), we performed manual perceptual calibration, based on the feedback of five naive listeners. Loudness adjustment was performed using the software FFMPEG and Audacity (version 3.2.1; www.audacityteam.org). Ultimately, the experiment included 42 segments from the start and end of each lecture (21 each), and we discarded the content from the middle of the lecture (see experimental procedure).

Experimental procedure

The experiment was programmed and presented to participants using the software OpenSesame (Mathôt et al., 2012). Participants were seated on a comfortable chair in a sound-attenuated booth and were instructed to keep as still as possible. Participants viewed a video lecture (target), presented segment-by-segment, on a computer monitor in front of them with the lecture audio presented through a loudspeaker placed behind the monitor. They were instructed to pay attention to the video and after every three segments, were asked three multiple-choice comprehension questions, regarding the content of the recent segments (one question per segment, four possible answers). Participants received feedback regarding the correctness of their responses and indicated via button press when they were ready to continue to the next segment. In addition to the target lecture, audio from an additional non-target lecture was presented through a loudspeaker placed on their left side (Fig. 1A). Segments of non-target speech began 3 s after the onset of the target speech and included a volume ramp-up of 2 s (Fig. 1B). Both loudspeakers were placed at the same distance from participants’ head (∼95 cm). We chose to present non-target speech only from the left side, rather than counterbalanced across both sides, to ensure sufficient amount of data for TRF estimation, without doubling the experiment length.

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

Experimental setup. A, Two lectures were presented simultaneously, with one lecture (target talker) displayed on the screen and audio emitted through the front loudspeaker. The other lecture (non-target talker) was played audio-only through the left loudspeaker. Participants were instructed to focus their attention on the lecture presented on the screen. Critically, in the middle of the experiment, the stimuli switched such that the lecture that was played from the side and had been non-target in the first half, was presented as a video in the second half, and became the target talker, whereas the target talker from the first half was presented in the second half from a loudspeaker on the side and became non-target. Participants answered comprehension questions regarding the target lecture after every three trials. B, Single trial illustration. Target speech began at trial onset, and non-target speech began 3 s after onset and included a 2 s volume ramp-up.

In the middle of the experiment, the stimuli were switched: The lecture that had been the non-target became the target lecture and was presented as a video in the second half, whereas the lecture that had been the target was presented as audio-only from a loudspeaker on the left and was the non-target (Fig. 1A). Importantly, different portions of each lecture were presented in each half of the experiment. When a lecture was designated as the target, it started from the beginning to ensure optimal comprehension and continued for 21 consecutive segments. When a lecture was designated as the non-target (and presented only auditorily), the last 21 segments of the lecture were played, also in consecutive order. In this way, segments from each lecture served both as target and non-target speech in different parts of the experiment (thus sharing the talker-specific attributes and general topic), but none of the content was repeated. The order of the starting lecture (male/female talker) was counterbalanced across participants, and participants were not informed in advance about the switch. Audio from on-ear microphones and eye movements were also recorded during the experiment, but their analysis is outside the scope of this study.

EEG data acquisition

EEG was recorded using a 64 Active Two system (BioSemi; sampling rate, 2,048 Hz) with Ag-AgCl electrodes, placed according to the 10–20 system. Two external electrodes were placed on the mastoids and served as reference channels. Electrooculographic (EOG) signals were simultaneously measured by four additional electrodes, located above and beneath the right eye and on the external side of both eyes.

Behavioral data analysis

Behavioral data consisted of accuracy on the comprehension questions asked about each segment. These values were averaged across trials, separately for each half of the experiment, and for each participant. We used a two-tailed paired t test to evaluate whether accuracy rates differed significantly before and after the talker switch (first and second half).

EEG preprocessing and speech tracking

EEG preprocessing and analysis were performed using the Matlab-based FieldTrip toolbox (Oostenveld et al., 2011), as well as custom-written scripts. Raw data was first rereferenced to the linked left and right mastoids and was bandpass filtered between 0.5 and 40 Hz (fourth-order zero-phase Butterworth filter). Data were then visually inspected and gross artifacts (that were not eye movements) were removed. Independent component analysis (ICA) was performed to identify and remove components associated with horizontal or vertical eye movements as well as heartbeats (identified through visual inspection). Any remaining noisy electrodes that exhibited either extreme high-frequency activity or low-frequency drifts were replaced with the average of their neighbors using an interpolation procedure (ft_channelrepair function in the FieldTrip toolbox). The clean data was then cut into trials, corresponding to portions of the experiment in which a single segment of the lecture was presented. These were divided according to which half of the experiment they were from—before and after switching the target talker (first and second half).

To estimate neural responses to the speech from the two simultaneous lectures, we performed speech-tracking analysis, using both an encoding and a decoding approach. We estimated multivariate linear Temporal Response Functions (TRFs) using the mTRF MATLAB toolbox (Crosse et al., 2016), which constitutes a linear transfer function describing the relationship between a particular feature of the stimuli (S) and the neural response (R) recorded when hearing it.

Here S was a matrix composed of the broadband envelopes of the two speech stimuli presented in each trial, and they were treated as separate regressors in a single multivariate regression model. Envelopes were extracted using an equally spaced filterbank between 100 and 10,000 Hz, with 20 frequency bands, based on Liberman's cochlear frequency map (Liberman, 1982). The narrowband filtered signals were summed across bands after taking the absolute value of the Hilbert transform for each one, resulting in a broadband envelope signal. The R used here was the continuous cleaned EEG data, bandpass filtered again between 1 and 20 Hz (fourth-order zero-phase Butterworth filter), since the speech-tracking response consists mostly of low-frequency modulations. S and R were aligned in time and downsampled to 100 Hz for computational efficiency. The five first seconds of each trial were excluded from data analysis, due to the differences in onset and ramp-up period (Fig. 1B). In addition, the first trial and the trial immediately after the talker switch were omitted from data analysis, to avoid confounding effects associated with attentional ambiguity and adjustment to the new target talker. This resulted in a total of 40 trials analyzed (20 per half), each ∼30 s long (see above).

Univariate and multivariate encoding and decoding models were optimized separately for each half of the experiment. In the encoding approach, linear TRFs are estimated reflecting the neural response at each electrode for each of the two simultaneously presented stimuli, and the predictive power of the model reflects how it predicts the actual neural response recorded. In the decoding approach, the neural data is used to reconstruct the envelope of each speech stimulus. TRF predictive power values (encoding) and reconstruction accuracies (decoding) were assessed using a leave-one-out cross-validation protocol. In each iteration, all trials except one were randomly selected to train the model (train set), which was then used to predict either the neural response at each electrode (encoding) or the two speech envelopes (decoding) in the left-out trial (test set). The predictive power of the encoding model is the Pearson's correlation (r value) between the actual neural response in the left-out trial and the response predicted by the model. The decoding reconstruction accuracies is calculated separately for the two speech stimuli presented in the left-out trial (target and non-target speech) and is the Pearson's correlation (r value) between the reconstructed envelope of each and the actual envelope. The reported TRFs, predictive power values, and reconstruction accuracies are the averages across all iterations.

Encoding TRFs were calculated over time lags ranging from −150 (pre-stimulus) to 450 ms, and the decoding analysis used time lags of 0–400 ms, as is customary in similar analyses (Crosse et al., 2016). To prevent overfitting of the model, a ridge parameter was chosen as part of the cross-validation process (λ predictive power). This parameter significantly influences the shape and amplitude of the TRF and therefore, rather than choosing a different λ for each participant (which would limit group-level analyses), a common λ value was selected for all participants, which yielded the highest average predictive power, across all channels and participants (see also Har-shai Yahav et al., 2024; Kaufman and Zion Golumbic, 2023). For both the encoding and decoding models, this optimal value was λ = 1,000 a. Note that decoding results were highly similar for λ's that were optimized separately for each participant (data not shown).

EEG statistical analysis

Group-level statistics

The statistical significance of the predictive power and reconstruction accuracy of the encoding and decoding models were evaluated using permutation tests. For this, we repeated the encoding/decoding analysis procedure on shuffled S-R data where speech envelopes presented in one trial (S) were paired with the neural response recorded in a different trial (mismatched R). This procedure was repeated 100 times, yielding a null distribution of predictive power/reconstruction accuracy values that could be obtained by chance. The real data was then compared with this null distribution and if it fell, within the top 5 percentile was considered statistically significant. To compare the speech-tracking response across conditions, we conducted a 2 × 2 ANOVA a Bayesian factor analysis with repeated measures to compare the predictive power and reconstruction accuracies obtained for target versus non-target speech, in each half of the experiment (before vs after the target talker switch; JASP-Team, 2022; version 0.16.3; prior distribution parameters: Uniform Cauchy distribution with a scale parameter of r = 0.5, random effects r = 1, scale covariates r = 0.354).

To test the generalizability of the speech-tracking patterns between the two halves of the experiment, we also tested how well decoders that were trained on data from one half of the experiment (either on target or non-target speech) could be used to accurately predict the stimuli presented in the other half of the experiment based on the neural data.

Individual-level statistics

Statistical analysis of individual-level data focused only on reconstruction accuracies (decoding approach), since this approach integrates responses across all electrodes, yielding a simpler metric for statistical analysis and avoiding multiple comparisons. We conducted a series of permutation tests to obtain data-driven statistics in each individual participant, designed to address different questions, as illustrated in Figure 2.

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2.

Data-driven permutation tests for individual-level statistics. Three permutation tests were designed to assess statistical significance of different results in individual participants. The black rectangles in all panels show the original data organization on the left and the relabeling for permutations test on the right. A, S-R permutation test. In each permutation, the pairing between acoustic envelopes (S) and neural data responses (R) was shuffled across trials such that speech envelopes presented in one trial (both target and non-target speech) were paired with the neural response (R) from in a different trial. This yields a null distribution of reconstruction accuracies that could be obtained by chance, to which the real data can be compared (right). B, Attention-agnostic permutation test. In each permutation, the target and non-target speech stimuli were randomly relabeled to create attention-agnostic regressors that contain 50% target speech and 50% non-target speech. The reconstruction accuracy for each regressor was estimated and the difference between them is used to create a null distribution to which the neural-bias index can be compared (right). C, Order-agnostic permutation test. In each permutation, trials were randomly relabeled and separated into two order-agnostic groups consisting of 50% trials from the first half of the experiment and 50% trials from the second half. The reconstruction accuracy for each group of trials was estimated, and the difference between them is used to create a null distribution to which the real data from each half of the experiment can be compared (right).

First, we assessed whether the speech reconstruction accuracies obtained for both target and non-target speech were significantly better than those that could be obtained chance. To this end, we used S-R permutations, similar to those used for group-level statistics, in which we shuffled the pairing between acoustic envelopes (S) and neural data responses (R; Fig. 2A). Reconstruction values were assessed in 100 permutations of mismatched S-R combinations, yielding a null distribution from which we derived a personalized chance-level value for target and non-target speech, for each participant (the top 5 percentile of the null distribution).

Second, we assessed whether the difference in the reconstruction accuracy for the target and non-target talker could reliably be attributed to their task relevance (referred to as the “Neural-Bias index”). For this, we followed the procedure introduced by Kaufman and Zion Golumbic (2023) to create an “attention-agnostic” null distribution of neural-bias indices (Fig. 2B). Specifically, for each participant we created 100 permutations in which the two speech stimuli were randomly relabeled so that the stimuli represented in each regressor was 50% target and 50% non-target. Multivariate decoders were trained on this relabeled data and reconstruction accuracy values were estimated for each regressor, and difference between them was used to create an attention-agnostic distribution of differences between reconstruction accuracies. The real neural-bias index for each participant was normalized (z-scored) relative to this null distribution, and participants with a z-score >1.64 were considered as exhibiting a significance neural bias toward the target speech (p < 0.05, one-tailed). We chose this normalization continuous approach, rather than using a cutoff value, which allows us to present the distribution of neural-bias values across participants. We note that the approach used here to assess differences in speech tracking of target and non-target here differs from the auditory attention-decoding (AAD) approach used in similar studies to identify which of two speech stimuli belongs to the target talker (Mirkovic et al., 2015; O’Sullivan et al., 2015, 2019; Fuglsang et al., 2017; Teoh and Lalor, 2019). In those studies, a decoder trained only on target speech is used to predict the envelope of non-target speech (and vice versa), which assesses the similarity/differences between the decoders estimated for each stimulus. However, this approach is less appropriate in the current study, where we were interested in assessing how accurately each speech stimulus is represented in the neural signal, even if the spatiotemporal features of their decoders are different (see Extended Data Fig. 6-1 for a direct comparison of these approaches and discussion of their utility for different purposes).

The two analyses described above (Fig. 2A,B) were performed using data from the entire experiment, as well as data from each half of the experiment separately. We then performed a third permutation test to assess whether speech reconstruction accuracies and the Neural-Bias index differed significantly in the two halves of the experiment, i.e., before versus after the talker switch. For this, we conducted an “order-agnostic” permutation test where trials were randomly relabeled so that the data included in each regressor were 50% from the first half of the experiment and 50% from the second half (Fig. 2C). Multivariate decoders were trained on this relabeled data and reconstruction accuracy values were estimated for each regressor, and this procedure was repeated 100 times, yielding a null distribution. A participant was considered as showing a significant difference in neural tracking of either the target or non-target talker, or different in neural bias, if their real data fell in the top 5 percentile of the relevant null distribution.

Results

Behavioral data

Accuracy in answering comprehension questions about the content of the lecture was significantly above chance (M = 0.845, SD = ±0.076; t(22) = 32.178, p < 10−20). No significant differences in performance were observed between the first half (M = 0.846, SD = ±0.09) and second half (M = 0.856, SD = ±0.1) of the experiment (t(22) = −0.42, p = 0.68), as shown in Figure 3.

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3.

Behavioral results. Averaged accuracy rates across trials and participants, for multiple-choice comprehension questions, separately for the first (green) and second (yellow) half of the experiment. Error bars denote SEM across participants.

Speech-tracking analysis

Group-level results

Figure 4 shows the results of the encoding approach. TRF models were estimated for target and non-target speech, trained on data from the entire experiment and also separately on each half of the experiment (before and after talker switch). As expected, the predictive power followed the common centro-frontal topography characteristic of auditory responses in EEG (Fig. 4A) and was significant compared with a null distribution (p < 0.01). The TRF for target speech showed three prominent peaks—at 60, 110, and 170 ms—which is in line with previous studies and are thought to reflect a cascade of information flow from the primary auditory cortex to associative and higher-order regions (Brodbeck et al., 2018b). The TRF for non-target speech was also robust but showed only a single positive peak, ∼70 ms, which likely reflects its early sensory encoding, but not the two later peaks. Although the TRFs estimated for target and non-target speech are not directly comparable (due to the many sensory differences between them, i.e., location, audiovisual vs audio only), they did differ in the amount of variance they explained in the neural signal (predictive power of univariate models for target vs non-target, averaged across all electrodes; F(22) = 11.5, p = 0.003) and including both regressors in a multivariate model explained significantly more of the variance in the neural signal than either univariate model alone (t test between average predictive power: multivariate vs target only: t = 2.97, p = 0.007; multivariate vs non-target only: t = 5.9, p < 10−5; Fig. 4B).

Figure 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4.

Neural bias: group-level results. A, TRF encoding models across all experimental trials, plotted from electrode “Fz,” separately for target and non-target speech. Shaded highlights denote SEM across participants (top). Topographic distribution of the TRF main peaks, plotted separately for target and non-target speech (bottom). B, Topographic distribution of predictive power values (Pearson's r) of the encoding model, averaged across participants, separately for multivariate (top) and univariate (bottom) analysis. C, TRF encoding models for the first half (green) and second half (yellow) of the experiment, plotted from electrode “Fz,” separately for target and non-target speech. Shaded highlights denote SEM across participants. D, Speech reconstruction accuracies for the first and second half of the experiment, for both target and non-target speech. Error bars denote SEM across participants. Extended Data Figure 4-1 is supporting Figure 4.

Figure 4-1

Group-level Spectral analysis. Comparison of the power spectrum of the EEG between the first and second half of the experiment. We calculated the spectral power density using a multitaper fast-fourier transform (FFT; as implemented in Fieldtrip), separately for the data from each half of the experiment. Shown in the figure is the power spectrum averaged over 9 centro-parietal electrodes (marked in the subplot on the top-left). A clear peak is seen in the low alpha-range (7-9  Hz) in both halves of the experiment, which was maximal at centro-parietal electrodes (shown in the topographies). However, a paired t-test revealed no significant difference between alpha power in the first vs. second half of the experiment [t(22) = 1.25, p = 0.22], which might have been expected as an index of fatigue or reduced attention over time (e.g. Yu et al., 2021). Download Figure 4-1, TIF file.

The TRFs estimated separately in the first and second half of the experiment for target and non-target speech were highly similar in their spatiotemporal properties, and no significant differences between them were found (Fig. 4C). An ANOVA comparing the reconstruction accuracies for target versus non-target speech across both halves of the experiment revealed a main effect of task relevance (target vs non-target: F(1,22) = 28.3, p < 0.001) but no main effect of half (F(1,22) = 0.12, p = 0.73) or interaction between the factors (half × talker: F(1,22) = 0.87, p = 0.87). These results were confirmed using a Bayesian ANOVA which indicated that the main effect of task relevance could be embraced with high confidence and explains most of the variance (BFinclusion = 317, p = 0.002) but there was no reliable difference between the two halves or interaction (BFinclusion = 0.26 and BFinclusion = 0.27, respectively, both ps > 0.7).

Moreover, we found that decoders trained in one half of the experiment generalized well to the other half when tested on stimuli that shared the same task relevance (role—target/non-target) but did not generalize well to stimuli that shared the same talker identity but had different roles in the two halves of the experiment (Fig. 5). Moreover, the modulation of reconstruction accuracy by task relevance was preserved even in this cross-over analysis. This suggests that the decoders are largely invariant to talker identity and primarily capture features related to the role of the talker in the given task and/or their sensory properties (in this case, being presented audiovisually from a central location for the target speaker).

Figure 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 5.

Generalizability across talkers and time. Reconstruction accuracies for decoders trained on data from one half of the experiment (either on target or non-target speech) and tested on data from the other half of the experiment, separately for same role decoders (e.g., train on target and test on target) and for same talker identity decoders (e.g., train on male talker, test on male talker).

Besides the speech-tracking analysis, we also conducted another group-level analysis of the data and inspected whether the spectral makeup of the EEG signal, and particularly power in the alpha range (8–12 Hz), differed between the first and second halves of the experiment. These results are shown in Extended Data Figure 4-1, as no significance differences were found.

Individual-level results

Statistical analysis of speech tracking in individual participants was three tiered, assessing the significance of: (1) reconstruction accuracies for target and non-target speech, (2) the neural-bias index, and (3) differences between the two parts of the experiment.

Figure 6A (left panel) shows the reconstruction accuracies for target and non-target speech in individual participants, relative to their personal chance-level value (p = 0.05 cutoff, derived in a data-driven manner using S-R permutation). All but one participant showed significant reconstruction accuracy of the target speech (22/23 participants—95%), and most participants also showed higher than chance reconstruction for the non-target speech (18/23 participants—78%). Moreover, reconstruction accuracies for target and non-target speech were positively correlated, across participants (Pearson's r = 0.43, p = 0.038; Fig. 6A, right panel).

Figure 6.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 6.

Speech reconstruction and neural bias in individual participants—full experiment. A, Left, Bar graphs depicting reconstruction accuracy in individual participants for target (black) and non-target (dark gray) speech. Horizontal light gray lines represent the p = 0.05 chance level, derived for each participant based on data-driven S-R permutation. Asterisks indicate participants who also showed significant neural bias to target speech (see panel B). Right, Scatterplot showing reconstruction accuracies for target and non-target speech across all participants. The red line represents the linear regression fit between the two variables, which was significant (Pearson's r = 0.43, p = 0.038). B, Scatterplot showing the average reconstruction accuracy and neural-bias index across participants, which were not significantly correlated. Vertical dashed lines indicate the threshold for significant neural bias (z = 1.64, one-tailed; p < 0.05). C, Scatterplots showing the accuracy on behavioral task versus reconstruction accuracy of target speech (left), non-target speech (middle), and the neural-bias index (right), across all participants. No significant correlations were found. Extended Data Figure 6-1 is supporting Figure 6.

Figure 6-1

Comparison of decoder-testing approaches. Here we compare two approaches for testing the performance of decoders trained on EEG data to reconstruct the envelope of concurrently presented speech. Top: The approach used and reported in the current study, in which two Stimulus specific decoders were trained using a multivariate approach to reconstruct the envelopes of target and non-target speech presented concurrently. The scatter-plot shows reconstruction accuracies achieved for both decoders across all participants, when tested on left-out data of the same type (i.e., how well the target decoder can reconstruct left-out target speech, and how well the non-target decoder can reconstruct left-out non-target speech). The gray line reflects the diagonal, and the red line represents the linear regression fit between the two variables which was statistically significant [data is the same as in Figure 6A]. Bottom: Re-analysis of the same data using the auditory attention-decoding (AAD) approach, in which a decoder is trained only on one stimulus (e.g., on target speech), and is then tested on left-out data of the same stimulus (target) and of the other stimulus (non-target), and the two results are compared for classification purposes. The left panel shows a scatter-plot showing how well a decoder trained on target speech can reconstruct left-out target speech vs. how well it can reconstruct left-out non-target speech, across all participants. The left panel shows the same for a decoder trained on non-target speech. The gray line reflects the diagonal, and the red line represents the linear regression fit between the two variables (dashed line indicates a marginally significant regression). In this analysis, almost all dots fall either below or above the diagonal, clearly showing between reconstruction performance when a decoder is tested on data of the same type that it was trained on. This is in line with multiple studies, that propose using this approach for practical applications, such as controlling a neuro-steered hearing device (Henshaw and Ferguson, 2013; O’Sullivan et al., 2015; Kidd, 2017; Geirnaert et al., 2022; Roebben et al., 2024). Given the qualitative sensory differences between target and non-target speech in the spatially-realistic audiovisual setup used (e.g., spatial location, audio/audio-visual presentation etc.), it is not very surprising that decoders trained on these stimuli would differ from each other. However, we posit that this approach is less appropriate in the current study, where the goal was not just to distinguish between the two stimuli, but to test whether target speech is represented more robustly in the neural data than non-target speech, a pattern that is considered a signature of ‘selective attention’ - i.e., enhancement of target speech and/or suppression of non-target speech (Kerlin et al., 2010; Ding et al., 2012; Zion Golumbic et al., 2013b; O’Sullivan et al., 2015; Fiedler et al., 2019). For this purpose, we believe that is it more appropriate to optimize decoders for each stimulus separately (thus accounting for their differences in properties, e.g. differences in spatial location or speaker characteristics), and then assess how well each one performs for predicting the stimulus it was trained on (the model’s goodness-of-fit/predictive power/accuracy). Using this approach, if we find that both decoders perform very well – this indicates that both stimuli are represented with similar precision in the neural response. Conversely, finding that the decoder for one stimulus outperforms the other can be interpreted as superior or more detailed neural encoding of that stimulus relative to the other, effects that have been associated with better intelligibly and/or higher levels of attention to this stimulus (Best et al., 2008; Lin and Carlile, 2015; Getzmann et al., 2017; Teoh and Lalor, 2019; Uhrig et al., 2022; Orf et al., 2023). Download Figure 6-1, TIF file.

In Figure 6B, we compare the average reconstruction accuracy from each participant to their “Neural-Bias index,” which reflects the difference in reconstruction accuracy for the target vs non-target speech (normalized relative “target-agnostic” permutations of the data). When using a cutoff of z > 1.64 (p < 0.05, one-tailed), only 10/23 participants (43%) showed a significant neural-bias index, and if we use a less conservative threshold of z > 1 (p < 0.15, one-tailed), this proportion changes only slightly to 13/23 participants (56%). Interestingly, the reconstruction accuracies themselves and the neural-bias index were not correlated with each other (Pearson's r = −0.017, p = 0.94), suggesting that these metrics are independent.

We further tested whether performance on the behavioral tasks (answering comprehension questions) was correlated with reconstruction accuracy of either speech stimuli or the neural-bias index; however, none of the brain–behavior correlations were significance (target reconstruction accuracy vs behavior: r = 0.032, p = 0.88; non-target reconstruction accuracy vs behavior: r = −0.088, p = 0.69; neural bias vs behavior: r = 0.14, p = 0.5; Fig. 6C).

Figures 7 and 8 depict the results of the same analyses shown in Figure 6 but conducted separately on data from each half of the experiment. Here, speech reconstruction was significant in a fewer proportion of participants, with 17/23 (74%) showing significant reconstruction of target speech in both the first and second half (although these were not necessarily the same participants) and 10/23 (52%) or 12/23 (43%) participants showing significant reconstruction of non-target speech in the first and second half of the experiment, respectively (Figs. 7A, 8A). Reconstruction accuracy of target and non-target speech were not significantly correlated in the first half of the experiment (Pearson's r = 0.2, p = 0.33) but were in the second half (Pearson's r = 0.4, p = 0.048).

Figure 7.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 7.

Speech reconstruction and neural bias in individual participants—first half of experiment. A, Left, Bar graphs depicting reconstruction accuracy in individual participants for target (black) and non-target (dark gray) speech. Horizontal light gray lines represent the p = 0.05 chance level, derived for each participant based on data-driven S-R permutation. Right, Scatterplot showing reconstruction accuracies for target and non-target speech across all participants. B, Scatterplot showing the average reconstruction accuracy and neural-bias index across participants, which were not significantly correlated. Vertical dashed lines indicate the threshold for significant neural bias (z = 1.64, one-tailed; p < 0.05). C, Scatterplots showing the accuracy on behavioral task versus reconstruction accuracy of target speech (left), non-target speech (middle), and the neural-bias index (right), across all participants. No significant correlations were found.

Figure 8.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 8.

Speech reconstruction and neural bias in individual participants—second half of experiment. A, Left, Bar graphs depicting reconstruction accuracy in individual participants for target (black) and non-target (dark gray) speech. Horizontal light gray lines represent the p = 0.05 chance level, derived for each participant based on data-driven S-R permutation. Right, Scatterplot showing reconstruction accuracies for target and non-target speech across all participants. The red line represents the linear regression fit between the two variables, which was significant (Pearson's r = 0.4, p = 0.048). B, Scatterplot showing the average reconstruction accuracy and neural-bias index across participants, which were not significantly correlated. Vertical dashed lines indicate the threshold for significant neural bias (z = 1.64, one-tailed; p < 0.05). C, Scatterplots showing the accuracy on behavioral task versus reconstruction accuracy of target speech (left), non-target speech (middle), and the neural-bias index (right), across all participants. No significant correlations were found.

When evaluating the neural-bias index of individual participants, we found that only 7/23 (30%) and 5/23 (21%) showed significantly better reconstruction accuracy for the target versus non-target speech (z > 1.64, p < 0.05, one-tailed; first and second half, respectively). As observed for the full experiment, here too reconstruction accuracy was not correlated with the neural-bias index in either half (first half: r = −0.37, p = 0.08; second half: r = −0.026, p = 0.9; Figs. 7B, 8B), nor were any brain–behavior correlations significant (first half: target reconstruction accuracy vs behavior: r = 0.095, p = 0.67; non-target reconstruction accuracy vs behavior: r = 0.14, p = 0.54; neural bias vs behavior: r = −0.075, p = 0.73; Fig. 7C; second half: target reconstruction accuracy vs behavior: r = 0.05, p = 0.83; non-target reconstruction accuracy vs behavior: r = −0.007, p = 0.97; neural bias vs behavior: r = 0.1, p = 0.64; Fig. 8C).

Last, we compared the speech reconstruction accuracies and neural-bias indices obtained in each half of the experiment but found that none of these measures were significantly correlated between the first and second half of the experiment (neural bias: Pearson's r = −0.005, p = 0.98, target speech: Pearson's r = 0.34, p = 0.11, non-target speech: Pearson's r = 0.17, p = 0.44; Fig. 9). Only a handful of participants showed above-chance differences between the first and second half of the experiment in these metrics, relative to a distribution of order-agnostic permuted data (Fig. 9, red); however, these may represent false positives due to multiple comparisons.

Figure 9.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 9.

First versus second half of experiment. Scatterplots showing the neural-bias index (left), target speech (middle), and non-target speech reconstruction accuracies (right) across all participants, in the first versus second half of the experiment. No significant correlations were found for any of the measures. Participants for whom significant difference were found between the two halves (based on an order-agnostic permutation test) are marked in red.

Discussion

Here we studied the neural representation of target and non-target speech in a spatially realistic audiovisual setup, at both the group-level and in individual participants. The group-level results are in line with results from previous two-talker experiments that used less ecological paradigms (e.g., dichotic listening, scripted speech materials etc.), namely, that the acoustic envelope of target speech is represented more robustly than that of non-target speech (Kerlin et al., 2010; Ding et al., 2012; Mesgarani and Chang, 2012; Zion Golumbic et al., 2013b; O’Sullivan et al., 2015; Fuglsang et al., 2017; Brodbeck et al., 2018b; Fiedler et al., 2019; Niesen et al., 2019; Har-shai Yahav and Zion Golumbic, 2021; Kaufman and Zion Golumbic, 2023; Straetmans et al., 2024). We also show that neural tracking is invariant to talker identity and does not change significantly over the course of the experiment (in magnitude or spatiotemporal features of the decoder). This supports the robustness of this measure for use in free-field audiovisual contexts, laying the foundation to extend scientific investigation of selective attention to more realistic environments (Parsons, 2015; Risko et al., 2016; Brown et al., 2023; Levy et al., 2024).

However, examination of individual-level results revealed that the neural bias observed at the group level is not seen consistently in all, or even in most, listeners. Fewer than half of the participants showed significant neural bias for target speech, whereas in most participants non-target speech could be reconstructed just as well as target speech from the neural signal. Importantly, reconstruction accuracy and the neural-bias index were not correlated, indicating that variability across participants cannot be trivially explained by poor speech tracking or low SNR. Moreover, neither the reconstruction accuracy of target or non-target speech nor the neural-bias index were significantly correlated with performance, suggesting they carry limited behavioral consequences. These results are similar to those reported in a previous MEG study (Kaufman and Zion Golumbic, 2023), which taken together lead us to suggest that although speech-tracking metrics are useful for studying selective attention at the group level, they may fall short as “neural-markers of selective attention” at the individual level. Below we discuss the potential implications of these results for understanding the different strategies listeners may employ to deal with concurrent speech in realistic multitalker contexts.

Group-level effects: highly robust and generalizable

The group-level TRFs, derived separately for target and non-target speech using an encoding approach, represent spatiotemporal filters that capture the neural responses to each of the two concurrent talkers (Ding and Simon, 2012; Power et al., 2012; Zion Golumbic et al., 2013b; Fuglsang et al., 2017; Fiedler et al., 2019; Kaufman and Zion Golumbic, 2023). They share similar frontocentral topographical distribution, which is typical of auditory responses; however, they differ in their time course. The TRF for target speech contains three prominent peaks—roughly at 60, 110, and 170 ms—which is in line with previous studies and are thought to reflect a cascade of information flow from primary auditory cortex to associative and higher-order regions (Brodbeck et al., 2018b; Chen et al., 2023). Conversely, the TRF for non-target speech showed only a single positive peak, ∼70 ms, which likely reflects its early sensory encoding, but not the two later peaks. Past studies have reported mixed results regarding the temporal similarity of TRFs for target and non-target speech, some showing similar time courses albeit with reduced amplitudes for non-target speech (Kerlin et al., 2010; Ding et al., 2012; Zion Golumbic et al., 2013b; O’Sullivan et al., 2015; Fiedler et al., 2019; Kaufman and Zion Golumbic, 2023) and others showing that the later TRF peaks are not present for non-target speech (Jaeger et al., 2020; Har-Shai Yahav et al., 2024). It is likely that differences in spatiotemporal characteristics of TRFs for target and non-target speech are affected both by the specific perceptual attributes of the stimuli themselves (e.g., audiovisual vs audio presentation, their different spatial location and consequent reverberation pattern) as well as by their task-related role (target vs non-target). In the spatially realistic experimental design used here, these factors are inherently confounded, just as they are under real-life conditions in which listeners look at the talker that they are trying to pay attention to. A previous study by O’sullivan et al. (2019) and reanalyzed by Ahmed et al. (2023a) attempted to dissect the specific contribution of selective attention versus audiovisual input by including a condition in which participants watched video of a talker that they had to ignore and attended to a talker who they could not see. However, it is not clear to what degree such a highly artificial design (which is also extremely cognitively demanding) is representative of the mechanisms that listeners use when processing speech under natural circumstances. Instead, here, rather than trying to assert whether the differences in TRF are due to “selective attention per se” or to “perceptual differences,” we accept that in real life these often go together. We posit that as selective attention research progresses to more ecologically valid contexts, these factors cannot (and perhaps need not!) be teased apart but rather should be considered as making inherently joint contributions to behavior and neural responses. Arguably, in the aspiration to empirically dissociate between “pure” effects of perception and attention in ecological contexts may be inadequate, and we believe that one of the great challenges facing our field today is to reconsider and redefine the relationship between these constructs, if we strive to truly understand how the brain operates under real-life conditions (Anderson, 2011; Risko et al., 2016; Schotter et al., 2025).

Regarding the comparison between the first and second half of the experiment, we found similar TRFs and reconstruction accuracies, both for target and for non-target speech. This indicates that listeners were highly effective at adapting their neural encoding after the mid-way shift in the identity of the target talker and the topic of the lecture. It also demonstrates the robustness of EEG-based speech-tracking measures and of the neural bias toward target speech when using roughly 10 min of data (at least at the group level). When designing this study, we postulated that neural tracking and/or neural bias to the target speech might be worse in the second half of the experiment, either due to fatigue (Moore et al., 2017; Jaeger et al., 2020) or due to higher cognitive interference of the non-target talker who had previously been the target talker (Johnsrude et al., 2013; Har-Shai Yahav et al., 2024). The finding that here the switch did not carry a behavioral or neural processing cost is testimony to the high adaptability and flexibility of the attentional system, which does not “get stuck” on processing previously relevant features but updates its preferences according to changing task demands (Kiesel et al., 2010; Koch and Lawo, 2014; Agmon et al., 2021; Kaufman and Zion Golumbic, 2023). This result is somewhat in contrast to some previous findings showing that speech processing was adversely affected by attention switching. For example, a recent study similar to ours (albeit using auditory-only stimuli) found that a previously attended stream posed more of an interference to behavior task compared with a consistently task-irrelevant stream (Orf et al., 2023). Other studies have also demonstrated reduced speech processing, decreased intelligibility, and impaired recall of specific details, resulting from attention switching between talkers (Best et al., 2008; Lin and Carlile, 2015; Getzmann et al., 2017; Teoh and Lalor, 2019; Uhrig et al., 2022). However, in those studies, the switches occurred on a per-trial basis, which likely creates more opportunities for confusion relative to the current study where the target talker was switched only once. Admittedly, there is much yet to explore regarding attention switching between talkers, particularly under ecological conditions where switches are contextual and often initiated by the listener themselves. The current findings contribute to these efforts by demonstrating that the neural representation of target and non-target speech is invariant to talker identity and stabilizes nicely after a switch in a realistic audiovisual context.

From group averages to individual-level responses

The vast majority of cognitive neuroscience research, and particularly when using EEG, relies on averaging results from large samples of participants to obtain group-level results. This has traditionally been motivated by the noisy nature of many EEG metrics, the variability across individuals, and the need to generalize results beyond a specific sample (Luck et al., 2000; Makeig et al., 2004; Luck, 2014). However, increasingly, there is also a desire to derive reliable EEG-based measures from the brains of individuals—to be used, for example, to explain variability in behavioral performance and cognitive capabilities, as biomarkers for clinical symptoms, and to monitor the effectiveness of personalized interventions (O’Sullivan et al., 2017; Bednar and Lalor, 2020; Hadley and Culling, 2022; Geirnaert et al., 2024). Deriving reliable individual-level EEG-based metrics for speech processing and/or for attention has been particularly appealing, given their potential utility for clinical, educational, and technological interventions. The development of speech-tracking methods over the past decade has given hope that this metric will prove to be useful for individual-level assessments. Indeed, in the domain of hearing and speech processing, several groups have demonstrated robust correlations between neural speech-tracking metrics and speech intelligibility, for example, in children with dyslexia or in those with hearing impairment (Keshavarzi et al., 2022; Xiu et al., 2022; Van Hirtum et al., 2023). In contexts that require selective attention to a target talker, speech-tracking methods have also been successfully applied to distinguish between the neural representations of target and non-target speech in individual participants and even on a per-trial basis (a classification-based approach that can be effective, for example, for designing neuro-steering devices or hearing aids; Henshaw and Ferguson, 2013; O’Sullivan et al., 2015; Kidd, 2017; Fallahnezhad et al., 2023; see Extended Data Fig. 6-1). However, to date fewer studies have looked at the neural-bias index in individual participants, a metric that offers insights not only into the distinction between target and non-target speech but into the modulation of their neural representations as a function of task relevance, which is thought to be a signature of top-down selective attention (Hillyard et al., 1973; Hansen and Hillyard, 1983; Woods et al., 1984; Bidet-Caulet et al., 2007; Manting et al., 2020).

The neural bias in reconstruction accuracies of target and non-target speech for individual participants is shown qualitatively in several previous papers, but statistical analyses focused mostly on the group level (Ding et al., 2012; Ding and Simon, 2012; Fuglsang et al., 2017; Rosenkranz et al., 2021). Kaufman and Zion Golumbic (2023) introduced the use of attention-agnostic permutation tests to quantify and test the neural-bias index statistically in individual listeners. Results from that MEG study are similar to those found here, whereby fewer than half of the participants (∼30% in that study) exhibited significant modulation of speech tracking by selective attention, even when using a “permissive” statistical threshold. The fact that group-level averages consistently show a robust difference between target and non-target speech despite the underlying variability across participants is explained by the asymmetric distribution of the neural-bias metric—since none of the participants show an “opposite” bias (i.e., more accurate reconstruction of non-target vs target speech). This asymmetric distribution also lends further credibility to the neural-bias metric as reliably capturing the relative neural representation of the two speech stimuli.

How should we interpret the variability in neural bias across individuals? Here we ruled out several trivial explanations, namely, that variability is due to poor EEG signal quality, poor speech-tracking abilities, or poor attention to the target—since we show that the reconstruction accuracies for target and non-target speech are correlated with each other, but their average is not correlated with the neural-bias index, nor is it correlated with accuracy in answering questions about the content of target speech. Instead, we offer an alternative—admittedly speculative—interpretation that emphasizes possible variability across individuals in their de facto allocation of processing resources among competing talkers. We know from our subjective experience that paying attention solely to one target speech and shutting out other competing stimuli can be extremely difficult. There are numerous examples that both sensory and semantic properties of non-target speech are encoded and processed, indicating that selective attention to one talker does not imply its exclusive representation (Dupoux et al., 2003; Rivenez et al., 2006; Beaman et al., 2007; Parmentier et al., 2018; Vachon et al., 2019; Har-shai Yahav and Zion Golumbic, 2021; Brown et al., 2023; Har-Shai Yahav et al., 2024). Moreover, the fact that individuals can divide their attention reasonably well between two concurrent speech streams if asked to do so indicates that sufficient perceptual cognitive resources may be available to listeners to apply a multiplexed listening strategy (Vanthornhout et al., 2019; Agmon et al., 2021; Kaufman and Zion Golumbic, 2023). Given that, it is reasonable to assume that even when instructed to pay attention to only one talker, listeners may devote at least some resources to the competing non-target talker as well—either voluntarily, as in divided attention, or involuntarily (Makov et al., 2023). This notion is in line with “load theory of attention,” which suggests that rather than attributing attentional selection of target stimuli/features to either “early” or “late” stages of processing, attention should be viewed as the dynamic allocation of available cognitive resources among competing stimuli. This allocation of resources among talkers reflects their prioritization vis-à-vis their relevance to the listener but can also vary as a function of perceptual load, task demands, and listener motivation/internal state, which we propose may underlie some of the variability observed here between participants (Lavie et al., 2004; Wild et al., 2012; Gagné et al., 2017; Murphy et al., 2017; Peelle, 2018). Along these lines, the lack of a correlation between neural bias and performance may suggest that the perceptual and cognitive demands of the current task, which emulates those encountered in many real-life situations, left many listeners with sufficient available resources to corepresent both talkers without behavioral costs. Clearly, this interpretation is speculative and the current data are insufficient for testing its plausibility; however, we offer it as a hypothesis for future studies aimed at better understanding individual differences in prioritizing between target and non-target speech under realistic circumstances and whether this variability is explained by specific personal traits, by perceptual or cognitive load, or by other state-related factors (Beaman et al., 2007; Colflesh and Conway, 2007; Forster and Lavie, 2008; Sörqvist and Rönnberg, 2014; Murphy et al., 2017; Lambez et al., 2020).

Another important point to note in this regard is that speech tracking, as quantified here as the envelope-following response measured using EEG, captures only a partial neural representation of the speech, primarily reflecting encoding of its acoustic properties in auditory (Ding et al., 2012; Mesgarani and Chang, 2012; Zion Golumbic et al., 2013b; Crosse et al., 2016; Fiedler et al., 2019). Recent work that has attempted to separate between neural tracking of acoustic and linguistic/semantic features of speech has suggested that selective attention primarily affects the latter (Lachter et al., 2004; Ding et al., 2018; Brodbeck et al., 2018a), although this is not always the case (Parmentier, 2008; Parmentier et al., 2018; Vachon et al., 2019; Har-shai Yahav and Zion Golumbic, 2021). Moreover, studies that have looked at neural speech tracking across different brain regions, using source-level MEG data or intracranial EEG (ECoG) recordings, have shown a dissociation between the sensory cortical regions that corepresent concurrent speech versus higher-order regions (e.g., anterior temporal cortex, as well as frontal and parietal regions) where attention selectivity was more prominent (Zion Golumbic et al., 2013b; Brodbeck et al., 2018a,b; Har-shai Yahav and Zion Golumbic, 2021). Accordingly, it is possible that if we were to use a more detailed characterization of the speech stimuli, had used a more complex non-linear model, or had analyzed neural responses stemming from brain regions beyond auditory cortex, we might have found more extensive evidence for neural bias in individual participants. While this is an important limitation to consider from a basic-science perspective, the use of EEG in the current study and our focus on the acoustic envelope of speech have critical applicational value. The motivation to derive personalized neural metrics of selective attention (as opposed to group-based data) has a large practical component, such as providing tools for clinical/educational assessments and interventions. As such, these would likely involve EEG recordings (which are substantially more accessible and affordable than MEG) and—in the case of speech processing—would rely on analyzing speech features that are easy to derive (and do not require a tedious annotation process; Agmon et al., 2023). In publishing this data set and emphasizing the variability across individuals, we hope to provide a transparent account of the complexity of using EEG-based speech-tracking data, as a “biomarker” of selective attention in individual listeners, and emphasize the need to systematically investigate the factors underlying this variability (be them “cognitive” or “methodological” in nature).

Conclusion

In traditional cognitive neuroscience research, there is a desire to manipulate a specific construct (e.g., which stimulus is the target) while controlling for all low-level sensory differences between stimuli. However, as we turn to studying neural operations under increasingly ecological conditions, perfect control is less possible. In the case of selective attention, “targetness” is often accompanied with specific sensory attributes, making targets inherently different than non-targets. Here we studied one such case—where target speech is audiovisual and non-target speech is peripheral and auditory only. We show that under these more realistic conditions, the hallmark signature of selective attention—namely, the modulation of sensory representation and its robustness to switches in target identity—is conserved, at least at the group level. At the same time, our results also point to the underlying diversity among participants in how that this modulation manifests, to the degree that in over half of the participants target and non-target speech were represented just as well (albeit in different ways). These results emphasize that there is still much to explore regarding how the brain—or how different brains—treats target and non-target speech when attempting to achieve selective attention. This work calls for more granular investigation of how factors such as task difficulty, perceptual load, listener motivation, and personal traits ultimately affect neural encoding of competing stimuli, under ecological conditions.

Footnotes

  • The authors declare no competing financial interests.

  • This work was supported by Deutsche Forschungsgemeinschaft (1591/2-1) and Israel Ministry of Science (3-16416).

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.

References

  1. ↵
    1. Agmon G,
    2. Yahav PH-S,
    3. Ben-Shachar M,
    4. Zion Golumbic E
    (2021) Attention to speech: mapping distributed and selective attention systems. Cereb Cortex 32:3763–3776. https://doi.org/10.1093/cercor/bhab446
    OpenUrl
  2. ↵
    1. Agmon G,
    2. Jaeger M,
    3. Tsarfaty R,
    4. Bleichner MG,
    5. Golumbic EZ
    (2023) Um…, it’s really difficult to… um… speak fluently”: neural tracking of spontaneous speech. Neurobiol Lang 4:435–454. https://doi.org/10.1162/nol_a_00109
    OpenUrl
  3. ↵
    1. Ahmed F,
    2. Nidiffer AR,
    3. Lalor EC
    (2023a) The effect of gaze on EEG measures of multisensory integration in a cocktail party scenario. Front Hum Neurosci 17:1283206. https://doi.org/10.3389/fnhum.2023.1283206 pmid:38162285
    OpenUrlPubMed
  4. ↵
    1. Ahmed F,
    2. Nidiffer AR,
    3. O’Sullivan AE,
    4. Zuk NJ,
    5. Lalor EC
    (2023b) The integration of continuous audio and visual speech in a cocktail-party environment depends on attention. Neuroimage 274:120143. https://doi.org/10.1016/j.neuroimage.2023.120143
    OpenUrlCrossRefPubMed
  5. ↵
    1. Anderson B
    (2011) There is no such thing as attention. Front Psychol 2:10181. https://doi.org/10.3389/fpsyg.2011.00246 pmid:21977019
    OpenUrlPubMed
  6. ↵
    1. Aydelott J,
    2. Jamaluddin Z,
    3. Nixon Pearce S
    (2015) Semantic processing of unattended speech in dichotic listening. J Acoust Soc Am 138:964–975. https://doi.org/10.1121/1.4927410
    OpenUrlCrossRefPubMed
  7. ↵
    1. Beaman CP,
    2. Bridges AM,
    3. Scott SK
    (2007) From dichotic listening to the irrelevant sound effect: a behavioural and neuroimaging analysis of the processing of unattended speech. Cortex 43:124–134. https://doi.org/10.1016/S0010-9452(08)70450-7
    OpenUrlCrossRefPubMed
  8. ↵
    1. Bednar A,
    2. Lalor EC
    (2020) Where is the cocktail party? Decoding locations of attended and unattended moving sound sources using EEG. Neuroimage 205:116283. https://doi.org/10.1016/j.neuroimage.2019.116283
    OpenUrlCrossRefPubMed
  9. ↵
    1. Bentin S,
    2. Kutas M,
    3. Hillyard SA
    (1995) Semantic processing and memory for attended and unattended words in dichotic listening: behavioral and electrophysiological evidence. J Exp Psychol Hum Percept Perform 21:54–67. https://doi.org/10.1037/0096-1523.21.1.54
    OpenUrlCrossRefPubMed
  10. ↵
    1. Best V,
    2. Ozmeral EJ,
    3. Kopčo N,
    4. Shinn-Cunningham BG
    (2008) Object continuity enhances selective auditory attention. Proc Natl Acad Sci U S A 105:13174–13178. https://doi.org/10.1073/pnas.0803718105 pmid:18719099
    OpenUrlAbstract/FREE Full Text
  11. ↵
    1. Bidet-Caulet A,
    2. Fischer C,
    3. Besle J,
    4. Aguera PE,
    5. Giard MH,
    6. Bertrand O
    (2007) Effects of selective attention on the electrophysiological representation of concurrent sounds in the human auditory cortex. J Neurosci 27:9252–9261. https://doi.org/10.1523/JNEUROSCI.1402-07.2007 pmid:17728439
    OpenUrlAbstract/FREE Full Text
  12. ↵
    1. Broadbent DE
    (1958) Perception and communication. London: Pergamon Press.
  13. ↵
    1. Brodbeck C,
    2. Hong LE,
    3. Simon JZ
    (2018a) Rapid transformation from auditory to linguistic representations of continuous speech. Curr Biol 28:3976–3983.e5. https://doi.org/10.1016/j.cub.2018.10.042 pmid:30503620
    OpenUrlCrossRefPubMed
  14. ↵
    1. Brodbeck C,
    2. Presacco A,
    3. Simon JZ
    (2018b) Neural source dynamics of brain responses to continuous stimuli: speech processing from acoustics to comprehension. Neuroimage 172:162–174. https://doi.org/10.1016/j.neuroimage.2018.01.042 pmid:29366698
    OpenUrlCrossRefPubMed
  15. ↵
    1. Brodbeck C,
    2. Jiao A,
    3. Hong LE,
    4. Simon JZ
    (2020) Neural speech restoration at the cocktail party: auditory cortex recovers masked speech of both attended and ignored speakers (Malmierca MS, ed). PLoS Biol 18:e3000883. https://doi.org/10.1371/journal.pbio.3000883 pmid:33091003
    OpenUrlCrossRefPubMed
  16. ↵
    1. Brown A,
    2. Pinto D,
    3. Burgart K,
    4. Zvilichovsky Y,
    5. Zion-Golumbic E
    (2023) Neurophysiological evidence for semantic processing of irrelevant speech and own-name detection in a virtual café. J Neurosci 43:5045–5056. https://doi.org/10.1523/JNEUROSCI.1731-22.2023 pmid:37336758
    OpenUrlAbstract/FREE Full Text
  17. ↵
    1. Brungart DS
    (2001) Evaluation of speech intelligibility with the coordinate response measure. J Acoust Soc Am 109:2276–2279. https://doi.org/10.1121/1.1357812
    OpenUrlCrossRefPubMed
  18. ↵
    1. Carlyon RP
    (2004) How the brain separates sounds. Trends Cogn Sci 8:465–471. https://doi.org/10.1016/j.tics.2004.08.008
    OpenUrlCrossRefPubMed
  19. ↵
    1. Chen YP,
    2. Schmidt F,
    3. Keitel A,
    4. Rösch S,
    5. Hauswald A,
    6. Weisz N
    (2023) Speech intelligibility changes the temporal evolution of neural speech tracking. Neuroimage 268:119894. https://doi.org/10.1016/j.neuroimage.2023.119894
    OpenUrlCrossRefPubMed
  20. ↵
    1. Cherry EC
    (1953) Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am 25:975–979. https://doi.org/10.1121/1.1907229
    OpenUrlCrossRef
  21. ↵
    1. Colflesh GJH,
    2. Conway ARA
    (2007) Individual differences in working memory capacity and divided attention in dichotic listening. Psychon Bull Rev 14:699–703. https://doi.org/10.3758/BF03196824
    OpenUrlCrossRefPubMed
  22. ↵
    1. Crosse MJ,
    2. Butler JS,
    3. Lalor EC
    (2015) Congruent visual speech enhances cortical entrainment to continuous auditory speech in noise-free conditions. J Neurosci 35:14195–14204. https://doi.org/10.1523/JNEUROSCI.1829-15.2015 pmid:26490860
    OpenUrlAbstract/FREE Full Text
  23. ↵
    1. Crosse MJ,
    2. Di Liberto GM,
    3. Bednar A,
    4. Lalor EC
    (2016) The multivariate temporal response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals to continuous stimuli. Front Hum Neurosci 10:604. https://doi.org/10.3389/fnhum.2016.00604 pmid:27965557
    OpenUrlCrossRefPubMed
  24. ↵
    1. Deutsch JA,
    2. Deutsch D
    (1963) Attention: some theoretical considerations. Psychol Rev 70:80–90. https://doi.org/10.1037/h0039515
    OpenUrlCrossRefPubMed
  25. ↵
    1. Ding N,
    2. Simon JZ,
    3. Simon JZ
    (2012) Neural coding of continuous speech in auditory cortex during monaural and dichotic listening. J Neurophysiol 107:78–89. https://doi.org/10.1152/jn.00297.2011 pmid:21975452
    OpenUrlCrossRefPubMed
  26. ↵
    1. Ding N,
    2. Pan X,
    3. Luo C,
    4. Su N,
    5. Zhang W,
    6. Zhang J
    (2018) Attention is required for knowledge-based sequential grouping: insights from the integration of syllables into words. J Neurosci 38:1178–1188. https://doi.org/10.1523/JNEUROSCI.2606-17.2017 pmid:29255005
    OpenUrlAbstract/FREE Full Text
  27. ↵
    1. Ding N,
    2. Simon JZ
    (2012) Emergence of neural encoding of auditory objects while listening to competing speakers. Proc Natl Acad Sci U S A 109:11854–11859. https://doi.org/10.1073/pnas.1205381109 pmid:22753470
    OpenUrlAbstract/FREE Full Text
  28. ↵
    1. Dupoux E,
    2. Kouider S,
    3. Mehler J
    (2003) Lexical access without attention? Explorations using dichotic priming. J Exp Psychol Hum Percept Perform 29:172–184. https://doi.org/10.1037/0096-1523.29.1.172
    OpenUrlCrossRefPubMed
  29. ↵
    1. Fallahnezhad T,
    2. Pourbakht A,
    3. Toufan R
    (2023) The effect of computer-based auditory training on speech-in-noise perception in adults: a systematic review and meta-analysis. Indian J Otolaryngol Head Neck Surg 75:4198–4211. https://doi.org/10.1007/s12070-023-03920-0 pmid:37974862
    OpenUrlPubMed
  30. ↵
    1. Fiedler L,
    2. Wöstmann M,
    3. Herbst SK,
    4. Obleser J
    (2019) Late cortical tracking of ignored speech facilitates neural selectivity in acoustically challenging conditions. Neuroimage 186:33–42. https://doi.org/10.1016/j.neuroimage.2018.10.057
    OpenUrlCrossRefPubMed
  31. ↵
    1. Fleming JT,
    2. Maddox RK,
    3. Shinn-Cunningham BG
    (2021) Spatial alignment between faces and voices improves selective attention to audio-visual speech. J Acoust Soc Am 150:3085–3100. https://doi.org/10.1121/10.0006415
    OpenUrl
  32. ↵
    1. Forster S,
    2. Lavie N
    (2008) Failures to ignore entirely irrelevant distractors: the role of load. J Exp Psychol Appl 14:73–83. https://doi.org/10.1037/1076-898X.14.1.73 pmid:18377168
    OpenUrlCrossRefPubMed
  33. ↵
    1. Freyman RL,
    2. Balakrishnan U,
    3. Helfer KS
    (2001) Spatial release from informational masking in speech recognition. J Acoust Soc Am 109:2112–2122. https://doi.org/10.1121/1.1354984
    OpenUrlCrossRefPubMed
  34. ↵
    1. Fu Z,
    2. Wu X,
    3. Chen J
    (2019) Congruent audiovisual speech enhances auditory attention decoding with EEG. J Neural Eng 16:066033. https://doi.org/10.1088/1741-2552/ab4340
    OpenUrl
  35. ↵
    1. Fuglsang SA,
    2. Dau T,
    3. Hjortkjær J
    (2017) Noise-robust cortical tracking of attended speech in real-world acoustic scenes. Neuroimage 156:435–444. https://doi.org/10.1016/j.neuroimage.2017.04.026
    OpenUrlCrossRefPubMed
  36. ↵
    1. Gagné JP,
    2. Besser J,
    3. Lemke U
    (2017) Behavioral assessment of listening effort using a dual-task paradigm: a review. Trends Hear 21:1–25. https://doi.org/10.1177/2331216516687287 pmid:28091178
    OpenUrlCrossRefPubMed
  37. ↵
    1. Geirnaert S,
    2. Zink R,
    3. Francart T,
    4. Bertrand A
    (2024) Fast, accurate, unsupervised, and time-adaptive EEG-based auditory attention decoding for neuro-steered hearing devices. In: Brain-computer interface research (Guger C, Allison B, Rutkowski TM, Korostenskaja M, eds), pp 29–40. Cham: Springer.
  38. ↵
    1. Getzmann S,
    2. Jasny J,
    3. Falkenstein M
    (2017) Switching of auditory attention in “cocktail-party” listening: ERP evidence of cueing effects in younger and older adults. Brain Cogn 111:1–12. https://doi.org/10.1016/j.bandc.2016.09.006
    OpenUrlCrossRefPubMed
  39. ↵
    1. Grant KW,
    2. Seitz P-F
    (2000) The use of visible speech cues for improving auditory detection of spoken sentences. J Acoust Soc Am 108:1197–1208. https://doi.org/10.1121/1.1288668
    OpenUrlCrossRefPubMed
  40. ↵
    1. Hadley LV,
    2. Culling JF
    (2022) Timing of head turns to upcoming talkers in triadic conversation: evidence for prediction of turn ends and interruptions. Front Psychol 13:1061582. https://doi.org/10.3389/fpsyg.2022.1061582 pmid:36605274
    OpenUrlCrossRefPubMed
  41. ↵
    1. Haider CL,
    2. Park H,
    3. Hauswald A,
    4. Weisz N
    (2024) Neural speech tracking highlights the importance of visual speech in multi-speaker situations. J Cogn Neurosci 36:128–142. https://doi.org/10.1162/jocn_a_02059
    OpenUrlCrossRefPubMed
  42. ↵
    1. Hansen JC,
    2. Hillyard SA
    (1983) Selective attention to multidimensional auditory stimuli. J Exp Psychol Hum Percept Perform 9:1–19. https://doi.org/10.1037/0096-1523.9.1.1
    OpenUrlCrossRefPubMed
  43. ↵
    1. Har-Shai Yahav P,
    2. Sharaabi A,
    3. Zion Golumbic E
    (2024) The effect of voice familiarity on attention to speech in a cocktail party scenario. Cereb Cortex 34:bhad475. https://doi.org/10.1093/cercor/bhad475
    OpenUrl
  44. ↵
    1. Har-shai Yahav P,
    2. Zion Golumbic E
    (2021) Linguistic processing of task-irrelevant speech at a cocktail party. Elife 10:e65096. https://doi.org/10.7554/eLife.65096 pmid:33942722
    OpenUrlCrossRefPubMed
  45. ↵
    1. Henshaw H,
    2. Ferguson MA
    (2013) Efficacy of individual computer-based auditory training for people with hearing loss: a systematic review of the evidence (Snyder J, ed). PLoS One 8:e62836. https://doi.org/10.1371/journal.pone.0062836 pmid:23675431
    OpenUrlCrossRefPubMed
  46. ↵
    1. Hillyard SA,
    2. Hink RF,
    3. Schwent VL,
    4. Picton TW
    (1973) Electrical signs of selective attention in the human brain. Science 182:177–180. https://doi.org/10.1126/science.182.4108.177
    OpenUrlAbstract/FREE Full Text
  47. ↵
    1. Humes LE,
    2. Kidd GR,
    3. Fogerty D
    (2017) Exploring use of the coordinate response measure in a multitalker babble paradigm. J Speech Lang Hear Res 60:741–754. https://doi.org/10.1044/2016_JSLHR-H-16-0042 pmid:28249093
    OpenUrlCrossRefPubMed
  48. ↵
    1. Jaeger M,
    2. Mirkovic B,
    3. Bleichner MG,
    4. Debener S
    (2020) Decoding the attended speaker from EEG using adaptive evaluation intervals captures fluctuations in attentional listening. Front Neurosci 14:510408. https://doi.org/10.3389/fnins.2020.00603 pmid:32612507
    OpenUrlPubMed
  49. ↵
    JASP-Team (2022) JASP (version 0.16.3) [computer software].
  50. ↵
    1. Johnsrude IS,
    2. Mackey A,
    3. Hakyemez H,
    4. Alexander E,
    5. Trang HP,
    6. Carlyon RP
    (2013) Swinging at a cocktail party: voice familiarity aids speech perception in the presence of a competing voice. Psychol Sci 24:1995–2004. https://doi.org/10.1177/0956797613482467
    OpenUrlCrossRefPubMed
  51. ↵
    1. Karthik G,
    2. Cao CZ,
    3. Demidenko MI,
    4. Jahn A,
    5. Stacey WC,
    6. Wasade VS,
    7. Brang D
    (2024) Auditory cortex encodes lipreading information through spatially distributed activity. Curr Biol 34:4021–4032.e5. https://doi.org/10.1016/j.cub.2024.07.073 pmid:39153482
    OpenUrlCrossRefPubMed
  52. ↵
    1. Kaufman M,
    2. Zion Golumbic E
    (2023) Listening to two speakers: capacity and tradeoffs in neural speech tracking during selective and distributed attention. Neuroimage 270:119984. https://doi.org/10.1016/j.neuroimage.2023.119984
    OpenUrlCrossRefPubMed
  53. ↵
    1. Keidser G, et al.
    (2020) The quest for ecological validity in hearing science: what it is, why it matters, and how to advance it. Ear Hear 41:5S–19S. https://doi.org/10.1097/AUD.0000000000000944 pmid:33105255
    OpenUrlCrossRefPubMed
  54. ↵
    1. Kerlin JR,
    2. Shahin AJ,
    3. Miller LM
    (2010) Attentional gain control of ongoing cortical speech representations in a cocktail party. J Neurosci 30:620–628. https://doi.org/10.1523/JNEUROSCI.3631-09.2010 pmid:20071526
    OpenUrlAbstract/FREE Full Text
  55. ↵
    1. Keshavarzi M,
    2. Mandke K,
    3. Macfarlane A,
    4. Parvez L,
    5. Gabrielczyk F,
    6. Wilson A,
    7. Flanagan S,
    8. Goswami U
    (2022) Decoding of speech information using EEG in children with dyslexia: less accurate low-frequency representations of speech, not “noisy” representations. Brain Lang 235:105198. https://doi.org/10.1016/j.bandl.2022.105198
    OpenUrl
  56. ↵
    1. Kidd G
    (2017) Enhancing auditory selective attention using a visually guided hearing aid. J Speech Lang Hear Res 60:3027. https://doi.org/10.1044/2017_JSLHR-H-17-0071 pmid:29049603
    OpenUrlCrossRefPubMed
  57. ↵
    1. Kiesel A,
    2. Steinhauser M,
    3. Wendt M,
    4. Falkenstein M,
    5. Jost K,
    6. Philipp AM,
    7. Koch I
    (2010) Control and interference in task switching–a review. Psychol Bull 136:849–874. https://doi.org/10.1037/a0019842
    OpenUrlCrossRefPubMed
  58. ↵
    1. Koch I,
    2. Lawo V
    (2014) Exploring temporal dissipation of attention settings in auditory task switching. Atten Percept Psychophys 76:73–80. https://doi.org/10.3758/s13414-013-0571-5
    OpenUrl
  59. ↵
    1. Lachter J,
    2. Forster KI,
    3. Ruthruff E
    (2004) Forty-five years after broadbent (1958): still no identification without attention. Psychol Rev 111:880–913. https://doi.org/10.1037/0033-295X.111.4.880
    OpenUrlCrossRefPubMed
  60. ↵
    1. Lambez B,
    2. Agmon G,
    3. Har-Shai Yahav P,
    4. Rassovsky Y,
    5. Zion Golumbic E
    (2020) Paying attention to speech: the role of working memory capacity and professional experience. Atten Percept Psychophys 82:3594–3605. https://doi.org/10.3758/s13414-020-02091-2
    OpenUrlPubMed
  61. ↵
    1. Lavie N,
    2. Hirst A,
    3. de Fockert JW,
    4. Viding E
    (2004) Load theory of selective attention and cognitive control. J Exp Psychol Gen 133:339–354. https://doi.org/10.1037/0096-3445.133.3.339
    OpenUrlCrossRefPubMed
  62. ↵
    1. Levy O,
    2. Korisky A,
    3. Zvilichovsky Y,
    4. Golumbic EZ
    (2024) The neurophysiological costs of learning in a noisy classroom: an ecological virtual reality study. J Cogn Neurosci 37:300–316. https://doi.org/10.1162/jocn_a_02249
    OpenUrl
  63. ↵
    1. Liberman MC
    (1982) The cochlear frequency map for the cat: labeling auditory-nerve fibers of known characteristic frequency. J Acoust Soc Am 72:1441–1449. https://doi.org/10.1121/1.388677
    OpenUrlCrossRefPubMed
  64. ↵
    1. Lin G,
    2. Carlile S
    (2015) Costs of switching auditory spatial attention in following conversational turn-taking. Front Neurosci 9:136588. https://doi.org/10.3389/fnins.2015.00124 pmid:25941466
    OpenUrlPubMed
  65. ↵
    1. Luck SJ,
    2. Woodman GF,
    3. Vogel EK
    (2000) Event-related potential studies of attention. Trends Cogn Sci 4:432–440. https://doi.org/10.1016/S1364-6613(00)01545-X
    OpenUrlCrossRefPubMed
  66. ↵
    1. Luck SJ
    (2014) An introduction to the event-related potential technique, Ed 2. Cambridge: MIT press.
  67. ↵
    1. Makeig S,
    2. Debener S,
    3. Onton J,
    4. Delorme A
    (2004) Mining event-related brain dynamics. Trends Cogn Sci 8:204–210. https://doi.org/10.1016/j.tics.2004.03.008
    OpenUrlCrossRefPubMed
  68. ↵
    1. Makov S,
    2. Pinto D,
    3. Har-shai Yahav P,
    4. Miller LM,
    5. Zion Golumbic E
    (2023) Unattended, distracting or irrelevant”: theoretical implications of terminological choices in auditory selective attention research. Cognition 231:105313. https://doi.org/10.1016/j.cognition.2022.105313
    OpenUrl
  69. ↵
    1. Manting CL,
    2. Andersen LM,
    3. Gulyas B,
    4. Ullén F,
    5. Lundqvist D
    (2020) Attentional modulation of the auditory steady-state response across the cortex. Neuroimage 217:116930. https://doi.org/10.1016/j.neuroimage.2020.116930
    OpenUrlCrossRefPubMed
  70. ↵
    1. Mathôt S,
    2. Schreij D,
    3. Theeuwes J
    (2012) OpenSesame: an open-source, graphical experiment builder for the social sciences. Behav Res Methods 44:314–324. https://doi.org/10.3758/s13428-011-0168-7 pmid:22083660
    OpenUrlCrossRefPubMed
  71. ↵
    1. Mesgarani N,
    2. Chang EF
    (2012) Selective cortical representation of attended speaker in multi-talker speech perception. Nature 485:233–236. https://doi.org/10.1038/nature11020 pmid:22522927
    OpenUrlCrossRefPubMed
  72. ↵
    1. Mirkovic B,
    2. Debener S,
    3. Jaeger M,
    4. De Vos M
    (2015) Decoding the attended speech stream with multi-channel EEG: implications for online, daily-life applications. J Neural Eng 12:046007. https://doi.org/10.1088/1741-2560/12/4/046007
    OpenUrlCrossRefPubMed
  73. ↵
    1. Moore TM,
    2. Key AP,
    3. Thelen A,
    4. Hornsby BWY
    (2017) Neural mechanisms of mental fatigue elicited by sustained auditory processing. Neuropsychologia 106:371. https://doi.org/10.1016/j.neuropsychologia.2017.10.025 pmid:29061491
    OpenUrlCrossRefPubMed
  74. ↵
    1. Murphy S,
    2. Fraenkel N,
    3. Dalton P
    (2013) Perceptual load does not modulate auditory distractor processing. Cognition 129:345–355. https://doi.org/10.1016/j.cognition.2013.07.014
    OpenUrlCrossRefPubMed
  75. ↵
    1. Murphy S,
    2. Spence C,
    3. Dalton P
    (2017) Auditory perceptual load: a review. Hear Res 352:40–48. https://doi.org/10.1016/j.heares.2017.02.005
    OpenUrlCrossRefPubMed
  76. ↵
    1. Näätänen R,
    2. Teder W,
    3. Alho K,
    4. Lavikainen J
    (1992) Auditory attention and selective input modulation: a topographical ERP study. Neuroreport 3:493–496. https://doi.org/10.1097/00001756-199206000-00009
    OpenUrlCrossRefPubMed
  77. ↵
    1. Niesen M,
    2. Bourguignon M,
    3. Vander Ghinst M,
    4. Bertels J,
    5. Wens V,
    6. Choufani G,
    7. Hassid S,
    8. Goldman S,
    9. De Tiège X
    (2019) Cortical processing of hierarchical linguistic structures in adverse auditory situations. Front Neurosci 13. https://doi.org/10.3389/conf.fnins.2019.96.00052
  78. ↵
    1. Oostenveld R,
    2. Fries P,
    3. Maris E,
    4. Schoffelen J-M
    (2011) FieldTrip: open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci 2011:1–9. https://doi.org/10.1155/2011/156869 pmid:21253357
    OpenUrlCrossRefPubMed
  79. ↵
    1. Orf M,
    2. Wö M,
    3. Hannemann R,
    4. Obleser J
    (2023) Target enhancement but not distractor suppression in auditory neural tracking during continuous speech. iScience 26:106849. https://doi.org/10.1016/j.isci.2023.106849 pmid:37305701
    OpenUrlCrossRefPubMed
  80. ↵
    1. O’Sullivan AE,
    2. Lim CY,
    3. Lalor EC
    (2019) Look at me when I’m talking to you: selective attention at a multisensory cocktail party can be decoded using stimulus reconstruction and alpha power modulations. Eur J Neurosci 50:3282–3295. https://doi.org/10.1111/ejn.14425
    OpenUrlCrossRefPubMed
  81. ↵
    1. O’Sullivan JA,
    2. Power AJ,
    3. Mesgarani N,
    4. Rajaram S,
    5. Foxe JJ,
    6. Shinn-Cunningham BG,
    7. Slaney M,
    8. Shamma SA,
    9. Lalor EC
    (2015) Attentional selection in a cocktail party environment can be decoded from single-trial EEG. Cereb Cortex 25:1697–1706. https://doi.org/10.1093/cercor/bht355 pmid:24429136
    OpenUrlCrossRefPubMed
  82. ↵
    1. O’Sullivan J,
    2. Chen Z,
    3. Herrero J,
    4. McKhann GM,
    5. Sheth SA,
    6. Mehta AD,
    7. Mesgarani N
    (2017) Neural decoding of attentional selection in multi-speaker environments without access to clean sources. J Neural Eng 14:056001. https://doi.org/10.1088/1741-2552/aa7ab4 pmid:28776506
    OpenUrlCrossRefPubMed
  83. ↵
    1. Parmentier FBR
    (2008) Towards a cognitive model of distraction by auditory novelty: the role of involuntary attention capture and semantic processing. Cognition 109:345–362. https://doi.org/10.1016/j.cognition.2008.09.005
    OpenUrlCrossRefPubMed
  84. ↵
    1. Parmentier FBR,
    2. Pacheco-Unguetti AP,
    3. Valero S
    (2018) Food words distract the hungry: evidence of involuntary semantic processing of task-irrelevant but biologically-relevant unexpected auditory words. PLoS One 13:1–17. https://doi.org/10.1371/journal.pone.0190644 pmid:29300763
    OpenUrlCrossRefPubMed
  85. ↵
    1. Parsons TD
    (2015) Virtual reality for enhanced ecological validity and experimental control in the clinical, affective and social neurosciences. Front Hum Neurosci 9:146520. https://doi.org/10.3389/fnhum.2015.00660 pmid:26696869
    OpenUrlPubMed
  86. ↵
    1. Peelle JE
    (2018) Listening effort: how the cognitive consequences of acoustic challenge are reflected in brain and behavior. Ear Hear 39:204–214. https://doi.org/10.1097/AUD.0000000000000494 pmid:28938250
    OpenUrlCrossRefPubMed
  87. ↵
    1. Power AJ,
    2. Foxe JJ,
    3. Forde EJ,
    4. Reilly RB,
    5. Lalor EC
    (2012) At what time is the cocktail party? A late locus of selective attention to natural speech. Eur J Neurosci 35:1497–1503. https://doi.org/10.1111/j.1460-9568.2012.08060.x
    OpenUrlCrossRefPubMed
  88. ↵
    1. Risko EF,
    2. Richardson DC,
    3. Kingstone A
    (2016) Breaking the fourth wall of cognitive science. Curr Dir Psychol Sci 25:70–74. https://doi.org/10.1177/0963721415617806
    OpenUrlCrossRef
  89. ↵
    1. Rivenez M,
    2. Darwin CJ,
    3. Guillaume A
    (2006) Processing unattended speech. J Acoust Soc Am 119:4027–4040. https://doi.org/10.1121/1.2190162
    OpenUrlCrossRefPubMed
  90. ↵
    1. Rosenkranz M,
    2. Holtze B,
    3. Jaeger M,
    4. Debener S
    (2021) EEG-based intersubject correlations reflect selective attention in a competing speaker scenario. Front Neurosci 15:685774. https://doi.org/10.3389/fnins.2021.685774 pmid:34194296
    OpenUrlCrossRefPubMed
  91. ↵
    1. Ross LA,
    2. Saint-Amour D,
    3. Leavitt VM,
    4. Molholm S,
    5. Javitt DC,
    6. Foxe JJ
    (2007) Impaired multisensory processing in schizophrenia: deficits in the visual enhancement of speech comprehension under noisy environmental conditions. Schizophr Res 97:173–183. https://doi.org/10.1016/j.schres.2007.08.008
    OpenUrlCrossRefPubMed
  92. ↵
    1. Schotter E,
    2. Payne B,
    3. Melcher D
    (2025) Characterizing the neural underpinnings of attention in the real world via co-registration of eye movements and EEG/MEG: an introduction to the special issue. Atten Percept Psychophys 87:1–4. https://doi.org/10.3758/s13414-025-03017-6
    OpenUrl
  93. ↵
    1. Schwartz JL,
    2. Berthommier F,
    3. Savariaux C
    (2004) Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition 93:B69–B78. https://doi.org/10.1016/j.cognition.2004.01.006
    OpenUrlCrossRefPubMed
  94. ↵
    1. Shavit-Cohen K,
    2. Zion Golumbic E
    (2019) The dynamics of attention shifts among concurrent speech in a naturalistic multi-speaker virtual environment. Front Hum Neurosci 13:386. https://doi.org/10.3389/fnhum.2019.00386 pmid:31780911
    OpenUrlCrossRefPubMed
  95. ↵
    1. Sörqvist P,
    2. Rönnberg J
    (2014) Individual differences in distractibility: an update and a model. Psych J 3:42–57. https://doi.org/10.1002/pchj.47 pmid:25632345
    OpenUrlPubMed
  96. ↵
    1. Straetmans L,
    2. Adiloglu K,
    3. Debener S
    (2024) Neural speech tracking and auditory attention decoding in everyday life. Front Hum Neurosci 18:1483024. https://doi.org/10.3389/fnhum.2024.1483024 pmid:39606787
    OpenUrlCrossRefPubMed
  97. ↵
    1. Sumby WH,
    2. Pollack I
    (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26:212–215. https://doi.org/10.1121/1.1907309
    OpenUrlCrossRef
  98. ↵
    1. Teoh ES,
    2. Lalor EC
    (2019) EEG decoding of the target speaker in a cocktail party scenario: considerations regarding dynamic switching of talker location. J Neural Eng 16:036017. doi:10.1088/1741-2552/ab0cf1
    OpenUrlCrossRefPubMed
  99. ↵
    1. Treisman AM
    (1960) Contextual cues in selective listening. Q J Exp Psychol 12:242–248. https://doi.org/10.1080/17470216008416732
    OpenUrlCrossRef
  100. ↵
    1. Treisman AM
    (1969) Strategies and models of selective attention. Psychol Rev 76:282–299. https://doi.org/10.1037/h0027242
    OpenUrlCrossRefPubMed
  101. ↵
    1. Tye-Mmurray N,
    2. Spehar B,
    3. Myerson J,
    4. Hale S,
    5. Sommers M
    (2016) Lipreading and audiovisual speech recognition across the adult lifespan: implications for audiovisual integration. Psychol Aging 31:380–389. https://doi.org/10.1037/pag0000094 pmid:27294718
    OpenUrlCrossRefPubMed
  102. ↵
    1. Uhrig S,
    2. Perkis A,
    3. Möller S,
    4. Svensson UP,
    5. Behne DM
    (2022) Effects of spatial speech presentation on listener response strategy for talker-identification. Front Neurosci 15:730744. https://doi.org/10.3389/fnins.2021.730744 pmid:35153653
    OpenUrlCrossRefPubMed
  103. ↵
    1. Vachon F,
    2. Marsh JE,
    3. Labonté K
    (2019) The automaticity of semantic processing revisited: auditory distraction by a categorical deviation. J Exp Psychol Gen 149:1360–1397. https://doi.org/10.1037/xge0000714
    OpenUrl
  104. ↵
    1. Van Hirtum T,
    2. Somers B,
    3. Dieudonné B,
    4. Verschueren E,
    5. Wouters J,
    6. Francart T
    (2023) Neural envelope tracking predicts speech intelligibility and hearing aid benefit in children with hearing loss. Hear Res 439:108893. https://doi.org/10.1016/j.heares.2023.108893
    OpenUrlCrossRefPubMed
  105. ↵
    1. Vanthornhout J,
    2. Decruy L,
    3. Francart T
    (2019) Effect of task and attention on neural tracking of speech. Front Neurosci 13:977. https://doi.org/10.3389/fnins.2019.00977 pmid:31607841
    OpenUrlCrossRefPubMed
  106. ↵
    1. Wang B,
    2. Xu X,
    3. Niu Y,
    4. Wu C,
    5. Wu X,
    6. Chen J
    (2023) EEG-based auditory attention decoding with audiovisual speech for hearing-impaired listeners. Cereb Cortex 33:10972–10983. https://doi.org/10.1093/cercor/bhad325
    OpenUrlCrossRefPubMed
  107. ↵
    1. Wikman P,
    2. Salmela V,
    3. Sjöblom E,
    4. Leminen M,
    5. Laine M,
    6. Alho K
    (2024) Attention to audiovisual speech shapes neural processing through feedback-feedforward loops between different nodes of the speech network. PLoS Biol 22:e3002534. https://doi.org/10.1371/journal.pbio.3002534 pmid:38466713
    OpenUrlCrossRefPubMed
  108. ↵
    1. Wild CJ,
    2. Yusuf A,
    3. Wilson DE,
    4. Peelle JE,
    5. Davis MH,
    6. Johnsrude IS
    (2012) Effortful listening: the processing of degraded speech depends critically on attention. J Neurosci 32:14010–14021. https://doi.org/10.1523/JNEUROSCI.1528-12.2012 pmid:23035108
    OpenUrlAbstract/FREE Full Text
  109. ↵
    1. Woods DL,
    2. Hillyard SA,
    3. Hansen JC
    (1984) Event-related brain potentials reveal similar attentional mechanisms during selective listening and shadowing. J Exp Psychol Hum Percept Perform 10:761–777. https://doi.org/10.1037/0096-1523.10.6.761
    OpenUrlCrossRefPubMed
  110. ↵
    1. Xiu B,
    2. Paul BT,
    3. Chen JM,
    4. Le TN,
    5. Lin VY,
    6. Dimitrijevic A
    (2022) Neural responses to naturalistic audiovisual speech are related to listening demand in cochlear implant users. Front Hum Neurosci 16:1043499. https://doi.org/10.3389/fnhum.2022.1043499 pmid:36419642
    OpenUrlPubMed
  111. ↵
    1. Zion Golumbic EM,
    2. Cogan GB,
    3. Schroeder CE,
    4. Poeppel D
    (2013a) Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”. J Neurosci 33:1417–1426. https://doi.org/10.1523/JNEUROSCI.3675-12.2013 pmid:23345218
    OpenUrlAbstract/FREE Full Text
  112. ↵
    1. Zion Golumbic EM, et al.
    (2013b) Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party”. Neuron 77:980–991. https://doi.org/10.1016/j.neuron.2012.12.037 pmid:23473326
    OpenUrlCrossRefPubMed

Synthesis

Reviewing Editor: Catherine Schevon, Columbia University

Decisions are customarily a result of the Reviewing Editor and the peer reviewers coming together and discussing their recommendations until a consensus is reached. When revisions are invited, a fact-based synthesis statement explaining their decision and outlining what is needed to prepare a revision will be listed below. The following reviewer(s) agreed to reveal their identity: NONE.

Both reviewers provided several critical points regarding the analysis strategy and the interpretation of the results. They need to be carefully considered in a revision. Since both reviewers provide detailed comments, I here provide all comments of the reviewers to allow the authors to reply point-by-point to these comments.

Reviewer 1:

This manuscript presents an investigation on how attention switching impacts the neural encoding of speech in a two-talker environment. The rationale was that switching attention might impact the re-engagement in various ways, from an improved suppression of the unattended stream due to the increased familiarity with the speaker, to an increased difficulty in performing the selective attention task, due to the a higher distractibility towards the formerly-attended stream.

The idea and data are interesting. However, there are some fundamental flaws in some of the analyses and discussion that mine the validity of the results, if not addressed.

Please find my comments below. Please note that line numbers are missing, hence my comments cannot point to the specific lines.

Major comments

1. The main dependent measure seems to be the comparison between the prediction correlations for the target model and the non-target model. While that effect was found in previous work, it is not the most important one. In fact, markers of the segregation of target and non-target (rather than overall encoding) are more sensitive to selective attention, and that in fact what is also used for decoding tasks (e.g., O'Sullivan et al., Cereb. Cortex, 2014). Typically, such markers are derived by fitting a target decoding model, and then by correlating the reconstructed envelope to the two possible envelopes. The resulting classification scores (or d-prime) quantify the ability to filter EEG activity reflecting the target speech more than the non-target. Why did the authors use such a suboptimal metric, when a stronger and more direct measurement was available? Please note that comparing target and non-target models that are fit separately assumes that the brain encoding of the target is stronger than for the non-target speech. Given the goals of this study, that is an unnecessary assumption, as the goal should be to determine if target and non-target are segregated (while the weaker encoding would be a byproduct of that segregation).

2. The authors have decided to optimise data from all participants as a group, selecting a common lambda value. While one might have some reasons to do so with encoding models (e.g., Crosse et al., Front. in Hum Neurosci, 2021), I don't see a reason for doing that with decoding models. In fact, that would lead to suboptimal results by definition, while the point of a decoding model is to get the best possible decoding (rather than interpreting the weights). This choise is particularly problematic on the claim of the individual subject differences. So, I find this choice to be in contradiction with the intent of looking into the individual participants. That should be done by considering the participants independently, with completely independent models, instead of using a "group-level" choice of the lambda.

3. The study does not control (as far as I could see) for potential effect of fatigue. That effect could be excluded by including a control condition with the same duration as the typical trials, but without a switch. That is a confound that might mask interesting effects. That issue was timidly mentioned in the discussion, while I think it is a main issue that would have been easy to address. Could the authors comment on that or potentially collect additional data?

4. On that same analysis (before vs. after): There are other issues that hamper the ability to claim anything on that point. Indeed, lack of a significance difference doesn't mean that there is no difference. But my main issue with that n.s. effect is that the authors do not seem to be really trying to get an effect there. First, a stronger metric should be used (see major comment 1). Second, the re-use of the same lecture as speech material would likely reduce any distraction effect of the non-target, as that would be somewhat familiar even in the "before" condition. So, in my view, the current data cannot support that claim due to these several issues.

5. The authors claim (in multiple pars of the paper) that a strong or significant envelope tracking "rules out" the possibility that a low EEG signal quality could explain why most participants did not show target > non-target. One of the factors that might impact that was discussed in the first and second major points. But there is a more general consideration that is crucial here. Envelope tracking measurements can reflect a number of neural sources and processes, and only few of them might reflect the separation between target and non-target. So, good envelope tracking might mean that the EEG signal is capturing some acoustic- or speech-sensitive neural sources. But it doesn't imply that it captures signals that are sensitive to selective attention. So that argument does not hold.

6. Page 15 (middle paragraph). One of the reasons for looking at EEG results at the group level is to determine if an effect is consistent in a statistical sense. That is often because of the low SNR of EEG signals and consequent large variability across participants. The authors found effects that are significant at the group level, with large variability at the single subject level. They then claim that such a variability reflects different neural encodings or different behaviour across participants. That might be true, but another even more likely explanation is that the data is just too little and too noisy to fit reliable single subject models. This is particularly true in this study, which used less data than in previous work, from what I could see (30 min per participant; so, 15 min per condition?). This is an important issue as it hampers one of the key claims of the study. One suggestion is to run the same analysis on publicly available data (e.g., O'Sullivan, which used ~1 hour of data I think) and to measure how many participants show significance when using 15 min of data, 30 min, 45 min, and 60 min. Would it be reasonable to think that 25% of the participants would show significance with 15 min of data, and that the percentage would increase when using more data? That, or some other control, is necessary to provide some support to the author's claim.

Other comments

1. There are several unmotivated experimental choices that appear to be afterthoughts. For example, why was the non-target presented only on the left? What not left and right, maybe in alternation or simultaneously? Why did the authors decide to use an AV target. How does the speech envelope correlate with the visual motion? Could part of the target vs. non-target effect be explained by the presence of the V component, or is that completely uncorrelated to the envelope?

2. It seems that the EEG data was filtered twice. First between 0.5 and 40 Hz, then between 1 and 20 Hz. Could the authors elaborate on that point? Also, what filters were used (type and order)? Were they zero-phase shift? Filtering is a delicate operation that can damage the data.

3. Figure 3A: It would be useful to see that result before and after the switch, and for target and non-target, as in the other panels

4. It wasn't clear to me how the encoding and decoding models were fit exactly? Were separate models fit for before and after? That would lead to even smaller training data, which would be problematic. What about Target and Non-Target?

5. It seems that the z-score thresholds were picked arbitrarily, with little justification. Could the authors elaborate on that? That looks important, as a different threshold would change that 25%.

6. Is the data going to be publicly available? I strongly recommend that.

Reviewer 2:

The paper presents an experiment investigating the neural tracking of speech signals in the presence of a competing, differently localised speaker. The target speaker is always presented in front of the listener, with a supporting video and the competing speaker is presented to the left. The authors use a TRF approach to estimate the neural tracking of both speakers and further use the difference in (envelope) reconstruction accuracy for target and ignored talker as an index of attention bias. They report that attention bias (calculated in this way) is at best a poor predictor of behaviour, varying within substantially between participants and within participant over experimental segments.

This is in many ways an intriguing report, but there are some issues that, in my opinion, reduce the confidence we may have in the results as they are currently presented and interpreted.

The paper is overall well written and presented, the introduction covers a satisfactory range of relevant recent work and frames the work adequately (although note comment below relating to alpha-band power). The authors raise an important point regarding the significance of individual differences. However, there are methodological details that are at times insufficiently clearly described and choices pertaining to the experimental design and analyses are in part not sufficiently explained.

In particular, it is unclear how the authors conceive of "attention switch". I believe the paradigm can be used to probe sustained (selective) attention to only one target (the speaker presented with video) in the presence of a lateralised distractor. I found the construct of switching to be over-emphasised, given that there is only one switch of talker, which did not entail a change in locus or modality of attention in order to continue effectively executing the task, and that, furthermore, the trial in which this switch occurs is explicitly not analysed.

I am also somewhat concerned about a circularity of argumentation. It appears that attention is defined as a function of a specific neural corelate (i.e. the attention bias, derived from the TRF). Is this a valid definition of attention? How confident can we be that the measure of attention bias is a measure of attention? It may well be an index that is affected by attention, but given that it is not well predictive of behavioural outcomes, there is an implicit implication that auditory selective attention must therefore not be relevant to behavioural performance in this task. I do not think this is the message the authors are trying to convey, thus caution is required in describing the nature of the effects.

In sum, the paper presents an interesting study with an intriguing outcome, but the core cognitive construct requires more comprehensive elucidation and the design and analysis require additional explanation and justification.

Main Concerns:

The description of the loudspeaker position follows the description of the presentation of the distractor to the left. In order to make it clear where the target is presented, specify the location of loudspeakers first.

An analysis of audio from on-ear headphones is mentioned - if the listener is concurrently receiving auditory input from headphones alongside the free field, this must be discussed, and any potential interference must be accounted for. Wearing on-ear headphones will have at least one consequence that is of particular relevance in the context of sound localisation - it will eliminate cues for localisation that are generated by the transfer functions of the pinnae. This appears to have the unfortunate consequence of reducing the ecological validity of the free-field auditory presentation.

Justification for the target speaker being presented with video. It is known that neural tracking of speech signals increases (i.e. is altered) in the presence of concurrent visual input. This is briefly mentioned in the discussion, but the potential consequences of one speech stream being supported by visual input and the other not is potentially greater than merely increasing the amplitude of entrainment. The role of, for example, the motor cortex (e.g. Park, Kayser, Thut &Gross, 2016)) in mediating auditory entrainment to the speech signal may have consequences for the response function that relates the acoustic and cerebral signals.

This brings me on to a further methodological question: Is TRF the correct approach for assessing neural tracking? The measure provides an insight into the stability of the linear relationship between R and S, but it is not clear to what extent they truly reflect tracking of the envelope. Can the authors dedicate some effort to explaining why they favoured this specific method over others (such as phase-locking index, Gaussian Copula Mutual Information, etc.)? It would also be valuable to know more about the justification for the broadband speech envelope employed.

Other work has focused on lateralised changes to the oscillatory cerebral signal (particularly in the alpha band) as a function of lateralised auditory attention (Wöstmann and colleagues, e.g. Wöstmann, Alavash &Obleser, 2019; Wöstmann, Lim &Obleser 2017). There is no mention of this research, which seems like an oversight when summarising the current state of the field, even if the TRF method is not directly comparable to time-frequency analysis.

Minor points and requests for clarification:

Specify the interpolation method applied, describe how cardiac artefacts were identified

Please clarify or homogenise: "Trials" and "Sentences"

Please provide the reference for "Liberman's cochlear frequency map"

Provide more detail on the calculation of Bayes factors - report the priors in sufficient detail to allow future replication of the analyses, even if the defaults of the software should change, or in other software.

Author Response

We greatly appreciate the reviewers' feedback and comments on our manuscript, and have made substantial modification to the text and to some of the analyses, to address their concerns. The current version focuses more on the basic effect of selective attention, emphasizing the spatial-realism and impact of using audiovisual speech, as well as the differences between averaging data across the sample vs. recognizing the variability across individuals. As part of our revision, we have also changed the title of the paper to: "Neural Speech-Tracking During Selective Attention: A Spatially Realistic Audiovisual Study", and have added some Extended Data to address Reviewer #1's concern about the decoding approach used here.

Below, please find a detailed response in purple to the specific points raised by the reviewers.

Reviewer 1:

This manuscript presents an investigation on how attention switching impacts the neural encoding of speech in a two-talker environment. The rationale was that switching attention might impact the re-engagement in various ways, from an improved suppression of the unattended stream due to the increased familiarity with the speaker, to an increased difficulty in performing the selective attention task, due to the a higher distractibility towards the formerly-attended stream.

The idea and data are interesting. However, there are some fundamental flaws in some of the analyses and discussion that mine the validity of the results, if not addressed.

Please find my comments below. Please note that line numbers are missing, hence my comments cannot point to the specific lines.

Major comments 1. The main dependent measure seems to be the comparison between the prediction correlations for the target model and the non-target model. While that effect was found in previous work, it is not the most important one. In fact, markers of the segregation of target and non-target (rather than overall encoding) are more sensitive to selective attention, and that in fact what is also used for decoding tasks (e.g., O'Sullivan et al., Cereb. Cortex, 2014). Typically, such markers are derived by fitting a target decoding model, and then by correlating the reconstructed envelope to the two possible envelopes. The resulting classification scores (or d-prime) quantify the ability to filter EEG activity reflecting the target speech more than the non-target. Why did the authors use such a suboptimal metric, when a stronger and more direct measurement was available? We are aware of the classification method used by O'Sullivan et al. and others, in which the goal is to identify which of two competing stimuli was the "target" speech in a given trial, based on a model trained specifically on "target" speech (and vice-versa for the "non-target" speech).

This approach does indeed differ in its rationale from the one used here, where we train two separate models on "target" and "non-target" speech (using a multivariate approach), and compare the 'goodness of fit' (predictive power) of each model.

There are advantages and disadvantages to the two approaches, and both have been used extensively in studies of selective attention to speech (Fuglsang et al., 2017; Har-shai Yahav et al., 2023; Kaufman &Zion Golumbic, 2023; Mirkovic et al., 2015; A. E. O'Sullivan et al., 2019; J. A. O'Sullivan et al., 2015; Teoh &Lalor, 2019). Rather than stating that one approach is 'superior', 'optimal' or 'more sensitive' than the other, we believe that they serve different scientific purposes and that the classification approach suggested by the reviewer would not be appropriate for addressing the scientific questions posed here. We explain the rationale for this choice below:

1) The classification approach can be useful if we want to ask: is it possible to distinguish between the neural representation of target and non-target speech. For this purpose, demonstrating that a decoder trained on target speech can decode target speech better than non-target speech would indicate that their neural representations are "not the same". Accordingly, (as also shown in original paper by O'Sullivan et al.), the same pattern of results could be obtained when a decoder is trained on non-target speech and tested on target speech - here too, the decoder would not perform as well. Successful classification - while potentially useful for engineering purposes, such as controlling a neuro-steered hearing aid or other device - can be easily explained by differences in the acoustic, spatial and other perceptual characteristics of target and non-target speech (and in our case, by the fact that target-speech was audiovisual), which would undoubtedly lead to finding decoders with different spatio-temporal characteristics for each stimulus. This is seen nicely when using encoding models - the TRFs derived for target and non-target speech differ in many spatio-temporal properties, so it is not surprising that they would not be successful if used to predict stimuli from the opposite class. Accordingly - successful classification in and of itself should not be interpreted as a "pure" effect of selective attention, as it is heavily influenced by the differences between the stimuli themselves.

2) The scientific question we ask here is slightly different. We want to know whether target speech is represented more robustly in the neural data than non-target speech, a pattern that is considered a signature of 'selective attention' - i.e., enhancement of target speech and/or suppression of non-target speech (Ding et al., 2012; Fiedler et al., 2019; Kerlin et al., 2010; O'Sullivan et al., 2015; Zion Golumbic, Ding, et al., 2013). For this purpose, we believe that is it more appropriate to optimize decoders for each stimulus separately (thus accounting for their differences in properties), and then assess how well each one performs for predicting the stimulus it was trained on (the model's goodness-of-fit/predictive power/accuracy). Using this approach, if we find that both decoders perform very well - this indicates that both stimuli are represented with similar precision in the neural response. Conversely, finding that the decoder for one stimulus outperforms the other can be interpreted as superior or more detailed neural encoding of that stimulus relative to the other, effects that have been associated with better intelligibly and/or higher levels of attention to this stimulus (Best et al., 2008; Getzmann et al., 2017; Lin &Carlile, 2015; Orf et al., 2023; Teoh &Lalor, 2019; Uhrig et al., 2022) .

We have clarified this rationale in the revised manuscript (Methods p. 10; and Discussion p. 23). In addition, we have added an Extended Data analysis (Figure S1) where we demonstrate the results obtained using the current approach vs. the classification approach suggested by the reviewer. As the reviewer anticipated, using a decoder trained on one class of stimuli (target/non-target) to distinguish between the two classes is indeed more 'sensitive', however, in our opinion does not teach us much about selective attention - just that these are different stimuli. We hope that by articulating the differences between these two approaches and the inferences that can be derived from each one, our work will contribute to formulating best-practices for using speech-tracking methods to study aspects of selective attention.

Please note that comparing target and non-target models that are fit separately assumes that the brain encoding of the target is stronger than for the non-target speech. Given the goals of this study, that is an unnecessary assumption, as the goal should be to determine if target and non-target are segregated (while the weaker encoding would be a byproduct of that segregation).

Perhaps we misunderstood the reviewer's intention here, but - to clarify - the multivariate model used to train the decoders for target and non-target speech makes no assumption regarding any difference between them. The model is agnostic to any 'sensory/cognitive distinction' between the two stimuli, therefore any differences that emerge in the 'goodness of fit' of the two decoders, can reliably be attributed to differences in their neural coding.

Regarding the goals of the study - from our perspective, this is exactly what we wanted to study: whether there is weaker encoding for non-target speech relative to target speech, and whether this effect is maintained when the role of the two talkers is switched. Whether the two decoders are different (and therefore "segregated" is trivial in our opinion, given the difference in their spatial and audiovisual properties - as elaborated in response to the previous comment).

2. The authors have decided to optimise data from all participants as a group, selecting a common lambda value. While one might have some reasons to do so with encoding models (e.g., Crosse et al., Front. in Hum Neurosci, 2021), I don't see a reason for doing that with decoding models. In fact, that would lead to suboptimal results by definition, while the point of a decoding model is to get the best possible decoding (rather than interpreting the weights). This choise is particularly problematic on the claim of the individual subject differences. So, I find this choice to be in contradiction with the intent of looking into the individual participants. That should be done by considering the participants independently, with completely independent models, instead of using a "group-level" choice of the lambda.

Thank you for this comment. We have re-run the decoding analyses and chose the most 'optimal' lambda for each participant. However, in practice in 21/23 participants the best lambda was the same (lambda = 1000) and for the two participants where the optimal lambda was different (lambda = 500), the difference in r-values between the two lambdas was <0.005. Therefore, the use of a group-level lambda here is indistinguishable from selecting personal lambdas.

We have clarified this in the text.

3. The study does not control (as far as I could see) for potential effect of fatigue. That effect could be excluded by including a control condition with the same duration as the typical trials, but without a switch. That is a confound that might mask interesting effects. That issue was timidly mentioned in the discussion, while I think it is a main issue that would have been easy to address. Could the authors comment on that or potentially collect additional data? Indeed, the issue of fatigue is extremely interesting in the context of selective attention. We agree that if we had found reduced speech tracking/neural bias in the 2nd half of the experiment, this could have been confounded with fatigue. However, since here we did not find any difference between the 1st vs. 2nd half of the experiment, this suggests that fatigue probably did not play a big role here and supports the claim that speech tracking and neural-bias remains robust over time despite the switch in talkers. Unfortunately, in the current design the data is insufficient for reliable comparison of results in smaller portions of the experiment (e.g., in each quarter).

It is quite possible that switching the target-talker mid-way could have mitigated effects of fatigue that otherwise would have occurred, but this is a question for future studies. We now address the issue of fatigue in Discussion p. 22).

4. On that same analysis (before vs. after): There are other issues that hamper the ability to claim anything on that point. Indeed, lack of a significance difference doesn't mean that there is no difference. But my main issue with that n.s. effect is that the authors do not seem to be really trying to get an effect there. First, a stronger metric should be used (see major comment 1). Second, the re-use of the same lecture as speech material would likely reduce any distraction effect of the non-target, as that would be somewhat familiar even in the "before" condition. So, in my view, the current data cannot support that claim due to these several issues.

We fully agree that caution needs to be exercised not to over-interpret null-results, and have edited the text to reflect this (Discussion p. 22).

To the reviewer's point regarding the use of a 'stronger metric' please see our response to comment #1, where we argue that the alternative metric suggested would in fact be uninformative (relative to the research questions asked here), given the use of spatial and audiovisual speech in the current experimental setup.

To the reviewer's comment regarding the "re-use of the same lecture" we would like to clarify that none of the speech materials were repeated in the experiment. Rather, we used different segments of the same lecture in part 1 and part 2 of the experiment. We have now clarified this in the Methods section (p. 6) To the broader concern regarding the reliability of difference/lack-thereof between the first and second half of the experiment, in the revised manuscript we now include additional statistical analyses to address this point: • In the group-analysis, we have added Bayes analysis of the difference between the two parts, indicating that the null result is supported (p.14). "An ANOVA comparing the reconstruction accuracies for target vs. non-target speech across both halves of the experiment, revealed a main effect of task-relevance [target vs. non-target: F(1,22)=28.3, p<0.001] but no main effect of half [F(1,22)=0.12, p=0.73] or interaction between the factors [half x talker: F(1,22)=0.87, p=0.87)]. These results were confirmed using a Bayesian ANOVA which indicated that the main effect of task-relevance could be embraced with high confidence and explains most of the variance (BFinclusion = 317, p=0.002) but there was no reliable difference between the two halves or interaction (BFinclusion = 0.26 and BFinclusion = 0.27 respectively, both ps > 0.7). • In the individual-level analysis, we have added permutations tests assessing the chance-level of finding changes in decoding accuracy and/or changes in the size/direction of neural-bias between the two parts. This analysis also shows a lack of difference between the two halves even in individual participants (where the speech-materials were actually different in the two halves; Methods p.9, Figure 2; Results p.16): "We performed a third permutation test to assess whether speech reconstruction accuracies and the Neural-Bias index differed significantly in the two halves of the experiment, i.e., before vs. after the talker-switch. For this, we conducted an "order-agnostic" permutation test where trials were randomly re-labeled so that the data included in each regressor were 50% from the first half of the experiment and 50% from the second half (Figure 2C). Multivariate decoders were trained on this re-labeled data and reconstruction accuracy values were estimated for each regressor and this procedure was repeated 100 times, yielding a null-distribution. A participant was considered as showing a significant difference in neural-tracking of either the target or non-target talker, or different in neural-bias, if the their real-data fell in the top 5%tile of the relevant null-distribution." We have also added a cross-over test, whereby we assess how well decoders trained on target/non-target speech in one part of the experiment, are able to predict stimuli from the other part of the experiment (Figure 5). This analysis showed that decoders trained in one half of the experiment generalized well to the other half when tested on stimuli that shared the same task-relevance (role), but did not generalize well to stimuli that shared the same talker-identity. This suggests that the decoders are largely invariant to talker-identity, and primarily capture features related to the role of the talker in the given task (in this case, being the target talker, presented audio-visually from a central location).

5. The authors claim (in multiple parts of the paper) that a strong or significant envelope tracking "rules out" the possibility that a low EEG signal quality could explain why most participants did not show target > non-target. One of the factors that might impact that was discussed in the first and second major points. But there is a more general consideration that is crucial here. Envelope tracking measurements can reflect a number of neural sources and processes, and only few of them might reflect the separation between target and non-target. So, good envelope tracking might mean that the EEG signal is capturing some acoustic- or speech-sensitive neural sources. But it doesn't imply that it captures signals that are sensitive to selective attention. So that argument does not hold.

We fully agree with the reviewer that measures of envelope-tracking represent only one, out of many potential mechanisms that the brain likely engages for speech processing, source separation etc. We have toned-down our language in multiple places throughout the manuscript, to clarify that the current results pertain only to this specific measure, which limits more general claims about attention.

See for example Discussion p.26: "Another important point to note in this regard is that speech-tracking as quantified here as the envelope-following response measured using EEG, captures only a partial neural representation of the speech, primarily reflecting encoding of its acoustic properties in auditory".

6. Page 15 (middle paragraph). One of the reasons for looking at EEG results at the group level is to determine if an effect is consistent in a statistical sense. That is often because of the low SNR of EEG signals and consequent large variability across participants. The authors found effects that are significant at the group level, with large variability at the single subject level. They then claim that such a variability reflects different neural encodings or different behaviour across participants. That might be true, but another even more likely explanation is that the data is just too little and too noisy to fit reliable single subject models. This is particularly true in this study, which used less data than in previous work, from what I could see (30 min per participant; so, 15 min per condition?). This is an important issue as it hampers one of the key claims of the study.

One suggestion is to run the same analysis on publicly available data (e.g., O'Sullivan, which used ~1 hour of data I think) and to measure how many participants show significance when using 15 min of data, 30 min, 45 min, and 60 min. Would it be reasonable to think that 25% of the participants would show significance with 15 min of data, and that the percentage would increase when using more data? That, or some other control, is necessary to provide some support to the author's claim.

We agree that the amount of data can critically affect the reliability of results, particularly when working with single-subject data.

In a recent paper, Mesik and Wojtczak (2023) tested how the amounts of training data affects the validity of TRF models, and found that approximately 8 minutes of clean data should be sufficient for estimating reliable TRFs in most participants. However, since in that work only single-speaker data was used, we have now conducted several additional analyses on the current data set to estimate the expected chance-level given the amount of data collected in the current design (~30 minutes total, 15 minutes per half). Specifically, we now report the results of within-participant permutations tests looking at the reliability of a) the decoding accuracy for each speaker (in the full experiment and in each half); b) the difference in decoding between target and non-target speech (in the full experiment and in each half); and c) the difference in decoding between the first and second half of the experiment. These analyses are described below and are now included in the revised manuscript (see new Figure 2 for an illustration of these analyses, pasted here).

New Figure 2. Data-driven permutation tests for individual-level statistics. Three permutation tests were designed to assess statistical significance of different results in individual participants. The black rectangles in all panels show the original data organization on the left and the re-labeling for permutations test on the right. A. S-R permutation test. In each permutation, the pairing between acoustic envelopes (S) and neural data responses (R) was shuffled across trials such that speech-envelopes presented in one trial (both target &non-target speech) were paired with the neural response (R) from in a different trial. This yields a null-distribution of reconstruction accuracies that could be obtained by chance, to which the real-data can be compared (right). B. Attention-agnostic permutation test. In each permutation, the target and non-target speech stimuli were randomly re-labeled to create attention-agnostic regressors that contain 50% target speech and 50% non-target speech. The reconstruction accuracy for each regressor was estimated and the difference between them is used to create a null-distribution to which the neural-bias index can be compared (right). C. Order-agnostic permutation test. In each permutation, trials were randomly re-labeled and separated into two order-agnostic groups consisting of 50% trials from the first half of the experiment and 50% trials from the second half. The reconstruction accuracy for each group of trials was estimated and the difference between them is used to create a null-distribution to which the real data from each half of the experiment can be compared (right).

We are also making this data set publicly available, so that other researchers can test our methods or use them to develop alternative approaches. Results from these analyses demonstrate the following: a) As shown in the new Figure 6, when analyzing data from the entire experiment, 22 out of 23 participants showed significant decoding values for at least one of the speech-stimuli. This provides data-driven evidence for the robustness of the speech-tracking metric in individual participants, despite the noisier nature of single-subject EEG. When we compare the average decoding accuracy to the neural-bias index in individual participants (revised Figure 6, pasted below), we find no correlation between them. In other words, even participants who had very good decoding accuracies, did not necessarily show large neural-bias for target speech. This result is in line with the findings reported by Kaufmann &Zion Golumbic (2022), and in our opinion support the claim that the lack of a significant neural-bias cannot be "trivially" explained by poor data quality. We now clarify this rationale in the Discussion as well (p. 25) Revised Figure 6. Speech Reconstruction and Neural-bias in individual participants - Full experiment. A. Left: Bar graphs depicting reconstruction accuracy in individual participants for target (black) and non-target (dark gray) speech. Horizontal light gray lines represent the p=0.05 chance-level, derived for each participant based on data-driven S-R permutation. Asterisks indicate participants who also showed significant neural-bias to target speech (see panel B). Right: Scatter plot showing reconstruction accuracies for target and non-target speech across all participants. The red line represents the linear regression fit between the two variables, which was significant (Pearson's r = 0.43, p = 0.038). B. Scatter plot showing the average reconstruction accuracy and neural-bias index across participants, which were not significantly correlated. Vertical dashed lines indicate the threshold for significant neural-bias (z=1.64, one-tailed; p<0.05). C. Scatter plots showing the accuracy on behavioral task vs. reconstruction accuracy of target speech (left), non-target speech (middle) and the neural-bias index (right), across all participants. No significant correlations were found. b) We also repeated these same analyses on each half of the experiment separately. As the reviewer anticipated, chance-levels were higher when using only half of the data (range between r = 0.02-0.03) relative to when using the full data set (range between r = 0.015-0.02). Nonetheless, significant decoding (of at least one talker) was still found in the vast majority of participants (19/23 in each part, but not necessarily the same participants; see Figures below), even if this was not as ubiquitous as when using all the data.

Here too there was no correspondence between decoding accuracy and the neural-bias index, and further weakening the claim that a lack of neural-bias is due to poor EEG quality. We hope that by conducting data-driven statistical analyses in individual participants and transparently presenting these data, our results will contribute to ongoing efforts to optimize the means for obtaining reliable and interpretable speech tracking data from individuals (including the amount of data needed).

Panel A from revised Figures 7&8 Bar graphs depicting reconstruction accuracy in individual participants for target (black) and non-target (dark gray) speech, in each half of the experiment separately. Horizontal light gray lines represent the p=0.05 chance-level, derived for each participant based on data-driven S-R permutation.

Other comments 1. There are several unmotivated experimental choices that appear to be afterthoughts. For example, why was the non-target presented only on the left? What not left and right, maybe in alternation or simultaneously? We are happy to clarify. This choice of presented non-target speech from one side (left) stemmed purely out of our desire to maximize amount of data in each condition, and avoid the need to double the length of the experiment to accommodate two presentation locations. We chose to keep this location constant for all participants since we assumed that presenting speech from different spatial locations might result in different TRF filters, which would make it even more difficult to compare results between participants. However, we fully agree that future studies should be designed looking specifically into the effect of spatial location per se.

We did not want to use 'simultaneous' presentation to two-sides, as this would invalidate the spatial-realism of the setup.

We have now clarified this in the text (p.7) Why did the authors decide to use an AV target. How does the speech envelope correlate with the visual motion? Could part of the target vs. non-target effect be explained by the presence of the V component, or is that completely uncorrelated to the envelope? This is another excellent point. Yes, it is highly likely that the visual input contributed to the better tracking of the AV movie, as have been shown in several previous studies (Ahmed, Nidiffer, &Lalor, 2023; Ahmed, Nidiffer, O'Sullivan, et al., 2023; Crosse et al., 2015; Fleming et al., 2021; Fu et al., 2019; Grant &Seitz, 2000; Haider et al., 2024; Karthik et al., 2024; Schwartz et al., 2004; Sumby &Pollack, 1954; Wikman et al., 2024; Zion Golumbic, Cogan, et al., 2013). And yes, looking at the person you are meant to pay attention to IS expected to explain part of the increase in speech-tracking for target speech, although it is impossible to separate the contribution of the "visual motion" from the contribution of "attention".

The choice to use a movie as the target stimulus was motivated by the aspiration to study selective attention to speech under more ecological and relevant contexts. Specifically, this study simulates the case where individuals are asked to pay attention to an actual person speaking, rather than simply a voice without a body. Eventhough this introduces yet another difference bewteen the target and non-target speech, this is closer to what the brain encounters in real-life - where the listener looks at a target speaker, but still may hear audio stemming from other irrelevant locations.

We now discuss this point more in depth in the Introduction (p.3) and Discussion (p.22).

2. It seems that the EEG data was filtered twice. First between 0.5 and 40 Hz, then between 1 and 20 Hz. Could the authors elaborate on that point? Also, what filters were used (type and order)? Were they zero-phase shift? Filtering is a delicate operation that can damage the data.

Yes, the data was indeed filtered twice. The first filter is done as part of the preprocessing procedure, in order to remove extremely high- and low- frequency noise but retain most activity within the range of "neural" activity. This broad range is mostly important for the ICA procedure, so as to adequately separate between ocular and neural contribution to the recorded signal.

However, since the speech tracking responses itself is less broadband and is comprised mostly of frequencies present in the speech-envelope itself (mostly <10Hz), a second narrower filter was applied to improve TRF model-fit.

In both cases we used a 4th order zero-phase Butterworth IIR filter with 1-seconds of padding, as implemented in the Fieldtrip toolbox. We have added these details to the manuscript.

3. Figure 3A: It would be useful to see that result before and after the switch, and for target and non-target, as in the other panels For this, please see revised Figure 4 (previously 3) and the new Figure 5.

4. It wasn't clear to me how the encoding and decoding models were fit exactly? Were separate models fit for before and after? That would lead to even smaller training data, which would be problematic. What about Target and Non-Target? We have thoroughly revised the methods section to clarify exactly how models were fit. Indeed, for the analysis comparing the 1st vs. 2nd half of the experiment, separately models were fit for each half, and compared to each other. We have also added a new cross-over analysis, where models trained on data from one half of the experiment were tested on data from the other half (see revised Figure 5). As noted in response to previous comments, we have also added new analyses to assess the reliability of these models, despite being trained on half the amount of data.

5. It seems that the z-score thresholds were picked arbitrarily, with little justification. Could the authors elaborate on that? That looks important, as a different threshold would change that 25%.

Yes, we agree that z-score thresholds can be arbitrary, eventhough they are statistically interpretable in terms of associated p-values. We now report z-values as a continuum (see panel B in revised Figures 6-8), and report the proportion of participants with a significant neural-bias index using both a threshold of z=1.64 (p<0.05 one-tailed) as well as with a more conservative z=1.

6. Is the data going to be publicly available? I strongly recommend that.

Yes, we are preparing the data and will make it publicly available once the paper is published.

Reviewer 2:

The paper presents an experiment investigating the neural tracking of speech signals in the presence of a competing, differently localised speaker. The target speaker is always presented in front of the listener, with a supporting video and the competing speaker is presented to the left. The authors use a TRF approach to estimate the neural tracking of both speakers and further use the difference in (envelope) reconstruction accuracy for target and ignored talker as an index of attention bias. They report that attention bias (calculated in this way) is at best a poor predictor of behaviour, varying within substantially between participants and within participant over experimental segments.

This is in many ways an intriguing report, but there are some issues that, in my opinion, reduce the confidence we may have in the results as they are currently presented and interpreted.

The paper is overall well written and presented, the introduction covers a satisfactory range of relevant recent work and frames the work adequately (although note comment below relating to alpha-band power). The authors raise an important point regarding the significance of individual differences. However, there are methodological details that are at times insufficiently clearly described and choices pertaining to the experimental design and analyses are in part not sufficiently explained.

In particular, it is unclear how the authors conceive of "attention switch". I believe the paradigm can be used to probe sustained (selective) attention to only one target (the speaker presented with video) in the presence of a lateralised distractor. I found the construct of switching to be over-emphasised, given that there is only one switch of talker, which did not entail a change in locus or modality of attention in order to continue effectively executing the task, and that, furthermore, the trial in which this switch occurs is explicitly not analysed.

I am also somewhat concerned about a circularity of argumentation. It appears that attention is defined as a function of a specific neural corelate (i.e. the attention bias, derived from the TRF). Is this a valid definition of attention? How confident can we be that the measure of attention bias is a measure of attention? It may well be an index that is affected by attention, but given that it is not well predictive of behavioural outcomes, there is an implicit implication that auditory selective attention must therefore not be relevant to behavioural performance in this task. I do not think this is the message the authors are trying to convey, thus caution is required in describing the nature of the effects.

In sum, the paper presents an interesting study with an intriguing outcome, but the core cognitive construct requires more comprehensive elucidation and the design and analysis require additional explanation and justification.

Main Concerns:

The description of the loudspeaker position follows the description of the presentation of the distractor to the left. In order to make it clear where the target is presented, specify the location of loudspeakers first.

Thank you, we clarified this in the text (p.6).

An analysis of audio from on-ear headphones is mentioned - if the listener is concurrently receiving auditory input from headphones alongside the free field, this must be discussed, and any potential interference must be accounted for. Wearing on-ear headphones will have at least one consequence that is of particular relevance in the context of sound localisation - it will eliminate cues for localisation that are generated by the transfer functions of the pinnae. This appears to have the unfortunate consequence of reducing the ecological validity of the free-field auditory presentation.

We apologize for our mistake - participants were wearing on-ear microphones, not headphone. Audio was presented free field. (the microphones were used to test if we could segregate the audio of the two speakers without access to the 'clean audio', an analysis that was successful but is orthogonal to the current manuscript.) Justification for the target speaker being presented with video. It is known that neural tracking of speech signals increases (i.e. is altered) in the presence of concurrent visual input. This is briefly mentioned in the discussion, but the potential consequences of one speech stream being supported by visual input and the other not is potentially greater than merely increasing the amplitude of entrainment. The role of, for example, the motor cortex (e.g. Park, Kayser, Thut &Gross, 2016)) in mediating auditory entrainment to the speech signal may have consequences for the response function that relates the acoustic and cerebral signals.

Thank you for highlighting this point. Yes, the AV presentation is a key factor (see response to Rev #1 as well). Visual input likely contributed to the enhanced tracking of the AV movie, contributed by both the visual motion and selective attention. While this creates a difference between target and non-target speech, the decision to use an AV movie was driven by the goal of simulating more ecologically valid listening conditions.

We now discuss this point more in depth in the Introduction (p.3) and Discussion (p.21): "It is likely that differences in spatio-temporal characteristics of TRFs for target and non-target speech are affected both by the specific perceptual attributes of the stimuli themselves (e.g., audiovisual vs. audio presentation, spatial location) as well as by their task-related role (target vs. non-target). In the spatially-realistic experimental design used here, these factors are inherently confounded, just as they are under real-life conditions in which listeners look at the talker that they are trying to pay attention to. [....] Rather than trying to assert whether the differences in TRF are due to "selective attention per se" or to "perceptual differences", we accept that in real-life these often go together. We posit that as selective attention research progresses to more ecologically valid contexts, these factors cannot (and perhaps need not!) be teased apart, but rather should be considered as making inherently joint contributions to the recorded neural signal." This brings me on to a further methodological question: Is TRF the correct approach for assessing neural tracking? The measure provides an insight into the stability of the linear relationship between R and S, but it is not clear to what extent they truly reflect tracking of the envelope. Can the authors dedicate some effort to explaining why they favoured this specific method over others (such as phase-locking index, Gaussian Copula Mutual Information, etc.)? It would also be valuable to know more about the justification for the broadband speech envelope employed.

This is a great question, and one that the speech-tracking community has been actively looking into. Although we agree that the linear nature of the TRF might not fully capture all aspects of the neural response to speech, has been shown in countless studies (in humans and animal models) to be an effective, reliable and simple way to capture basic features of the neural representation of speech and other dynamic auditory stimuli.

Importantly, although the alternative methods mentioned by the reviewer can capture non-linear representations as well, they lack sufficiently detailed information about the nature of the neural response itself, whereas TRF provide insight into the spatio-temporal features of the neural response and are thus is physiologically interpretable.

The functional relevance of TRFs has also been validated in the domain of hearing and speech processing, with correlations demonstrated between neural speech-tracking metrics and speech intelligibility, for example in children with dyslexia or in those with hearing impairment (Keshavarzi et al., 2022; Van Hirtum et al., 2023; Xiu et al., 2022), and of course in the domain of selective attention (Ding et al., 2012; Fiedler et al., 2019; Kerlin et al., 2010; J. A. O'Sullivan et al., 2015; Zion Golumbic, Ding, et al., 2013, Kaufman &Zion Golumbic, 2023). For these reasons, we have chosen to stick with this approach in the current work, although we fully acknowledge that its linear nature is somewhat limiting. We now acknowledge this in the Discussion (p. 25).

Regarding the choice to use the broadband speech envelope: Several previous studies have shown that when performing speech-tracking analysis of scalp-recorded E/MEG data, similar results are obtained if using a bank of narrow-band responses or averaging them into a broadband response. This might not be the case for recordings with higher spatial resolution (e.g. invasive recordings) which can capture more fine-grained spectro-temporal responses, however for the current purposes the broadband envelope is sufficient.

Other work has focused on lateralised changes to the oscillatory cerebral signal (particularly in the alpha band) as a function of lateralised auditory attention (Wöstmann and colleagues, e.g. Wöstmann, Alavash &Obleser, 2019; Wöstmann, Lim &Obleser 2017). There is no mention of this research, which seems like an oversight when summarising the current state of the field, even if the TRF method is not directly comparable to time-frequency analysis.

We are aware of this work and did look at alpha power but did not find any interesting effects, nor did we see convincing lateralization (see Figure below). For this reason, we did not include this in the manuscript, however if the reviewer feels that this is pertinent, we are happy to add these null results to the paper.

Minor points and requests for clarification:

Specify the interpolation method applied, We used the built-in Fieldtrip toolbox FT_CHANNELREPAIR function which repairs bad channels in the data by replacing them with the plain average of all neighbors (see Perrin et al., 1989). Neighbors were defined as all channels within a distance of up to 0.15 cm from the bad channel. We have now added this detail to the manuscript (p.7). describe how cardiac artefacts were identified ICA components capturing cardiac artifacts were identified visually. These components show a typical rhythmically-recurring spikes that are reminiscent of AKG recordings, and often show a diagonal topography over the scalp. We have now clarified in the manuscript that identification was done through visual inspection (p.7).

Please clarify or homogenise: "Trials" and "Sentences" Perhaps the reviewer meant "segments" rather than "sentences"? We use the term "segment" to refer a portion of the speech-stimulus, and the term "trial" to refer to the data collected while participants heard a particular "segment" of the lecture.

We have now clarified that (p.7): "The clean data was cut into trials, corresponding to portions of the experiment in which a single segment of the lecture was presented." Please provide the reference for "Liberman's cochlear frequency map" We have added the reference to Liberman (1982) to the manuscript Provide more detail on the calculation of Bayes factors - report the priors in sufficient detail to allow future replication of the analyses, even if the defaults of the software should change, or in other software.

The Bayesian ANOVA was conducted using JASP (version 0.16.3) with default settings. Fixed effects were modeled using a Cauchy prior distribution with a scale parameter of r=0.5. Random effects were modeled using a Cauchy prior with a scale parameter of r=1, and covariates were modeled using a Cauchy prior with a scale parameter of r=0.354. A uniform prior was applied to the model space, assigning equal prior probabilities to all models. The principle of marginality was respected, ensuring that all lower-order effects were included in models with higher-order interactions. We have now added these details to the manuscript (p.8).

Back to top

In this issue

eneuro: 12 (6)
eNeuro
Vol. 12, Issue 6
June 2025
  • Table of Contents
  • Index by author
  • Masthead (PDF)
Email

Thank you for sharing this eNeuro article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Neural Speech Tracking during Selective Attention: A Spatially Realistic Audiovisual Study
(Your Name) has forwarded a page to you from eNeuro
(Your Name) thought you would be interested in this article in eNeuro.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Print
View Full Page PDF
Citation Tools
Neural Speech Tracking during Selective Attention: A Spatially Realistic Audiovisual Study
Paz Har-shai Yahav, Eshed Rabinovitch, Adi Korisky, Renana Vaknin Harel, Martin Bliechner, Elana Zion Golumbic
eNeuro 2 June 2025, 12 (6) ENEURO.0132-24.2025; DOI: 10.1523/ENEURO.0132-24.2025

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Respond to this article
Share
Neural Speech Tracking during Selective Attention: A Spatially Realistic Audiovisual Study
Paz Har-shai Yahav, Eshed Rabinovitch, Adi Korisky, Renana Vaknin Harel, Martin Bliechner, Elana Zion Golumbic
eNeuro 2 June 2025, 12 (6) ENEURO.0132-24.2025; DOI: 10.1523/ENEURO.0132-24.2025
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Significance Statement
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
    • Footnotes
    • References
    • Synthesis
    • Author Response
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF

Keywords

  • EEG
  • selective attention
  • spatial
  • speech processing
  • TRF

Responses to this article

Respond to this article

Jump to comment:

No eLetters have been published for this article.

Related Articles

Cited By...

More in this TOC Section

Research Article: New Research

  • Novel roles for the GPI-anchor cleaving enzyme, GDE2, in hippocampal synaptic morphology and function
  • Upright posture: a singular condition stabilizing sensorimotor coordination
  • EEG Signatures of Auditory Distraction: Neural Responses to Spectral Novelty in Real-World Soundscapes
Show more Research Article: New Research

Cognition and Behavior

  • EEG Signatures of Auditory Distraction: Neural Responses to Spectral Novelty in Real-World Soundscapes
  • Excess neonatal testosterone causes male-specific social and fear memory deficits in wild-type mice
  • The effects of mindfulness meditation on mechanisms of attentional control in young and older adults: a preregistered eye tracking study
Show more Cognition and Behavior

Subjects

  • Cognition and Behavior
  • Home
  • Alerts
  • Follow SFN on BlueSky
  • Visit Society for Neuroscience on Facebook
  • Follow Society for Neuroscience on Twitter
  • Follow Society for Neuroscience on LinkedIn
  • Visit Society for Neuroscience on Youtube
  • Follow our RSS feeds

Content

  • Early Release
  • Current Issue
  • Latest Articles
  • Issue Archive
  • Blog
  • Browse by Topic

Information

  • For Authors
  • For the Media

About

  • About the Journal
  • Editorial Board
  • Privacy Notice
  • Contact
  • Feedback
(eNeuro logo)
(SfN logo)

Copyright © 2025 by the Society for Neuroscience.
eNeuro eISSN: 2373-2822

The ideas and opinions expressed in eNeuro do not necessarily reflect those of SfN or the eNeuro Editorial Board. Publication of an advertisement or other product mention in eNeuro should not be construed as an endorsement of the manufacturer’s claims. SfN does not assume any responsibility for any injury and/or damage to persons or property arising from or related to any use of any material contained in eNeuro.