Skip to main content

Main menu

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT

User menu

Search

  • Advanced search
eNeuro
eNeuro

Advanced Search

 

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT
PreviousNext
Research ArticleResearch Article: New Research, Cognition and Behavior

Neural Speech Tracking Contribution of Lip Movements Predicts Behavioral Deterioration When the Speaker's Mouth Is Occluded

Patrick Reisinger, Marlies Gillis, Nina Suess, Jonas Vanthornhout, Chandra Leon Haider, Thomas Hartmann, Anne Hauswald, Konrad Schwarz, Tom Francart and Nathan Weisz
eNeuro 16 January 2025, 12 (2) ENEURO.0368-24.2024; https://doi.org/10.1523/ENEURO.0368-24.2024
Patrick Reisinger
1Department of Psychology, Centre for Cognitive Neuroscience, Paris-Lodron-University of Salzburg, Salzburg 5020, Austria
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Patrick Reisinger
Marlies Gillis
2Experimental Oto-Rhino-Laryngology, Department of Neurosciences, Leuven Brain Institute, KU Leuven, Leuven 3000, Belgium
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Marlies Gillis
Nina Suess
1Department of Psychology, Centre for Cognitive Neuroscience, Paris-Lodron-University of Salzburg, Salzburg 5020, Austria
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nina Suess
Jonas Vanthornhout
2Experimental Oto-Rhino-Laryngology, Department of Neurosciences, Leuven Brain Institute, KU Leuven, Leuven 3000, Belgium
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jonas Vanthornhout
Chandra Leon Haider
1Department of Psychology, Centre for Cognitive Neuroscience, Paris-Lodron-University of Salzburg, Salzburg 5020, Austria
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Chandra Leon Haider
Thomas Hartmann
1Department of Psychology, Centre for Cognitive Neuroscience, Paris-Lodron-University of Salzburg, Salzburg 5020, Austria
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Thomas Hartmann
Anne Hauswald
1Department of Psychology, Centre for Cognitive Neuroscience, Paris-Lodron-University of Salzburg, Salzburg 5020, Austria
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Anne Hauswald
Konrad Schwarz
3MED-EL GmbH, Innsbruck 6020, Austria
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Konrad Schwarz
Tom Francart
2Experimental Oto-Rhino-Laryngology, Department of Neurosciences, Leuven Brain Institute, KU Leuven, Leuven 3000, Belgium
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tom Francart
Nathan Weisz
1Department of Psychology, Centre for Cognitive Neuroscience, Paris-Lodron-University of Salzburg, Salzburg 5020, Austria
4Neuroscience Institute, Christian Doppler University Hospital, Paracelsus Medical University Salzburg, Salzburg 5020, Austria
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Nathan Weisz
  • Article
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF
Loading

Abstract

Observing lip movements of a speaker facilitates speech understanding, especially in challenging listening situations. Converging evidence from neuroscientific studies shows stronger neural responses to audiovisual stimuli compared with audio-only stimuli. However, the interindividual variability of this contribution of lip movement information and its consequences on behavior are unknown. We analyzed source-localized magnetoencephalographic responses from 29 normal-hearing participants (12 females) listening to audiovisual speech, both with and without the speaker wearing a surgical face mask, and in the presence or absence of a distractor speaker. Using temporal response functions to quantify neural speech tracking, we show that neural responses to lip movements are, in general, enhanced when speech is challenging. After controlling for speech acoustics, we show that lip movements contribute to enhanced neural speech tracking, particularly when a distractor speaker is present. However, the extent of this visual contribution to neural speech tracking varied greatly among participants. Probing the behavioral relevance, we demonstrate that individuals who show a higher contribution of lip movements in terms of neural speech tracking show a stronger drop in comprehension and an increase in perceived difficulty when the mouth is occluded by a surgical face mask. In contrast, no effect was found when the mouth was not occluded. We provide novel insights on how the contribution of lip movements in terms of neural speech tracking varies among individuals and its behavioral relevance, revealing negative consequences when visual speech is absent. Our results also offer potential implications for objective assessments of audiovisual speech perception.

  • audiovisual speech
  • lip movements
  • MEG
  • neural tracking
  • temporal response functions
  • TRF

Significance Statement

In complex auditory environments, simultaneous conversations pose a challenge to speech comprehension. We investigated on a neural level how lip movements aid in such situations and what the behavioral consequences are, especially when lip information is occluded with a face mask. Using magnetoencephalographic responses from participants listening to audiovisual speech, we show that observing lip movements enhances neural speech tracking and participants who rely more on lip movements show behavioral deterioration when the speaker wears a face mask. Remarkably, this is not the case when no face mask was worn by the speaker. Our findings reveal interindividual differences in the contribution of lip movements to neural speech tracking, with potential applications in objective assessments of audiovisual speech perception.

Introduction

Face masks are an important tool in preventing the spread of contagious diseases such as COVID-19 (Chu et al., 2020; Suñer et al., 2022). However, as many have subjectively experienced firsthand, the use of face masks also impairs speech perception, and not only by attenuating sound. More importantly, they occlude facial expressions, such as lip movements (Brown et al., 2021; Rahne et al., 2021), that provide visual information for a relevant speech stream. This is particularly critical when speech is challenging, such as in the classic cocktail party situation, where conversations are happening simultaneously (Cherry, 1953). Ideally, visual information is available to aid in such situations, with numerous studies demonstrating that visual speech features enhance the understanding of degraded auditory input (Sumby and Pollack, 1954; Grant and Seitz, 2000; Ross et al., 2007; Remez, 2012). Among visual speech features, lip movements are the most important, playing a crucial role in the perception of challenging speech (Erber, 1975; Peelle and Sommers, 2015). However, substantial interindividual differences in lip-reading performance among normal, as well as hearing-impaired, populations have been shown in previous studies (Suess et al., 2022b; for a review see Summerfield et al., 1992). Despite our imperfect lip-reading abilities, the human brain effectively uses lip movements to facilitate the perception of challenging speech, with the neural mechanisms and regions involved still under debate (Ross et al., 2022; Zhang and Du, 2022).

A suitable method for studying the neural responses to audiovisual speech is neural speech tracking (Obleser and Kayser, 2019; Brodbeck and Simon, 2020). This method typically aims to predict the neural response to one or more stimulus features via so-called temporal response functions (TRFs; Crosse et al., 2021). The TRF approach has so far extended our understanding of speech processing from acoustic features (Lalor et al., 2009) to higher-level linguistic features (Broderick et al., 2018; Brodbeck et al., 2018a; Gillis et al., 2021).

Previous studies have shown beneficial effects of visual speech on the representation of speech in the brain. A magnetoencephalography (MEG) study by Park et al. (2016) showed enhanced entrainment between lip movements and speech-related brain areas when congruent audiovisual speech was presented. Other studies have shown that the incorporation of visual speech enhances the ability of the brain to track acoustic speech (Golumbic et al., 2013; Crosse et al., 2015, 2016b). Interestingly, when silent lip movements are presented, the visual cortex also follows the unheard acoustic speech envelope (Hauswald et al., 2018) or unheard spectral fine details (Suess et al., 2022a). Despite these findings, two questions remain unanswered: First, it is unknown to which extent individuals vary in their unique contribution of lip movements to neural speech tracking. Given the aforementioned interindividual differences in lip-reading performance, we hypothesize a high degree of variability in this contribution. Second, it is unknown if the unique contribution of lip movements is of behavioral relevance, as, for example, when the lips are occluded with a face mask, as has been common during the COVID-19 pandemic. Given the negative impact of face masks on behavioral measures (Rahne et al., 2021; Toscano and Toscano, 2021; Truong et al., 2021), we expect the following relationship: Individuals who show a higher unique contribution of lip movements should also show poorer behavioral outcomes when no lip movements are available, as they are deprived of critical visual information.

Using MEG and an audiovisual speech paradigm with one or two speakers, we probed the unique contribution of lip movements and its behavioral relevance. Utilizing a state-of-the-art neural tracking framework with source-localized TRFs (Fig. 1C), we show that lip movements elicit stronger neural responses when speech is difficult compared with when it is clear. Additionally, we show that the neural tracking of lip movements is enhanced in multispeaker settings. When controlled for acoustic speech features, the obtained unique contribution of lip movements is, in general, more enhanced in the multispeaker condition, with substantial interindividual variability. Using Bayesian modeling, we show that acoustic speech tracking is related to behavioral measures. Crucially, we demonstrate that individuals who show a higher unique contribution of lip movements show a stronger drop in comprehension and report a higher subjective difficulty when the mouth is occluded by a surgical face mask. In terms of neural tracking, our results suggest that individuals show a unique contribution of lip movements in a highly variable manner. We also establish a novel link between the neural unique contribution of visual speech and behavior when no lip movement information is available.

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

Experimental design, behavioral results, and analysis framework. A, Each block consisted of 10 ∼1 min trials of continuous audiovisual speech by either a female or male speaker (single-speaker condition). In 30% of these 10 trials, a same-sex audio–only distractor speaker was added (multispeaker condition). After every block, two comprehension statements had to be rated as correct or wrong. B, Comprehension was lower in the multispeaker condition than in the single-speaker condition (p = 0.003; rC = 0.64). Subjective difficulty ratings, reported on a five-point Likert scale, were higher in the multispeaker condition (p = 9.00 × 10−6; rC = −0.95). The reported engagement was lower in the multispeaker condition (p = 0.024; rC = 0.62). The black dots represent the mean, and the bars represent the standard error of the mean (SEM). C, Three stimulus features (spectrogram, acoustic onsets, and lip movements) extracted from the audiovisual stimuli are shown for an example sentence. Higher values in the lip area unit represent a wider opening of the mouth and vice versa. Three forward models were calculated: (1) one using only acoustic features, (2) one using only lip movements, and (3) one combining all features. Together with the corresponding source-localized MEG data, the boosting algorithm was used to calculate the models. Exemplary minimum-norm source estimates are shown for a representative participant. The resulting TRFs (a.u.) and neural tracking (expressed as Pearson's r) were analyzed in fROIs, obtained either via the acoustic or lip model of the multispeaker condition. The TRFs and prediction accuracies shown are from a representative participant reflecting the group-level results. To obtain the unique contribution of lip movements, we controlled acoustic features by subtracting the prediction accuracies in an acoustic + lip fROI of the acoustic model from the combined model. The unique contribution of lip movements was expressed as a percentage change. *p < 0.05; **p < 0.01; ***p < 0.001.

Materials and Methods

Participants

The data were collected as part of a recent study (Haider et al., 2022), in which 30 native speakers of German participated. One participant was excluded because signal source separation could not be applied to the MEG dataset due to file corruption. This led to a final sample size of 29 participants aged between 22 and 41 years (12 females; Mage = 26.79 years; SDage = 4.87 years). All participants reported normal vision and hearing (thresholds did not exceed 25 dB HL at any frequency from 125 to 8,000 Hz), the latter verified by a standard clinical audiometer (AS608 Basic; Interacoustics A/S). Additional exclusion criteria included nonremovable magnetic objects and any psychiatric or neurologic history. All participants signed an informed consent and were reimbursed at a rate of 10 € per hour. The experimental protocol was approved by the ethics committee of the Paris-Lodron-University of Salzburg and was conducted in accordance with the Declaration of Helsinki.

Stimuli and experimental design

The experimental procedure was implemented in MATLAB 9.10 (The MathWorks) using custom scripts. Presentation of stimuli and response collection was achieved with the Objective Psychophysics Toolbox (o_ptb; Hartmann and Weisz, 2020), which adds a class-based abstraction layer onto the Psychophysics Toolbox version 3.0.16 (Brainard, 1997; Pelli, 1997; Kleiner et al., 2007). Stimuli and triggers were generated and emitted via the VPixx system (DATAPixx2 display driver, PROPixx DLP LED projector, RESPONSEPixx response box; VPixx Technologies). Videos were back-projected onto a translucent screen with a screen diagonal of 74 cm (∼110 cm in front of the participants), with a refresh rate of 120 Hz and a resolution of 1,920 × 1,080 pixels. Timings were measured with the Black Box ToolKit v2 (The Black Box ToolKit) to ensure accurate stimulus presentation and triggering.

The audiovisual stimuli were excerpts from four German stories, two of each read out loud by a female or male speaker (female: “Die Schokoladenvilla - Zeit des Schicksals. Die Vorgeschichte zu Band 3” by Maria Nikolai , “Die Federn des Windes” by Manuel Timm; male: “Das Gestüt am See. Charlottes großer Traum” by Paula Mattis and “Gegen den Willen der Väter” by Klaus Tiberius Schmidt). A Sony NEX-FS100 (Sony) camera with a sampling rate of 25 Hz and a RØDE NTG2 microphone (RØDE Microphones) with a sampling rate of 48 kHz were used to record the stimuli. Each of the four stories was recorded twice, once with and once without a surgical face mask (type IIR three-layer disposable medical mask). These eight videos were cut into 10 segments of ∼1 min each (M = 64.29 s; SD = 4.87 s), resulting in 80 videos. In order to rule out sex-specific effects, 40 videos (20 with a female speaker and 20 with a male speaker) were presented to each participant. The speakers’ syllable rates were analyzed using Praat (Boersma, 2001; de Jong and Wempe, 2009) and varied between 3.7 and 4.6 Hz (M = 4.1 Hz). The audio-only distractor speech consisted of prerecorded audiobooks (Schubert et al., 2023), read by either a female or a male speaker. All audio files were normalized using ffmpeg-normalize version 1.19.1 (running on Python 3.9.7) with default options.

Before the experiment, a standard clinical audiometry was performed (for details, see above, Participants). The MEG measurement started with a 5 min resting-state recording (not analyzed in this manuscript). Next, the participant's individual hearing threshold was determined in order to adjust the stimulation volume. If the participant reported that the stimulation was not loud enough or comfortable, the volume was manually adjusted to the participant's requirements. Hearing threshold levels ranged from −91.76 db (RMS) to −68.78 db (RMS) [M = −80.57 db (RMS); SD = 4.20 db (RMS)].

The actual experiment consisted of four stimulation blocks, one for each of the four stories, with two featuring each sex. Each story was presented as a block of 10 ∼1 min trials (ranging from 0.93 to 1.27 min) in a chronological order to preserve the story content (Fig. 1A). In every block, a same-sex audio–only distractor speaker was added to three randomly selected trials, with a 5 s delay and volume equal to the target speaker. The resulting ratio of 30% multispeaker trials and 70% single-speaker trials per block was chosen because of a different data analysis method in Haider et al. (2022). The distractor speech started with a delay of 5 s to give participants time to attend the target speaker. In two randomly selected blocks, the target speaker wore a face mask (only the corresponding behavioral data were used here; see below, Statistical analysis and Bayesian modeling). Two unstandardized correct or wrong statements about semantic content were presented after each trial to assess comprehension and to maintain attention (Fig. 1A). On four occasions in each block, participants also rated subjective difficulty and engagement on a five-point Likert scale (not depicted in Fig. 1A). The participants responded by pressing buttons. The blocks were presented randomly, and the total duration of the experiment was ∼2 h, including preparation.

MEG data acquisition and preprocessing

Before entering the magnetically shielded room, five head position indicator (HPI) coils were applied on the scalp. Electrodes for electrooculography (vertical and horizontal eye movements) and electrocardiography were also applied (recorded data not used here). Fiducial landmarks (nasion and left/right preauricular points), the HPI locations, and ∼300 head shape points were sampled with a Polhemus FASTRAK digitizer (Polhemus).

Magnetic brain activity was recorded with a Neuromag Triux whole-head MEG system (MEGIN Oy, Espoo, Finland) using a sampling rate of 1,000 Hz (hardware filters, 0.1–330 Hz). The signals were acquired from 102 magnetometers and 204 orthogonally placed planar gradiometers at 102 different positions. The system is placed in a standard passive magnetically shielded room (AK3b; Vacuumschmelze).

A signal space separation (Taulu and Kajola, 2005; Taulu and Simola, 2006) algorithm implemented in MaxFilter version 2.2.15 provided by the MEG manufacturer was used. The algorithm removes external noise from the MEG signal (mainly 16.6, and 50 Hz, plus harmonics) and realigns the data to a common standard head position (to [0 0 40] mm, -trans default MaxFilter parameter) across different blocks, based on the measured head position at the beginning of each block.

Preprocessing of the raw data was done in MATLAB 9.8 using the FieldTrip toolbox (revision f7adf3ab0; Oostenveld et al., 2011). A low-pass filter of 10 Hz (hamming-windowed sinc FIR filter; onepass-zerophase; order, 1320; transition width, 2.5 Hz) was applied, and the data were downsampled to 100 Hz. Afterward, a high-pass filter of 1 Hz (hamming-windowed sinc FIR filter; onepass-zerophase; order, 166; transition width, 2.0 Hz) was applied.

Independent component analysis (ICA) was used to remove eye and cardiac artifacts (data were filtered between 1 and 100 Hz; sampling rate, 1,000 Hz) via the infomax algorithm (“runica” implementation in EEGLAB; Bell and Sejnowski, 1995; Delorme and Makeig, 2004) applied to a random block of the main experiment. Prior to the ICA computation, we performed a principal component analysis with 50 components in order to ease the convergence of the ICA algorithm. After visual identification of artifact-related components, an average of 2.38 components per participant were removed (SD = 0.68).

The cleaned data were epoched into trials that matched the length of the audiovisual stimuli. To account for an auditory stimulus delay introduced by the tubes of the sound system, the data were shifted by 16.5 ms. In the multispeaker condition, the first 5 s of data were removed to match the onset of the distractor speech. The last eight trials were removed to equalize the data length between the single-speaker and multispeaker conditions. To prepare the data for the following steps, the trials in each condition were concatenated. This resulted in a data length of ∼6 min per condition.

Source localization

Source projection of the data was done with MNE-Python 1.1.0 running on Python 3.9.7 (Gramfort et al., 2013, 2014). A semiautomatic coregistration pipeline was used to coregister the FreeSurfer “fsaverage” template brain (Fischl, 2012) to each participant's head shape. After an initial fit using the three fiducial landmarks, the coregistration was refined with the iterative closest point algorithm (Besl and McKay, 1992). Head shape points that were >5 mm away from the scalp were automatically omitted. The subsequent final fit was visually inspected to confirm its accuracy. This semiautomatic approach performs comparably to manual coregistration pipelines (Houck and Claus, 2020).

A single-layer boundary element model (BEM; Akalin-Acar and Gençer, 2004) was computed to create a BEM solution for the “fsaverage” template brain. Next, a volumetric source space with a grid of 7 mm was defined, containing a total of 5,222 sources (Kulasingham et al., 2020). In order to remove nonrelevant regions and shorten computation times, subcortical structures along the midline were removed, reducing the source space to 3,053 sources (similar to Das et al., 2020). Subsequently, the forward operator (i.e., lead field matrix) was computed using the individual coregistrations, the BEM, and the volume source space.

Afterward, the data were projected to the defined sources using the minimum norm estimate (MNE) method (Hämäläinen and Ilmoniemi, 1994). MNE is known to be biased toward superficial sources, which can be reduced by applying depth weighting with a coefficient between 0.6 and 0.8 (Lin et al., 2006). For creating the MNE inverse operator, depth weighting with a coefficient of 0.8 was used (Brodbeck et al., 2018a). The required noise covariance matrix was estimated with an empty-room MEG recording relative to the participant's measurement date with the same preprocessing settings as the MEG data of the actual experiment (see above, MEG data acquisition and preprocessing). The MNE inverse operator was then applied to the concatenated MEG data with ℓ2 regularization [signal-to-noise ratio (SNR) = 3 dB, λ2=1SNR2 ] and three free-orientation dipoles orthogonally at each source.

Extraction of stimulus features

Since the focus of this study is on audiovisual speech, we extracted acoustic (spectrograms and acoustic onsets) and visual (lip movements) speech features from the stimuli (Fig. 1C). The spectrograms of the auditory stimuli were obtained using the Gammatone Filterbank Toolkit 1.0 (Heeris, 2013), with frequency cutoffs at 20 and 5,000 Hz, 256 filter channels, and a window time of 0.01 s. This toolkit computes a spectrogram representation on the basis of a set of gammatone filters which are inspired by the human auditory system (Slaney, 1998). The resulting filter outputs with logarithmic center frequencies were averaged into eight frequency bands (frequencies <100 Hz were omitted; Gillis et al., 2021). Each frequency band was scaled with exponent 0.6 (Biesmans et al., 2017) and downsampled to 100 Hz, which is the same sampling frequency as the preprocessed MEG data.

Acoustic onset representations were calculated for each frequency band of the spectrograms using an auditory edge detection model (Fishbach et al., 2001). The resulting spectrograms of the acoustic onsets are valuable predictors of MEG responses to speech stimuli (Daube et al., 2019; Brodbeck et al., 2020). A delay layer with 10 delays from 3 to 5 ms, a saturation scaling factor of 30, and a receptive field based on the derivative of a Gaussian window (SD = 2 ms) were used (Gillis et al., 2021). Each frequency band was downsampled to 100 Hz.

The lip movements of every speaker were extracted from the videos with a MATLAB script adapted from Suess et al. (2022a; originally by Park et al., 2016). Within the lip, contour, the area, and the horizontal and vertical axis were calculated. Only the area was used for the analysis, which leads to results comparable with using the vertical axis (Park et al., 2016). The lip area signal was upsampled from 25 to 100 Hz using FFT-based interpolation.

Forward models

A linear forward modeling approach was used to predict the MEG response to the aforementioned stimulus features (Fig. 1C). These approaches are based on the idea that the brain's response to a stimulus is a continuous function in time (Lalor et al., 2006). The boosting algorithm (David et al., 2007), implemented in eelbrain 0.38 (running on Python 3.9.7; Brodbeck et al., 2023), was used to predict MNE source-localized MEG responses to stimulus features (“MNE-boosting”; Brodbeck et al., 2018b). For multiple stimulus features, the linear forward model can be formulated as follows:y^t=∑i=0n∑τ=τminτmaxhi,τxi,t−τ. For every n stimulus feature, the algorithm finds an optimal filter kernel h, which is also known as a TRF. When n stimulus feature is >1, h is referred to as multivariate TRF (mTRF). The term τ denotes the delays between the predicted brain response y^t and stimulus feature x (for further details see Brodbeck et al., 2023). TRFs reflect responses to continuous data instead of averaged responses to discrete events (Crosse et al., 2021). For the estimation of the TRFs, the stimulus features and MEG data were normalized (z-scored), and an integration window from −100 to 600 ms with a kernel basis of 50 ms Hamming windows was defined. To prevent overfitting, early stopping based on the ℓ2 norm was used. By using fourfold nested cross-validation (two training folds, one validation fold, and one test fold), each partition served as a test set once (Brodbeck et al., 2023). TRFs were estimated for each of the three free-orientation dipoles independently at all 3,053 sources (see above, Source localization). The spectrogram and acoustic onset mTRFs were averaged over the frequency dimension. To account for interindividual anatomical differences, TRFs were spatially smoothed with a Gaussian kernel (SD = 5 mm; Kulasingham et al., 2020). The Euclidean vector norm of the smoothed TRFs was taken, resulting in one TRF per source.

To obtain a measure of neural tracking, we correlated the predicted brain response y^t with the original response to calculate the prediction accuracy and computed as the average dot product over time (expressed as Pearson's correlation coefficient r). This correlation indicates that a higher prediction accuracy reflects enhanced neural tracking, meaning that the brain response more closely aligns with the stimulus features (Gillis et al., 2022).

In order to investigate the neural processing of the audiovisual speech features, we calculated three different forward models per condition and participant (see Fig. 1C for the analysis framework). The acoustic model consisted of the two acoustic stimulus features (spectrogram and acoustic onsets) and—also applicable to all other models—the corresponding MNE source-localized MEG data. The lip model contained only the lip movements as a stimulus feature. Additionally, a combined acoustic + lip model was calculated to control for acoustic features in a subsequent analysis.

We defined functional regions of interest (fROIs; Nieto-Castanon et al., 2003) by creating labels based on the 90th percentile of the whole-brain prediction accuracies in the multispeaker condition (similar to Suess et al., 2022a). The multispeaker condition was chosen for extracting the fROIs because it potentially incorporates all included stimulus features, due to its higher demand (Golumbic et al., 2013). This was done separately for the acoustic and lip models to map their unique neural sources (Fig. 1C). According to the “aparc” FreeSurfer parcellation (Desikan et al., 2006), the acoustic fROI mainly involved sources in the temporal, lateral parietal, and posterior frontal lobes. The superior parietal and lateral occipital lobes made up the majority of the lip fROI. To obtain an audiovisual fROI for the acoustic + lip model, we combined the labels of the acoustic and lip fROIs.

For every model, the TRFs in their respective fROI were averaged and, exclusively for Figure 2A, smoothed over time with a 50 ms Hamming window. Grand-average TRF magnitude peaks were detected with scipy version 1.8.0 (running on Python 3.9.7; Virtanen et al., 2020) and visualized as a difference between the multi- and single-speaker conditions. To suppress regression artifacts that typically occur (Crosse et al., 2016a), we visualized TRFs between −50 and 550 ms. Prediction accuracies in the fROIs were Fisher z-transformed and then averaged, and then the z values were back-transformed to Pearson's correlation coefficients (Corey et al., 1998). For the lower panels of each model in Figure 2B, the prediction accuracies of the acoustic and lip models were averaged in their respective fROIs. Figures were created with the built-in plotting functions of eelbrain and seaborn version 0.12.0 (running on Python 3.9.7; Waskom, 2021).

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2.

Neural responses to audiovisual speech features, neural speech tracking, and the unique contribution of lip movements. A, The three plots show grand-averaged TRFs for the stimulus features in their respective fROIs and the peak magnitude contrasts (multispeaker vs single speaker) between the two conditions in the involved sources. For the acoustic features, TRF magnitudes were generally enhanced when speech was clear, with significant differences ranging from p = 0.004 to p < 0.001 (d = −0.81 to −1.43). In contrast, the TRF to lip movements showed an enhanced magnitude in the multispeaker condition (p = 0.01 to p = 0.0005 and effect sizes from d = 0.86–0.91). The shaded areas of the respective conditions represent the SEM. Gray bars indicate the temporal extent of significant differences (p < 0.05) between the two conditions. B, Neural speech tracking is shown for the nonaveraged fROIs (top brain plots) and averaged fROIs of the acoustic and lip models. Acoustic neural tracking was higher in the single-speaker condition, with significant left- and right-hemispheric differences (both p < 0.001 with d from −1.30 to −1.47; averaged, p = 8.76 × 10−9; d = −1.30). Lip movements were tracked higher in the multispeaker condition (p = 0.037; d = 0.51; averaged, p = 0.026; rC = 0.48). In the averaged plots, the black dots represent the mean, and the corresponding bars the SEM, of the respective condition. C, In a combined acoustic and lip fROI, the acoustic model showed higher neural tracking in the single-speaker condition (p = 7.68 × 10−8; d = 1.18). The unique contribution of lip movements was obtained by subtracting the acoustic model from the acoustic + lip model and expressed as percentage change. Lip movements especially enhanced neural tracking in the multispeaker condition (p = 0.00003; rC = 0.89). Participants showed high interindividual variability with a unique contribution of lip movements of up to 45.37% but also only a small contribution or no contribution at all. The black dots represent the mean, and the corresponding bars the SEM, of the respective condition. *p < 0.05; **p < 0.01; ***p < 0.001.

In order to answer the question whether or not lip movements enhance neural tracking, a control for acoustic features (spectrograms and acoustic onsets) is needed. This is particularly important due to the intercorrelation of audiovisual speech features (Chandrasekaran et al., 2009; Daube et al., 2019). To investigate the unique contribution of lip movements, we used the averaged prediction accuracies in the audiovisual fROI and subtracted the acoustic model from the acoustic + lip model (for a general overview on control approaches, see Gillis et al., 2022). The resulting unique contribution of lip movements was expressed as percentage change (Fig. 2C).

Statistical analysis and Bayesian modeling

All frequentist statistical tests were conducted with built-in functions from eelbrain and the statistical package pingouin version 0.5.2 (running on Python 3.9.7; Vallat, 2018). The three behavioral measures (comprehension, difficulty, and engagement; Fig. 1B) were statistically compared between the two conditions (single speaker and multispeaker) using a Wilcoxon signed-rank test and the matched-pairs rank–biserial correlation rC was reported as the effect size (King et al., 2018).

The TRFs corresponding to the three stimulus features (spectrogram, acoustic onsets, and lip movements; Fig. 2A) were tested for statistical difference between the two conditions using a cluster-based permutation test with threshold-free cluster enhancement (TFCE; dependent-sample t test; 10,000 randomizations; Maris and Oostenveld, 2007; Smith and Nichols, 2009). Due to the previously mentioned TRF regression artifacts, the time window for the test was limited to −50 to 550 ms. Depending on the direction of the cluster, the maximum or minimum t value was reported and Cohen's d of the averaged temporal extent of the cluster was calculated.

We tested the nonaveraged prediction accuracies in the acoustic and lip fROIs (Fig. 2B) with a cluster-based permutation test with TFCE (dependent-sample t test, 10,000 randomizations). According to the cluster's direction, the maximum or minimum t value was reported, and Cohen's d of the cluster's averaged spatial extent was calculated. Additionally, averaged prediction accuracies in the acoustic and lip fROIs were statistically tested with a dependent-sample t test, and Cohen's d was reported as the effect size. In the audiovisual fROI, the prediction accuracies and unique contribution of lip movements (Fig. 2C) were tested with a dependent-sample t test, and Cohen's d was reported as the effect size. If the data were not normally distributed according to a Shapiro–Wilk test, the Wilcoxon signed-rank test was used, and the matched-pair rank–biserial correlation rC was reported as the effect size. The distribution of the contribution of lip movements was assessed using the bimodality coefficient (Freeman and Dale, 2013).

To investigate if neural tracking is predictive for behavior, we calculated Bayesian multilevel models in R version 4.2.2 (R Core Team, 2022) with the Stan-based package brms version 2.18.4 (Bürkner, 2017; Carpenter et al., 2017). Neural tracking (i.e., the averaged prediction accuracies within the respective fROI) was used to separately predict the three behavioral measures (averaged over the same number of trials for all participants). A random intercept was added for each participant to account for repeated measures (single speaker and multispeaker). The models were fitted independently for the acoustic and lip models (Fig. 3). According to the Wilkinson notation (Wilkinson and Rogers, 1973), the general formula was as follows:behavioralmeasure∼1+neuraltracking+(1|participant).

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3.

Relating behavior to neural speech tracking. Bayesian multilevel models were fitted to predict the behavioral measures with neural speech tracking. A, Higher acoustic neural speech tracking was linked to higher comprehension, lower difficulty ratings and higher engagement ratings. B, No evidence for an effect was observed for the neural tracking of lip movements. Both panels, The shaded areas show the 89% CIs of the respective model. The distributions on the right show the posterior draws of the three models. The black dots represent the mean standardized regression coefficient b of the corresponding model. The corresponding bars show the 89% CI. If zero was not part of the 89% CI, the effect was considered significant (*).

We wanted to test whether the unique contribution of lip movements to neural speech tracking (see above, Forward models) yields any behavioral relevance. For this, we also used the behavioral data of the otherwise unanalyzed conditions with a face mask, which were the same number of trials for all participants (see above, Stimuli and experimental design). We fitted Bayesian multilevel models with the averaged unique contribution of lip movements to separately predict the behavioral measures when the speaker wore a face mask or not (Fig. 4). The general formula was as follows:behavioral measure∼1+unique contribution of lip movements+(1|participant).

Figure 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4.

Relating the unique contribution of lip movements to behavior. The unique contribution of lip movements was used to predict the behavioral measures when the lips are occluded or not. A, When the unique contribution of lip movements was high, comprehension was lower, and difficulty was reported higher. No evidence for an effect was observed for the engagement rating. The values of the fitted Bayesian multilevel models are shown with a depiction of the conditions in which the speakers wore a surgical face mask. B, The behavioral measures when the lips were not occluded were not linked to the unique contribution of lip movements. Both panels, The shaded areas show the 89% CIs of the respective model. The distributions on the right show the posterior draws of the three models. The black dots represent the mean standardized regression coefficient b of the corresponding model. The corresponding bars show the 89% CI. If zero was not part of the 89% CI, the effect was considered significant (*).

Before doing so, we fitted control models to show the effect of the conditions on the behavioral measures when the lips are occluded. Additional control models to test the effect of the unique contribution of lip movements on the averaged behavioral data without a face mask were also fitted. In all described models, a random intercept was included for each participant to account for repeated measures (single speaker and multispeaker).

Weakly or noninformative default priors of brms were used, whose influence on the results is negligible (Bürkner, 2017, 2018). For model calculation, all numerical variables were z-scored, and standardized regression coefficients (b) were reported with 89% credible intervals (CIs; i.e., Bayesian uncertainty intervals; McElreath, 2020). In addition, we report posterior probabilities (PPb > 0) with values closer to 100%, providing evidence that the effect is greater than zero and closer to 0% that the effect was reversed (i.e., smaller than zero). If the 89% CIs for an estimate did not include zero and PPb > 0 was below 5.5% or above 94.5%, the effects were considered statistically significant.

All models were fitted with a Student’s t distribution, as indicated by graphical posterior predictive checks, Pareto k^ diagnostics (Vehtari et al., 2022b), and leave-one-out cross–validation via loo version 2.5.1 (Vehtari et al., 2017, 2022a). Common algorithm-agnostic (Vehtari et al., 2021) and algorithm-specific diagnostics (Betancourt, 2018) showed that all Bayesian multilevel models converged. For all relevant parameters, the convergence diagnostic R^<1.01 and effective sample size >400 indicated that there were no divergent transitions. Figures were created with ggplot2 version 3.4.0 (Wickham, 2016) and ggdist version 3.2.0 (Kay, 2022). Unstandardized b's were used for the fitted values of the models in Figures 3 and 4.

Data and code availability

Preprocessed data and code are publicly available at GitHub (https://github.com/reispat/av_speech_mask).

Results

Twenty-nine participants listened to audiobooks with a corresponding video of the speaker and a randomly occurring audio-only distractor. Source-localized MEG responses to acoustic features (spectrogram and acoustic onsets) and lip movements were predicted using forward models (TRFs). We compared the TRFs between the two conditions and evaluated neural tracking of the acoustic features and lip movements. The unique contribution of lip movements was obtained by controlling for acoustic features and was compared between conditions. Using Bayesian multilevel modeling, we predicted the behavioral measures with neural tracking. We also probed the unique contribution of lip movements for their behavioral relevance by predicting the behavioral measures when the lips were occluded with a surgical face mask or not.

Listening situations with multiple speakers are behaviorally more demanding

Participants performed worse in the multispeaker condition (M = 62.93%; SD = 17.34%), compared with the single-speaker condition (M = 73.52%; SD = 9.71%; W = 73.00; p = 0.003; rC = 0.64). In the multispeaker condition, subjective difficulty ratings were higher (M = 3.67; SD = 0.82) than in the single-speaker condition (M = 2.47; SD = 0.71; W = 11.50; p = 9.00 × 10−6; rC = −0.95). Engagement was rated higher in the single-speaker condition (M = 3.91; SD = 0.74) compared with the multispeaker condition (M = 3.72; SD = 0.85; W = 29.00; p = 0.024; rC = 0.62). Overall, behavioral data showed that in the multispeaker condition, participants performed worse, reported the task to be more difficult, and were less motivated (Fig. 1B).

Neural responses to lip movements are enhanced in a multispeaker setting

First, we analyzed the neural responses to acoustic and visual speech features by statistically comparing the corresponding TRFs between the single- and multispeaker conditions within their respective fROIs (Fig. 2A). The spectrogram TRFs showed a significant difference between conditions, with three clusters extending from early (30–110 ms; t = −5.26; p = 0.0001; d = −0.81), middle (160–290 ms; t = −3.78; p = 0.003; d = −1.00), and late (310–470 ms; t = −5.58; p = 0.0001; d = −1.02) time ranges. Grand-average TRF peaks are more pronounced in the single-speaker condition, with two peaks at 70 and 180 ms. While the first peak is also present in the multispeaker condition, the second peak appeared 50 ms earlier than the single-speaker setting. The latter peak caused the largest differences in the magnitudes of the TRFs, which are most prominent in the right hemisphere of the fROI.

The TRFs to acoustic onsets showed a significant difference between single- and multispeaker speech, with three clusters extending from early (−20 to 80 ms; t = −5.39; p < 0.001; d = −1.10; Fig. 2A), mid (120–140 ms; t = −4.54; p = 0.004; d = −1.43), and mid-late (190–260 ms; t = −6.11; p < 0.001; d = −1.13) time windows. The TRFs showed two peaks at 70 and 190 ms in the single-speaker condition. Similar to the spectrogram TRFs, the first peak in the multispeaker condition is at the same time point as in the single-speaker condition, and the second peak is 50 ms earlier. The magnitude differences across peaks and hemispheres are not substantially different.

TRFs to lip movements show an opposite pattern to the TRFs to acoustic features, with stronger processing in the multispeaker condition. Significant condition differences in the TRFs to lip movements between single- and multispeaker speech were found, with four clusters extending from early (−20 to 70 ms; t = 4.41; p = 0.0005; d = 0.86; Fig. 2A), mid (140–270 ms; t = 3.97; p = 0.001; d = 0.88), mid-late (290–330 ms; t = 3.34; p = 0.01; d = 0.91), and late (420–460 ms; t = 3.90; p = 0.002; d = 0.90) time windows. The latencies of the peaks were later in general (160 and 290 ms), as compared with the acoustic TRFs, which is also in line with the longer duration for a stimulus to reach the visual system (Thorpe et al., 1996; VanRullen and Thorpe, 2001). In the single-speaker condition, the peaks are delayed by 10 ms compared with the multispeaker condition, and magnitude differences are most prominent in the first peak and left hemisphere.

Our initial analysis showed that neural responses to acoustic features are stronger when speech is clear. In contrast, neural responses to lip movements were enhanced in a multispeaker environment. The stronger processing of lip movements suggests a greater reliance on the lips of a speaker when speech is harder to understand.

The cocktail party diametrically affects acoustic and visual neural speech tracking

So far, the TRF results indicate a stronger neural response to lip movements and a weaker one to acoustic features when there is more than one simultaneous speaker. We also wanted to answer the question whether neural tracking of audiovisual speech features differs between the single-speaker and multispeaker conditions in their respective fROIs (Fig. 2B). Acoustic neural tracking in the nonaveraged acoustic fROI showed a significant condition difference in the left (t = −8.04; p < 0.001; d = −1.47) and right (t = −9.26; p < 0.001; d = −1.30) hemispheres. Averaged acoustic neural tracking was higher in the single-speaker condition than in the multispeaker condition (t(28) = −8.07; p = 8.76 × 10−9; d = −1.30). Neural tracking of lip movements showed a significant condition difference in the left hemisphere (t = 3.83; p = 0.037; d = 0.51; Fig. 2B), with a focal superior parietal area involved. When averaging over sources, neural tracking was higher in the multispeaker condition than in the single-speaker condition (W = 114.00; p = 0.026; rC = 0.48).

Overall, the results showed that neural tracking was enhanced for acoustic features when speech is clear and higher for lip movements when there are multiple speakers. This is in line with the observed neural responses.

Lip movements enhance neural speech tracking more in multispeaker situations

When there are two speakers, we have so far demonstrated that lip movements are processed more strongly and lead to higher neural tracking compared with one speaker. However, their unique contribution to neural tracking is still unknown, due to the intercorrelation of audiovisual speech features (Chandrasekaran et al., 2009; Daube et al., 2019). To address this, we controlled for the acoustic features so as to obtain the unique contribution of lip movements over and above acoustic speech features. First, the acoustic model was evaluated in the audiovisual fROI (Fig. 2C). Acoustic neural tracking was higher in the single-speaker condition than in the multispeaker condition (t(28) = −7.20; p = 7.68 × 10−8; d = 1.18). The acoustic model served as a baseline and was subtracted from a combined acoustic + lip model and expressed as percentage change. The obtained unique contribution of lip movements was higher in the multispeaker condition than in the single-speaker condition (W = 24.00; p = 0.00003; rC = 0.89). The unique contribution of lip movements showed high interindividual variability and seemed to follow a bimodal distribution (Fig. 2C), which was confirmed by a bimodality coefficient of 0.68 (values >0.555 indicate bimodality; Pfister et al., 2013).

These results strongly indicate that lip movements enhance neural tracking, especially in multitalker speech. However, substantial interindividual variability was observed, with participants showing a unique contribution of lip movements of up to 45.37% in the multispeaker condition, while others showed only a small contribution or no contribution at all. In the next steps, we will probe the behavioral relevance of the unique contribution of lip movements to neural speech tracking by depriving individuals of this source of information.

Only acoustic neural speech tracking predicts behavior

Having established that listening situations with two speakers affect neural tracking of acoustic and visual speech features in a diametrical way, we were further interested if neural tracking is able to predict the behavioral measures. We calculated Bayesian multilevel models to predict the three behavioral measures (comprehension, difficulty, and engagement; Fig. 1B) with the averaged neural tracking of the acoustic and lip models (Fig. 3). In the acoustic model, higher neural tracking was linked to higher comprehension (b = 0.29; 89% CI = [0.07, 0.51]; PPb > 0 = 98.37%; Fig. 3A). Lower neural tracking predicted higher difficulty ratings (b = −0.50; 89% CI = [−0.72, −0.29]; PPb > 0 = 0.01%). When neural tracking was high, the engagement ratings were also higher (b = 0.12; 89% CI = [0.004, 0.24]; PPb > 0 = 95.05%).

Neural tracking of lip movements was not related to comprehension (b = 0.06; 89% CI = [−0.18, 0.28]; PPb > 0 = 65.61%; Fig. 3B). We also observed no evidence for an effect of the difficulty (b = −0.05; 89% CI = [−0.28, 0.18]; PPb > 0 = 35.63%) or engagement (b = 0.09; 89% CI = [−0.08, 0.26]; PPb > 0 = 80.40%) ratings.

These results indicate that acoustic neural speech tracking predicts behavior: The higher the neural speech tracking, the higher the comprehension and engagement ratings. Lower acoustic neural speech tracking was linked to higher difficulty ratings. In contrast, neural speech tracking of lip movements did not predict behavior.

Stronger unique contribution of lip movements predicts behavioral deterioration when lips are occluded

Given the finding that lip movements enhance neural speech tracking (Fig. 2C), we were interested in whether this unique contribution to neural speech tracking is behaviorally relevant. To do so, we also used the behavioral data from the otherwise unanalyzed conditions in which the mouth was occluded by a surgical face mask (see the center of Fig. 4A for example stimuli). Given that critical visual information is missing in these conditions, individuals who show a strong unique contribution of lip movements on a neural level should show poorer behavioral outcomes. An initial analysis showed that the effect of the conditions with a surgical face mask on behavior followed a similar pattern as those with nonoccluded lips (Fig. 1B): Comprehension was worse in the multispeaker condition (b = −0.77; 89% CI = [−1.13, −0.41]; PPb > 0 = 0.07%). Subjective difficulty ratings were also higher in the multispeaker condition (b = −0.77; 89% CI = [−1.13, −0.41]; PPb > 0 = 0.07%). However, there was no effect of the conditions with a surgical face mask on the engagement ratings (b = −0.77; 89% CI = [−1.13, −0.41]; PPb > 0 = 0.07%).

While the effects on a solely behavioral level seem not to differ substantially when the lips are occluded or not, predicting the behavioral measures with the unique contribution of lip movements showed the expected outcome (Fig. 4A): Participants that had a higher unique contribution of lip movements in terms of neural tracking showed a decline in comprehension (b = −0.27; 89% CI = [−0.49, −0.06]; PPb > 0 = 2.21%) and reported the task to be more difficult (b = 0.25; 89% CI = [0.01, 0.51]; PPb > 0 = 95.41%). The engagement ratings did not yield an effect (b = 0.05; 89% CI = [−0.07, 0.18]; PPb > 0 = 76.14%).

Interestingly, we were not able to establish a link between the unique contribution of lip movements to the behavioral data when the lips were not occluded (Fig. 4B). Comprehension (b = −0.05; 89% CI = [−0.28, 0.17]; PPb > 0 = 36.09%), difficulty (b = 0.04; 89% CI = [−0.19, 0.28]; PPb > 0 = 60.86%), and engagement (b = 0.06; 89% CI = [−0.08, 0.19]; PPb > 0 = 76.64%) were not linked to the unique contribution of lip movements.

Taken together, these findings support a behavioral relevance of the unique contribution of lip movements. Individuals that have a higher unique contribution of lip movements on a neural level performed worse and reported the task to be more difficult when the mouth of the speaker was covered by a surgical face mask.

Discussion

The method of neural speech tracking is widely used to study the neural processing of continuous speech, though primarily with audio-only stimuli (Di Liberto et al., 2015; Keitel et al., 2018; Brodbeck et al., 2018a; Chalas et al., 2022). Recent studies have used audiovisual speech paradigms, but without directly modeling visual speech features and their temporal dynamics (Golumbic et al., 2013; Crosse et al., 2016b). In this study, we first show the temporal dynamics and cortical origins of TRFs obtained from lip movements in an audiovisual setting with one or two speakers. Similar to Brodbeck et al. (2018a), neural responses to acoustic features in the two-speaker paradigm were generally weaker. In both acoustic features, we observed that the second peak was 50 ms earlier when there were two speakers. Similar temporal differences in TRFs were also observed in normal-hearing individuals in a selective attention speech paradigm (Kaufman and Zion Golumbic, 2023), as well as in cochlear implant users, where the attended speech was showing enhanced earlier responses compared with ignored speech (Kraus et al., 2021). The TRFs to lip movements showed an opposite pattern, with an enhanced magnitude in the multispeaker condition (Fig. 2A) and with substantially later peaks compared with the TRF to acoustic features. This is in line with Bourguignon et al. (2020), where initial TRF peaks at 115 and 159 ms were shown from two significant sources, overlapping with our involved parietal and occipital sources (Fig. 1C). However, the TRFs in their work were modeled to lip movements from silent videos, which precludes a comparison between different listening situations. The finding that the peaks of TRFs occur later for lip movements than for auditory features seems counterintuitive, as the visual stimulus usually precedes the auditory stimulus (van Wassenhove et al., 2005). One possible reason for this could be that visual stimuli require a longer period of time to reach the visual system compared with auditory stimuli (Thorpe et al., 1996; VanRullen and Thorpe, 2001), thus leading to a later neural response when modeled using TRFs. For TRFs to lip movements, we also observed a stronger contribution of left parietal and occipital regions, especially when contrasting the first peak. A possible explanation for this lateralization could be due to asymmetries in the processing of lip movements, where previous studies showed a left-hemispheric advantage (Campbell et al., 1996; Nicholls and Searle, 2006).

Our findings also strengthen the argument that TRFs to visual speech are quantitatively different from TRFs to acoustic speech (for an analysis based on coherence, see Park et al., 2016). In this study, however, we were not able to completely rule out the contribution of auditory speech to modeled TRFs to lip movements, since an audiovisual paradigm was used. In future studies, a visual-only condition should also be incorporated to further compare the differences between TRFs derived from lip movements in an audiovisual or visual-only condition.

Based on the source-localized neural tracking, we determined fROIs via a data-driven approach—separately for the acoustic features and lip movements (Fig. 1C). The fROIs for the acoustic speech features involved sources along temporal, parietal, and posterior frontal regions, covering regions that are related to speech perception (Franken et al., 2022). Previous studies source-localized TRFs in audio-only settings, though commonly restricting the analysis to temporal regions (Brodbeck et al., 2018a; Kulasingham et al., 2020). The fROIs for the lip movements involved parietal and occipital regions, in line with previous studies that source-localized the neural tracking of lip movements (Hauswald et al., 2018; Bourguignon et al., 2020; Aller et al., 2022). We also observed neural tracking of lip movements in temporal regions (similar to Park et al., 2016) but with less involvement of the primary visual cortex and prominent only in the single-speaker condition. Due to our approach of defining our fROIs based on the multispeaker condition, we minimized the involvement of auditory regions in the lip fROIs.

When analyzing neural speech tracking in the acoustic fROIs, we showed a large effect with enhanced tracking in the single-speaker condition compared with the multispeaker condition (Fig. 2B). Using phase consistency as a neural tracking method, and not statistically tested, a large difference in neural tracking between single- and multispeaker speech was shown by Golumbic et al. (2013). We were not able to identify further studies that presented such a statistical contrast, which could be due to the general focus on neural tracking of attended versus unattended speech, especially to decode auditory attention (Mirkovic et al., 2015; O’Sullivan et al., 2015; Schäfer et al., 2018; Ciccarelli et al., 2019; Geirnaert et al., 2021). On a group level, the neural tracking of lip movements showed an enhancement in the multispeaker condition (Fig. 2B). When comparing the involved sources of the corresponding lip fROI, we found a medium effect in the left superior parietal cortex. This is in line with Park et al. (2016), showing an effect in the left occipital and parietal cortex when comparing two similar conditions to our design (“AV congruent vs All congruent”), although after partializing out auditory-related coherence. A possible explanation for the strong focality of our left superior parietal effect could be due to the used TFCE method with a small default step size of 0.1, leading to a downweighting of spatially larger clusters (Smith and Nichols, 2009). When we averaged the neural tracking of lip movements, we observed interesting patterns, with participants showing no meaningful neural tracking (i.e., close to zero or negative correlations) when there was one speaker, but when speech became challenging, their neural tracking reached positive values. Notably, this pattern was reversed for some participants, suggesting that not all of them used the lip movements in the same manner. To investigate this further, eye tracking should be used to identify which face regions participants fixated when attending audiovisual speech (Rennig and Beauchamp, 2018) or to additionally incorporate a recently proposed phenomenon termed “ocular speech tracking” (Gehmacher et al., 2024). In this study, we cannot rule out that during challenging speech, participants fixated the mouth area stronger, thus contributing to enhanced neural tracking. However, previous eye tracking research has shown that individuals gaze at talking (Gurler et al., 2015) and also nontalking faces (Peterson and Eckstein, 2013; Mehoudar et al., 2014) in a highly individual manner, which is putatively incorporated in our findings of high interindividual variability.

We first compared the neural tracking of audiovisual speech between single-speaker and multispeaker conditions in an isolated manner. Due to the aforementioned intercorrelation of audiovisual speech features (Chandrasekaran et al., 2009; Daube et al., 2019), this approach could not rule out any acoustic contributions to the neural tracking of lip movements or vice versa. To reveal the unique contribution of lip movements and to incorporate regions that are part of models of audiovisual speech perception (Bernstein and Liebenthal, 2014) and multisensory integration (Peelle and Sommers, 2015), we combined both fROIs and controlled for acoustic speech features. Within the TRF framework, we show that lip movements enhance acoustic-controlled neural speech tracking (Fig. 2C). A general enhancement was observed for both single- and multispeaker speech, which is in line with behavioral findings that visual speech features enhance intelligibility under clear speech conditions as well (Stacey et al., 2016; Blackburn et al., 2019). When comparing the two conditions, we observed a large effect, showing a higher unique contribution of lip movements in the multispeaker condition. Analogous to behavioral findings in Aller et al. (2022), the unique contribution of lip movements showed high interindividual variability (Fig. 2C) and also followed a bimodal distribution: Some individuals showed a strong unique contribution of lip movements, while others showed only a small unique contribution or none at all. Interestingly, one individual even showed a negative influence when adding lip movements to the acoustic model when there was only one speaker. As soon as speech became challenging, that individual showed a contribution of lip information. Previous research on audiovisual speech processing showed that interindividual differences are related to visual attention (Tiippana et al., 2004), availability of attentional resources (Alsius et al., 2005), or individual preference for auditory or visual stimuli (Schwartz, 2010), which are potential factors that could explain our observed differences. Overall, our findings are in line with the beneficial effects of visual speech when listening is challenging (Sumby and Pollack, 1954; Grant and Seitz, 2000; Ross et al., 2007; Remez, 2012).

Using Bayesian multilevel modeling, we established a link between neural speech tracking and behavior. We show that higher acoustic neural tracking is related to higher comprehension (Fig. 3A), a finding also reported in a study that used vocoded speech (Chen et al., 2023). We also show that higher acoustic neural tracking is related to lower difficulty ratings. This is in line with a study that showed a positive relationship between speech intelligibility ratings and acoustic neural tracking, though using speech-in-noise (Ding and Simon, 2013). Higher engagement ratings were associated with higher acoustic neural tracking—in contrast to Schubert et al. (2023)—showing no relationship between the two measures. Our findings suggest that enhanced neural speech tracking of acoustic features is related to a lower listening effort, where cognitive demand and motivation are the key contributors (Peelle, 2018). Interpreting the relationship with comprehension performance and lower difficulty ratings as cognitive demand and that with motivation as engagement, our results putatively reflect a neural proxy of listening effort.

We were not able to establish any link between the neural tracking of lip movements and the behavioral measures. It is important to note here that the analyzed neural tracking of lip movements was not yet controlled for speech acoustics (Gillis et al., 2022), which could confound any relationship with behavior. A recent MEG study impressively showed that the neural tracking of acoustic speech features can explain cortical responses to higher-order linguistic features, such as phoneme onsets (Daube et al., 2019). It is important to note that this caveat also applies to the observed relationship between acoustic neural tracking and behavior, and it cannot be ruled out that this relationship is driven by these higher-order features. Further audiovisual speech studies, in which linguistic features are also modeled, are necessary.

The COVID-19 pandemic established the use of face masks on a global scale (Feng et al., 2020). However, it has been demonstrated that covering the mouth has adverse effects on behavioral measures, such as speech perception (Rahne et al., 2021). On a neural level, Haider et al. (2022) showed that surgical face masks impair the neural tracking of acoustic and higher-order segmentational speech features. In a follow-up study, Haider et al. (2024) incorporated lip movements in the analysis and showed that face masks primarily impact speech processing by blocking visual speech rather than by acoustic degradation. Here, we establish a relationship between behavioral measures and the unique contribution of visual speech on neural tracking, which has not yet been shown. When the speaker wore a surgical face mask, individuals that show a higher unique contribution of lip movements displayed lower comprehension and higher difficulty ratings. Strikingly, no effect was found when the speaker did not wear a surgical face mask. Further studies with larger sample sizes are needed to disentangle the potential influence of experimental conditions on this relationship, e.g., using Bayesian mediation analysis (Yuan and MacKinnon, 2009; Nuijten et al., 2015). Overall, our results suggest that individuals who use lip movements more effectively show behavioral deterioration when visual speech is absent.

The current study provides evidence for the substantial interindividual variability in the unique contribution of lip movements to neural speech tracking and its relationship to behavior. First, we show that neural responses to lip movements are more pronounced when speech is challenging, compared with when speech is clear. We show that lip movements effectively enhance neural speech tracking in brain regions related to audiovisual speech, with high interindividual variability. Furthermore, we demonstrate that the unique contribution of lip movements is behaviorally relevant. Individuals that show a higher unique contribution of lip movements show lower comprehension and rate the task to be more difficult when the speaker wears a surgical face mask. Remarkably, this relationship is completely absent when the speaker did not wear a mask. Our results provide insights into the individual differences in the neural tracking of lip movements and offer potential implications for future clinical and audiological settings to objectively assess audiovisual speech perception, such as in populations where traditional task-based assessments cannot be meaningfully conducted.

Footnotes

  • K.S. is an employee of MED-EL GmbH. All other authors declare no competing financial interests.

  • This research was funded in whole or in part by the Austrian Science Fund (FWF; 10.55776/W1233). For open access purposes, we have applied a CC BY public copyright license to any author-accepted manuscript version arising from this submission. P.R. is supported by FWF (Doctoral College “Imaging the Mind”; W 1233-B), as well as N.S. (“Audiovisual speech entrainment in deafness”; P31230) and C.L.H. (“Impact of face masks on speech comprehension”; P34237). P.R. is also supported by the Austrian Research Promotion Agency (FFG; BRIDGE 1 project “SmartCIs”; 871232). M.G. is supported by a Strategic Basic Research Grant by the Research Foundation Flanders (FWO, Grant No. 1SA0620N). J.V. is supported by a postdoctoral grant provided by the FWO (Grant No. 1290821). We thank Juliane Schubert for the stimulus recordings and Sarah Danböck for her helpful methodological input.

  • ↵†T.F. and N.W. shared last authorship.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.

References

  1. ↵
    1. Akalin-Acar Z,
    2. Gençer NG
    (2004) An advanced boundary element method (BEM) implementation for the forward problem of electromagnetic source imaging. Phys Med Biol 49:5011. https://doi.org/10.1088/0031-9155/49/21/012
    OpenUrlCrossRefPubMed
  2. ↵
    1. Aller M,
    2. Økland HS,
    3. MacGregor LJ,
    4. Blank H,
    5. Davis MH
    (2022) Differential auditory and visual phase-locking are observed during audio-visual benefit and silent lip-reading for speech perception. J Neurosci 42:6108–6120. https://doi.org/10.1523/JNEUROSCI.2476-21.2022 pmid:35760528
    OpenUrlAbstract/FREE Full Text
  3. ↵
    1. Alsius A,
    2. Navarra J,
    3. Campbell R,
    4. Soto-Faraco S
    (2005) Audiovisual integration of speech falters under high attention demands. Curr Biol 15:839–843. https://doi.org/10.1016/j.cub.2005.03.046
    OpenUrlCrossRefPubMed
  4. ↵
    1. Bell AJ,
    2. Sejnowski TJ
    (1995) An information-maximization approach to blind separation and blind deconvolution. Neural Comput 7:1129–1159. https://doi.org/10.1162/neco.1995.7.6.1129
    OpenUrlCrossRefPubMed
  5. ↵
    1. Bernstein LE,
    2. Liebenthal E
    (2014) Neural pathways for visual speech perception. Front Neurosci 8:1–18. https://doi.org/10.3389/fnins.2014.00386 pmid:25520611
    OpenUrlCrossRefPubMed
  6. ↵
    1. Besl PJ,
    2. McKay ND
    (1992) A method for registration of 3-D shapes. IEEE Trans Pattern Anal Mach Intell 14:239–256. https://doi.org/10.1109/34.121791
    OpenUrlCrossRef
  7. ↵
    1. Betancourt M
    (2018) A conceptual introduction to Hamiltonian Monte Carlo (arXiv:1701.02434). arXiv.
  8. ↵
    1. Biesmans W,
    2. Das N,
    3. Francart T,
    4. Bertrand A
    (2017) Auditory-inspired speech envelope extraction methods for improved EEG-based auditory attention detection in a cocktail party scenario. IEEE Trans Neural Syst Rehabil Eng 25:402–412. https://doi.org/10.1109/TNSRE.2016.2571900
    OpenUrlCrossRefPubMed
  9. ↵
    1. Blackburn CL,
    2. Kitterick PT,
    3. Jones G,
    4. Sumner CJ,
    5. Stacey PC
    (2019) Visual speech benefit in clear and degraded speech depends on the auditory intelligibility of the talker and the number of background talkers. Trends Hear 23:1–14. https://doi.org/10.1177/2331216519837866 pmid:30909814
    OpenUrlCrossRefPubMed
  10. ↵
    1. Boersma P
    (2001) Praat, a system for doing phonetics by computer. Glot Int 5:341–345.
    OpenUrl
  11. ↵
    1. Bourguignon M,
    2. Baart M,
    3. Kapnoula EC,
    4. Molinaro N
    (2020) Lip-reading enables the brain to synthesize auditory features of unknown silent speech. J Neurosci 40:1053–1065. https://doi.org/10.1523/JNEUROSCI.1101-19.2019 pmid:31889007
    OpenUrlAbstract/FREE Full Text
  12. ↵
    1. Brainard DH
    (1997) The psychophysics toolbox. Spat Vis 10:433–436. https://doi.org/10.1163/156856897X00357
    OpenUrlCrossRefPubMed
  13. ↵
    1. Brodbeck C,
    2. Das P,
    3. Gillis M,
    4. Kulasingham JP,
    5. Bhattasali S,
    6. Gaston P,
    7. Resnik P,
    8. Simon JZ
    (2023) Eelbrain, a Python toolkit for time-continuous analysis with temporal response functions. Elife 12:e85012. https://doi.org/10.7554/eLife.85012 pmid:38018501
    OpenUrlCrossRefPubMed
  14. ↵
    1. Brodbeck C,
    2. Hong LE,
    3. Simon JZ
    (2018a) Rapid transformation from auditory to linguistic representations of continuous speech. Curr Biol 28:3976–3983.e5. https://doi.org/10.1016/j.cub.2018.10.042 pmid:30503620
    OpenUrlCrossRefPubMed
  15. ↵
    1. Brodbeck C,
    2. Jiao A,
    3. Hong LE,
    4. Simon JZ
    (2020) Neural speech restoration at the cocktail party: auditory cortex recovers masked speech of both attended and ignored speakers. PLoS Biol 18:e3000883. https://doi.org/10.1371/journal.pbio.3000883 pmid:33091003
    OpenUrlCrossRefPubMed
  16. ↵
    1. Brodbeck C,
    2. Presacco A,
    3. Simon JZ
    (2018b) Neural source dynamics of brain responses to continuous stimuli: speech processing from acoustics to comprehension. Neuroimage 172:162–174. https://doi.org/10.1016/j.neuroimage.2018.01.042 pmid:29366698
    OpenUrlCrossRefPubMed
  17. ↵
    1. Brodbeck C,
    2. Simon JZ
    (2020) Continuous speech processing. Curr Opin Physiol 18:25–31. https://doi.org/10.1016/j.cophys.2020.07.014 pmid:33225119
    OpenUrlCrossRefPubMed
  18. ↵
    1. Broderick MP,
    2. Anderson AJ,
    3. Di Liberto GM,
    4. Crosse MJ,
    5. Lalor EC
    (2018) Electrophysiological correlates of semantic dissimilarity reflect the comprehension of natural, narrative speech. Curr Biol 28:803–809.e3. https://doi.org/10.1016/j.cub.2018.01.080
    OpenUrlCrossRefPubMed
  19. ↵
    1. Brown VA,
    2. Van Engen KJ,
    3. Peelle JE
    (2021) Face mask type affects audiovisual speech intelligibility and subjective listening effort in young and older adults. Cogn Res Princ Implic 6:49. https://doi.org/10.1186/s41235-021-00314-0 pmid:34275022
    OpenUrlPubMed
  20. ↵
    1. Bürkner P-C
    (2017) brms: an R package for Bayesian multilevel models using Stan. J Stat Softw 80:1–28. https://doi.org/10.18637/jss.v080.i01
    OpenUrlCrossRefPubMed
  21. ↵
    1. Bürkner P-C
    (2018) Advanced Bayesian multilevel modeling with the R package brms. R J 10:395–411. https://doi.org/10.32614/RJ-2018-017
    OpenUrlCrossRef
  22. ↵
    1. Campbell R,
    2. De Gelder B,
    3. De Haan E
    (1996) The lateralization of lip-reading: a second look. Neuropsychologia 34:1235–1240. https://doi.org/10.1016/0028-3932(96)00046-2
    OpenUrlCrossRefPubMed
  23. ↵
    1. Carpenter B,
    2. Gelman A,
    3. Hoffman MD,
    4. Lee D,
    5. Goodrich B,
    6. Betancourt M,
    7. Brubaker M,
    8. Guo J,
    9. Li P,
    10. Riddell A
    (2017) Stan: a probabilistic programming language. J Stat Softw 76:1–32. https://doi.org/10.18637/jss.v076.i01 pmid:36568334
    OpenUrlCrossRefPubMed
  24. ↵
    1. Chalas N,
    2. Daube C,
    3. Kluger DS,
    4. Abbasi O,
    5. Nitsch R,
    6. Gross J
    (2022) Multivariate analysis of speech envelope tracking reveals coupling beyond auditory cortex. Neuroimage 258:119395. https://doi.org/10.1016/j.neuroimage.2022.119395
    OpenUrlCrossRefPubMed
  25. ↵
    1. Chandrasekaran C,
    2. Trubanova A,
    3. Stillittano S,
    4. Caplier A,
    5. Ghazanfar AA
    (2009) The natural statistics of audiovisual speech. PLoS Comput Biol 5:e1000436. https://doi.org/10.1371/journal.pcbi.1000436 pmid:19609344
    OpenUrlCrossRefPubMed
  26. ↵
    1. Chen Y-P,
    2. Schmidt F,
    3. Keitel A,
    4. Rösch S,
    5. Hauswald A,
    6. Weisz N
    (2023) Speech intelligibility changes the temporal evolution of neural speech tracking. Neuroimage 268:119894. https://doi.org/10.1016/j.neuroimage.2023.119894
    OpenUrlCrossRefPubMed
  27. ↵
    1. Cherry EC
    (1953) Some experiments on the recognition of speech, with one and with two ears. J Acoust Soc Am 25:975–979. https://doi.org/10.1121/1.1907229
    OpenUrlCrossRef
  28. ↵
    1. Chu DK, et al.
    (2020) Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: a systematic review and meta-analysis. Lancet 395:1973–1987. https://doi.org/10.1016/S0140-6736(20)31142-9
    OpenUrlCrossRefPubMed
  29. ↵
    1. Ciccarelli G,
    2. Nolan M,
    3. Perricone J,
    4. Calamia PT,
    5. Haro S,
    6. O’Sullivan J,
    7. Mesgarani N,
    8. Quatieri TF,
    9. Smalt CJ
    (2019) Comparison of two-talker attention decoding from EEG with nonlinear neural networks and linear methods. Sci Rep 9:11538. https://doi.org/10.1038/s41598-019-47795-0 pmid:31395905
    OpenUrlPubMed
  30. ↵
    1. Corey DM,
    2. Dunlap WP,
    3. Burke MJ
    (1998) Averaging correlations: expected values and bias in combined Pearson rs and Fisher’s z transformations. J Gen Psychol 125:245–261. https://doi.org/10.1080/00221309809595548
    OpenUrlCrossRef
  31. ↵
    1. Crosse MJ,
    2. Butler JS,
    3. Lalor EC
    (2015) Congruent visual speech enhances cortical entrainment to continuous auditory speech in noise-free conditions. J Neurosci 35:14195–14204. https://doi.org/10.1523/JNEUROSCI.1829-15.2015 pmid:26490860
    OpenUrlAbstract/FREE Full Text
  32. ↵
    1. Crosse MJ,
    2. Di Liberto GM,
    3. Bednar A,
    4. Lalor EC
    (2016a) The multivariate temporal response function (mTRF) toolbox: a Matlab toolbox for relating neural signals to continuous stimuli. Front Hum Neurosci 10:1–14. https://doi.org/10.3389/fnhum.2016.00604 pmid:27965557
    OpenUrlCrossRefPubMed
  33. ↵
    1. Crosse MJ,
    2. Liberto GMD,
    3. Lalor EC
    (2016b) Eye can hear clearly now: inverse effectiveness in natural audiovisual speech processing relies on long-term crossmodal temporal integration. J Neurosci 36:9888–9895. https://doi.org/10.1523/JNEUROSCI.1396-16.2016 pmid:27656026
    OpenUrlAbstract/FREE Full Text
  34. ↵
    1. Crosse MJ,
    2. Zuk NJ,
    3. Di Liberto GM,
    4. Nidiffer AR,
    5. Molholm S,
    6. Lalor EC
    (2021) Linear modeling of neurophysiological responses to speech and other continuous stimuli: methodological considerations for applied research. Front Neurosci 15:1350. https://doi.org/10.3389/fnins.2021.705621 pmid:34880719
    OpenUrlPubMed
  35. ↵
    1. Das P,
    2. Brodbeck C,
    3. Simon JZ,
    4. Babadi B
    (2020) Neuro-current response functions: a unified approach to MEG source analysis under the continuous stimuli paradigm. Neuroimage 211:116528. https://doi.org/10.1016/j.neuroimage.2020.116528 pmid:31945510
    OpenUrlCrossRefPubMed
  36. ↵
    1. Daube C,
    2. Ince RAA,
    3. Gross J
    (2019) Simple acoustic features can explain phoneme-based predictions of cortical responses to speech. Curr Biol 29:1924–1937.e9. https://doi.org/10.1016/j.cub.2019.04.067 pmid:31130454
    OpenUrlCrossRefPubMed
  37. ↵
    1. David SV,
    2. Mesgarani N,
    3. Shamma SA
    (2007) Estimating sparse spectro-temporal receptive fields with natural stimuli. Network 18:191–212. https://doi.org/10.1080/09548980701609235
    OpenUrlCrossRefPubMed
  38. ↵
    1. de Jong NH,
    2. Wempe T
    (2009) Praat script to detect syllable nuclei and measure speech rate automatically. Behav Res Methods 41:385–390. https://doi.org/10.3758/BRM.41.2.385
    OpenUrlCrossRefPubMed
  39. ↵
    1. Delorme A,
    2. Makeig S
    (2004) EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J Neurosci Methods 134:9–21. https://doi.org/10.1016/j.jneumeth.2003.10.009
    OpenUrlCrossRefPubMed
  40. ↵
    1. Desikan RS, et al.
    (2006) An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. Neuroimage 31:968–980. https://doi.org/10.1016/j.neuroimage.2006.01.021
    OpenUrlCrossRefPubMed
  41. ↵
    1. Di Liberto GM,
    2. O’Sullivan JA,
    3. Lalor EC
    (2015) Low-frequency cortical entrainment to speech reflects phoneme-level processing. Curr Biol 25:2457–2465. https://doi.org/10.1016/j.cub.2015.08.030
    OpenUrlCrossRefPubMed
  42. ↵
    1. Ding N,
    2. Simon JZ
    (2013) Adaptive temporal encoding leads to a background-insensitive cortical representation of speech. J Neurosci 33:5728–5735. https://doi.org/10.1523/JNEUROSCI.5297-12.2013 pmid:23536086
    OpenUrlAbstract/FREE Full Text
  43. ↵
    1. Erber NP
    (1975) Auditory-visual perception of speech. J Speech Hear Disord 40:481–492. https://doi.org/10.1044/jshd.4004.481
    OpenUrlCrossRefPubMed
  44. ↵
    1. Feng S,
    2. Shen C,
    3. Xia N,
    4. Song W,
    5. Fan M,
    6. Cowling BJ
    (2020) Rational use of face masks in the COVID-19 pandemic. Lancet Respir Med 8:434–436. https://doi.org/10.1016/S2213-2600(20)30134-X
    OpenUrlPubMed
  45. ↵
    1. Fischl B
    (2012) Freesurfer. Neuroimage 62:774–781. https://doi.org/10.1016/j.neuroimage.2012.01.021 pmid:22248573
    OpenUrlCrossRefPubMed
  46. ↵
    1. Fishbach A,
    2. Nelken I,
    3. Yeshurun Y
    (2001) Auditory edge detection: a neural model for physiological and psychoacoustical responses to amplitude transients. J Neurophysiol 85:2303–2323. https://doi.org/10.1152/jn.2001.85.6.2303
    OpenUrlCrossRefPubMed
  47. ↵
    1. Franken MK,
    2. Liu BC,
    3. Ostry DJ
    (2022) Towards a somatosensory theory of speech perception. J Neurophysiol 128:1683–1695. https://doi.org/10.1152/jn.00381.2022 pmid:36416451
    OpenUrlCrossRefPubMed
  48. ↵
    1. Freeman JB,
    2. Dale R
    (2013) Assessing bimodality to detect the presence of a dual cognitive process. Behav Res Methods 45:83–97. https://doi.org/10.3758/s13428-012-0225-x
    OpenUrlCrossRefPubMed
  49. ↵
    1. Gehmacher Q,
    2. Schubert J,
    3. Schmidt F,
    4. Hartmann T,
    5. Reisinger P,
    6. Rösch S,
    7. Schwarz K,
    8. Popov T,
    9. Chait M,
    10. Weisz N
    (2024) Eye movements track prioritized auditory features in selective attention to natural speech. Nat Commun 15:3692. https://doi.org/10.1038/s41467-024-48126-2 pmid:38693186
    OpenUrlCrossRefPubMed
  50. ↵
    1. Geirnaert S,
    2. Vandecappelle S,
    3. Alickovic E,
    4. de Cheveigne A,
    5. Lalor E,
    6. Meyer BT,
    7. Miran S,
    8. Francart T,
    9. Bertrand A
    (2021) Electroencephalography-based auditory attention decoding: toward neurosteered hearing devices. IEEE Signal Process Mag 38:89–102. https://doi.org/10.1109/MSP.2021.3075932
    OpenUrl
  51. ↵
    1. Gillis M,
    2. Van Canneyt J,
    3. Francart T,
    4. Vanthornhout J
    (2022) Neural tracking as a diagnostic tool to assess the auditory pathway. Hear Res 426:108607. https://doi.org/10.1016/j.heares.2022.108607
    OpenUrlCrossRefPubMed
  52. ↵
    1. Gillis M,
    2. Vanthornhout J,
    3. Simon JZ,
    4. Francart T,
    5. Brodbeck C
    (2021) Neural markers of speech comprehension: measuring EEG tracking of linguistic speech representations, controlling the speech acoustics. J Neurosci 41:10316–10329. https://doi.org/10.1523/JNEUROSCI.0812-21.2021 pmid:34732519
    OpenUrlAbstract/FREE Full Text
  53. ↵
    1. Golumbic EZ,
    2. Cogan GB,
    3. Schroeder CE,
    4. Poeppel D
    (2013) Visual input enhances selective speech envelope tracking in auditory cortex at a “cocktail party”. J Neurosci 33:1417–1426. https://doi.org/10.1523/JNEUROSCI.3675-12.2013 pmid:23345218
    OpenUrlAbstract/FREE Full Text
  54. ↵
    1. Gramfort A, et al.
    (2013) MEG and EEG data analysis with MNE-Python. Front Neurosci 7:1–13. https://doi.org/10.3389/fnins.2013.00267 pmid:24431986
    OpenUrlCrossRefPubMed
  55. ↵
    1. Gramfort A,
    2. Luessi M,
    3. Larson E,
    4. Engemann DA,
    5. Strohmeier D,
    6. Brodbeck C,
    7. Parkkonen L,
    8. Hämäläinen MS
    (2014) MNE software for processing MEG and EEG data. Neuroimage 86:446–460. https://doi.org/10.1016/j.neuroimage.2013.10.027 pmid:24161808
    OpenUrlCrossRefPubMed
  56. ↵
    1. Grant KW,
    2. Seitz P-F
    (2000) The use of visible speech cues for improving auditory detection of spoken sentences. J Acoust Soc Am 108:1197–1208. https://doi.org/10.1121/1.1288668
    OpenUrlCrossRefPubMed
  57. ↵
    1. Gurler D,
    2. Doyle N,
    3. Walker E,
    4. Magnotti J,
    5. Beauchamp M
    (2015) A link between individual differences in multisensory speech perception and eye movements. Atten Percept Psychophys 77:1333–1341. https://doi.org/10.3758/s13414-014-0821-1 pmid:25810157
    OpenUrlCrossRefPubMed
  58. ↵
    1. Haider CL,
    2. Park H,
    3. Hauswald A,
    4. Weisz N
    (2024) Neural speech tracking highlights the importance of visual speech in multi-speaker situations. J Cogn Neurosci 36:128–142. https://doi.org/10.1162/jocn_a_02059
    OpenUrlCrossRefPubMed
  59. ↵
    1. Haider CL,
    2. Suess N,
    3. Hauswald A,
    4. Park H,
    5. Weisz N
    (2022) Masking of the mouth area impairs reconstruction of acoustic speech features and higher-level segmentational features in the presence of a distractor speaker. Neuroimage 252:119044. https://doi.org/10.1016/j.neuroimage.2022.119044
    OpenUrlPubMed
  60. ↵
    1. Hämäläinen MS,
    2. Ilmoniemi RJ
    (1994) Interpreting magnetic fields of the brain: minimum norm estimates. Med Biol Eng Comput 32:35–42. https://doi.org/10.1007/BF02512476
    OpenUrlCrossRefPubMed
  61. ↵
    1. Hartmann T,
    2. Weisz N
    (2020) An introduction to the objective psychophysics toolbox (o_ptb). Front Psychol 11:1–10. https://doi.org/10.3389/fpsyg.2020.585437 pmid:33224075
    OpenUrlCrossRefPubMed
  62. ↵
    1. Hauswald A,
    2. Lithari C,
    3. Collignon O,
    4. Leonardelli E,
    5. Weisz N
    (2018) A visual cortical network for deriving phonological information from intelligible lip movements. Curr Biol 28:1453–1459.e3. https://doi.org/10.1016/j.cub.2018.03.044 pmid:29681475
    OpenUrlCrossRefPubMed
  63. ↵
    1. Heeris J
    (2013) Gammatone filterbank toolkit [Computer software]. Available at: https://github.com/detly/gammatone
  64. ↵
    1. Houck JM,
    2. Claus ED
    (2020) A comparison of automated and manual co-registration for magnetoencephalography. PLOS One 15:e0232100. https://doi.org/10.1371/journal.pone.0232100 pmid:32348350
    OpenUrlCrossRefPubMed
  65. ↵
    1. Kaufman M,
    2. Zion Golumbic E
    (2023) Listening to two speakers: capacity and tradeoffs in neural speech tracking during selective and distributed attention. Neuroimage 270:119984. https://doi.org/10.1016/j.neuroimage.2023.119984
    OpenUrlCrossRefPubMed
  66. ↵
    1. Kay M
    (2022) ggdist: Visualizations of distributions and uncertainty [Computer software]. Zenodo.
  67. ↵
    1. Keitel A,
    2. Gross J,
    3. Kayser C
    (2018) Perceptually relevant speech tracking in auditory and motor cortex reflects distinct linguistic features. PLoS Biol 16:e2004473. https://doi.org/10.1371/journal.pbio.2004473 pmid:29529019
    OpenUrlCrossRefPubMed
  68. ↵
    1. King BM,
    2. Rosopa PJ,
    3. Minium EW
    (2018) Statistical reasoning in the behavioral sciences, Ed 7. Hoboken, NJ: John Wiley & Sons.
  69. ↵
    1. Kleiner M,
    2. Brainard D,
    3. Pelli D
    (2007) What’s new in psychtoolbox-3? Perception 36:1–16. https://doi.org/10.1177/03010066070360S101
    OpenUrlCrossRefPubMed
  70. ↵
    1. Kraus F,
    2. Tune S,
    3. Ruhe A,
    4. Obleser J,
    5. Wöstmann M
    (2021) Unilateral acoustic degradation delays attentional separation of competing speech. Trends Hear 25:23312165211013242. https://doi.org/10.1177/23312165211013242 pmid:34184964
    OpenUrlCrossRefPubMed
  71. ↵
    1. Kulasingham JP,
    2. Brodbeck C,
    3. Presacco A,
    4. Kuchinsky SE,
    5. Anderson S,
    6. Simon JZ
    (2020) High gamma cortical processing of continuous speech in younger and older listeners. Neuroimage 222:117291. https://doi.org/10.1016/j.neuroimage.2020.117291 pmid:32835821
    OpenUrlCrossRefPubMed
  72. ↵
    1. Lalor EC,
    2. Pearlmutter BA,
    3. Reilly RB,
    4. McDarby G,
    5. Foxe JJ
    (2006) The VESPA: a method for the rapid estimation of a visual evoked potential. Neuroimage 32:1549–1561. https://doi.org/10.1016/j.neuroimage.2006.05.054
    OpenUrlCrossRefPubMed
  73. ↵
    1. Lalor EC,
    2. Power AJ,
    3. Reilly RB,
    4. Foxe JJ
    (2009) Resolving precise temporal processing properties of the auditory system using continuous stimuli. J Neurophysiol 102:349–359. https://doi.org/10.1152/jn.90896.2008
    OpenUrlCrossRefPubMed
  74. ↵
    1. Lin F-H,
    2. Witzel T,
    3. Ahlfors SP,
    4. Stufflebeam SM,
    5. Belliveau JW,
    6. Hämäläinen MS
    (2006) Assessing and improving the spatial accuracy in MEG source localization by depth-weighted minimum-norm estimates. Neuroimage 31:160–171. https://doi.org/10.1016/j.neuroimage.2005.11.054
    OpenUrlCrossRefPubMed
  75. ↵
    1. Maris E,
    2. Oostenveld R
    (2007) Nonparametric statistical testing of EEG- and MEG-data. J Neurosci Methods 164:177–190. https://doi.org/10.1016/j.jneumeth.2007.03.024
    OpenUrlCrossRefPubMed
  76. ↵
    1. McElreath R
    (2020) Statistical rethinking: a Bayesian course with examples in R and STAN, Ed 2. Boca Raton, FL: Chapman and Hall/CRC.
  77. ↵
    1. Mehoudar E,
    2. Arizpe J,
    3. Baker CI,
    4. Yovel G
    (2014) Faces in the eye of the beholder: unique and stable eye scanning patterns of individual observers. J Vis 14:6. https://doi.org/10.1167/14.7.6 pmid:25057839
    OpenUrlAbstract/FREE Full Text
  78. ↵
    1. Mirkovic B,
    2. Debener S,
    3. Jaeger M,
    4. Vos MD
    (2015) Decoding the attended speech stream with multi-channel EEG: implications for online, daily-life applications. J Neural Eng 12:046007. https://doi.org/10.1088/1741-2560/12/4/046007
    OpenUrlCrossRefPubMed
  79. ↵
    1. Nicholls MER,
    2. Searle DA
    (2006) Asymmetries for the visual expression and perception of speech. Brain Lang 97:322–331. https://doi.org/10.1016/j.bandl.2005.11.007
    OpenUrlCrossRefPubMed
  80. ↵
    1. Nieto-Castanon A,
    2. Ghosh SS,
    3. Tourville JA,
    4. Guenther FH
    (2003) Region of interest based analysis of functional imaging data. Neuroimage 19:1303–1316. https://doi.org/10.1016/S1053-8119(03)00188-5
    OpenUrlCrossRefPubMed
  81. ↵
    1. Nuijten MB,
    2. Wetzels R,
    3. Matzke D,
    4. Dolan CV,
    5. Wagenmakers E-J
    (2015) A default Bayesian hypothesis test for mediation. Behav Res Methods 47:85–97. https://doi.org/10.3758/s13428-014-0470-2
    OpenUrlPubMed
  82. ↵
    1. Obleser J,
    2. Kayser C
    (2019) Neural entrainment and attentional selection in the listening brain. Trends Cogn Sci 23:913–926. https://doi.org/10.1016/j.tics.2019.08.004
    OpenUrlCrossRefPubMed
  83. ↵
    1. Oostenveld R,
    2. Fries P,
    3. Maris E,
    4. Schoffelen J-M
    (2011) Fieldtrip: open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci 2011:1–9. https://doi.org/10.1155/2011/156869 pmid:21253357
    OpenUrlCrossRefPubMed
  84. ↵
    1. O’Sullivan JA,
    2. Power AJ,
    3. Mesgarani N,
    4. Rajaram S,
    5. Foxe JJ,
    6. Shinn-Cunningham BG,
    7. Slaney M,
    8. Shamma SA,
    9. Lalor EC
    (2015) Attentional selection in a cocktail party environment can be decoded from single-trial EEG. Cereb Cortex 25:1697–1706. https://doi.org/10.1093/cercor/bht355 pmid:24429136
    OpenUrlCrossRefPubMed
  85. ↵
    1. Park H,
    2. Kayser C,
    3. Thut G,
    4. Gross J
    (2016) Lip movements entrain the observers’ low-frequency brain oscillations to facilitate speech intelligibility. Elife 5:e14521. https://doi.org/10.7554/eLife.14521 pmid:27146891
    OpenUrlCrossRefPubMed
  86. ↵
    1. Peelle JE
    (2018) Listening effort: how the cognitive consequences of acoustic challenge are reflected in brain and behavior. Ear Hear 39:204–214. https://doi.org/10.1097/AUD.0000000000000494 pmid:28938250
    OpenUrlCrossRefPubMed
  87. ↵
    1. Peelle JE,
    2. Sommers MS
    (2015) Prediction and constraint in audiovisual speech perception. Cortex 68:169–181. https://doi.org/10.1016/j.cortex.2015.03.006 pmid:25890390
    OpenUrlCrossRefPubMed
  88. ↵
    1. Pelli DG
    (1997) The VideoToolbox software for visual psychophysics: transforming numbers into movies. Spat Vis 10:437–442. https://doi.org/10.1163/156856897X00366
    OpenUrlCrossRefPubMed
  89. ↵
    1. Peterson MF,
    2. Eckstein MP
    (2013) Individual differences in eye movements during face identification reflect observer-specific optimal points of fixation. Psychol Sci 24:1216–1225. https://doi.org/10.1177/0956797612471684 pmid:23740552
    OpenUrlCrossRefPubMed
  90. ↵
    1. Pfister R,
    2. Schwarz K,
    3. Janczyk M,
    4. Dale R,
    5. Freeman J
    (2013) Good things peak in pairs: a note on the bimodality coefficient. Front Psychol 4:1–4. https://doi.org/10.3389/fpsyg.2013.00700 pmid:24109465
    OpenUrlCrossRefPubMed
  91. ↵
    1. Rahne T,
    2. Fröhlich L,
    3. Plontke S,
    4. Wagner L
    (2021) Influence of surgical and N95 face masks on speech perception and listening effort in noise. PLOS One 16:e0253874. https://doi.org/10.1371/journal.pone.0253874 pmid:34197513
    OpenUrlPubMed
  92. ↵
    R Core Team (2022) R: a language and environment for statistical computing [computer software]. R Foundation for Statistical Computing. Available at: https://www.R-project.org/
  93. ↵
    1. Remez RE
    (2012) Three puzzles of multimodal speech perception. In: Audiovisual speech processing (Vatikiotis-Bateson E, Bailly G, Perrier P, eds), pp 4–20. Cambridge: Cambridge University Press.
  94. ↵
    1. Rennig J,
    2. Beauchamp MS
    (2018) Free viewing of talking faces reveals mouth and eye preferring regions of the human superior temporal sulcus. Neuroimage 183:25–36. https://doi.org/10.1016/j.neuroimage.2018.08.008 pmid:30092347
    OpenUrlCrossRefPubMed
  95. ↵
    1. Ross LA,
    2. Molholm S,
    3. Butler JS,
    4. Bene VAD,
    5. Foxe JJ
    (2022) Neural correlates of multisensory enhancement in audiovisual narrative speech perception: a fMRI investigation. Neuroimage 263:119598. https://doi.org/10.1016/j.neuroimage.2022.119598
    OpenUrlPubMed
  96. ↵
    1. Ross LA,
    2. Saint-Amour D,
    3. Leavitt VM,
    4. Javitt DC,
    5. Foxe JJ
    (2007) Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cereb Cortex 17:1147–1153. https://doi.org/10.1093/cercor/bhl024
    OpenUrlCrossRefPubMed
  97. ↵
    1. Schäfer PJ,
    2. Corona-Strauss FI,
    3. Hannemann R,
    4. Hillyard SA,
    5. Strauss DJ
    (2018) Testing the limits of the stimulus reconstruction approach: auditory attention decoding in a four-speaker free field environment. Trends Hear 22:1–12. https://doi.org/10.1177/2331216518816600
    OpenUrlCrossRefPubMed
  98. ↵
    1. Schubert J,
    2. Schmidt F,
    3. Gehmacher Q,
    4. Bresgen A,
    5. Weisz N
    (2023) Cortical speech tracking is related to individual prediction tendencies. Cereb Cortex 33:6608–6619. https://doi.org/10.1093/cercor/bhac528 pmid:36617790
    OpenUrlCrossRefPubMed
  99. ↵
    1. Schwartz JL
    (2010) A reanalysis of McGurk data suggests that audiovisual fusion in speech perception is subject-dependent. J Acoust Soc Am 127:1584–1594. https://doi.org/10.1121/1.3293001
    OpenUrlCrossRefPubMed
  100. ↵
    1. Slaney M
    (1998) Auditory toolbox. Interval Research Corporation, 10(1998), 1194.
  101. ↵
    1. Smith SM,
    2. Nichols TE
    (2009) Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage 44:83–98. https://doi.org/10.1016/j.neuroimage.2008.03.061
    OpenUrlCrossRefPubMed
  102. ↵
    1. Stacey PC,
    2. Kitterick PT,
    3. Morris SD,
    4. Sumner CJ
    (2016) The contribution of visual information to the perception of speech in noise with and without informative temporal fine structure. Hear Res 336:17–28. https://doi.org/10.1016/j.heares.2016.04.002 pmid:27085797
    OpenUrlCrossRefPubMed
  103. ↵
    1. Suess N,
    2. Hauswald A,
    3. Reisinger P,
    4. Rösch S,
    5. Keitel A,
    6. Weisz N
    (2022a) Cortical tracking of formant modulations derived from silently presented lip movements and its decline with age. Cereb Cortex 32:4818–4833. https://doi.org/10.1093/cercor/bhab518 pmid:35062025
    OpenUrlPubMed
  104. ↵
    1. Suess N,
    2. Hauswald A,
    3. Zehentner V,
    4. Depireux J,
    5. Herzog G,
    6. Rösch S,
    7. Weisz N
    (2022b) Influence of linguistic properties and hearing impairment on visual speech perception skills in the German language. PLOS One 17:e0275585. https://doi.org/10.1371/journal.pone.0275585 pmid:36178907
    OpenUrlPubMed
  105. ↵
    1. Sumby WH,
    2. Pollack I
    (1954) Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26:212–215. https://doi.org/10.1121/1.1907309
    OpenUrlCrossRef
  106. ↵
    1. Summerfield Q,
    2. Bruce V,
    3. Cowey A,
    4. Ellis AW,
    5. Perrett DI
    (1992) Lipreading and audio-visual speech perception. Philos Trans R Soc Lond B Biol Sci 335:71–78. https://doi.org/10.1098/rstb.1992.0009
    OpenUrlCrossRefPubMed
  107. ↵
    1. Suñer C,
    2. Coma E,
    3. Ouchi D,
    4. Hermosilla E,
    5. Baro B,
    6. Rodríguez-Arias MÀ,
    7. Puig J,
    8. Clotet B,
    9. Medina M,
    10. Mitjà O
    (2022) Association between two mass-gathering outdoor events and incidence of SARS-CoV-2 infections during the fifth wave of COVID-19 in north-east Spain: a population-based control-matched analysis. Lancet Reg Health Eur 15:100337. https://doi.org/10.1016/j.lanepe.2022.100337 pmid:35237763
    OpenUrlPubMed
  108. ↵
    1. Taulu S,
    2. Kajola M
    (2005) Presentation of electromagnetic multichannel data: the signal space separation method. J Appl Phys 97:124905. https://doi.org/10.1063/1.1935742
    OpenUrlCrossRef
  109. ↵
    1. Taulu S,
    2. Simola J
    (2006) Spatiotemporal signal space separation method for rejecting nearby interference in MEG measurements. Phys Med Biol 51:1759. https://doi.org/10.1088/0031-9155/51/7/008
    OpenUrlCrossRefPubMed
  110. ↵
    1. Thorpe S,
    2. Fize D,
    3. Marlot C
    (1996) Speed of processing in the human visual system. Nature 381:520–522. https://doi.org/10.1038/381520a0
    OpenUrlCrossRefPubMed
  111. ↵
    1. Tiippana K,
    2. Andersen TS,
    3. Sams M
    (2004) Visual attention modulates audiovisual speech perception. EJCP 16:457–472. https://doi.org/10.1080/09541440340000268
    OpenUrl
  112. ↵
    1. Toscano JC,
    2. Toscano CM
    (2021) Effects of face masks on speech recognition in multi-talker babble noise. PLOS One 16:e0246842. https://doi.org/10.1371/journal.pone.0246842 pmid:33626073
    OpenUrlPubMed
  113. ↵
    1. Truong TL,
    2. Beck SD,
    3. Weber A
    (2021) The impact of face masks on the recall of spoken sentences. J Acoust Soc Am 149:142–144. https://doi.org/10.1121/10.0002951 pmid:33514131
    OpenUrlPubMed
  114. ↵
    1. Vallat R
    (2018) Pingouin: statistics in Python. J Open Source Softw 3:1026. https://doi.org/10.21105/joss.01026
    OpenUrlCrossRef
  115. ↵
    1. VanRullen R,
    2. Thorpe SJ
    (2001) The time course of visual processing: from early perception to decision-making. J Cogn Neurosci 13:454–461. https://doi.org/10.1162/08989290152001880
    OpenUrlCrossRefPubMed
  116. ↵
    1. van Wassenhove V,
    2. Grant KW,
    3. Poeppel D
    (2005) Visual speech speeds up the neural processing of auditory speech. Proc Natl Acad Sci U S A 102:1181–1186. https://doi.org/10.1073/pnas.0408949102 pmid:15647358
    OpenUrlAbstract/FREE Full Text
  117. ↵
    1. Vehtari A,
    2. Gabry J,
    3. Magnusson M,
    4. Yao Y,
    5. Bürkner P-C,
    6. Paananen T,
    7. Gelman A
    (2022a) loo: efficient leave-one-out cross-validation and WAIC for Bayesian models [computer software]. Available at: https://mc-stan.org/loo/
  118. ↵
    1. Vehtari A,
    2. Gelman A,
    3. Gabry J
    (2017) Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Stat Comput 27:1413–1432. https://doi.org/10.1007/s11222-016-9696-4
    OpenUrlCrossRef
  119. ↵
    1. Vehtari A,
    2. Gelman A,
    3. Simpson D,
    4. Carpenter B,
    5. Bürkner P-C
    (2021) Rank-normalization, folding, and localization: an improved R^ for assessing convergence of MCMC (with discussion). Bayesian Anal 16:667–718. https://doi.org/10.1214/20-BA1221
    OpenUrlCrossRef
  120. ↵
    1. Vehtari A,
    2. Simpson D,
    3. Gelman A,
    4. Yao Y,
    5. Gabry J
    (2022b) Pareto smoothed importance sampling (arXiv:1507.02646). arXiv.
  121. ↵
    1. Virtanen P, et al.
    (2020) Scipy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods 17:261–272. https://doi.org/10.1038/s41592-019-0686-2 pmid:32015543
    OpenUrlCrossRefPubMed
  122. ↵
    1. Waskom ML
    (2021) Seaborn: statistical data visualization. J Open Source Softw 6:3021. https://doi.org/10.21105/joss.03021
    OpenUrlCrossRef
  123. ↵
    1. Wickham H
    (2016) Ggplot2: elegant graphics for data analysis, Ed 2. Cham: Springer-Verlag New York.
  124. ↵
    1. Wilkinson GN,
    2. Rogers CE
    (1973) Symbolic description of factorial models for analysis of variance. J R Stat Soc Ser C Appl Stat 22:392–399. https://doi.org/10.2307/2346786
    OpenUrlCrossRef
  125. ↵
    1. Yuan Y,
    2. MacKinnon DP
    (2009) Bayesian mediation analysis. Psychol Methods 14:301–322. https://doi.org/10.1037/a0016972 pmid:19968395
    OpenUrlCrossRefPubMed
  126. ↵
    1. Zhang L,
    2. Du Y
    (2022) Lip movements enhance speech representations and effective connectivity in auditory dorsal stream. Neuroimage 257:119311. https://doi.org/10.1016/j.neuroimage.2022.119311
    OpenUrlCrossRefPubMed

Synthesis

Reviewing Editor: Christine Portfors, Washington State University

Decisions are customarily a result of the Reviewing Editor and the peer reviewers coming together and discussing their recommendations until a consensus is reached. When revisions are invited, a fact-based synthesis statement explaining their decision and outlining what is needed to prepare a revision will be listed below. The following reviewer(s) agreed to reveal their identity: Aaron Nidiffer. Note: If this manuscript was transferred from JNeurosci and a decision was made to accept the manuscript without peer review, a brief statement to this effect will instead be what is listed below.

Thank you to the authors for addressing the previous comments completely by either making meaningful changes to the manuscript or providing thoughtful reasoning to their decisions for each point brought up.

Back to top

In this issue

eneuro: 12 (2)
eNeuro
Vol. 12, Issue 2
February 2025
  • Table of Contents
  • Index by author
  • Masthead (PDF)
Email

Thank you for sharing this eNeuro article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Neural Speech Tracking Contribution of Lip Movements Predicts Behavioral Deterioration When the Speaker's Mouth Is Occluded
(Your Name) has forwarded a page to you from eNeuro
(Your Name) thought you would be interested in this article in eNeuro.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Print
View Full Page PDF
Citation Tools
Neural Speech Tracking Contribution of Lip Movements Predicts Behavioral Deterioration When the Speaker's Mouth Is Occluded
Patrick Reisinger, Marlies Gillis, Nina Suess, Jonas Vanthornhout, Chandra Leon Haider, Thomas Hartmann, Anne Hauswald, Konrad Schwarz, Tom Francart, Nathan Weisz
eNeuro 16 January 2025, 12 (2) ENEURO.0368-24.2024; DOI: 10.1523/ENEURO.0368-24.2024

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Respond to this article
Share
Neural Speech Tracking Contribution of Lip Movements Predicts Behavioral Deterioration When the Speaker's Mouth Is Occluded
Patrick Reisinger, Marlies Gillis, Nina Suess, Jonas Vanthornhout, Chandra Leon Haider, Thomas Hartmann, Anne Hauswald, Konrad Schwarz, Tom Francart, Nathan Weisz
eNeuro 16 January 2025, 12 (2) ENEURO.0368-24.2024; DOI: 10.1523/ENEURO.0368-24.2024
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Significance Statement
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
    • Footnotes
    • References
    • Synthesis
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF

Keywords

  • audiovisual speech
  • lip movements
  • MEG
  • neural tracking
  • temporal response functions
  • TRF

Responses to this article

Respond to this article

Jump to comment:

No eLetters have been published for this article.

Related Articles

Cited By...

More in this TOC Section

Research Article: New Research

  • Independent encoding of orientation and mean luminance by mouse visual cortex
  • Neck Vascular Biomechanical Dysfunction Precedes Brain Biochemical Alterations in a Murine Model of Alzheimer’s Disease
  • Alpha-2 Adrenergic Agonists Reduce Heavy Alcohol Drinking and Improve Cognitive Performance in Mice
Show more Research Article: New Research

Cognition and Behavior

  • Neck Vascular Biomechanical Dysfunction Precedes Brain Biochemical Alterations in a Murine Model of Alzheimer’s Disease
  • Spontaneous oscillatory activity in episodic timing: an EEG replication study and its limitations
  • Neural signatures of engagement and event segmentation during story listening in background noise
Show more Cognition and Behavior

Subjects

  • Cognition and Behavior
  • Home
  • Alerts
  • Follow SFN on BlueSky
  • Visit Society for Neuroscience on Facebook
  • Follow Society for Neuroscience on Twitter
  • Follow Society for Neuroscience on LinkedIn
  • Visit Society for Neuroscience on Youtube
  • Follow our RSS feeds

Content

  • Early Release
  • Current Issue
  • Latest Articles
  • Issue Archive
  • Blog
  • Browse by Topic

Information

  • For Authors
  • For the Media

About

  • About the Journal
  • Editorial Board
  • Privacy Notice
  • Contact
  • Feedback
(eNeuro logo)
(SfN logo)

Copyright © 2026 by the Society for Neuroscience.
eNeuro eISSN: 2373-2822

The ideas and opinions expressed in eNeuro do not necessarily reflect those of SfN or the eNeuro Editorial Board. Publication of an advertisement or other product mention in eNeuro should not be construed as an endorsement of the manufacturer’s claims. SfN does not assume any responsibility for any injury and/or damage to persons or property arising from or related to any use of any material contained in eNeuro.