Abstract
In everyday life, ambient sounds can disrupt our concentration, interfere with task performance, and contribute to mental fatigue. Even when not actively attended to, salient or changing sounds in the environment can involuntarily divert attention. Understanding how the brain responds to these real-world auditory distractions is essential for evaluating the cognitive consequences of environmental noise. In this study, we recorded electroencephalography while participants performed different tasks during prolonged exposure to a complex urban soundscape. We identified naturally occurring, acoustically salient events and analyzed the corresponding event-related potentials (ERPs). Auditory spectral novelty reliably elicited a P3a response (250–350 ms), reflecting robust attentional capture by novel environmental sounds. In contrast, the reorienting negativity (RON) window (450–600 ms) showed no consistent modulation, possibly due to the continuous and largely behaviorally irrelevant nature of the soundscape. Performance in a behavioral task was briefly disrupted following novel sounds, underscoring the functional impact of attentional capture. Noise sensitivity, measured via the Weinstein Noise Sensitivity Scale ( 1978), was not associated with ERP amplitudes. Together, these findings demonstrate that the P3a component provides a stable neural marker of attentional shifts in naturalistic contexts and highlight the utility of spectral novelty detection as a tool for investigating auditory attention outside the laboratory.
- auditory attention
- auditory distraction
- ecological validity
- EEG
- event-related potentials
- neuroergonomics
- P3a component
- real-world soundscape
- spectral novelty
Significance Statement
Everyday environments are filled with unpredictable sounds that can capture our attention and disrupt performance, yet most research on auditory distraction relies on highly controlled stimuli. Our study bridges this gap by identifying electroencephalography (EEG) responses to naturally occurring acoustic changes in a real-world soundscape. Using a spectral novelty algorithm, we show that the P3a component reliably tracks attentional capture in complex auditory scenes—even when the soundscape is behaviorally irrelevant. This approach not only enhances ecological validity but also demonstrates a practical method for studying auditory attention outside the lab. Our findings highlight the potential for using EEG to understand cognitive functioning in real-life environments such as offices, classrooms, or public spaces.
Introduction
In our increasingly noisy environments (Asdrubali, 2014), managing auditory attention is crucial for cognition, well-being, and brain health. Complex acoustic environments require focusing on relevant sounds while suppressing irrelevant ones (Hillyard et al., 1973; Fritz et al., 2005; Shinn-Cunningham and Best, 2008; Ahveninen et al., 2011; Choi et al., 2013; Schwartz and David, 2018). This demanding process can cause fatigue over time (Saremi et al., 2008). Understanding how the brain copes with persistent noise and individual differences in this process (Kjellberg et al., 1996) is key to mitigating cognitive and societal effects.
Studies on auditory distractibility propose a three-phase model of distraction (Escera et al., 2000; Wetzel and Schröger, 2014; Getzmann et al., 2024). In this model, distraction unfolds in three stages: an initial, automatic detection of change in the acoustic environment, an involuntary shift of attention toward the deviant sound, and a final stage where attention is voluntarily reoriented and internal predictions updated. This framework aligns with early theories of the orienting reflex, describing how unexpected sensory events interrupt behavior by signaling a mismatch with internal models (Sokolov, 1963).
Some distracting sounds are automatically filtered out before reaching awareness (Boutros and Belger, 1999), while others demand effortful suppression, draining cognitive resources, and contributing to fatigue and annoyance linked to health issues (Bidet-Caulet et al., 2007; Basner et al., 2014; Schwartz and David, 2018). However, individuals differ in their susceptibility to distraction: some are more easily captured by novel sounds, while others exhibit stronger cognitive control mechanisms (Kjellberg et al., 1996; Shepherd et al., 2016). Investigating how these differences shape neural responses to irrelevant sounds may explain why some listeners are more affected by noise than others. This variability underscores the importance of studying attention in settings that better reflect everyday auditory complexity.
Traditional event-related potential (ERP) paradigms rely on short, repetitive, and highly controlled auditory stimuli. While instrumental in advancing our understanding of auditory processing (Spong et al., 1965; Hillyard et al., 1973), they offer limited insight into how attention operates in complex environments. Real-world listening involves dynamic, overlapping streams that unfold continuously over time. This questions the generalizability of lab findings to real-world listening.
In response, some studies have adopted more ecologically valid designs by embedding naturalistic sounds, like speech or environmental noise, into continuous streams (Straetmans et al., 2021; Rosenkranz et al., 2023). These approaches preserve some real-world acoustics while maintaining electroencephalography (EEG) interpretability. Previous studies demonstrated that neural responses to repeated naturalistic stimuli are shaped by factors like task complexity and personal relevance (Korte et al., 2025). However, such studies still rely on artificial structuring of the soundscape. Fully natural soundscapes, such as a busy street, present another challenge. They are dense, unpredictable, and lack experimenter-controlled stimuli. Though listeners readily notice salient changes (Hicks and McDermott, 2024), identifying these moments objectively is non-trivial. Real-world acoustic events often overlap or are masked by background noise, much like subtle instruments in polyphonic music can be masked by louder ones (Müller, 2021). Salience depends on spectral and contextual factors, not always reflected in the waveform (Lavie, 2005; Müller, 2021).
To address this, we utilize a spectral novelty detection algorithm (Müller, 2021) to identify perceptually salient auditory events directly from the natural soundscape. It eliminates the need for artificially embedded stimuli, allowing examination of neural responses to naturally occurring, contextual sounds in complex acoustic environments. ERPs serve as a key tool for this investigation, capturing time-locked brain responses to discrete events.
The present study tests whether the three-phase model of auditory distraction (Escera et al., 2000; Wetzel and Schröger, 2014), originally developed under controlled lab conditions, can explain neural responses in complex, real-world environments. The model examines ERP components associated with each stage of distraction: the mismatch negativity (MMN), occurring around 100 ms at central and prefrontal sites, typically indexing pre-attentive deviance detection; the P3a, a frontocentral positivity around 300 ms, reflecting involuntary attention capture; and the reorienting negativity (RON), a frontocentral negativity occurring 400–600 ms after stimulus onset, marking the reallocation of attention. As our paradigm does not include a repetitive standard against which deviance can be established, the MMN component cannot be meaningfully assessed. Thus, we focus our analysis on the P3a and RON, which remain informative in this context.
By examining these later ERP components in response to spontaneous acoustic changes, our study evaluates the translatability of the three-phase model to ecologically valid auditory scenes. Our approach aims to extend attentional control models into complex acoustic environments that shape real-world listening.
Methods
Data and code availability
The stimuli, analysis scripts, and data to reproduce the findings of this paper can be found at: https://zenodo.org/records/15182196.
Participants
The present study builds on the dataset reported in Korte et al. (2025). In total, 30 individuals underwent audiometric screening (pure-tone audiometry). Twenty-three participants (13 female, 10 male) met the eligibility criterion of having hearing thresholds of at least 20 dB hearing level (HL) at octave frequencies from 250 Hz to 8 kHz and were included in the final sample. Participants were between 21 and 37 years old (mean: 25.57, SD: 3.48), right-handed, had normal or corrected-to-normal vision, and reported no history of neurological, psychiatric, or psychological conditions. All participants provided written informed consent and received monetary compensation for their participation.
Procedure
Prior to EEG data acquisition, participants completed the Weinstein Noise Sensitivity Scale (WNSS; Weinstein, 1978), a 21-item inventory designed to assess individual differences in noise sensitivity. The questionnaire asks participants to rate their agreement with statements related to noise (e.g., “I wouldn’t mind living on a noisy street if the apartment I had was nice”) on a 6-point Likert scale ranging from “strongly disagree” to “strongly agree.” The total WNSS score reflects a participant’s general sensitivity to noise, with higher scores indicating greater susceptibility to noise-related annoyance and distraction.
Afterwards, participants completed six blocks of EEG recordings, each lasting 15–45 min, totaling approximately 3.5 h of recording data. Participants could take self-determined breaks between blocks. The experimental design alternated between passive listening blocks, where participants were instructed to disregard the soundscape, and active listening conditions, where they responded to specific auditory events.
Paradigm
All parts of the paradigm, apart from the transcription task, were presented using the Psychophysics Toolbox extension (Brainard, 1997; Pelli, 1997; Kleiner et al., 2007, Version: 3) on MATLAB 2021b.
Auditory stimuli
For our analysis in this paper, we only included blocks in which the pre-recorded soundscape of a busy city street (see https://www.youtube.com/watch?v=Le_g4s6KloU, accessed 01.07.22) was played. It consisted of a variety of ambient sounds, typical of an urban area (e.g., streetcars, motorcycles, or incomprehensible speech). The street scenario had a total length of 2 h and 21 min, from which we took four segments of 45 min each. These segments had a short overlap, since the original sound file was not long enough to cover three non-overlapping hours. The sequence of segments was randomized across participants.
The soundscape was presented via two free-field loudspeakers (Sirocco S30, Cambridge Audio) positioned at ear level, at a 45∘ angle to the left and right with a distance of approximately 0.5 m from the participant. Playback volume was calibrated prior to the experiment using a sound level meter placed at head position, with the average sound pressure level (SPL) set to 51 dB(A). To characterize the acoustic properties of each sound file, we computed short-term root mean square (RMS) energy using 50 ms windows (with 50% overlap) and converted this to dB SPL relative to the calibrated reference. Two brief signal artifacts were observed in File 1, where sound levels dropped below 30 dB(A) for isolated frames. These values were excluded from the SPL summary statistics to avoid skewing the results. No such artifacts were present in the other files.
The cleaned analysis revealed highly consistent sound level distributions across the four segments, with mean SPLs around 49.65 dB(A). Minimum values ranged from 30.56 to 32.01 dB(A), and maximum levels from 68.54 to 68.85 dB(A) (Fig. 1 and Table 1). A separate figure illustrates the artifacts in File 1 that were excluded from analysis (Extended Data Fig. 1-1).
SPL envelope of the urban soundscape (exemplary snippet) used in the experiment. Short-term RMS energy was calculated using 50 ms windows (50% overlap) and converted to dB SPL based on a calibration reference of 51 dB(A). The plot reflects cleaned data, with values below 30 dB(A) excluded to eliminate brief non-acoustic artifacts. For visualization of these excluded artifacts, see Extended Data Figure 1-1.
Figure 1-1
Sound level envelope of File 1 with artifact values included. Two brief dips below 30 dB(A) (highlighted in red) were identified as likely signal artifacts rather than valid acoustic events and were excluded from the cleaned analysis. These artifacts occurred only in File 1 and are shown here for transparency. Download Figure 1-1, TIF file.
Cleaned SPL statistics for the four soundscape segments
In the original study, additional auditory stimuli (church bells) were added to the street soundscape. However, for the analysis presented in this paper, only the urban soundscape without additional auditory cues was considered.
Nonauditory task
Participants engaged in one of two nonauditory tasks, depending on the experimental block. The first was a visual search task using detailed hidden-object pictures (HOPs), similar to the well-known “Where’s Waldo?/Where’s Wally?” game. Participants searched for specific objects within complex illustrated scenes and selected them using the mouse. The number of targets was deliberately set high to ensure continuous engagement throughout the block.
The second task was a transcription task that resembled simple office work. Taken from the citizen science project “World Architecture Unlocked” on “Zooniverse,” this task required participants to transcribe handwritten details from architectural photographs (for further information, see https://www.zooniverse.org/projects/courtaulddigital/world-architecture-unlocked/about/research). Participants categorized information such as city names, architects, or building names. No prior architectural knowledge was necessary, but the task was sufficiently complex to require sustained attention. Participants were encouraged to use search engines and online maps to verify and categorize the transcribed information. The task involved reading, typing, using the mouse, and researching, making it a suitable approximation of realistic office-based cognitive tasks, while providing a higher task complexity than the HOP task. In the original experiment, the street soundscape was played during four separate blocks, each combined with either the visual search task or the transcription task. These blocks were not designed for the current analyses but were selected post hoc because they provided extended, ecologically valid exposure to a complex auditory environment while participants were engaged in nonauditory activities.
Experimental blocks
The experiment consisted of three phases, divided into four consecutive blocks, as illustrated in Figure 2. The passive phase A comprised two blocks, while the active phase and the passive phase B each included one block.
Overview of experimental blocks. Order of the blocks is chosen to ensure naivety concerning target sounds in the passive phase A. Gray shaded blocks are not considered in this work and are only displayed for the sake of completeness.
During the experiment, participants were either instructed that the soundscape was irrelevant or that they had to detect a specific target sound (church bell). This manipulation was intended to draw participants’ attention to the overall soundscape. In the passive listening conditions, participants were told that background sounds would be present but were irrelevant to their task and could be ignored. In the active listening condition, participants were required to detect and respond to the target sound by pressing the F4 key. The response key was chosen to avoid interference with the nonauditory task, where the keyboard was also used. All other sounds in the soundscape were not behaviorally relevant and did not require a response.
The block structure and auditory manipulations were part of a broader dataset that has been described in detail in Korte et al. (2025). The present study focuses on a subset of the experimental blocks from that study. The structure for the present study was as follows:
Block 1 (passive listening A, passive phase A): A 45-min sequence of the street soundscape was played while participants performed the HOP task, without responding to the sounds.
Block 2 (passive listening A, passive phase A): Identical to Block 1, but participants performed the transcription task.
Block 3 (active listening, active phase): Identical to Block 2, except participants now responded to the target sounds (church bell) in the street soundscape.
Block 4 (passive listening B, passive phase B): Identical to Block 2. However, participants were instructed to ignore the previously relevant church bell chimes again.
The order of blocks was fixed for all participants to ensure that they remained naive to the target sound during the passive listening phase A and to preserve the mental representation of the target sound in the passive listening B phase. Randomizing the order would have disrupted the intended transition between conditions, particularly in the final block. This design allowed for a consistent progression across participants.
Data acquisition
Description of lab setup
Participants were seated in a soundproof recording booth at a desk equipped with a screen (Samsung, SyncMaster P2470). A keyboard and a mouse were placed on the desk for task input and target response. Event markers for auditory stimuli and task events were generated using the Lab Streaming Layer (LSL) library (see https://github.com/labstreaminglayer/liblsl-Matlab, v1.14.0). Keyboard input was logged using LSL-compatible key capture software (see https://github.com/labstreaminglayer/App-Input, v1.15.0). The Lab Recorder software (see https://github.com/labstreaminglayer/App-LabRecorder, v1.14.0) ensured synchronized data recording of the EEG data, the event markers, and the keyboard capture in .xdf format. Files were organized using the Brain Imaging Data Structure format (Gorgolewski et al., 2016) with the EEG data extension (Pernet et al., 2019).
EEG system
EEG data were collected using a 24-channel EEG cap (EasyCap GmbH) with passive Ag/AgCl electrodes (channel positions: Fp1, Fp2, F7, Fz, F8, FC1, FC2, C3, Cz, C4, T7, T8, CP5, CP1, CPz, CP2, CP6, TP9, TP10, P3, Pz, P4, O1, O2). The mobile cap setup, with fewer electrodes than typical lab systems, was chosen for participant comfort during the extended recording sessions and was well tolerated, even during breaks. A mobile amplifier (SMARTING MOBI, mBrainTrain) was attached to the EEG cap, allowing participants a more natural sitting position compared to a wired EEG system. Gyroscope data from the amplifier were recorded to track head movements. Data were transmitted via Bluetooth to a desktop computer using a BlueSoleil dongle. EEG and gyroscope data were streamed to LSL via the SMARTING Streamer software (v3.4.3; mBrainTrain) and recorded at a sampling rate of 250 Hz using Lab Recorder.
Measurement procedure
Before data collection, electrode sites were cleaned with 70% alcohol and abrasive gel (Abralyt HiCl, Easycap GmbH). Electrode gel was applied to maintain impedances below 10 kΩ and impedances were monitored throughout the session. If signal quality dropped, individual electrodes were re-gelled between blocks. Re-gelling was rare, typically affecting one or two electrodes, and no full cap removal was required. Given the experiment’s length, participants were allowed an extended lunch break, scheduled to avoid interference with experimental manipulation (active phase and passive phase B).
Data analysis
All analyses were conducted in MATLAB 2021b using the EEGLAB toolbox (Delorme and Makeig, 2004; version: 2021.1).
Behavioral data
To investigate whether sound events interfered with participants’ typing behavior, we analyzed the time interval between consecutive keystrokes. Specifically, we examined whether inter-keystroke intervals (IKIs) were longer when a sound onset occurred between two keystrokes, compared to intervals without an intervening sound. The IKI was defined as the time between the first and the second keystroke. If the sound was perceived as distracting, the second keystroke was expected to be delayed, resulting in a longer interval. To ensure comparability, control intervals were selected immediately prior to sound events. Only IKIs between 200 and 600 ms were retained to exclude implausibly short latencies and those unlikely to reflect continuous typing. For each participant and condition, mean IKI values were computed separately for sound and no-sound intervals.
Audio processing and feature extraction
We were interested in how listeners perceive complex street soundscapes, particularly how auditory events influence attention and neural responses. In an initial test run, two human listeners manually annotated perceptually salient events in the soundscape. Their annotations confirmed that distinct auditory objects were identifiable and corresponded to measurable brain responses when used to time-lock EEG data. However, manual annotation is both time-consuming and highly variable, depending on factors such as headphone use, attentional state, and individual listener differences.
To address this, we applied spectral novelty analysis (Müller, 2021), an algorithmic method for detecting salient sound events based on abrupt changes in spectral content—especially in higher frequencies, where transient sounds are more easily distinguished from background noise. This approach enables reproducible event detection in naturalistic soundscapes without relying on manual annotations or artificial stimuli.
We implemented the method using the open-source MATLAB functions spectral_novelty.m and simp_peak.m (available at github.com/ThorgeHaupt/Audionovelty). Audio recordings of the street scenes (sample rate: 44.1 kHz) were converted to mono by averaging stereo channels. Each signal was transformed into the time-frequency domain using a short-time Fourier transform (Hanning window size: 882 samples, hop size: 441 samples). The resulting magnitude spectrogram was logarithmically compressed (γ = 10) to enhance perceptually relevant spectral variations (Fig. 3, second plot from top, left panel).
Left: example for spectral novelty decomposition on a snippet of 20 s. The raw audio (top plot) is first transformed into the time-frequency domain, from which a spectral change is computed (second plot). Afterwards, this spectral change is normalized and smoothed (third plot). Lastly, spectral peaks are identified based on a fixed threshold, resulting in a binary vector of peaks (bottom plot); Top right: spectral representation of an example novelty event. Bottom right: resulting ERP and topographies of P3a- and RON-time-window. Each trace represents one EEG channel.
To highlight spectral changes, a first-order derivative across time frames was computed. Negative values were set to zero, and a local average over a 0.5 s window was subtracted to reduce noise and emphasize meaningful fluctuations (Fig. 3, third plot from top, left panel). The resulting novelty function was normalized and resampled to 100 Hz to match the temporal resolution required for further EEG analysis.
Sound onsets were defined as local maxima in the novelty function that exceeded neighboring values and a fixed threshold of 0.1. This resulted in a binary peak vector marking moments of salient acoustic change (Fig. 3, bottom plot, left panel). These peak markers were used to time-lock EEG analyses to naturally occurring auditory events in the urban soundscape.
EEG data
The EEG was preprocessed as described in Korte et al. (2025). To ensure clarity, we briefly summarize the processing steps here. The data were first filtered between 1 and 40 Hz (default settings of the pop_eegfiltnew function). Next, bad channels were identified, using the clean_artifacts function of EEGLAB with the default settings for channel_crit_maxbad_time and subsequently stored for later interpolation. The data were then segmented into 1-s windows. Artifact rejection was performed on these windows based on a probability threshold of ±3 SD from the mean, which helped optimize independent component analysis (ICA) training. All data were then combined to compute ICA weights using the runica function in EEGLAB with the extended training mode. The ICA weights were applied to the raw EEG data.
Artifact rejection was performed using the ICLabel algorithm (Pion-Tonachini et al., 2019), where components classified with ≥80% probability as artifacts (e.g., eye blinks, muscle activity, heartbeats) were removed. Additionally, a manual inspection was conducted to account for possible misclassifications, as the ICLabel algorithm is primarily optimized for stationary datasets with minimal movement, whereas our setup allowed participants a degree of mobility. On average, 8 out of 24 components were removed per participant (
Following ICA-based artifact removal, the EEG data were further processed by applying a low-pass filter at 20 Hz and a high-pass filter at 0.5 Hz. Any previously identified bad channels were interpolated (mean
Events corresponding to the onset of spectral novelty peaks were identified, and their latencies were mapped to the EEG time-series. Epochs were extracted from −0.2 to 0.8 s relative to sound onset. If an epoch extended beyond the available data range, zero-padding was applied to maintain uniform epoch length across trials. Baseline correction was performed using a pre-stimulus interval from 0.2 to 0 s, subtracting the mean baseline activity from each epoch.
To assess differences in neural processing of the soundscapes under different listening conditions, we computed a grand-average ERP and topographic maps for each block. Additionally, we investigated the relationship between neural responses and spectral novelty. To assess whether the magnitude of the neural response depends on the degree of novelty of a given sound, consistent with expectations based on Downar et al. (2002), where several brain areas showed sensitivity to stimulus novelty, epochs were sorted according to their novelty score and assigned to 20 equally sized bins, sorted in ascending order of spectral novelty. Bins were equalized (where necessary) by excluding excess trials randomly (with a fixed random seed for reproducibility). Artifactual epochs were removed after binning, with a probability criterion of ±3 SD from the mean. A 50% overlap between bins was applied to ensure smoother transitions between novelty levels.
To investigate ERP responses, we focused on four frontocentral electrodes (Fz, FC1, FC2, Cz), selected based on previous literature emphasizing their sensitivity to components within the three-phase model of auditory distraction (Escera et al., 2000; Wetzel and Schröger, 2014; Getzmann et al., 2024). For each participant and condition, ERPs were averaged across these electrodes and across trials. To enable cross-subject comparison, we further computed grand-average ERPs per novelty bin across participants.
The P3a component was analyzed as the mean amplitude in the 250–350 ms time window post-onset, while the RON component was assessed in the 450–600 ms window. These windows were selected based on visual inspection of peak deflections in the grand-average waveforms and align with typical latencies reported in the literature (Escera et al., 2000; Wetzel and Schröger, 2014; Getzmann et al., 2024). We did not include the MMN in our analysis, as its elicitation typically depends on a structured sequence of frequent standard and infrequent deviant stimuli (Näätänen et al., 1993), which introduce a violation of regular auditory patterns. Since our continuous, real-world street soundscape is highly dynamic by nature, no such pattern of consistent auditory regularities exist. Thus, our soundscape is not suited to investigate the MMN component.
Statistical analysis
Weinstein noise sensitivity scale
In an exploratory analysis, we examined whether individual differences in noise sensitivity predict neural responses to auditory novelty. We conducted Pearson’s correlation analyses between WNSS scores and individual ERP amplitudes averaged over all conditions and as a mean of the selected frontocentral channels. Specifically, we tested the relationship between WNSS scores and mean ERP amplitudes in two key time windows: the P3a window (250–350 ms) and the RON window (450–600 ms). The normality of the WNSS scores and ERP amplitudes were confirmed using the Shapiro–Wilk test (WNSS: p = 0.451, P3a: p = 0.366, RON: p = 0.160), justifying the use of Pearson’s correlation. Correlations were computed separately for the P3a and RON amplitudes, with significance levels set at p < 0.05.
Behavioral data
Statistical analysis was performed using the Wilcoxon signed-rank test for paired samples to compare typing speed between uninterrupted and interrupted typing within each experimental condition. This non-parametric test was chosen due to deviations from normality in the data distribution, as confirmed by the Shapiro–Wilk test. The Wilcoxon signed-rank test was applied separately for each condition to determine whether the presence of auditory interruptions significantly affected typing speed.
To account for multiple comparisons, p-values were adjusted using the Benjamini–Hochberg false discovery rate (FDR) correction. This method controls the expected proportion of false positives while maintaining statistical power.
To assess whether the magnitude of behavioral disruption (i.e., the difference in IKIs between interrupted and uninterrupted typing) differed between conditions, we conducted a Friedman test. This non-parametric equivalent of a repeated-measures ANOVA is suitable for comparing more than two related samples when the data may not follow a normal distribution. The Friedman test was applied to per-subject difference scores across all three experimental conditions that contained the transcription task .
Statistical analyses were conducted in MATLAB R2021b using the signrank.m function for Wilcoxon tests and the mafdr.m function for FDR correction.
EEG data
To assess differences in ERP amplitudes across conditions, we conducted Wilcoxon signed-rank tests, a non-parametric paired test, comparing mean ERP amplitudes within the time windows of interest for the P3a and RON.
We performed four pairwise comparisons, motivated by the study’s design:
Passive A + HOP versus passive A + transcription to test whether the type of nonauditory task modulates ERP amplitudes under passive listening conditions.
Passive A + transcription versus active + transcription to examine whether directing attention to the soundscape in the active condition influences ERP amplitudes.
Active + transcription versus passive B + transcription to determine whether ERP amplitudes remain modulated after the active phase or return to passive A levels.
Passive A + HOP versus passive B + transcription, to evaluate whether ERP amplitudes in the passive B Phase differ from those in the passive A Phase.
To correct for multiple comparisons, we applied FDR correction using the Benjamini–Hochberg procedure.
To examine the influence of novelty intensity at the single-trial level, we used linear mixed-effects models (LMMs) with novelty score as a continuous predictor. Separate LMMs were fitted for the P3a and RON time windows. The models included fixed effects for novelty score and condition and a random intercept for subject to account for within-subject variability (Extended Data Fig. 7-1). We compared two models per time window using a likelihood ratio test (LRT):
Full model: EEG amplitude ∼ novelty score + condition + (1|subject)
Simpler model: EEG amplitude ∼ novelty score + (1|subject)
Model selection was based on Akaike information criterion and the LRT to determine whether including condition improved model fit. The models were implemented using the fitlme.m function in Matlab.
Results
Overall ERP responses to spectral novelty peaks
In a first step, we investigated the overall ERPs per condition, time-locked to all identified spectral novelty peaks. Figure 4 displays the time-series from −0.2 to 0.8 s relative to peak onsets, with corresponding topographical representations for two distinct time windows: 0.25 to 0.35 s, corresponding to the P3a, and 0.45 to 0.60 s, corresponding to the RON.
ERPs time-locked to spectral novelty peaks across all participants and trials, visualized as butterfly plots (i.e., each trace represents one EEG channel). Gray shaded area in the time-series plots represents the time windows for the topographies (0.25–0.35 s and 0.45–0.60 s). N refers to the number of trials included in the average. Top left: passive listening A condition while participants engaged in the HOP task. Top right: passive listening A condition while participants engaged in the transcription task. Bottom left: active listening condition while participants engaged in the transcription task. Bottom right: passive listening B while participants engaged in the transcription task.
A distinct positive deflection at approximately 300 ms post-onset can be observed across all conditions, with the strongest amplitude around frontocentral electrodes. This pattern is consistent with the expected characteristics of the P3a component. The response is most pronounced in the passive listening A condition with the HOP, followed by the passive listening A condition with the transcription task. The amplitude of this peak appears slightly reduced in the active listening condition and lowest in the passive listening B condition.
In contrast, we did not observe a pronounced negative deflection in the expected RON time window (450–600 ms). While there are slight amplitude variations across conditions, the expected negativity is not clearly present. This suggests that the reorienting process might be weaker or less reliably elicited in the given experimental context. The summary statistics for these analyses are presented in Table 2.
Summary statistics for the grand ERPs per condition in time windows of interest and for frontocentral channels (mean of Fz, FC1, FC2, Cz)
Condition effects on ERP amplitudes
P3a time window (250–350 ms)
A Wilcoxon signed-rank test was conducted to compare ERP amplitudes across conditions. None of the pairwise comparisons showed a significant difference in ERP amplitude before or after correction for multiple comparisons (all p-values >0.05). Specifically, comparisons between passive A + HOP and passive A + transcription (W = 165, p = 0.2113, FDR-corrected p = 0.8453), passive A + transcription and active + transcription (W = 127, p = 0.9870), active + transcription and passive B + transcription (W = 126, p = 0.9870), and passive A + HOP and passive B + transcription (W = 133, p = 0.8329, FDR-corrected p = 0.9870) all yielded non-significant results.
These findings indicate that neither task engagement nor listening condition (passive vs active) led to significant changes in the P3a time window.
RON time window (450–600 ms)
To investigate voluntary reorientation processes, we examined ERP amplitudes within the 450–600 ms time window. The Wilcoxon signed-rank tests revealed no significant differences between conditions, even before correction for multiple comparisons (all uncorrected p-values >0.05). Specifically, comparisons between passive A + HOP and passive A + transcription (W = 147, p = 0.5057, FDR-corrected p = 0.6743), passive A + transcription and active + transcription (W = 103, p = 0.4455, FDR-corrected p = 0.6743), active + transcription and passive B + transcription (W = 138, p = .7089), and passive A + HOP and passive B + transcription (W = 155, p = 0.3548, FDR-corrected p = 0.6743) all yielded non-significant results.
These findings suggest that no robust effects were observed in the RON time window, regardless of task or listening condition.
EEG responses as a function of spectral novelty
P3a component
We observed a systematic increase in EEG amplitude with higher spectral novelty scores (Fig. 5). This trend is visually apparent in the grand-average ERPs across novelty bins (Fig. 6), where larger P3a amplitudes are observed in bins with higher novelty values - particularly in the passive A + HOP condition (Fig. 7).
Grand-average ERPs across the 20 novelty bins as mean over all conditions. Data were averaged per condition first, then across conditions.
Grand-average ERPs across 20 spectral novelty bins for each listening condition. Each subplot represents the ERP response for a specific spectral novelty bin, with Bin 1 corresponding to the lowest novelty values and Bin 20 to the highest. Gray marked windows correspond to the P3a window (early window, 0.25–0.35 s) and the RON window (late window, 0.45–0.60 s). For individual participants’ data averaged over all conditions, Extended Data Figure 6-1.
Figure 6-1
Individual ERPs for each participant, averaged across all conditions at selected frontocentral electrodes (Fz, FC1, FC2, and Cz). Shaded regions indicate the time windows of interest: P3a (250–350 ms, light gray) and Reorienting Negativity (RON, 450–600 ms, dark gray). Download Figure 6-1, TIF file.
Development of mean ERP amplitudes across frontocentral electrodes (Fz, FC1, FC2, Cz) for each spectral novelty bin. For individual participants’ data, Extended Data Figure 7-1.
Figure 7-1
Smoothed mean EEG amplitude in the P3a window across spectral novelty bins and averaged over all conditions, split by median amplitude in the highest novelty bins. Red lines represent participants with amplitudes above the median, blue lines represent participants below the median, and the black line shows the grand average. This figure illustrates the overall trend of increasing EEG amplitude with spectral novelty, with inter-individual variability in the magnitude of this effect. Download Figure 7-1, TIF file.
LMM analysis confirmed a significant effect of novelty score on EEG amplitude (β = 4.53, p < 0.001), supporting the observed trend that higher novelty is associated with increased P3a amplitudes. Including “condition” as a predictor did not improve model fit (χ2(3) = −3.22, p = 1), and no reliable pairwise differences between conditions were observed (p > 0.05 for all conditions). While one contrast (passive A + HOP vs active + transcription) showed a borderline p-value (p = 0.049), this result should be interpreted with caution, as it emerged from a model that did not outperform the simpler one.
These statistical findings align with the visual representation in Figures 5 and 6, where novelty is the primary factor modulating amplitude, and condition differences are subtle. For an illustration of individual participant trajectories across novelty bins, Extended Data Figure 7-1, which presents ERP amplitude changes in the P3a time window for each participant, separated by median split. Similarly, the Wilcoxon signed-rank test results confirmed that condition did not significantly influence P3a amplitude.
RON component
In contrast to the P3a window, no strong amplitude changes are evident in the RON time window (Figs. 5, 6). The expected negative deflection for the RON component is not clearly present across conditions. LMM analysis confirmed the absence of an effect of novelty (β = −0.11, p = 0.823) or condition (p > .05 for all pairwise comparisons). Additionally, adding condition to the model did not improve fit (χ2(3) = 0.00, p = 1).
These statistical findings are visually supported by Figures 5 and 6, where no pronounced negativity is evident in the 450–600 ms window. Likewise, Wilcoxon signed-rank tests did not reveal significant differences between conditions, further reinforcing that RON amplitudes are not modulated by spectral novelty or condition.
Behavioral data
The analysis of IKIs revealed a consistent increase for interrupted typing compared to uninterrupted typing across all three experimental conditions (Fig. 8). Descriptive statistics showed that in the passive A + transcription condition, the mean IKI was 327.04 ms (SD = 18.17) for uninterrupted typing and increased to 350.03 ms (SD = 24.86) for interrupted typing. In the active + transcription condition, the mean IKI was 327.45 ms (SD = 23.48) for uninterrupted typing and 347.99 ms (SD = 37.77) for interrupted typing. Similarly, in the passive B + transcription condition, the mean IKI increased from 320.88 ms (SD = 14.30) in the uninterrupted state to 357.03 ms (SD = 49.97) in the interrupted state.
Analysis of inter-keystroke intervals for uninterrupted versus interrupted typing across conditions. Gray lines connect corresponding values for each participant. Asterisks indicate significance levels (*p ≤ 0.05; **p ≤ 0.01; ***p ≤ 0.001). Friedman’s test revealed no significant difference between conditions (p = 0.958).
Statistical analysis using the Wilcoxon signed-rank test confirmed that IKIs were significantly larger when typing was interrupted in all conditions. In the passive A + transcription condition, the test yielded W = 38, p = 0.0024 (uncorrected), p = 0.0035 (FDR-adjusted). For the active + transcription condition, the results were W = 42, p = 0.0035 (uncorrected), p = 0.0035 (FDR-adjusted). The effect was most pronounced in the passive B + transcription condition with W = 0, p < 0.0001 (uncorrected), p = 0.0001 (FDR-adjusted). Furthermore, Friedman test revealed no significant differences in the magnitude of behavioral disruption across conditions, χ2(2) = 0.09, p = 0.958, indicating that the prolonged IKIs following sound events were comparable across all conditions. These findings indicate that the sound events significantly increased IKIs, suggesting a robust disruptive effect of sound events on typing, irrespective of experimental conditions.
Weinstein noise sensitivity scale
Participants’ noise sensitivity was assessed using the WNSS (Weinstein, 1978). The WNSS scores were normally distributed (Shapiro–Wilk test: p = 0.451), with a mean score of 3.36 (SD = 0.54, range = 2.43–4.76). These values were previously reported in Korte et al. (2025), which examined the same sample in a different analytical context.
To provide context for these scores, we compared them to the normative value of 3.04 (SD = 0.57), indicating that our sample exhibited slightly higher noise sensitivity on average. However, the observed mean falls within one standard deviation of the norm, suggesting that the noise sensitivity distribution in our sample is broadly comparable to the general population.
To examine whether noise sensitivity was associated with neural responses to auditory novelty, we conducted Pearson’s correlation analyses between WNSS scores and individual ERP amplitudes averaged over all conditions in two key time windows: the P3a window (250–350 ms) and the RON window (450–600 ms).
The correlation analysis revealed no significant relationship between WNSS scores and P3a amplitudes (r = −0.006, p = 0.980), indicating that noise sensitivity did not predict the degree of attentional capture by novel sounds. A weak negative trend was observed between WNSS scores and RON amplitudes (r = −0.192, p = 0.393), which would suggest that individuals with higher noise sensitivity may have a less pronounced reorienting response following distraction. However, this trend did not reach statistical significance (Fig. 9). For a visualization of individual ERP waveforms across all participants, see Extended Data Figure 6-1, which depicts ERPs averaged across all conditions for the selected frontocentral channels.
Scatter plots depicting Pearson’s correlation between WNSS scores and mean ERP amplitudes for the P3a (left) and RON (right) components. Each dot represents an individual participant, and red lines indicate the best-fitting linear regression.
Discussion
This study investigated how the brain processes auditory novelty in a real-world soundscape while participants engaged in different tasks and listening conditions. We found that the intensity of spectral novelty significantly modulated EEG responses, particularly in the P3a time window. This highlights that stronger acoustic changes in the environment evoke more pronounced neural responses. Based on previous research, we expected these neural responses to vary with both novelty intensity and listening engagement. Specifically, we anticipated that active listening would enhance ERP amplitudes compared to passive listening. The contrast between active and passive listening conditions revealed only subtle differences in ERP amplitudes. That is, it did not significantly matter whether participants were passively listening or actively attending to the soundscape. EEG responses to sound events remained robust across all conditions. Below, we discuss the implications of these findings and potential explanations for the observed patterns.
Effect of spectral novelty on EEG responses
The results demonstrated that spectral novelty robustly influenced EEG amplitudes, particularly in the P3a time window. This finding aligns with previous research showing that novel auditory events elicit enhanced neural responses, reflecting attentional capture (Escera et al., 2000; Wetzel and Schröger, 2014; Getzmann et al., 2024).
The increased P3a amplitude we observed with higher novelty values supports the idea that spectral novelty acts as a key trigger for involuntary attentional shifts, which aligns with the second phase of the distraction model. Interestingly, we found no strong modulation in the RON time window, suggesting that participants either did not consistently reorient attention away from novel sounds or that this process was not as pronounced in a naturalistic setting. One possible explanation is that the ongoing soundscape did not provide discrete auditory events that necessitated reorienting, unlike traditional lab paradigms with isolated deviant stimuli. In our setting, novel sounds may not have required context updating or behavioral adaptation, and thus made it unlikely to observe a reorienting response. Additionally, since the background audio was mostly behaviorally irrelevant (even though the active listening condition encouraged attention to the soundscape, where a reorienting response might have been expected), participants may not have engaged in an active reorienting process. This highlights the need for future research to explore whether reorientation mechanisms are suppressed when auditory distractions occur within continuous, real-world soundscapes rather than discrete, laboratory-controlled paradigms.
Our approach of using spectral novelty provides new insights into how attention dynamically fluctuates in response to environmental sound changes. The observed increase in P3a amplitude with higher novelty (Figs. 5, 7) suggests that the brain remains highly sensitive to salient changes in the acoustic environment, even when attention is directed elsewhere. This responsiveness likely reflects bottom-up attentional capture by acoustically novel events, consistent with the notion that attention can be involuntarily drawn to unexpected changes in the environment. Furthermore, our findings highlight the robustness of the P3a component, which showed similar amplitude and morphology (i.e., shape, latency) across all conditions. This is in line with previous research (Fallgatter et al., 2000; Korte et al., 2025) and renders the P3a particularly useful for real-world EEG applications, where data might be noisier, and less robust components could be overshadowed by noise. Extended Data Figure 6-1 shows the consistency of this pattern across individual participants. The stability of the P3a across different listening conditions suggests that it may serve as a reliable neural marker for attentional capture in complex auditory environments such as workplaces (Wascher et al., 2023), classrooms (Janssen et al., 2021), or public spaces (Gramann, 2024).
Notably, the use of “P3a” versus “Novelty P3” remains a topic of debate in the literature. Some studies use the terms interchangeably, and factor-analytic work by Simons et al. (2001) suggests that they reflect the same neural process. Others, however, argue for a clearer distinction: for instance, Barry et al. (2016) propose that the Novelty P3 is a temporally and functionally distinct component, occurring after the P3a and P3b, and specifically associated with orienting to novel stimuli. Our data—showing a temporally stable, frontocentral response that scales with novelty but does not clearly differentiate into subcomponents—appear more consistent with a unified P3a/Novelty P3 interpretation. Nonetheless, we acknowledge that future studies with higher temporal resolution and precise source localization may help clarify whether these components are indeed separable, particularly in naturalistic contexts.
Limited condition effects and potential explanations
Contrary to our initial hypothesis, listening mode did not significantly alter ERP responses. While the active listening condition was expected to enhance auditory processing compared to passive listening, we found only subtle differences between conditions (Fig. 6 for a visual comparison across bins and conditions). One possible explanation is that attentional resource allocation varied dynamically across tasks, but did not create large enough differences to be reflected in ERPs. The HOP task, being cognitively less demanding than transcription, may have allowed participants to allocate more resources to background sound processing (Sörqvist and Rönnberg, 2016). This could have facilitated stronger neural responses to auditory novelty, but may not have yielded sufficiently large neural differences to reach statistical significance.
Additionally, habituation effects may have contributed to the condition pattern, particularly since the passive A + HOP condition was always presented first. This initial exposure to the soundscape may have triggered stronger neural responses, reflecting heightened sensitivity to novel auditory input. While one might expect habituation to continue progressively across all blocks, it is also possible that the largest adjustment occurred early on, during the first encounter with the soundscape. In this view, the strongest novelty-related responses would be limited to the first block, with a relatively stable lower responsiveness in subsequent blocks. Such an early adjustment would be consistent with the deviance detection and attentional shift stages of the three-phase distraction model, which are known to diminish once novelty is no longer perceived as salient.
Another factor to consider is the role of self-generated sounds in the transcription task. Participants’ typing may have introduced competing auditory input that either acoustically masked or perceptually deprioritized the street soundscape. Prior research suggests that self-generated sounds are processed differently from externally generated ones (Martikainen, 2004; Bäß et al., 2008; Saupe et al., 2013) and often involve predictive mechanisms that suppress their neural representation. As a result, the soundscape may have been pushed into the perceptual background, both because it was masked by keystrokes and because attentional resources were directed toward the motor task and its auditory consequences. This may have reduced the salience of the background noise, thereby contributing to the weaker ERP responses in conditions involving transcription. Future studies could explore whether minimizing self-generated auditory input alters neural responses to environmental sounds.
Behavioral impact of novel sounds on task performance
Beyond the neural effects of spectral novelty, our results also revealed a significant behavioral impact. We observed that IKIs increased in response to novel sounds irrespective of experimental conditions, indicating that distraction effects extend beyond electrophysiological responses (Fig. 8). This result aligns with prior research showing that auditory events can momentarily disrupt ongoing cognitive-motor tasks (Conrad et al., 2012). The slowing of typing speed suggests that involuntary attentional capture, as reflected in the P3a component, translated into measurable performance decrements.
Notably, the behavioral disruption was present across conditions, further supporting the robustness of spectral novelty in capturing attention regardless of task engagement. This aligns with findings from workplace distraction studies, where unpredictable background sounds, such as sudden conversations or environmental noises, reduce productivity in cognitively demanding tasks (Kjellberg et al., 1996; Sexton and Helmreich, 2000; Conrad et al., 2012; Sonnleitner et al., 2014). The observed slowing of typing speed indicates that novel sounds impact task performance in the moments directly following the sound event. Whether this effect extends beyond the immediate keystroke remains to be investigated. Future research should explore whether the distraction persists over time or diminishes with continued exposure. Additionally, while our findings point to a general effect of novelty on behavior, it remains unclear whether varying levels of novelty intensity produce graded effects on task performance. Although this analysis was not feasible in the current dataset due to the limited number of keypress instances per participant, it represents a promising avenue for future investigation.
Individual differences in noise sensitivity and attentional modulation
While previous research has highlighted variability in how individuals respond to auditory distraction (Kjellberg et al., 1996; Shepherd et al., 2016), our analysis found no significant correlation between noise sensitivity (WNSS scores) and ERP amplitudes. This suggests that while noise sensitivity may influence subjective experiences of distraction, it does not necessarily translate to differences in early neural responses to background sounds. However, this does not preclude noise sensitivity from affecting later cognitive or behavioral stages of distraction processing.
One possibility is that noise sensitivity exerts its influence beyond early attentional capture, modulating higher-order cognitive and emotional responses to auditory distractions rather than automatic neural responses measured by ERPs. For instance, individuals with higher noise sensitivity may not show stronger P3a responses but may still perceive background noise as more disruptive, leading to greater cognitive fatigue, annoyance, or task disengagement over time.
It is also worth noting that our sample exhibited relatively average noise sensitivity scores, with no extreme outliers. This restricted variability may have limited the ability to detect significant correlations with ERP amplitudes. Future studies should aim to include individuals across a broader range of sensitivity levels to better assess whether more noise-sensitive individuals show distinct neural or behavioral response patterns.
Given this, future research should also explore whether noise sensitivity influences behavioral performance, subjective distraction ratings, or physiological measures (e.g., autonomic responses such as heart rate variability or skin conductance), which may better reflect individual differences in real-world auditory distraction. Additionally, non-linear effects should be considered, as highly noise-sensitive individuals may show disproportionate responses compared to those with lower sensitivity (Kliuchko et al., 2016).
Implications for real-world auditory attention research
Our study highlights the utility of a spectral novelty detection approach for identifying salient auditory events within a continuous, real-world soundscape. Unlike traditional paradigms that rely on pre-defined, isolated stimuli, spectral novelty is computed in relation to the surrounding acoustic context. This means that the algorithm dynamically evaluates whether a sound deviates from the local sound environment. A sound may be classified as novel in one context but not in another, depending on its spectral contrast with the preceding acoustic input. This context sensitivity allows for a more ecologically valid identification of attention-capturing events, as it mirrors the perceptual mechanisms by which human listeners extract meaningful signals from background noise.
By leveraging this context-aware detection of acoustic change, we move beyond discrete stimulus presentations toward a more naturalistic framework for studying auditory attention. The three-phase model of distraction (Escera et al., 2000; Wetzel and Schröger, 2014; Getzmann et al., 2024), typically investigated in tightly controlled laboratory settings, can thus be extended to complex real-world environments. Here, attentional shifts and reorienting responses may be shaped by factors such as habituation, cognitive load, and environmental complexity (Woods and Elmasian, 1986; Lavie, 2005; Gygi and Shafiro, 2011; Brockhoff et al., 2023). The successful application of spectral novelty in ERP research not only enhances ecological validity but also offers a promising tool for investigating dynamic attention in everyday auditory scenes.
Limitations and future directions
While our study provides valuable insights into auditory attention in real-world settings, certain methodological aspects warrant consideration. First, although spectral novelty served as a robust marker of auditory salience, it does not capture other factors such as semantic relevance or emotional valence, which can also strongly influence attention allocation (Kjellberg et al., 1996; Lavie, 2005; Asutay and Västfjäll, 2012; Roye et al., 2013; Holtze et al., 2021; Debnath and Wetzel, 2022). Second, the fixed order of conditions may have introduced habituation effects, as participants were always exposed to the same sequence of listening modes. Counterbalancing condition order in future studies would help disentangle potential order effects from true condition-related differences.
Furthermore, while we observed clear P3a responses to acoustic novelty, the absence of strong condition effects suggests that our task manipulations may not have been sufficiently distinct to drive measurable differences in ERP amplitude. Refining the contrast between passive and active listening may help clarify how task demands shape auditory distraction. Finally, future studies should consider investigating individual variability in noise sensitivity, as subtle differences in attentional engagement may be masked in group-level analyses of P3a and RON components.
Conclusion
In conclusion, our study provides compelling evidence that spectral novelty serves as a reliable and ecologically valid trigger of attentional processing in naturalistic soundscapes. Across a large dataset and diverse listening contexts, we found that higher novelty consistently elicited strong P3a responses, demonstrating robust neural signatures of attentional capture even when participants were engaged in unrelated tasks. Importantly, this neural response was accompanied by measurable behavioral slowing, confirming that these sound events were not only registered by the brain but also disrupted ongoing performance.
These findings highlight the value of spectral novelty detection as a powerful tool for identifying cognitively relevant sound events in real-world environments, moving beyond traditional stimulus designs. The P3a emerged as a particularly stable marker, showing consistent morphology and amplitude across conditions, positioning it as a key component for studying auditory distraction outside the lab.
While listening mode did not strongly influence ERP amplitudes, this likely reflects the adaptive nature of auditory attention rather than a lack of engagement. Similarly, the absence of a correlation with noise sensitivity underscores the idea that neural responses to distraction are more influenced by moment-to-moment context than by trait-level sensitivity.
Altogether, our results underscore that the brain remains highly responsive to acoustic novelty in real-world settings, both neurally and behaviorally and establish spectral novelty detection as a promising approach for future research on attention, cognition, and distraction in everyday life.
Footnotes
The authors declare no competing financial interests.
We thank Daniel Küppers for his kind help with the typing speed analysis. Furthermore, we thank all members of the Neurophysiology of Everday Life group for their support and guidance. We would also like to thank Negar Dadkhah and Amrah Gasimli for their annotation of the soundscape and the Friedrich Ebert Foundation, which always provides helpful support to S.K. This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under the Emmy-Noether Program—BL 1591/1-1—Project ID 411333557.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.















