Skip to main content

Main menu

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT

User menu

Search

  • Advanced search
eNeuro
eNeuro

Advanced Search

 

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT
PreviousNext
Research ArticleResearch Article: New Research, Sensory and Motor Systems

Otoacoustic Emissions Evoked by the Time-Varying Harmonic Structure of Speech

Marina Saiz-Alía, Peter Miller and Tobias Reichenbach
eNeuro 25 February 2021, 8 (2) ENEURO.0428-20.2021; https://doi.org/10.1523/ENEURO.0428-20.2021
Marina Saiz-Alía
Department of Bioengineering and Centre for Neurotechnology, Imperial College London, London SW7 2AZ, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Marina Saiz-Alía
Peter Miller
Department of Bioengineering and Centre for Neurotechnology, Imperial College London, London SW7 2AZ, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Tobias Reichenbach
Department of Bioengineering and Centre for Neurotechnology, Imperial College London, London SW7 2AZ, United Kingdom
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Tobias Reichenbach
  • Article
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF
Loading

Abstract

The human auditory system is exceptional at comprehending an individual speaker even in complex acoustic environments. Because the inner ear, or cochlea, possesses an active mechanism that can be controlled by subsequent neural processing centers through descending nerve fibers, it may already contribute to speech processing. The cochlear activity can be assessed by recording otoacoustic emissions (OAEs), but employing these emissions to assess speech processing in the cochlea is obstructed by the complexity of natural speech. Here, we develop a novel methodology to measure OAEs that are related to the time-varying harmonic structure of speech [speech-distortion-product OAEs (DPOAEs)]. We then employ the method to investigate the effect of selective attention on the speech-DPOAEs. We provide tentative evidence that the speech-DPOAEs are larger when the corresponding speech signal is attended than when it is ignored. Our development of speech-DPOAEs opens up a path to further investigations of the contribution of the cochlea to the processing of complex real-world signals.

Significance Statement

Real-world environments, such as a loud pub or restaurant, are often noisy. The detection of sound occurs in the inner ear, which also possesses an active mechanism to mechanically amplify sound vibrations. Because the active mechanism can be regulated by the nervous system, it may already play a part in analyzing complex acoustic scenes. However, investigations of these questions have been hindered by a lack of experimental tools to assess the inner ear’s activity in relation to speech processing. Here, we develop a method to record otoacoustic emissions (OAEs) that relate to the harmonic structure of speech, a key feature of many speech parts. We use the novel tool to provide tentative evidence that the inner ear contributes to selective attention.

Introduction

Humans have a remarkable ability to selectively listen to one of several competing speakers and understand them despite the interfering voices. The cognitive processes involved in these tasks are typically attributed to the auditory cortex that receives its inputs from lower-level neural processing centers. However, extensive feedback loops exist between the lower-level structures and the cortex. In particular, descending efferent nerve fibers transmit information from the auditory cortex to the superior olivary complex and to the cochlear nuclei, and through the olivocochlear bundle from the superior olivary complex to the inner ear (Pickles, 1988; Huffman and Henson, 1990; Winer et al., 1998).

The inner ear, or cochlea, detects sound vibrations by transducing them into electrical signals in the auditory nerve. However, the inner ear also aids the processing of sound. It spatially separates the different frequency components of a complex tone: high frequencies cause maximal vibration near the base of the organ and lower frequencies are detected at progressively more apical locations (Pickles, 1988; Robles and Ruggero, 2001; Reichenbach and Hudspeth, 2014). This spatial frequency decomposition is aided by an active process that mechanically amplifies weak signals, and thereby boosts the frequency selectivity. As a characteristic of the active process, the vibration amplitude at the peak location depends on the sound intensity in a compressively nonlinear manner.

The active process is mediated by the inner ear’s mechanosensitive outer hair cells. These cells are innervated by one of two types of the olivocochlear fibers, the medial ones. Activation of the medial olivocochlear (MOC) fibers can reduce the mechanical amplification provided by the outer hair cells (Guinan, 2006; Lopez-Poveda, 2018). Because each MOC fiber is tuned to a narrow frequency band, and because the innervation of the inner ear by these fibers displays a tonotopic arrangement, the reduction of cochlear amplification can potentially vary with frequency (Liberman and Brown, 1986; Brown, 1989; Lilaonitkul and Guinan, 2012). Computational models of the inner ear and efferent feedback have shown that the efferent feedback can contribute to speech processing through frequency-specific modulation of its mechanical activity (Messing et al., 2009; Clark et al., 2012). Experimental verification of such an effect remains, however, lacking.

The strength of the mechanical amplification in the cochlea, as well as its regulation through efferent feedback, can be assessed through distortion-product otoacoustic emissions (DPOAEs). A by-product of the active process’ compressive nonlinearity, DPOAEs are typically elicited by two pure tones of nearby frequencies f1 and f2, and emerge in particular at the cubic distortion product frequencies 2f1 – f2 and 2f2 – f1. By convention, the two primary frequencies are chosen such that f1 < f2. The cubic distortion product 2f1 – f2 is then below the two primary frequencies and is referred to as the lower-sideband distortion product. Conversely, the upper-sideband distortion product 2f2 – f1 is higher than f1 and f2.

Because of efferent feedback, DPOAEs from one ear are suppressed when stimulating the other ear with noise [MOC reflex (MOCR); Guinan, 1996, 2006]. However, an investigation of the role of efferent feedback for speech processing has been hampered by the complexity of speech. As opposed to the pure tones used for DPOAE measurements, speech is a broad-band, time-varying, and non-stationary signal, which obstructs an assessment of cochlear responses.

An important feature of speech is its harmonic structure. Many parts of speech are voiced, that is, they arise from vibration of the vocal folds. The vibration occurs at a fundamental frequency, typically between 100 and 300 Hz, and the resulting speech signal is dominated by that frequency as well as its higher harmonics. Neural activity in subcortical areas phase lock to this harmonic structure (Galbraith et al., 1995; Russo et al., 2004; Skoe and Kraus, 2010), and we have recently shown that these neural responses are modulated by selective attention to speech (Forte et al., 2017; Etard et al., 2019). Importantly, the attentional modulation of the brainstem response only emerged systematically when measuring it in response to running speech, while responses to short tones such as single vowels yielded inconclusive results (Galbraith et al., 1998; Choi et al., 2013; Lehmann and Schönwiesner, 2014).

Because of efferent feedback, the cochlear activity related to the harmonic structure of speech may already be modulated to aid its processing. In particular, the cochlear activity at locations that do not correspond to the harmonics of a speech signal might be reduced, which could help to reduce background noise. Computational modeling shows that such modulation of cochlear activity that depends on the cochlear location and therefore affects different frequency bands of the encountered sound differently can indeed aid with sound processing (Clark et al., 2012).

To investigate this issue, we develop a method to monitor the inner ear’s activity related to speech through DPOAEs that are matched to the harmonic structure (speech-DPOAEs). We then show that the speech-DPOAEs can be employed to investigate the role of the efferent feedback to the inner ear in selectively attending to one of two continuous speech signals, an ecologically highly relevant scenario.

Materials and Methods

Participants

A total of 24 healthy young volunteers (14 female, 10 male) aged between 18 and 26 years were recruited. All subjects were native English speakers and had no history of hearing or neurologic impairments. The experimental procedures were approved by the local Ethics Committee, and were performed in accordance with all relevant guidelines and regulations. Informed consent was obtained from all participants.

Test environment

All testing was conducted in a sound-proof and semi-anechoic room. A personal computer (PC) controlled the audio presentation and data acquisition. Experiments were automated and instructions were presented to subjects through the PC; when prompted, subjects submitted responses using a keyboard. Sound stimuli were presented at a sampling frequency of 44.1 kHz through a high-performance sound card (RME Fireface 802) and delivered by an extended-bandwidth otoacoustic measurement system (ER10X, Etymotics) through probes placed in both ears of a subject. Each probe contains a microphone and three speakers which allow for a simultaneous presentation of the acoustic stimuli and the measurement of the acoustic emissions generated by the inner ear. OAEs were recorded through this system from the right ear, at the same sampling rate.

Because OAEs are very faint signals and easily masked by other sounds, we presented the speech signals to the contralateral and not to the ipsilateral ear (Fig. 1E). We thereby tested whether attention to one of two speakers presented to the left ear would affect cochlear activity contralaterally.

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

The waveforms used to elicit and detect speech-DPOAEs. A, B, The spectrogram of the voiced parts of a male speech signal (A) or a female speech signal (B) shows the harmonic structure, with a fundamental frequency and many higher harmonics (note that the colormap represents lower power as dark and higher power as white). C, D, The waveforms used to elicit and to detect the speech-DPOAEs to the male voice (C) and to the female voice (D). A, C, We measure speech-DPOAEs related to the male voice by constructing waveforms Embedded Image (red line) and Embedded Image (purple line) that oscillate at the 9th and 11th harmonics of the fundamental frequency of the speech signal, respectively. The lower-sideband speech-DPOAE then emerges at the 7th harmonic and is measured through cross-correlation with the corresponding waveform Embedded Image (dashed red line). B, D, The speech-DPOAEs related to the female voice are elicited by waveforms Embedded Image (red line) and Embedded Image (purple line) that correspond to the 6th and 8th harmonics. The speech-DPOAE is found at the 4th harmonic, we measure it through the waveform Embedded Image (dashed red line). E, In our experiment, we presented subjects with speech stimuli to the left ear. The speech stimuli were either a single voice or two competing voices, a male and a female one. Two waveforms Embedded Image and Embedded Image that were derived from one of the speech stimuli were presented to the right ear. The microphone signal Embedded Image was recorded from the right ear as well, and the speech-DPOAE was derived from this recording.

Sound intensity was calibrated with an ear simulator (type 4157, Brüel & Kjær). The ER-10X instrument comes with a supply of single-use eartips that fit a variety of ear canal sizes. Precaution was taken to ensure the selection of an appropriate size for each participant and a correct fitting of the probe, as the eartip must seal the ear canal.

Generation and measurement of speech-DPOAEs

We sought to measure DPOAEs related to the temporal fine structure of speech. We thereby used continuous natural speech that we obtained by recording a male and female speaker reading a story. The male voice had a fundamental frequency of 105 ± 6 Hz (mean ± SD), while the female voice had a fundamental frequency of 172 ± 10 Hz (mean ± SD). For a given voiced part of a speech signal, we therefore computed a fundamental waveform Formula , that is, a temporal signal that varied at each time point t at the fundamental frequency Formula of the speech at that moment. The fundamental waveform was obtained from bandpass filtering the speech around the fundamental frequency. The latter was determined using the speech-analysis software package PRAAT (Boersma, 2002). The bandpass filter was a zero-phase sixth order IIR filter. The filter passbands were 0.5 SD above and below the mean fundamental frequency. For the non-voiced parts of a speech signal, the fundamental waveform was zero. We confirmed that the filter did not introduce any delays.

From the fundamental waveform we then obtained further waveforms whose instantaneous frequencies matched the higher harmonics of the fundamental frequency. In particular, we obtained the waveform Formula that corresponded to the nth harmonic of the fundamental frequency by applying a Hilbert-transform frequency shifter to the fundamental waveform Formula (Wardle, 1998). To this end, we first determined the analytical representation of the fundamental waveform Formula , using the Hilbert transform Formula : Formula (1)

We then used the analytical representation to obtain a waveform whose instantaneous frequency was n-fold that of the fundamental waveform: Formula (2)

We employed the waveforms that corresponded to the higher harmonics to elicit, as well as to measure, the speech-DPOAEs. We stimulated the ear of a subject with two waveforms Formula and Formula that tracked the mth and nth harmonics of the fundamental frequency, with m smaller but not much below n (Fig. 1E). The instantaneous frequencies of these two waveforms at time t were Formula and Formula . They therefore elicited cubic distortion at the instantaneous frequencies Formula and Formula that corresponded to the frequency range of the waveforms Formula and Formula .

This approach enabled us to assess cochlear activity at frequencies, and therefore at cochlear locations, that corresponded to the harmonic structure of speech. A traditional approach using constant primary frequencies f1 and f2, in contrast, would not have allowed to track cochlear activity at the temporally-varying harmonic structure.

However, regarding the analysis of the speech-DPOAE, the nonstationary nature of speech and the waveforms that we derived from it meant that we could not employ power spectra to identify the emissions. Instead, we employed a cross-correlation approach in which we compared the microphone recording with the expected cubic distortion waveforms that we computed from the speech signals as well, Formula and Formula .

In our analysis, we focused on the lower sideband speech-DPOAE at the instantaneous frequency Formula , as the lower sideband cubic DPOAE is the strongest in human ears (Probst et al., 1991). We measured this speech-DPOAE by cross-correlating the microphone recording Formula (Fig. 1) with the waveform Formula . Because the speech-DPOAE could have a phase shift with respect to the waveform Formula , we interpreted this cross-correlation as the real part of a complex cross-correlation Formula that depended on the delay τ. The imaginary part of this complex cross-correlation was computed as the cross-correlation of the microphone recording with the Hilbert transform of the waveform Formula . The complete complex cross-correlation thus followed as Formula (3)in which i denotes the imaginary unit and Formula is the Hilbert transform of Formula . The normalization coefficient N is determined such that the auto cross-correlations at zero lag equal 1. The delay of the speech-DPOAE could then be obtained from the delay at which the amplitude of Formula peaked, and the phase shift followed from the phase of Formula at that latency. We had previously developed a similar procedure to detect the brainstem response at the fundamental waveform of speech at a particular delay and phase shift (Forte et al., 2017).

DPOAEs are strongest for a ratio of the two primary frequencies of ∼1.2 (Abdala, 1996). They are easiest to measure for primary frequencies of ∼1 kHz or higher (Probst et al., 1991). We followed these recommendations and chose harmonics that were in this range. This additionally allowed us to make comparisons between pure-tone DPOAEs and speech-DPOAEs. For the male voice, we therefore employed the 9th and 11th harmonics, that is, the waveforms Formula and Formula (Fig. 1A,C). The lower-sideband speech-DPOAE emerged accordingly in correlation to the waveform Formula , and had a frequency of 735 ± 26 Hz (mean ± SD). For the female voice, we used the 6th and 8th harmonics, that is, the waveforms Formula and Formula (Fig. 1B,D). The lower-sideband speech-DPOAE was then at the 4th harmonic (Formula ), with a frequency of 692 ± 31 Hz (mean ± SD).

The different harmonics of the fundamental frequency that we employed for the male and for the female speech fell into two classes, resolved and unresolved. Because of the cochlea’s logarithmic mapping between spatial location and best frequency, only the lower harmonics are resolved. The upper limit for resolved harmonics is considered to be the 9th (Micheyl and Oxenham, 2007). The harmonics of the fundamental frequency of the female speaker were therefore more resolved than those of the male speaker.

Experimental design

We first measured pure-tone DPOAEs from the subject’s right ear. We therefore employed primary frequencies of Formula kHz and Formula = 1.2 kHz that were presented at 60-dB SPL for a duration of 30 s.

We then measured speech-DPOAEs while subjects listened to speech both in isolation and in noise. To this end, we employed different speech segments, all of which lasted 2 min. Some speech segments consisted only of the male or of the female voice, while others had both voices mixed together. The speech-DPOAEs were always measured from the right ear. To increase the signal-to-noise ratio of the recorded OAEs, the speech stimuli where not applied to the right ear as well, but exclusively to the left ear. Because the MOCR pathway is crossed, stimulation of a given ear can also modulate the contralateral inner ear activity. Selective attention to one of two speakers heard in a given ear may similarly modulate the activity of the contralateral cochlea, and thus modulate the speech-DPOAEs recorded from that ear. The speech segments were presented at 60-dB SPL, and the waveforms to elicit the speech-DPOAEs at a somewhat lower intensity of 57-dB SPL. The intensities were chosen to avoid evoking the middle-ear muscle reflex, as well as to enable subjects to focus on the speech signals without being distracted by the speech-DPOAE measurement.

We first familiarized the subject with the speech-DPOAE measurement and with the attention task. To this end subjects were presented with a speech stimulus that contained both the male and the female voice. Speech-DPOAEs related to the male speech were simultaneously elicited in the contralateral ear. The subject was instructed to listen to the male speaker and to ignore the female speaker. To verify attention, the subject then answered three comprehension questions regarding the male speech. This was then repeated, but subjects were asked to attend the female speech, while speech-DPOAEs related to the female voice were elicited.

We then measured speech-DPOAEs to speech in isolation. To this end, we employed one speech stimulus that consisted of the male voice, as well as another stimulus that contained the female voice. Each stimulus was followed by three comprehension questions.

The potential influence of selective attention to speech on the speech-DPOAEs was then assessed. We employed 12 segments with competing speech, that is, segments that contained both the male and the female voice. During each segment the subject was asked to attend either the female or the male speaker. Speech-DPOAEs related to either the male or the female speaker were measured from the contralateral ear. The segment was then presented again. The same speech-DPOAEs were measured, but the subject was asked to attend the other speaker. The attended and ignored segments were therefore paired. In this way we obtained three recordings of speech-DPOAEs to the male voice, both when that voice was attended and when it was ignored. Analogously we measured speech-DPOAEs to the female voice for three speech segments, both when the female voice was attended and when it was ignored. After each speech segment the subject answered three comprehension questions. The order of the attentional focus, as well as the order in which speech-DPOAEs to the male and the female voice were measured, was determined randomly per subject.

Preprocessing and analysis of OAEs

The pure-tone DPOAEs were first analyzed using a power spectrum of the microphone recording. The noise floor of the recording was computed from the spectral amplitudes within 30–70 Hz to each side from the DPOAEs. A pure-tone distortion product was considered to be significant if the DPOAE amplitude was larger than the 95th percentile of the noise.

Second, the pure-tone DPOAEs were processed in a manner that was comparable to the speech-DPOAEs. For this purpose, a sinusoidal waveform at the frequency of the lower sideband distortion product of 800 Hz was created. This waveform was then cross-correlated with the microphone recording following Equation 1. The envelope of the complex cross-correlation was smoothed with a moving-average filter of 199 samples. The noise level was computed by following the same procedure but using the unrelated nearby frequency of 900 Hz. The pure-tone DPOAEs was considered to be significant if the peak correlation amplitude in the range 0–7 ms was larger than the 95th percentile of the noise.

For computing speech-DPOAEs, the first and last three seconds of each recording were removed to eliminate transient activity. The envelope of the complex cross-correlation (1) was smoothed with a moving-average filter of 199 samples. To determine the equipment delay, the recording was also cross-correlated with the eliciting harmonics; the delay of the maximum correlation corresponded to the equipment delay. The recordings were compensated for this equipment delay.

The noise level of the speech-DPOAEs was determined as the 95th percentile of the amplitudes of the complex cross-correlation (1) in the temporal regions of −750 to −70 and 70 to 750 ms. A speech-DPOAE was considered to be significant if the peak amplitude of the complex cross-correlation (1) in the range of delays between 0 and 7 ms was larger than the noise level.

Comprehension scores

Speech comprehension was assessed through multiple-choice questions. The questions came in two formats: 60% of them had four possible answers, and the remaining 40% of questions had two possible answers. The comprehension score of a subject was computed as the proportion of correct answers to the questions posed during the selective attention task. Two participants did not score above the chance level.

Attentional modulation of the speech-DPOAEs

We analyzed the effect of selective attention on both the amplitude and the latency of the speech-DPOAEs. We performed this analysis for each of the three segments of pairs for which we recorded speech-DPOAEs related to the male voice, as well as for each of the three segments of pairs for which speech-DPOAEs related to the female voice were measured.

Denote by Formula the peak amplitude of the speech-DPOAE related to the male voice when that voice was attended, and by Formula the peak amplitude of the speech-DPOAE related to male voice when that voice was ignored. We then defined the relative attentional modulation of the speech-DPOAE related to the male voice as the difference between the two amplitudes, divided by the average amplitude: Formula . A positive relative attentional modulation signified a larger speech-DPOAE related to the male voice when it was attended, and a negative value implied a larger response when the male voice was ignored. Analogously, we defined the peak amplitudes Formula and Formula of the speech-DPOAE related to the female voice when this voice was attended respectively ignored. These coefficients yielded the relative attentional modulation of the speech-DPOAE related to the female voice, Formula .

The relative attentional modulation of the speech-DPOAE related to the male voice was computed for all three corresponding recordings separately, that is, for the three pairs of recordings in which the participants once attended the corresponding speaker and once ignored them. We then averaged the obtained coefficients to obtain a single value per subject. The same procedure was employed for the relative attentional modulation of the speech-DPOAE related to the female voice.

To test the attentional modulation at the level of individual subjects, we split each of the recording segments into 10 consecutive intervals and computed the corresponding speech-DPOAEs. For each interval, we determined the amplitude of the speech-DPOAE at the latency of the peak amplitude of the corresponding segment. For each pair of the intervals, with one interval corresponding to the task of attending the male voice and the other interval related to the task of attending the female voice, we then computed the relative attention modulation coefficient, either AF or AM, depending on which speech-DPOAE was measured. We then performed one-sided t tests on all the obtained modulation coefficients AF from a given subject to determine whether the relative attentional modulation with respect to the female voice was significantly above zero. Analogously, we conducted one-sided t tests on the coefficients AM to establish their statistical significance at the level of individual subjects.

To assess the potential modulation of the latency through selective attention, we computed the difference in the latency of the speech-DPOAE when the corresponding speech was attended and when it was ignored. The difference was computed for each of the three corresponding stimuli separately, and the average of the differences was taken subsequently.

Exclusion of subjects and statistical analysis

We performed statistical tests both regarding amplitude and latencies of the speech-DPOAEs when speech was presented in isolation, as well as regarding the amplitude and latencies of the speech-DPOAEs when participants attended one of two competing speakers.

To assess the attentional modulation, we excluded data from two subjects who had non-significant pure-tone DPOAEs and from another two subjects whose answers to the speech comprehension questions did not exceed chance level.

Regarding the attentional modulation of the speech-DPOAEs related to the male voice, we excluded the data from four further subjects whose corresponding speech-DPOAEs were non-significant. With respect to the attentional modulation of speech-DPOAEs related to the female voice, data from three subjects whose corresponding speech-DPOAEs did not reach significance were excluded.

The amplitudes and latencies of the different speech-DPOAEs, as well as their changes because of switching attention, were then checked for normality through the Kolmogorov–Smirnov test. Non-parametric tests were used for the hypothesis testing as the data were not normally distributed and as the sample size was small. We further used a bootstrap analysis to estimate the sampling distributions of the mean attentional effects, to compute the confidence intervals for estimation inference, and to test the stability of the results. In particular, we used a random sampling with replacement procedure for 10,000 resamples and performed bootstrap hypothesis tests for the mean. We used a significance level of 0.05.

Data availability

The speech stimuli, the corresponding harmonic waveforms, as well as the recordings of the pure-tone and speech-DPOAE from all participants are available on figshare (https://doi.org/10.6084/m9.figshare.12738515). The repository also contains an example script for computing the speech-DPOAE as presented in Figure 2E.

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2.

Measurement of speech-DPOAEs. A–D, Complex cross-correlations of the microphone recording of a representative subject with the stimulating waveforms for the male voice [Embedded Image and Embedded Image )], when the probe is placed inside the ear canal (A, C, respectively) and when it is hold outside the ear (B, D, respectively). The data from the representative subject show that the complex cross-correlation of each stimulation waveform with the microphone recording peaks at 0 ms (blue: real part, red: imaginary part, black: amplitude). These peaks occur both when the probe is placed inside (A, C) as well as outside the ear canal (B, D). E, An OAE is measured by computing the complex cross-correlation between the microphone recording and the waveform Embedded Image that corresponds to the lower-sideband distortion. We refer to this emission as a speech-DPOAE. The amplitude peaks at a latency of 2.2 ms (dashed line). F, The speech-DPOAE measured outside the ear canal. When the probe is placed outside the ear canal, the cross-correlation does not show a significant peak, demonstrating that no emission could be detected. G, Individual peak values of speech-DPOAEs for male and female speech in isolation. In most subjects the amplitude of the speech-DPOAE (darker bar) was significantly above the noise floor (lighter superimposed bar). The population average of the speech-DPOAE related to the male voice was significantly larger than that related to the female voice.

Results

We sought to measure OAEs that were related to the harmonic structure of continuous non-repetitive speech, an ecologically relevant stimulus. We therefore devised a method to measure such speech-DPOAEs by eliciting distortion products from waveforms that tracked particular harmonics of the fundamental frequency of the speech signal (Fig. 1). The distortion emerged then at nearby harmonics.

We first verified the presence of a particular stimulation waveform in the microphone recording by cross-correlating the recording with that waveform. We found that we could thereby indeed measure the two waveforms that were used to stimulate the OAEs: each of them caused a peak in the corresponding cross-correlation, at a delay of 0 ms (Fig. 2A–D). These peaks emerged whether or not the probe for stimulating the ear and measuring the sound pressure was placed inside or outside the ear canal, since these signals were produced by the probe itself.

The speech-DPOAE was then measured analogously by cross-correlating the obtained microphone recording with the waveform of the harmonics that corresponded to a distortion product. The speech-DPOAE emerged then as a peak in that cross-correlation (Fig. 2E). As a control, this peak disappeared when the probe used to measure the speech-DPOAEs was placed near but outside the ear canal (Fig. 2F).

When subjects listened to a single speaker, we found that we could record significant speech-DPOAEs in all 24 subjects: all recordings except for one that was related to the female voice were significant (Fig. 2G). The amplitude of the speech-DPOAEs was 3.4e-4 ± 8e-5 (population average over male and female voices and standard error of the mean). The speech-DPOAE related to the male voice, was, however, significantly larger than that related to the female voice (p = 0.0002; two-tailed two-sample Wilcoxon signed-rank test; Fig. 2G). It also had a larger variance (p = 0.02; Bartlett’s test). The latency of the speech-DPOAEs was 2.3 ± 0.2 ms (population average over male and female voices and standard error of the mean). It did not differ between the male and the female voice (p = 0.8; two-tailed two-sample Wilcoxon signed-rank test).

To compare the speech-DPOAEs to conventional OAEs, we also measured pure-tone DPOAEs (Fig. 3). We first analyzed the recordings through computing the power spectrum, which showed peaks at the DPOAE frequencies (Fig. 3A). Using this type of analysis, we found that the upper sideband distortion product Formula was measurable in 14 of the 24 subjects, while the lower-sideband distortion product Formula could be detected in all subjects but two. The power spectrum of the upper sideband distortion product Formula was 2 ± 1-dB SPL/Hz (population average and standard error of the mean), and that of the lower-sideband distortion product Formula reached −11 ± 1-dB SPL/Hz (population average and standard error of the mean).

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3.

Relation of speech-DPOAEs to pure-tone DPOAEs. A, The power spectrum of the microphone recording in response to pure tones of a representative subject. Pure-tone DPOAEs were measured in response to the two primary frequencies Embedded Image kHz and Embedded Image = 1.2 kHz, and emerged at the cubic distortion frequencies Embedded Image and Embedded Image . B, The cross-correlation of the lower-sideband Embedded Image with the microphone recording of the same subject shows an amplitude of about 5e-4 (upper panel), significantly higher than that obtained when the probe is placed outside the ear canal (lower panel). C, Comparison between the lower-sideband pure-tone DPOAEs analyzed through the two methods presented in A, B. The amplitude of the pure-tone DPOAEs when analyzed through the cross-correlation method (ordinates), strongly correlated with the amplitude obtained from the power spectrum across subjects (abscissas). D, Comparison between the pure-tone DPOAEs obtained through the power spectrum and the speech-DPOAEs peak responses. The amplitude of the speech-DPOAEs was strongly correlated, across subjects, to the amplitude of the lower-sideband DPOAE Embedded Image as well.

To relate the DPOAE measurement to the speech-DPOAEs, we also analyzed the lower-sideband distortion product Formula using the cross-correlation method (Fig. 3B). We found that the cross-correlation of sinusoidal oscillations at the distortion frequency Formula yielded significant results in all but one subject. Moreover, the amplitude of the cross-correlation for this DPOAE was strongly related to its power spectrum across the different subjects (Pearson correlation coefficient r = 0.83, p = 2e-6; Fig. 3C). This strong correlation showed that a DPOAE power spectrum of 10 dB/Hz, for instance, corresponded to a cross-correlation amplitude of approximately 1e-3.

We further investigated the relation between the power spectrum of the distortion product Formula with the amplitude of the speech-DPOAEs across the different participants (Fig. 3D). We found that these two measures exhibited a strong and significant correlation as well (Pearson correlation coefficient r = 0.65, p = 3e-6). As an example, a DPOAE power spectrum of 7-dB SPL/Hz corresponded on average to a speech-DPOAE of an amplitude of 1e-3.

Armed with the ability to monitor cochlear activity related to the harmonic structure of speech through the speech-DPOAEs, we then sought to employ them to investigate speech processing in the cochlea. We focused on an important aspect of speech-in-noise comprehension, namely selective attention to the target voice.

To this end, we presented subjects with both a male and a female voice in one ear, while measuring speech-DPOAEs from the contralateral ear. Subjects were instructed to sometimes attend the male and sometimes the female voice, and were asked comprehension questions regarding the target speech signal. The participants achieved a comprehension score of 80 ± 13% (population mean ± SD), demonstrating that they were able to maintain a high level of attention.

We then analyzed the magnitude of the speech-DPOAEs for each voice and how it was modulated by selective attention. We found that the speech-DPOAEs related to the female voice were larger when the subject attended the female speaker than when that voice was ignored (Fig. 4A). The relative attentional modulation of the speech-DPOAE related to the female voice, Formula , was 0.064 and was significantly greater than zero (p = 0.02, two-tailed one-sample Wilcoxon signed-rank test). The statistical significance remained when removing two outliers (p = 0.01, two-tailed one-sample Wilcoxon signed-rank test).

Figure 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4.

Attentional modulation of speech-DPOAEs. Individual attentional modulations of speech-DPOAEs to male and female voices (A; the diamond markers represent outliers) and bootstrap distributions of the mean amplitude relative attentional modulation to the male voice (B) and female voice (C). A, The relative attentional modulation of the speech-DPOAEs related to the male voice is not significantly different from zero. Speech-DPOAEs related to the female voice are, however, significantly larger when the female voice is attended than when it is ignored. B, C, The bootstrapping procedure confirms that the results are stable, and that the attentional modulation related to the female voice has a large intersubject variability (C).

To test the stability of the results and to derive an additional estimate of the mean relative attentional modulation, we performed a bootstrapping procedure (Fig. 4C). The 95% confidence interval for the mean attentional modulation ranged from 0.004 to 0.13. The bootstrapped one-sided p value of the estimated population mean was 0.03.

However, regarding the speech-DPOAEs related to the male voice, the relative attentional modulation Formula was not significantly different from zero (p = 0.8, two-tailed one-sample Wilcoxon signed-rank test). The removal of the outliers rendered a similar p value (p = 0.7, two-tailed one-sample Wilcoxon signed-rank test). These results were confirmed by the bootstrapping, that yielded 95% confidence intervals for the mean modulation of [−0.15, 0.087], and no significant difference from 0 (p = 0.65; Fig. 4B).

At the level of individual subjects, two subjects showed significant relative attentional modulations to the male voice (AM), with p values of 2e-5 and 3e-4. One further subject showed significant relative attentional modulation to the female voice (AF), with a p value of 0.02. The remainder of the attentional modulation coefficients were statistically insignificant, with p values between 0.06 and 0.9.

Discussion

We developed a method to measure speech-DPOAEs, namely OAEs that were related to the harmonic structure of a speech signal. These OAEs were elicited by waveforms whose instantaneous frequency corresponded to that of particular harmonics of the fundamental frequency of the voiced parts of speech. They elicited distortion at other harmonics, and we measured these distortions by cross-correlating the microphone recording with the corresponding waveforms.

We compared the speech-DPOAEs to conventional pure-tone DPOAEs, and found that, across the different subjects, the amplitudes of the speech-DPOAEs were correlated to those of the pure-tone DPOAEs. Moreover, analyzing pure-tone DPOAEs in a manner that was comparable to the speech-DPOAEs showed that the amplitude of the speech-DPOAEs was comparable to that of the pure-tone DPOAEs. This suggests that the speech-DPOAEs and pure-tone DPOAEs have indeed a common origin in the cochlea.

Because the fundamental frequency of speech varies over time, the harmonics vary as well. The stimuli that we employed to elicit speech-DPOAEs, as well as the speech-DPOAEs themselves, were therefore not pure tones, but had a broader frequency spectrum. This enabled us to obtain the latency of the speech-DPOAEs, which provides further information on the origin of the emissions.

OAEs have been found to consist of two components, one with a long latency of many cycles and another with a short one of maximally a few cycles of delay (Knight and Kemp, 2001; Robles and Ruggero, 2001; Bergevin et al., 2008). The two different components may arise through different mechanisms of how the backward-traveling wave in the cochlea is generated (Shera and Zweig, 1991; Talmadge et al., 1999; Kalluri and Shera, 2001), or through different propagation mechanisms of the OAEs in the cochlea (Ren, 2004; He et al., 2007, 2008; Lutman et al., 2008; Reichenbach et al., 2012). The lower-sideband of pure-tone DPOAEs is dominated by the long-latency component, while the upper-sideband consists mainly of the short-latency component.

We found a latency of the speech-DPOAEs of only 2.3 ms, corresponding to ∼1.6 cycles. Although the speech-DPOAEs result from the lower-sideband distortion, their short delay reveals that they are dominated by the short-latency component. This deviation from the behavior of pure-tone DPOAEs may reflect the varying frequency of the speech-DPOAEs, which can introduce negative interference in the long-latency component. Because the phase of the latter changes rapidly with frequency, variation in frequency can indeed lead to significant cancellation effects. The phase of the short-latency component, in contrast, depends barely on frequency, except for the phase changes associated with the varying primary frequencies themselves (Reichenbach and Hudspeth, 2014). The fundamental frequency of speech does, however, not vary greatly, largely eliminating negative interference.

The speech-DPOAE related to the male voice had a larger amplitude than that related to the female voice. This behavior might reflect the different frequency ratios between the waveforms that we used to elicit the different speech-DPOAEs. Pure-tone DPOAEs have been found to be largest when the ratio of the primary frequencies Formula and Formula is Formula (Probst et al., 1991). The harmonics that we used to elicit the speech-DPOAEs related to the male voice, had a frequency ratio of 1.2, while the frequency ratio for the waveforms that yielded the speech-DPOAEs related to the female voice was slightly higher, 1.33. The higher ratio likely led to a smaller amplitude of the speech-DPOAEs that were related to the female voice.

To investigate whether the speech-DPOAEs that we measured could in fact be used to investigate the effects of speech processing on the cochlea, we employed the speech-DPOAEs to study whether they were affected by selective attention to one of two competing voices. We found that the speech-DPOAEs related to the female voice were larger when the female voice was attended than when it was ignored. The speech-DPOAEs related to the male voice were not affected by attention.

We can speculate that the lack of a significant effect for the male voice might reflect the poorer resolvability of the target harmonics. The speech-DPOAEs that were related to the female voice tracked the resolved harmonics. Their measurement therefore allowed to test whether cochlear activity at the locations of the resolved harmonics was modulated by selective attention. In contrast, the speech-DPOAEs for the male voice were related to unresolved harmonics. Our observation of an attentional modulation of the speech-DPOAEs related to the female, but not of those related to the male voice, is therefore consistent with an attentional modulation of cochlear activity regarding the resolved but not the unresolved harmonics of speech. In particular, the cochlea appears to facilitate selective attention to a voice through a larger mechanical response at the locations of the resolved harmonics, but not at the unresolved harmonics. Because only the resolved harmonics can be differentiated in the cochlea, an attentional modulation that aims to reduce background noise can indeed only sensibly operate on the resolved and not on the unresolved harmonics.

The attentional effect on the amplitude of the speech-DPOAE related to the female voice that we observed is only 6.4% on average, corresponding to 0.54 dB. Although this effect is small, it is comparable to the MOC reflex that is elicited by broadband noise and changes DPOAE magnitudes between 0.5 and 2 dB (Chéry-Croze et al., 1993; Timpe-Syverson and Decker, 1999; Sun, 2008). Intermodal attention such as between attending to an acoustic and to a visual signal has been found to have similar or smaller effects on the DPOAE amplitude (Wittekindt et al., 2014; Beim et al., 2018).

The bootstrap analysis of the data validated the stability of the attentional modulation effect. Nevertheless, the large confidence interval for the mean attention estimate suggests considerable uncertainty as well as a large interindividual variability, in lines with recent observations on selective attention modulation of cochlear function when attending to tones (Beim et al., 2018).

The attentional modulation of cochlear activity related to the harmonic structure of speech is reminiscent of the attentional effect on the brainstem response to voiced speech. As we have shown recently, the brainstem response at the fundamental frequency of continuous speech is modulated by selective attention (Forte et al., 2017; Etard et al., 2019; Saiz-Alía et al., 2019). In particular, the response is larger when the voice is attended than when it is ignored. Because of the nonlinearities in the inner ear and in the neural processing, the brainstem response at the fundamental frequency reflects cochlear activity at higher harmonics (Saiz-Alía and Reichenbach, 2020). The attentional modulation of the cochlear activity for which we have provided evidence here may contribute to the attention effect seen in the brainstem response.

DPOAEs and other types of OAEs have been employed previously to investigate how cochlear activity can be affected by selective attention, such as auditory versus visual attention, but have yielded inconclusive results that include both positive (Giard et al., 1994; Maison et al., 2001; de Boer and Thornton, 2007; Walsh et al., 2008; Harkrider and Bowers, 2009; Smith et al., 2012; Srinivasan et al., 2014; Wittekindt et al., 2014) and negative findings (Avan and Bonfils, 1992; Beim et al., 2018, 2019). Potential confounds in these measurements were a task-irrelevance of some of the stimuli that were used for eliciting the OAEs and a difficulty to assign attention tasks in different modalities that were balanced in perceptual load and working memory.

Our method of assessing the attentional modulation of the inner ear’s activity relates directly to natural speech processing. The task of selective attention to speech was naturalistic with a high ecological validity, and with a high perceptual load. This factor may have contributed to our positive finding regarding a modulation of speech-DPOAEs through selective attention. However, because of the small sample size, the lack of effect for the male condition and the large margin of error of the effect, the evidence for attentional modulation that we obtained still needs to be treated with caution. Moreover, the modulation by selective attention was, at the level of individual subjects, only significant in a few cases. While we believe that the speech-DPOAEs that we introduced open a promising path to study selective attention to speech in the cochlea, further studies are required to replicate our findings, to firmly establish the attentional modulation, to investigate the impact of the ratio of the primary frequencies on the amplitude of the resulting speech-DPOAEs, and to investigate to which degree resolved versus unresolved harmonics lead to differences in the attentional modulation.

In conclusion, the speech-DPOAEs that we have developed here provide a novel tool to measure inner-ear activity related to the processing of naturalistic speech. In particular, this enables to assess aspects of speech processing such as selective attention in a manner that fosters sustained attention of a participant and avoids potential neural adaptation to repeated stimuli. We therefore expect speech-DPOAEs and related complex OAEs to become a useful tool in further exploring how the inner ear contributes to the processing of complex real-world acoustic signals. They may also be relevant for a better understanding and diagnosis of poorly understood hearing impairments such as cochlear neuropathy or speech-in-noise deficits.

Footnotes

  • The authors declare no competing financial interests.

  • Author contributions: M.S.-A. and T.R. designed research; M.S.-A. and P.M. performed research; M.S.-A., P.M., and T.R. analyzed data; M.S.-A. and T.R. wrote the paper.

  • This work was supported by the Royal British Legion Centre for Blast Injury Studies, La Caixa Foundation Grant LCF/BQ/EU15/10350044, and Engineering and Physical Sciences Research Council Grants EP/M026728/1 and EP/R032602/1.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.

References

  1. ↵
    Abdala C (1996) Distortion product otoacoustic emission (2 f 1 − f 2) amplitude as a function of f 2/f 1 frequency ratio and primary tone level separation in human adults and neonates. J Acoust Soc Am 100:3726–3740. doi:10.1121/1.417234 pmid:8969474
    OpenUrlCrossRefPubMed
  2. ↵
    Avan P, Bonfils P (1992) Analysis of possible interactions of an attentional task with cochlear micromechanics. Hear Res 57:269–2675. doi:10.1016/0378-5955(92)90156-h pmid:1733917
    OpenUrlCrossRefPubMed
  3. ↵
    Beim JA, Oxenham AJ, Wojtczak M (2018) Examining replicability of an otoacoustic measure of cochlear function during selective attention. J Acoust Soc Am 144:2882–2895. doi:10.1121/1.5079311 pmid:30522315
    OpenUrlCrossRefPubMed
  4. ↵
    Beim JA, Oxenham AJ, Wojtczak M (2019) No effects of attention or visual perceptual load on cochlear function, as measured with stimulus-frequency otoacoustic emissions. J Acoust Soc Am 146:1475–1491. doi:10.1121/1.5123391 pmid:31472524
    OpenUrlCrossRefPubMed
  5. ↵
    Bergevin C, Freeman DM, Saunders JC, Shera CA (2008) Otoacoustic emissions in humans, birds, lizards, and frogs: evidence for multiple generation mechanisms. J Comp Physiol A Neuroethol Sens Neural Behav Physiol 194:665–683. doi:10.1007/s00359-008-0338-y pmid:18500528
    OpenUrlCrossRefPubMed
  6. ↵
    Boersma P (2002) Praat, a system for doing phonetics by computer. Glot Int 5:341–345.
    OpenUrl
  7. ↵
    Brown MC (1989) Morphology and response properties of single olivocochlear fibers in the guinea pig. Hear Res 40:93–109. doi:10.1016/0378-5955(89)90103-2 pmid:2768087
    OpenUrlCrossRefPubMed
  8. ↵
    Chéry-Croze S, Moulin A, Collet L (1993) Effect of contralateral sound stimulation on the distortion product 2f1-f2 in humans: evidence of a frequency specificity. Hear Res 68:53–58. doi:10.1016/0378-5955(93)90064-8 pmid:8376215
    OpenUrlCrossRefPubMed
  9. ↵
    Choi I, Rajaram S, Varghese LA, Shinn-Cunningham BG (2013) Quantifying attentional modulation of auditory-evoked cortical responses from single-trial electroencephalography. Front Hum Neurosci 7:115. doi:10.3389/fnhum.2013.00115 pmid:23576968
    OpenUrlCrossRefPubMed
  10. ↵
    Clark NR, Brown GJ, Jürgens T, Meddis R (2012) A frequency-selective feedback model of auditory efferent suppression and its implications for the recognition of speech in noise. J Acoust Soc Am 132:1535–1541. doi:10.1121/1.4742745 pmid:22978882
    OpenUrlCrossRefPubMed
  11. ↵
    de Boer J, Thornton ARD (2007) Effect of subject task on contralateral suppression of click evoked otoacoustic emissions. Hear Res 233:117–123. doi:10.1016/j.heares.2007.08.002 pmid:17910996
    OpenUrlCrossRefPubMed
  12. ↵
    Etard O, Kegler M, Braiman C, Forte AE, Reichenbach T (2019) Decoding of selective attention to continuous speech from the human auditory brainstem response. Neuroimage 200:1–11. doi:10.1016/j.neuroimage.2019.06.029 pmid:31212098
    OpenUrlCrossRefPubMed
  13. ↵
    Forte AE, Etard O, Reichenbach T (2017) The human auditory brainstem response to running speech reveals a subcortical mechanism for selective attention. Elife 6:e27203. doi:10.7554/eLife.27203
    OpenUrlCrossRef
  14. ↵
    Galbraith GC, Arbagey PW, Branski R, Comerci N, Rector PM (1995) Intelligible speech encoded in the human brain stem frequency-following response. Neuroreport 6:2363–2367. doi:10.1097/00001756-199511270-00021 pmid:8747154
    OpenUrlCrossRefPubMed
  15. ↵
    Galbraith GC, Bhuta CASM, Choate AK, Kitahara JM, Mullen TA (1998) Brain stem response to dichotic vowels during attention Neuroreport 9:1889–1894.
    OpenUrlCrossRefPubMed
  16. ↵
    Giard MH, Collet L, Bouchet P, Pernier J (1994) Auditory selective attention in the human cochlea. Brain Res 633:353–356. doi:10.1016/0006-8993(94)91561-x pmid:8137171
    OpenUrlCrossRefPubMed
  17. ↵
    Guinan JJ Jr. (1996) The cochlea. In: Physiology of olivocochlear efferents. New York: Springer.
  18. ↵
    Guinan JJ Jr. (2006) Olivocochlear efferents: anatomy, physiology, function, and the measurement of efferent effects in humans. Ear Hear 27:589–607. doi:10.1097/01.aud.0000240507.83072.e7 pmid:17086072
    OpenUrlCrossRefPubMed
  19. ↵
    Harkrider AW, Bowers CD (2009) Evidence for a cortically mediated release from inhibition in the human cochlea. J Am Acad Audiol 20:208–215. doi:10.3766/jaaa.20.3.7 pmid:19927691
    OpenUrlCrossRefPubMed
  20. ↵
    He W, Nuttall AL, Ren T (2007) Two-tone distortion at different longitudinal locations on the basilar membrane. Hear Res 228:112–122. doi:10.1016/j.heares.2007.01.026 pmid:17353104
    OpenUrlCrossRefPubMed
  21. ↵
    He W, Fridberger A, Porsov E, Grosh K, Ren T (2008) Reverse wave propagation in the cochlea. Proc Natl Acad Sci USA 105:2729–2733. doi:10.1073/pnas.0708103105 pmid:18272498
    OpenUrlAbstract/FREE Full Text
  22. ↵
    Huffman RF, Henson OW (1990) The descending auditory pathway and acousticomotor systems: connections with the inferior colliculus. Brain Res Brain Res Rev 15:295–323. doi:10.1016/0165-0173(90)90005-9 pmid:2289088
    OpenUrlCrossRefPubMed
  23. ↵
    Kalluri R, Shera CA (2001) Distortion-product source unmixing: a test of the two-mechanism model for DPOAE generation. J Acoust Soc Am 109:622–637. doi:10.1121/1.1334597 pmid:11248969
    OpenUrlCrossRefPubMed
  24. ↵
    Knight RD, Kemp DT (2001) Wave and place fixed DPOAE maps of the human ear. J Acoust Soc Am 109:1513–1525. doi:10.1121/1.1354197 pmid:11325123
    OpenUrlCrossRefPubMed
  25. ↵
    Lehmann A, Schönwiesner M (2014) Selective attention modulates human auditory brainstem responses: relative contributions of frequency and spatial cues. PLoS One 9:e85442. doi:10.1371/journal.pone.0085442 pmid:24454869
    OpenUrlCrossRefPubMed
  26. ↵
    Liberman MC, Brown MC (1986) Physiology and anatomy of single olivocochlear neurons in the cat. Hear Res 24:17–36. doi:10.1016/0378-5955(86)90003-1 pmid:3759672
    OpenUrlCrossRefPubMed
  27. ↵
    Lilaonitkul W, Guinan JJ (2012) Frequency tuning of medial-olivocochlear-efferent acoustic reflexes in humans as functions of probe frequency. J Neurophysiol 107:1598–1611. doi:10.1152/jn.00549.2011 pmid:22190630
    OpenUrlCrossRefPubMed
  28. ↵
    Lopez-Poveda EA (2018) Olivocochlear efferents in animals and humans: from anatomy to clinical relevance. Front Neurol 9:197–118. doi:10.3389/fneur.2018.00197 pmid:29632514
    OpenUrlCrossRefPubMed
  29. ↵
    Lutman ME, Davis AC, Ferguson MA (2008) Epidemiological evidence for the effectiveness of the noise at work regulations, RR669. Sudbury: HSE Books.
  30. ↵
    Maison S, Micheyl C, Collet L (2001) Influence of focused auditory attention on cochlear activity in humans. Psychophysiol 38:35–40. doi:10.1111/1469-8986.3810035
    OpenUrlCrossRefPubMed
  31. ↵
    Messing DP, Delhorne L, Bruckert E, Braida LD, Ghitza O (2009) A non-linear efferent-inspired model of the auditory system; matching human confusions in stationary noise. Speech Commun 51:668–683. doi:10.1016/j.specom.2009.02.002
    OpenUrlCrossRef
  32. ↵
    Micheyl C, Oxenham AJ (2007) Across-frequency pitch discrimination interference between complex tones containing resolved harmonics. J Acoust Soc Am 121:1621–1631. doi:10.1121/1.2431334 pmid:17407899
    OpenUrlCrossRefPubMed
  33. ↵
    Pickles JO (1988) An introduction to the physiology of hearing. London: Academic Press.
  34. ↵
    Probst R, Lonsbury-Martin BL, Martin GK (1991) A review of otoacoustic emissions. J Acoust Soc Am 89:2027–2067. doi:10.1121/1.400897 pmid:1860995
    OpenUrlCrossRefPubMed
  35. ↵
    Reichenbach T, Hudspeth AJ (2014) The physics of hearing: fluid mechanics and the active process of the inner ear. Rep Prog Phys 77:e076601. doi:10.1088/0034-4885/77/7/076601 pmid:25006839
    OpenUrlCrossRefPubMed
  36. ↵
    Reichenbach T, Stefanovic A, Nin F, Hudspeth AJ (2012) Waves on Reissner’s membrane: a mechanism for the propagation of otoacoustic emissions from the cochlea. Cell Rep 1:374–384. doi:10.1016/j.celrep.2012.02.013 pmid:22580949
    OpenUrlCrossRefPubMed
  37. ↵
    Ren T (2004) Reverse propagation of sound in the gerbil cochlea. Nat Neurosci 7:333–334. doi:10.1038/nn1216 pmid:15034589
    OpenUrlCrossRefPubMed
  38. ↵
    Robles L, Ruggero MA (2001) Mechanics of the mammalian cochlea. Physiol Rev 81:1305–1352. doi:10.1152/physrev.2001.81.3.1305 pmid:11427697
    OpenUrlCrossRefPubMed
  39. ↵
    Russo N, Nicol T, Musacchia G, Kraus N (2004) Brainstem responses to speech syllables. Clin Neurophysiol 115:2021–2030. doi:10.1016/j.clinph.2004.04.003 pmid:15294204
    OpenUrlCrossRefPubMed
  40. ↵
    Saiz-Alía M, Reichenbach T (2020) Computational modeling of the auditory brainstem response to continuous speech. J Neural Eng 17:036035. doi:10.1088/1741-2552/ab970d
    OpenUrlCrossRef
  41. ↵
    Saiz-Alía M, Forte AE, Reichenbach T (2019) Individual differences in the attentional modulation of the human auditory brainstem response to speech inform on speech-in-noise deficits. Sci Rep 9:14131. doi:10.1038/s41598-019-50773-1
    OpenUrlCrossRef
  42. ↵
    Shera CA, Zweig G (1991) Reflection of retrograde waves within the cochlea and at the stapes. J Acoust Soc Am 89:1290–1305. doi:10.1121/1.400654 pmid:2030216
    OpenUrlCrossRefPubMed
  43. ↵
    Skoe E, Kraus N (2010) Auditory brain stem response to complex sounds: a tutorial. Ear Hear 31:302–324. doi:10.1097/AUD.0b013e3181cdb272 pmid:20084007
    OpenUrlCrossRefPubMed
  44. ↵
    Smith DW, Aouad RK, Keil A (2012) Cognitive task demands modulate the sensitivity of the human cochlea. Front Psychol 3:30. doi:10.3389/fpsyg.2012.00030 pmid:22347870
    OpenUrlCrossRefPubMed
  45. ↵
    Srinivasan S, Keil A, Stratis K, Osborne AF, Cerwonka C, Wong J, Rieger BL, Polcz V, Smith DW (2014) Interaural attention modulates outer hair cell function. Eur J Neurosci 40:3785–3792. doi:10.1111/ejn.12746 pmid:25302959
    OpenUrlCrossRefPubMed
  46. ↵
    Sun XM (2008) Contralateral suppression of distortion product otoacoustic emissions and the middle-ear muscle reflex in human ears. Hear Res 237:66–75. doi:10.1016/j.heares.2007.12.004 pmid:18258398
    OpenUrlCrossRefPubMed
  47. ↵
    Talmadge CL, Long GR, Tubis A, Dhar S (1999) Experimental confirmation of the two-source interference model for the fine structure of distortion product otoacoustic emissions. J Acoust Soc Am 105:275–292. doi:10.1121/1.424584 pmid:9921655
    OpenUrlCrossRefPubMed
  48. ↵
    Timpe-Syverson GK, Decker TN (1999) Attention effects on distortion-product otoacoustic emissions with contralateral speech stimuli. J Am Acad Audiol 10:371–378. pmid:10949941
    OpenUrlPubMed
  49. ↵
    Walsh WE, Dougherty B, Reisberg DJ, Applebaum EL, Shah C, O’Donnell P, Richter CP (2008) The importance of auricular prostheses for speech recognition. Arch Fac Plast Surg 10:321–328. doi:10.1001/archfaci.10.5.321
    OpenUrlCrossRefPubMed
  50. ↵
    Wardle S (1998) A Hilbert-transformer frequency shifter for audio. First Workshop on Digital Audio Effects DAFx, pp 25–29.
  51. ↵
    Winer JA, Larue DT, Diehl JJ, Hefti BJ (1998) Auditory cortical projections to the cat inferior colliculus. J Comp Neurol 400:147–174. doi:10.1002/(SICI)1096-9861(19981019)400:2<147::AID-CNE1>3.0.CO;2-9
    OpenUrlCrossRefPubMed
  52. ↵
    Wittekindt A, Kaiser J, Abel C (2014) Attentional modulation of the inner ear: a combined otoacoustic emission and EEG study. J Neurosci 34:9995–10002. doi:10.1523/JNEUROSCI.4861-13.2014 pmid:25057201
    OpenUrlAbstract/FREE Full Text

Synthesis

Reviewing Editor: Christine Portfors, Washington State University

Decisions are customarily a result of the Reviewing Editor and the peer reviewers coming together and discussing their recommendations until a consensus is reached. When revisions are invited, a fact-based synthesis statement explaining their decision and outlining what is needed to prepare a revision will be listed below. The following reviewer(s) agreed to reveal their identity: Alain de Cheveigné.

The reviewers agree that the manuscript presents a novel method of recording cochlear distortion that would be of interest to the community if there was further clarification of details in a revised manuscript. The manuscript tests an interesting hypothesis; that selective attention-dependent processing extends to the cochlea. However, both reviewers brought up significant concerns with the sample size, certain aspects of the results, and interpretation of the data that need to be addressed. In particular, the reviewers were concerned with fragility of the experiments thus rigor and reproducibility should be addressed. It was also suggested that one route in revising the manuscript may be to re-focus it on the novel technique and have less emphasis on testing the specific hypothesis.

Reviewer 1:

The paper is written with obvious care, it targets a very interesting hypothesis (that selective attention-dependent processing extends to the cochlea), and the experiments are ingenious and for most part seem well-executed. However, some aspects need clarifying (see details below).

Isolating the phenomenon of interest, attention-dependent distortion product otoacoustic emissions (DPOAEs) produced by interactions between partials of natural speech in the presence of a second competing voice, is technically challenging. However, the nature of the challenges may not be immediately obvious to the reader: the paper would be improved by adding a discussion of the options and constraints. For example, the reader might wonder why you don’t just measure DPOAEs from speech. Or, having understood why it’s better to measure them from 2 partials, they might wonder why put speech in the other ear. Answers may be pretty obvious, but the reader will appreciate some gentle guidance.

A critical assumption is that the level of DPOAEs between pairs of tones in one ear depends on attentional state to stimuli in the other ear. It would be nice to know more. Do DPOAEs even depend on whether or not there is contralateral stimulation with speech? On whether the stimulation is single-voice or double voice? If single-voice, on whether the primaries match that voice? It would be nice to have answers if any of your recordings allow them. Negative answers to any of these questions would call into question your main results. Conversely, asking them, and getting positive answers, would make your story more convincing.

Key methodological details are missing, in particular the critical step by which harmonic waveforms w_n(t) are derived from the filtered fundamental. The reference cited (Wardle 1998) mentions ‘frequency shifting’ but this is not what is needed here. Rather, you need scaling by a factor n. I have no idea how you do that. Also, I’m unsure as whether the phase of w_n(t) is important (for estimation of phase of the complex cross-correlation function), and if so, in what way the procedure ensures a proper value.

Details:

l19: ‘speech-DPOAE’ might be better than ‘speech-OAE’

l65: ‘cubic distortions’ --> ‘cubic distortion frequencies’. Readers familiar with distortion products (probably not that many) might be comfortable with the formulas, but for others it might make things easier to write ‘f1 - (f2-f1)’ and ‘f2 + (f2-f1)’ and paraphrase in English (e.g. ‘cubic distortion products consist in a component at (f2-f1) below the lower primary and a component at (f2-f1) above the higher primary’).

l85: not sure what is meant by ‘harmonic structure is modulated’

l99 and following: you first say that sound stimuli are presented to both ears, and then that speech is presented to the left ear. This is confusing because you haven’t introduced the DPOAE-inducing stimulus. Either do so earlier, or delay mentioning the speech signal to later.

l104: give more information about the acoustics. How are the probes constructed, were they sealed in the ear canal, etc.

l106: describe the speech stimulus in one place (i.e. move stuff from l140 etc. here).

Fig. 1: A, B: a bit confusing because it is common for high power to be represented as black rather than white. Harmonics appear aliased and don’t quite fit the colored lines. Panels C, D are very confusing (Are these cross-correlation functions? If so, the abscissa should be lag. What are we supposed to see?). Caption: describe each panel before telling us what is worth noting.

l111: this raises several potential issues. One is how does the filter track the time varying F0? Another is the latency of the filter (or at least its smoothing properties): might these induce a mismatch? A third is the fact that the filter is (presumably) time varying. A filter is normally assumed to be time-invariant, things are less straightforward if it varies. Perhaps add a sentence or two to reassure the reader. Would it make any sense to plot the cross-correlation between speech and filtered components?

l118: Tell us more about how the frequency shifter works. Do you really mean ‘transformer’?

l119: this paragraph is unclear. Not clear what is meant by ‘generate as well as measure’. Perhaps ‘generate’ --> ‘elicit’? l124 seems to imply that w_{2m-n}(t) is a distortion product waveform, but it seems instead that it is a harmonic of rank 2m-n that you synthesize so as to be able to reveal the distortion product by cross-correlation. Why that is a good idea is a bit mysterious. Elaborate!

l125 and elsewhere: consider replacing ‘sideband’ by ‘partial’ or ‘distortion product’. Sideband is widely (mis)used in the DPOAE literature, but its real meaning is the spectral consequence of modulating a sinusoidal carrier, which doesn’t fit this situation.

l128: Again, it is not clear to me why it is important to consider the relation between w_{2m-n}(t) and the DPOAE (phase and delay). What does it tell us?

Eq1: actual calculations are on samples. Would it be possible to convert Eq to involve sampled quantities (e.g. \sum instead of \int) without making things less clear?

l135: Why must C(0) be 1? It isn’t in Fig. 2 C & D (if those plots represent crosscorrelation). Why would the imaginary part of C be zero at \tau=0?

l136: ‘amplitude’ --> ‘magnitude’? It would help if you gave a few more details.

l143: ‘best’ --> ‘easiest’?

l153: The boundary between resolved and unresolved depends on the criterion and is not the same in every study. It’s not all or nothing.

l174: ‘in quiet’ --> ‘in isolation’

l178: why 12?

l182: why 3? Reorganize the description.

l196: I don’t see how this is comparable to speech OAEs. What phase do you attribute to the 800 Hz tone?

l200: ‘peak amplitude’ --> ‘peak correlation’ (or ‘peak correlation magnitude’)?

l204, ‘To...’: perhaps move to the description of the equipment

l210: 95% of 95% of the noise?

l223, ‘amplitude’: do you mean peak value of correlation? If so, use that term.

l233: what are the ‘three corresponding stimuli’?

Fig2 Caption: describe each panel before telling us what is worth noting. The meaning of ‘saturated’ is obscure. Omit or define in the Methods.

l270: it’s unclear what to expect from placing the probe outside the ear canal, since you give no details in Methods. Transducer and microphone are coupled even if the probe outside the ear canal?

l280: units?

Fig3: C: axes have same label. DPOAE is a phenomenon, not a quantity. Units? ‘dB SPL’-->’dB SPL / Hz’ presumably. Caption: describe each panel before telling us what is worth noting.

l288: Is power density the right concept? For a line spectrum the density is infinite at each line, which means in practice that values depend on the size of the analysis window.

l308: then-->them

l311: Do speech-OAEs differ between with speech vs without speech in the other ear?

l357: this can’t be true as stated because phase is proportional to frequency for any sinusoid. I believe what you mean to say is that phase effects due to frequency-dependent phase variations affect a short latency response less than a long latency response. However, that too is incorrect: what counts is not the latency, but the duration of the impulse response, which you don’t discuss here. Please rewrite.

l362: why didn’t you use the same harmonics for male and female? Is there a reason to believe that higher frequencies would significantly reduce DPOAE amplitudes?

l371: Strictly speaking, a discussion of the difference between a condition that yields a significant effect and one that does not is not founded. You need to test whether the between-condition difference of effect size is significant (interaction). Perhaps formulate as: "We can speculate that the lack of significant effects for male voices might reflect poor resolvability of the target harmonics...’

l373: what does "attentional modulation of the resolved harmonics" mean?

l378, ‘the cochlea appears’: is this based on the present data or from the literature?

l387: what does ‘intermodal attention’ mean?

l389: in what way is this related to ‘temporal fine structure’? (or perhaps: in what way could it not be?) Reformulate to make the reasoning explicit.

l406: ‘related’ --> ‘relates’ or ‘is related’ ?

l410: ‘reduced’ --> ‘small’. ‘difference across conditions’ --> ‘lack of effect for male’

I’m a bit concerned by your statement that significance is ‘marginal’. On the one hand, this degree of self-scepticism is refreshing and commendable. On the other, it does seem to invoke the spectre of yet another non-replicable effect. Ideally you should recruit a new cohort of subjects and replicate, but that is perhaps difficult given the pandemic. My feeling is that the risk is worth taking if the methodology is solid and well explained.

References are incomplete (they lack volume and page numbers). Spell out all words unless perhaps standard journal abbreviations. Check journal style guide.

Reviewer 2:

The contents of the manuscript can be divided into two parts. The first part is the presentation of a novel method of recording cochlear distortion using analogues of harmonics of the fundamental frequency of human voice. The second part includes the use of this method to evaluate attentional modulation of cochlear distortion.

The premise of the paper is a solution to the heretofore unresolved contribution of the cochlea to understanding speech in the presence of background or distracting signals. Many have used otoacoustic emissions in general and efferent modulation of otoacoustic emission amplitude to examine whether central modulation of cochlear amplification is part of the solution enabling understanding of speech in the presence of distracting signals. The authors claim that previous work has been ambiguous because cochlear emissions have been recorded using signals that do not resemble human speech.

The results suggest that cochlear distortion can be recorded using the novel method presented in the manuscript, that this distortion is correlated in magnitude to that measured using the traditional steady sinusoidal signals, and attending to the female but not the male voice leads to an increase in the amplitude of cochlear distortion. This last finding is explained by suggesting that the harmonics of the female voice perhaps engender a modulation effect as they are in the range of resolved harmonics, whereas the harmonics used for the male voice were not.

The authors do caution the reader about their limited sample size. Unfortunately this is not enough in the opinion of this reviewer. Measures of efferent modulation of cochlear distortion are notoriously variable. For instance, Oxenham et al. have recently demonstrated that the direction of an effect observed in one sample of ears can be completely reversed in the next sample. At the very least, some attempt to demonstrate stability of these results even within this subject pool would be needed. For example, could some form of bootstrapping be used to establish reproducibility of the results?

The claim that measuring cochlear distortion using a speech-like signal leads to substantially different outcomes can only be claimed if a direct comparison with results obtained using the traditional methods was presented within subjects.

Finally, the claim that the lack of attentional effects in the male voice was because of unresolved harmonics could be easily verified by lowering the order of harmonics and repeating the experiment.

Overall it is a well written manuscript with the novel method presented adding value to the literature. If the authors are willing to recast the manuscript to focus on the methods and attenuate the claims about attentional modulation of cochlear distortion, the paper would be accurate and useful.

Back to top

In this issue

eneuro: 8 (2)
eNeuro
Vol. 8, Issue 2
March/April 2021
  • Table of Contents
  • Index by author
  • Ed Board (PDF)
Email

Thank you for sharing this eNeuro article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Otoacoustic Emissions Evoked by the Time-Varying Harmonic Structure of Speech
(Your Name) has forwarded a page to you from eNeuro
(Your Name) thought you would be interested in this article in eNeuro.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Print
View Full Page PDF
Citation Tools
Otoacoustic Emissions Evoked by the Time-Varying Harmonic Structure of Speech
Marina Saiz-Alía, Peter Miller, Tobias Reichenbach
eNeuro 25 February 2021, 8 (2) ENEURO.0428-20.2021; DOI: 10.1523/ENEURO.0428-20.2021

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Respond to this article
Share
Otoacoustic Emissions Evoked by the Time-Varying Harmonic Structure of Speech
Marina Saiz-Alía, Peter Miller, Tobias Reichenbach
eNeuro 25 February 2021, 8 (2) ENEURO.0428-20.2021; DOI: 10.1523/ENEURO.0428-20.2021
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Significance Statement
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
    • Footnotes
    • References
    • Synthesis
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF

Responses to this article

Respond to this article

Jump to comment:

No eLetters have been published for this article.

Related Articles

Cited By...

More in this TOC Section

Research Article: New Research

  • A progressive ratio task with costly resets reveals adaptive effort-delay tradeoffs
  • What is the difference between an impulsive and a timed anticipatory movement ?
  • Psychedelics Reverse the Polarity of Long-Term Synaptic Plasticity in Cortical-Projecting Claustrum Neurons
Show more Research Article: New Research

Sensory and Motor Systems

  • Spatially Extensive LFP Correlations Identify Slow-Wave Sleep in Marmoset Sensorimotor Cortex
  • What is the difference between an impulsive and a timed anticipatory movement ?
  • Odor Experience Stabilizes Glomerular Output Representations in Two Mouse Models of Autism
Show more Sensory and Motor Systems

Subjects

  • Sensory and Motor Systems
  • Home
  • Alerts
  • Follow SFN on BlueSky
  • Visit Society for Neuroscience on Facebook
  • Follow Society for Neuroscience on Twitter
  • Follow Society for Neuroscience on LinkedIn
  • Visit Society for Neuroscience on Youtube
  • Follow our RSS feeds

Content

  • Early Release
  • Current Issue
  • Latest Articles
  • Issue Archive
  • Blog
  • Browse by Topic

Information

  • For Authors
  • For the Media

About

  • About the Journal
  • Editorial Board
  • Privacy Notice
  • Contact
  • Feedback
(eNeuro logo)
(SfN logo)

Copyright © 2025 by the Society for Neuroscience.
eNeuro eISSN: 2373-2822

The ideas and opinions expressed in eNeuro do not necessarily reflect those of SfN or the eNeuro Editorial Board. Publication of an advertisement or other product mention in eNeuro should not be construed as an endorsement of the manufacturer’s claims. SfN does not assume any responsibility for any injury and/or damage to persons or property arising from or related to any use of any material contained in eNeuro.