Introduction

A brief foray into the literature on audiovisual (AV) speech perception reveals a common rhetorical approach, in which authors begin with the general claim that visual information influences speech perception and cite two effects as evidence: the McGurk effect (McGurk & McDonald, 1976) and the intelligibility benefit garnered by listeners when they can see speakers’ faces (Sumby & Pollack, 1954). (For examples of papers that begin this way, see Altieri et al., 2011; Anderson et al., 2009; Colin et al., 2005; Grant et al., 1998; Magnotti et al., 2015; Massaro et al., 1993; Nahorna et al., 2012; Norrix et al., 2007; Ronquest et al., 2010; Rosenblum et al., 1997; Ross et al., 2007; Saalasti et al., 2011; Sams et al., 1998; Sekiyama, 1997; Sekiyama et al., 2003; Strand et al., 2014; van Wassenhove et al., 2007.) Both effects have been replicated many times and unquestionably show the influence of visual input on speech perception.

It is often assumed, then, that these two phenomena arise from a common audiovisual integration mechanism. As a result, the McGurk effect (i.e., auditory misperception of a spoken syllable when it is presented with incongruent visual information) has often been used as a measure of auditory-visual integration in speech perception. Van Wassenhove et al. (2007), for example, define AV speech integration as having occurred when “a unitary integrated percept emerges as the result of the integration of clearly differing auditory and visual informational content,” and therefore use the McGurk illusion to “quantify the degree of integration that has taken place.” (p. 598). Alsius et al. (2007) similarly define the degree of AV integration as “the prevalence of the McGurk effect.” (p.400). Moreover, a number of studies investigating AV speech perception in clinical populations (e.g., individuals with schizophrenia (Pearl et al., 2009), children with amblyopia (Burgmeier et al., 2015), and individuals with Asperger syndrome (Saalasti et al., 2011)) have used the McGurk effect as their primary dependent measure of AV speech processing.

Despite the popularity of this approach, it still remains to be convincingly demonstrated that an individual’s susceptibility to the McGurk illusion relates to their ability to take advantage of visual information during everyday speech processing. There is evidence that McGurk susceptibility relates (weakly) to lip-reading ability under some task and scoring conditions (Strand et al., 2014), with better lip readers being slightly more susceptible to the McGurk effect. There is also evidence linking lip-reading ability to AV speech perception (Grant et al., 1998). The connection between McGurk susceptibility and AV speech perception was directly investigated in one study on older adults with acquired hearing loss (Grant & Seitz, 1998), with equivocal results: there was a correlation between McGurk susceptibility and visual enhancement for sentence recognition (r=.46), but McGurk susceptibility did not contribute significantly to a regression model predicting visual enhancement. The relationship between McGurk susceptibility and the use of visual information during speech perception, therefore, remains unclear. Here we present a within-subjects study of young adults with normal hearing in which we assess McGurk susceptibility and AV sentence recognition across a range of noise levels and types. If susceptibility to the McGurk effect reflects an AV integration process that is relevant to everyday speech comprehension in noise, then we expect listeners who are more susceptible to the illusion to show greater speech intelligibility gains when visual information is available for sentence recognition. If, on the other hand, different mechanisms mediate the use of auditory and visual information in the McGurk task and during everyday speech perception, then no such relationship is predicted. Such a finding would cast doubt on the utility of the McGurk task as a measure of AV speech perception.

Method

Participants

Thirty-nine healthy young adults (18–29 years; mean age = 21.03 years) were recruited from the Austin, TX, USA community. All participants were native speakers of American English and reported no history of speech, language, or hearing problems. Their hearing was screened to ensure thresholds ≤25 dB hearing level (HL) at 1,000, 2,000, and 4,000 Hz for each ear, and their vision was normal or corrected-to-normal. Participants were compensated in accordance with a protocol approved by the University of Texas Institutional Review Board.

McGurk task

Stimuli

The stimuli were identical to those in Experiment 2 of Mallick et al. (2015). They consisted of two types of AV syllables: McGurk incongruent syllables (auditory /ba/ + visual /ga/) and congruent syllables (/ba/, /da/, and /ga/). The McGurk syllables were created using video recordings from eight native English speakers (four females, four males). A different female speaker recorded the three congruent syllables.Footnote 1

Procedure.

The task was administered using E-Prime 2.0 software (Schneider et al., 2002). Auditory stimuli were presented binaurally at a comfortable level using Sennheiser HD-280 Pro headphones, and visual stimuli were presented on a computer screen. A fixation cross was displayed for 500 ms prior to each stimulus. Following Mallick et al. (2015), participants were instructed to report the syllable they heard in each trial from the set /ba/, /da/, /ga/, and /tha/. The 11 stimuli (eight McGurk and three congruent) were each presented ten times. The presentation of these 110 stimuli was randomized and self-paced.

McGurk susceptibility.

Responses to the McGurk incongruent stimuli were used to measure listeners’ susceptibility to the McGurk effect. As in Mallick et al. (2015), responses of either /da/ or /tha/ were coded as McGurk fusion percepts.

Speech perception in noise task

Target speech stimuli

A young adult male speaker of American English produced 80 simple sentences, each containing four keywords (e.g., The gray mouse ate the cheese) (Van Engen et al., 2012). Sentences were used for this task (rather than syllables) because our interest is in the relationship between McGurk susceptibility and the processing of running speech.

Maskers.

Two maskers, equated for RMS amplitude, were generated to create speech-in-noise stimuli: speech-shaped noise (SSN) filtered to match the long-term average spectrum of the target speech and two-talker babble consisting of two male voices. The two maskers were included to assess listeners’ ability to take advantage of visual cues in different types of challenging listening environments. SSN renders portions of the target speech signal inaudible to listeners (i.e., energetic masking), while two-talker babble can also interfere with target speech identification by creating confusion and/or distraction not accounted for by the physical properties of the speech and noise (i.e., informational masking). Visual information has been shown to be more helpful to listeners when the masker is composed of other voices (Helfer & Freyman, 2005).

Mixing targets and maskers.

The audio was detached from the video recording of each sentence and equalized for RMS amplitude using Praat (Boersma & Weenink 2010). Each audio clip was mixed with the maskers at five levels to create stimuli with the following signal-to-noise ratios (SNRs): −4 dB, −8 dB, −12 dB, −16 dB, and −20 dB. Each noise clip was 1 s longer than its corresponding sentence so that 500 ms of noise could be played before and after each target. These mixed audio clips served as the stimuli for the audio-only (AO) condition. The audio files were also reattached to the corresponding videos to create the stimuli for the AV condition. In total, there were 400 audio (AO) files and 400 corresponding AV files with the SSN masker (80 sentences × 5 SNRs), and 400 final audio files and 400 corresponding audiovisual files with the two-talker babble masker (80 sentences × 5 SNRs).

Design and procedure.

Masker type (SSN and two-talker babble), modality (AO and AV), and SNR (−4 dB, −8 dB, −12 dB, −16 dB, and −20 dB) were manipulated within subjects. Four target sentences were presented in each condition for a total of 80 trials. Trials were randomized for each participant. No sentence was repeated for a given participant.

Stimuli were presented to listeners at a comfortable level. Participants were instructed that they would be listening to AO and AV sentences in noise, and were told that the target sentences would begin a half second after the noise. The participant initiated each stimulus presentation using the keyboard, and they were given unlimited time to respond. If they were unable to understand a sentence, they were asked to report any intelligible words and/or make their best guess. If they did not understand anything, they were told to type ‘X.’ For AO trials, a centered crosshair was presented on the screen during the audio stimulus; for AV trials, a full-screen video of the speaker was presented with the audio. Responses were scored by the number of keywords identified correctly. Homophones and obvious spelling errors were scored as correct; words with added or deleted morphemes were scored as incorrect.

Results

McGurk susceptibility

Figure 1 shows the distribution of McGurk susceptibility scores. As in Mallick et al. (2015), scores ranged from 0 % to 100 % and were skewed to the right.

Fig. 1
figure 1

Distribution of McGurk susceptibility scores

Keyword identification in noise

Average keyword intelligibility across SNRs is shown in Fig. 2. The full set of identification data was first analyzed to determine whether McGurk susceptibility predicted keyword identification under any of the test conditions. Statistical analysis was performed using the lme4 package (version 1.1-12, Bates et al., 2015) for R software (2016). The response to each keyword was categorized as correct or incorrect and analyzed using a mixed logit model for binomially distributed outcomes. Because comparing the noise types to one another was not of primary interest, separate analyses were conducted for the two noise types to simplify model interpretation. For each analysis, modality, SNR, McGurk scores, and their 2- and 3-way interactions were entered as fixed factors. Modality was deviation-coded (i.e., −0.5 and 0.5), which entails that the coefficient represents modality’s “main effect” (i.e., its partial effect when all others are zero). Continuous predictors (SNR and McGurk scores) were centered and scaled. The models were fit with the maximal random effects structure justified by the experimental design (Barr et al., 2013).Footnote 2The model outputs for the fixed effects are shown in Table 1 and Table 2.

Fig. 2
figure 2

Proportion of keywords identified across SNRs in speech-shaped noise (left) and 2-talker babble (right). Error bars represent standard error

Table 1 Speech-shaped noise
Table 2 Two-talker babble

For both types of noise, modality, SNR, and their interaction were the only statistically significant predictors of keyword identification in noise. Neither McGurk susceptibility nor its interactions with SNR or modality significantly predicted keyword identification. That said, McGurk scores and their interaction with modality did near statistical significance (p=.10, p=.07) in SSN. As shown in Fig. 3, these trends reflect a negative association between McGurk susceptibility, particularly in the AV conditions.

Fig. 3
figure 3

Relationship between McGurk susceptibility scores (scaled) and proportion of keywords identified in speech-shaped noise

Visual enhancement

One way researchers quantify individuals’ ability to use visual information during speech identification tasks is by calculating visual enhancement (VE), which takes the difference between a listener’s performance in AV and AO conditions and normalizes it by the proportion of improvement available given their AO performance (Grant & Seitz, 1998; Grant et al., 1998; Sommers et al., 2005):

$$ \mathrm{V}\mathrm{E}=\frac{\left(\mathrm{A}\mathrm{V}-\mathrm{A}\mathrm{O}\right)}{\left(1-\mathrm{A}\mathrm{O}\right)} $$

Because AO performance must be below ceiling to calculate VE (i.e., there must be room for improvement), VE was calculated for the SNRs of −8 dB and below. (At least 10 % of the subjects were at ceiling in AO at −4 dB). For each listener, VE was calculated separately for two-talker babble and SSN. A positive VE score indicates that a listener identified more words in the AV condition than in the AO condition; the maximum VE score of 1 indicates that the listener identified all of the keywords in the AV condition.

Results are shown in Fig. 4. One outlier is not displayed because it would have required a significant extension of the y-axis. (In SSN at −8 dB, the individual identified ~94 % of AO words, but only 50 % of the AV words, resulting in a VE score of −7.) All other data are shown. Although it is not the focus of this study, it is worth noting that the VE data do not follow the principle of inverse effectiveness (Holmes, 2009), which predicts greater multisensory integration in more difficult conditions. If anything, the data for SSN suggest the opposite pattern: greater VE at easier SNRs. (See also Van Engen et al. 2014, Tye-Murray et al., 2010, and Ross et al., 2007 for other cases where this principle does not capture behavior in AV speech perception.)

Fig. 4
figure 4

Visual enhancement (VE) scores by condition. The boxes extend from the 25th percentile to the 75th percentile, with the dark line indicating the median. Whiskers extend to data points that are within 1.5 times the interquartile range. Data outside that range are denoted by open circles. Note that one outlier is not displayed (a VE score of −7 on SSN at −8 dB)

Figure 5 displays the data for each condition with McGurk susceptibility scores plotted against VE scores. Due to the skewness of the McGurk data (see Fig. 1), Kendall’s tau was used to assess the relationship between individuals’ rate of McGurk responses and their VE scores. As shown in Table 3, there was no statistically significant association in any condition (p-values ranged from .06 to .85; tau values ranged from −.21 to .06).

Fig. 5
figure 5

Visual enhancement plotted against McGurk susceptibility for each listening condition. Top row: Two-talker babble conditions. Bottom row: Speech-shaped noise conditions. Linear trendlines are included, with shading to represent standard error. For the sake of axis consistency, one data point for SSN −8 dB is not displayed, although it is included in the calculation of the trendline (McGurk score=.86; VE score = −7)

Table 3 Results of correlation tests (Kendall's tau) for the relationship between McGurk susceptibility and visual enhancement (VE)

In SSN at −8 dB and in two-talker babble at −16 dB, this correlation neared statistical significance (p=.190 and p=.063) with negative tau values, indicating that the (possible) relationship between VE and McGurk susceptibility is one in which susceptibility to the illusion is associated with less VE. In both the analysis of keyword identification and VE, then, no association between McGurk susceptibility and speech perception in noise was significant at the p <.05 level, and the two marginally significant relationships suggested that susceptibility to the illusion was associated with lower rates of keyword identification and less visual enhancement.

Discussion

The results of this experiment showed no statistically significant relationship between an individual’s susceptibility to the McGurk effect and their ability to understand speech in noise (with or without visual information). Importantly, the VE analyses also showed no significant relationship between an individual’s susceptibility to the McGurk illusion and their ability to use visual cues in order to improve on audio-only speech perception.Footnote 3 Research that is fundamentally interested in understanding audiovisual integration as it relates to a listener’s ability to understand connected speech, therefore, should not assume that susceptibility to the McGurk effect can be used as a valid measure of AV speech processing.

Although the McGurk effect and the benefit of visual information both show that visual information affects auditory processing, there are several critical differences between these phenomena. One noteworthy difference is the congruency of the auditory and visual information in these situations. In the McGurk effect, auditory identification of individual consonants is altered by conflicting visual information; visual information can be thought of, therefore, as detrimental to correct auditory perception. In AO versus AV speech in noise, congruent visual information facilitates auditory perception. At least one recent neuroimaging study supports the hypothesis that different neural mechanisms may mediate the integration of congruent versus incongruent visual information with auditory signals. Using functional magnetic resonance imaging (fMRI), Erickson et al. (2014) showed that distinct posterior superior temporal regions are involved in processing congruent AV speech and incongruent AV speech when compared to unimodal speech (acoustic-only and visual-only). Left posterior superior temporal sulcus (pSTS) was recruited during congruent bimodal AV speech. In contrast, left posterior superior temporal gyrus (pSTG) was recruited when processing McGurk speech, suggesting that left pSTG may be necessary when there is a discrepancy between auditory and visual cues. It may be that left pSTG is involved in the generation of the fused percept that can arise from conflicting cues.

Another critical difference between McGurk tasks and sentence-in-noise tasks is the amount of linguistic context available to listeners. The top-down and bottom-up processes involved in identifying speech vary significantly with listeners’ access to lexical, syntactic, and semantic context (Mattys et al., 2005), and the availability of rhythmic information in running speech allows for neural entrainment and predictive processing that is not possible when identifying isolated syllables (Peelle & Davis, 2012; Peelle & Sommers, 2015). In keeping with these observations, previous studies have failed to show a relationship between AV integration measures derived from consonant versus sentence recognition (Grant & Seitz, 1998; Sommers et al., 2005). Grant and Seitz (1998) showed no correlation between the two in older adults with hearing impairment, and Sommers et al. (2005) found that, while VE for word and sentence identification were related to one another for young adults, VE for consonant identification was not related to either. Given that other researchers have shown significant relationships between consonant identification and words or sentences in unimodal conditions (Grant et al., 1998; Humes et al., 1994; Sommers et al. 2005), these results suggest that the lack of relationship between VE for consonants and VE for words and sentences results from differences in the mechanisms mediating the integration of auditory and visual inputs for these different types of speech materials.

The results reported here serve two purposes: First, as a caution against the assumption that the McGurk effect can be used as an assay of audiovisual integration for speech in challenging listening conditions; and second, to set the scene for future work investigating the potentially different mechanisms supporting the integration of auditory and visual information across different types of speech materials.