It is well-established that speech production critically involves sensory systems. Evidence for this assertion comes primarily from various demonstrations of the modulatory or disruptive effect of altered auditory feedback on speech production, including late onset deafness, delayed feedback or formant-shifted feedback (Guenther, Hampson, & Johnson, 1998; Hickok, 2012a; Hickok, Houde, & Rong, 2011; J. Houde & Nagarajan, 2011; J. F. Houde & Jordan, 1998; Perkell, 2012; Tremblay, Shiller, & Ostry, 2003) and from similar demonstrations in the somatosensory domain (Waldstein, 1989).

It has been hypothesized that forward predictive coding is a mechanism for this involvement (Stuart, Kalinowski, Rastatter, & Lynch, 2002; Yates, 1963). Forward predictive coding (see also similar terms like forward model, internal model, efference copy, or corollary discharge) refers to the idea that motor plans or commands lead to predictions of the sensory consequences of those plans/commands via motor-to-sensory neural projections, which in turn serve to facilitate error detection and correction in motor control (Burnett, Senner, & Larson, 1997; J. F. Houde & Jordan, 1998).

Neural evidence for the existence of a forward predictive coding mechanism in speech motor control has come from experiments showing either a suppression of the auditory response, measured using magnetoencephalography (MEG), to self-produced speech compared to externally delivered speech (Tremblay et al., 2003) or from an enhanced response, measured using functional magnetic resonance imaging (fMRI), to self-produced altered versus unaltered speech (Hickok, 2012a; J. Houde & Nagarajan, 2011; Tourville, Reilly, & Guenther, 2008). A complication in interpreting the suppression effect data is that it is hard to precisely control the acoustic stimulus in self- versus externally produced speech. This is less of a concern in experiments that compare unaltered (predicted) versus altered (unpredicted) speech, but the observed localization of this effect did not fall squarely in auditory cortex but rather in what appears to be an auditory-motor integration area, area Spt (Desmurget & Grafton, 2000; Kawato, 1999; Wolpert, 1997; Wolpert, Ghahramani, & Jordan, 1995). This does not challenge the claim for the existence of a forward predictive code, but does question the role of auditory cortex proper in this system.

Here we use a very simple fMRI paradigm to provide straightforward evidence for the existence of a forward predictive coding mechanism in speech production and to home in on the level(s) of motor planning that give(s) rise to it. We asked participants to read a list of words and either imagine speaking them without overtly articulating or to overtly articulate the words without phonating (see Fig. 1). Thus both conditions are matched for acoustic input (i.e., no speech input). Previous behavioral research has shown that these two tasks engage different levels of linguistic/motor planning. Imagined speech engages lexical-level processes but not lower level phonological-level processes, whereas silently articulated speech engages both levels of processing (Aliu, Houde, & Nagarajan, 2009; Curio, Neuloh, Numminen, Jousmaki, & Hari, 2000; Heinks-Maldonado, Nagarajan, & Houde, 2006; J. F. Houde, Nagarajan, Sekihara, & Merzenich, 2002; Numminen, Salmelin, & Hari, 1999). We reasoned that engaging motor-phonological processes should generate a forward prediction of the acoustic consequences of the executed (silent) speech whereas engaging lexical-level processes should not. If true, we should then see differential activation in auditory cortex for silently articulated speech compared to imagined speech despite the fact that neither condition involves any speech input.

Fig. 1
figure 1

Example of a single trial. Subjects were presented with a tongue twister sequence that remained on screen for 3 seconds, followed by a cue to either articulate the sequence or imagine the sequence. They recited each word in sync with the visual metronome (Color figure online)

Method

Subjects

Twenty four participants (15 females) between 18 and 40 years of age were recruited from the University of California, Irvine (UCI) community and received monetary compensation for their time. The volunteers were right-handed, native English speakers with normal or corrected-to-normal vision, no known history of neurological disease, and no other contraindications for MRI. Informed consent was obtained from each participant prior to participation in the study in accordance with guidelines from UCI Institutional Review Board, which approved this study. Four subjects were omitted from data analysis: two subjects had excessive motion (>3° of movement), one subject reported excessive errors on the task (>50% error rate), and one subject was eliminated because of response box failure to log responses.

Stimuli and task

FMRI was used to monitor blood oxygenation (BOLD) changes elicited by reciting tongue twisters. The use of tongue twisters increases the probability of speech errors and therefore the load on error detection systems. The experiment was closely modeled on previous behavioral studies using these stimuli (Oppenheim & Dell, 2008, 2010), and participants were scanned while they recited a set of four words (e.g., lean reed reef leach) in sync with a visual metronome. Two speech production conditions were included, one in which speech was articulated without phonating (silent articulation) and one in which speech production was imagined without articulation (imagined). On each trial, a tongue twister phrase was visually presented on screen for 3 seconds, and then subjects were cued to silently articulate the sequence or imagine saying the sequence without mouth movements (see Fig. 1). The cue was a cartoon face that remained on screen for 500 ms, and a red arrow pointed either to the head or to the lips. An arrow pointing to the head cued the participants to imagine saying the word, and an arrow pointing to the lips cued the subjects to silently articulate the words. A red fixation appeared on screen 500 ms after cue offset, and this served as the visual metronome and flashed at a rate of 2/s. Participants recited one word per fixation in sync with the metronome. Following Oppenheim and Dell (2010), after recitation, participants indicated with a button press whether or not they correctly produced the sequence.

A total of 32 sets of tongue twisters were used in the study. Although not of primary interest in this study, the word lists in the experiment varied in lexicality. Lexical bias refers to the tendency for word errors to create a real word instead of a nonword (e.g., scenarios in which the target word is reef but slips to leaf is more likely than scenarios in which the target word is wreath and slips to leath because leath is a nonword). These stimuli were designed so that if an error occurred on the third word of each sequence, the outcome would yield either a real-word error (lean would induce the error leaf instead of reef) or a nonword error (lean would induce the error leath instead of wreath). The specific metrics of the stimuli are described elsewhere (Oppenheim & Dell, 2010).

A single trial was 8 seconds in length, and there were 36 trials in each session. Each session consisted of an equal number of tongue twister phrases, which were silently articulated or imagined. The experiment consisted of eight experimental sessions, and each session consisted of 16 trials of each type, which were randomly presented along with four rest trials (fixation). The study started with a short practice session with 10 trials to familiarize subjects with the task. Subjects were scanned during the practice session to acclimatize them to the fMRI environment. The study ended with a high-resolution structural scan, and the entire experiment was 1 hour in length. Stimuli presentation and timing was controlled using Cogent software (http://www.vislab.ucl.ac.uk/cogent_2000.php) implemented in MATLAB 7.1 (MathWorks, Inc, USA) running on a dual-core IBM Thinkpad laptop.

Imaging

MR images were obtained in a Philips Achieva 3T (Philips Medical Systems, Andover, MA) fitted with an eight-channel RF receiver head coil, at the Research Imaging Center scanning facility at the University of California, Irvine. Images during the experimental sessions were collected using Fast Echo EPI (sense reduction factor = 2.0, matrix = 112 × 112 mm, TR = 2.0 s, TE = 25 ms, size = 2.5 x 2.5 x 2.5 mm). A total of 1,152 echo planar images (EPI) were collected over eight sessions, and 41 slices provided whole brain coverage. After the functional scans, a high resolution T1-weighted anatomical image was acquired with an MPRAGE pulse sequence in axial plane (matrix = 256 × 256 mm, TR = 8 ms, TE = 3.6 ms, flip angle = 8°, size = 1 × 1 × 1 mm).

Data analysis

Data preprocessing and analyses were performed using AFNI software (Cox, 1996). First, motion correction was performed by creating a mean image from all of the volumes in the experiment and then realigning all volumes to that mean image using a six-parameter rigid-body model (Cox & Jesmanowicz, 1999). The images were then high-pass filtered at 0.008 Hz and spatially smoothed with an isotropic 8-mm full width half maximum (FWHM) Gaussian kernel. The anatomical image for each subject was coregistered to his or her mean EPI image. Data analysis proceeded in two steps. First, multiple regression analysis was performed at the single subject level, parameter estimates for events of interest were obtained, and then these data were transformed into standardized space for group-level analysis.

First-level analysis was performed on the time course of each voxel’s BOLD response for each subject using AFNI software (Cox, 1996). Regression analysis was performed using AFNI’s 3dDeconvolve function, and the regressors were created by convolving the predictor variables representing the time course of stimulus presentation with a gamma variate function. A total of 15 regressors were entered into the analysis. The first eight regressors were used to model the following experimental trial types—articulation: similar onset, nonword errors; articulation: similar onset, word errors; articulation: dissimilar onset, nonword errors; articulation: dissimilar onset, word errors; imagining: similar onset, nonword errors; and imagining: similar onset, word errors; imagining: dissimilar onset, nonword errors; and imagining: dissimilar onset, word errors. The ninth regressor corresponded to all of the trials in which the subjects reported making an error. Thus, the regressors modeling the speech production conditions only include trials in which subjects report accurately reproducing the sequence of words. An additional six regressors representing the motion parameters determined during the realignment stage of processing were entered into the model. Parameter estimates for the events of interest were obtained and statistical maps were created.

For group-level analysis, these statistical maps for each participant were transformed into standardized space (Talairach & Tournoux, 1988) using a Talairach template and resampled into 2-mm3 voxels. We performed t tests to examine group differences between silent articulation and imagining the word lists. We also examined the lexicality effect and phonemic similarity effects. Group-level activation maps were created, and a corrected significance level was set at p < .05. This threshold was determined using 3dFWHMx and 3dClustStim (Cox, 1996) to estimate the smoothness of the noise, then combines minimum cluster size and p threshold (p < .001) to correct for multiple testing.

Results

Silent articulation and imagining speech

First, we examined the neural regions activated by silent articulation and by imagining separately. This was done to ensure that regions previously implicated in speech production were engaged during the two tasks and to examine differences in activation patterns between the two tasks. As expected, silent articulation engaged a wide network of regions previously implicated in speech production. We found significant activation in inferior and middle frontal gyri in the left hemisphere, bilateral precentral gyrus, bilateral inferior parietal cortex including angular gyrus and supramarginal gyrus, insula, basal ganglia, and cerebellum. We also found significant activity in left superior temporal gyrus. Imagining word lists activated a similar network of areas, such as left inferior and middle frontal cortex, bilateral precentral gyrus, bilateral inferior parietal cortex, and cerebellum. Figure 2 illustrates the activation patterns associated with the two tasks, and areas significantly activated for each task are reported in Table 1.

Fig. 2
figure 2

Silent articulation and imagining word lists. Group activation map (N = 20) overlaid on a template brain illustrating regions significantly activated in the Articulation > Baseline, and Imagining > Baseline conditions (p < .05, corrected) (Color figure online)

Table 1 Summary of brain regions significantly activated in the Articulation > Imagining contrast (p < .05, corrected)

Articulation versus imagining

A contrast of silent articulation with imagining word lists yielded greater activity in bilateral superior and middle temporal gyrus, precentral gyrus, anterior cingulate, bilateral parahippocampal gyrus, cerebellum, and basal ganglia. Each of these areas was more active during silent articulation than imagined speech, and the reverse contrast of Imagining > Articulation did not yield any significant regions. What is particularly interesting is the large clusters of activation in bilateral auditory cortex, given that both tasks involved silent recitation. Subjects did not have overt speech feedback in the experiment. This shows that engaging motor articulators changes the activation pattern in auditory cortex compared to imagining/thinking about speech. Figure 3 shows the regions more activated by articulation, and Table 1 lists the Talairach coordinates of the significant clusters.

Fig. 3
figure 3

Evidence of predictive coding. Group activation map (N = 20) overlaid on a template brain illustrating regions significantly activated in the contrast Articulation > Imagining (p < .05, corrected). There was greater activation in bilateral auditory cortex when subjects were silently articulating word lists compared to imagining word lists. (Color figure online)

Lexicality effect

Although not the main focus of our study, the lexicality effect was examined by comparing activation associated with reciting word lists biased to produce real-word errors versus nonword errors. A lexicality effect was not observed at a threshold corrected for multiple comparisons. However, at a lowered threshold (p < .001, uncorrected), this contrast revealed a lexicality effect in a region previously implicated in lexical-level processes, the posterior middle temporal gyrus in the left hemisphere (peak coordinate [-53 -59 6]). That is, when the word list was biased to produce nonword errors (e.g., producing leath instead of leaf), greater activation was observed in pMTG compared to a word list biased to produce word errors, and this effect was observed on error-free trials (i.e., errors were modeled separately). What makes this finding potentially interesting is that the difference in activation pattern cannot be attributed to lexical differences in the words presented to the subject or to lexical differences in what was spoken; in both conditions, subjects viewed the same set of words (although in different combinations) and spoke real words. Instead, what is driving the activation difference is the potential for a word versus a nonword error that was not overtly committed. If this effect replicates in future studies it suggests that the system detected the distinction internally and this could only be the case if in fact an internal error was committed and then corrected prior to accurate output. Figure 4 illustrates the lexicality effect. This should be further investigated in future work.

Fig. 4
figure 4

Lexicality effect. Group activation map (N = 20) overlaid on a template brain illustrating regions significantly activated in the Nonword > Word contrast (p < .001, uncorrected). This activation includes only those trials in which subjects report correctly reciting the tongue twister sequence. Error trials are omitted. (Color figure online)

Discussion

We observed a substantial effect of speech production condition, with the silent articulation condition generating more activation than the imagined speech condition in a wide network of brain regions. Many of these are unsurprising, such as the greater activity in primary sensorimotor cortex (predominantly in the right hemisphere, however), the cerebellum, and subcortical nuclei all of which play a role in overt movement control. Most interestingly, we also found robust activation differences (Silent Articulation > Imagined Speech) in the superior temporal gyri bilaterally, including portions of auditory cortex, despite the fact that there was no difference in auditory input in the two conditions. One possible explanation of this effect is that it reflects a mismatch error signal: articulatory plans result in an internal forward prediction of the acoustic consequences of speech articulation, which fail to arrive. Mismatch error signals have been reported previously under conditions of altered auditory or somatosensory feedback (Golfinopoulos et al., 2011; Tourville et al., 2008) and similar inferences regarding forward prediction of sensory consequences in speech have been used to explain differences in the acoustic response to self- versus other-generated speech (Heinks-Maldonado et al., 2006; J. F. Houde et al., 2002; Ventura, Nagarajan, & Houde, 2009). If overt articulation (i.e., the actual execution of motor speech plans) results in stronger forward predictions than imagined speech, this could explain the observed activation in auditory cortex in the articulation versus imagined speech condition. This then, would be a reflection of a lower level of feedback control than the higher level circuit revealed by the lexicality manipulation, consistent with recent hierarchical models of feedback control in speech and motor control generally (Diedrichsen, Shadmehr, & Ivry, 2010; Grafton & Hamilton, 2007; Hickok, 2012a).

Of course, a forward predictive mechanism is not the only possible explanation for our findings. One might argue more broadly, for example, that auditory imagery is evoked more strongly during actual articulation than imagined speech, leading to the observed activation difference. However, one then might ask why this should be the case; why is the auditory system so compelled to image the acoustic correlates of articulated speech? And the best available answer to this question is that there are strong computational arguments from motor control that such “imagery” serves an important function in speech production.

Second, we found a possible brain region that is sensitive to the well-established lexical bias effect in speech production (although slightly under threshold). When participants were biased to make nonword errors, there was enhanced activation in posterior middle temporal gyrus (pMTG), a region previously implicated in lexical-semantic processing. Neuroimaging and neuropsychological studies have demonstrated this region’s involvement in semantic tasks (Diedrichsen et al., 2010; Grafton & Hamilton, 2007), word retrieval, and naming (Binder, Desai, Graves, & Conant, 2009; Binder et al., 1997; Rodd, Davis, & Johnsrude, 2005). Conditional on future replication, we interpret this pattern as evidence for the existence of neural network for detecting and correcting word-level errors prior to overt production, broadly consistent with recent proposals regarding hierarchically organized internal feedback control circuits for speech production (DeLeon et al., 2007; Hillis et al., 2001). The pMTG was not the only region to show a differential response to word versus non word biased lists, suggesting that it is part of a larger network. One of these activations implicates a structure, the cerebellum, that is widely believed to play a role in internal models for motor control (Hickok, 2012a, 2012b)—a fact that is broadly consistent with the interpretation of our data. Future work should investigate this further.

The main finding in this study is that auditory cortex activity is modulated as a function of silent motor speech articulation. We hypothesized that if forward predictions are generated at different levels of the speech motor control hierarchy, we should find more activity in some portions of auditory cortex during silently articulated speech compared to imagined speech because predictions are being generated from multiple levels, even though neither condition involves any auditory input. In line with our predictions, we found greater activation in several regions of auditory cortex during silent articulation compared with imagined speech. We suggest that these activations reflect forward predictions arising from additional levels of the perceptual/motor hierarchy that are involved in monitoring the intended speech output.