Neural Signal to Violations of Abstract Rules Using Speech-Like Stimuli

Abstract As the evidence of predictive processes playing a role in a wide variety of cognitive domains increases, the brain as a predictive machine becomes a central idea in neuroscience. In auditory processing, a considerable amount of progress has been made using variations of the Oddball design, but most of the existing work seems restricted to predictions based on physical features or conditional rules linking successive stimuli. To characterize the predictive capacity of the brain to abstract rules, we present here two experiments that use speech-like stimuli to overcome limitations and avoid common confounds. Pseudowords were presented in isolation, intermixed with infrequent deviants that contained unexpected phoneme sequences. As hypothesized, the occurrence of unexpected sequences of phonemes reliably elicited an early prediction error signal. These prediction error signals do not seemed to be modulated by attentional manipulations due to different task instructions, suggesting that the predictions are deployed even when the task at hand does not volitionally involve error detection. In contrast, the amount of syllables congruent with a standard pseudoword presented before the point of deviance exerted a strong modulation. Prediction error’s amplitude doubled when two congruent syllables were presented instead of one, despite keeping local transitional probabilities constant. This suggests that auditory predictions can be built integrating information beyond the immediate past. In sum, the results presented here further contribute to the understanding of the predictive capabilities of the human auditory system when facing complex stimuli and abstract rules.


Introduction
In recent years, the study of predictive processes has drawn increasing attention in neuroscience. In this context, Predictive Coding has emerged as a popular theory, which states that the brain constructs a hierarchy of predictions of incoming stimuli at multiple levels of processing (Friston, 2005(Friston, , 2009(Friston, , 2010Bubic et al., 2010;Hobson and Friston, 2012). This proposal has received mounting empirical evidence (Wacongne et al., 2011;Den Ouden et al., 2012;Phillips et al., 2015Phillips et al., , 2016. A wealth of experiments in the study of predictive coding are variations of the Oddball design (Squires et al., 1975;Heilbron and Chait, 2018), where frequent acoustic stimuli establish predictable sequences, which are at times violated. Besides designs using tones, the use of speech-like stimuli offers a number of advantages. Within speech, abstract rules are ubiquitous, allowing to test abstract predictions that go beyond physical stimuli features and local transitional probabilities. These properties make speech processing an excellent testbed for the study of the brain's signals to abstract rules establishment and its violations.
Speech perception requires the fast extraction of meaning from a complex auditory signal (Boudewyn et al., 2015;Kleinschmidt and Jaeger, 2015) and the generation of predictions might be an efficient solution to achieve fast and accurate comprehension (Kleinschmidt and Jaeger, 2015;Hauk, 2016). Although the proposal that predictive processes play a role in speech processing has been criticized (Norris et al., 2000;Van Petten and Luka, 2012;Huettig and Mani, 2016), evidence suggests that predictions are deployed at several speech processing levels (Lewis and Bastiaansen, 2015). At the syntactic level, listeners' knowledge influence sentence parsing (Farmer et al., 2006;Wilson and Garnsey, 2009;Traxler, 2014;Baart and Samuel, 2015). Lexico-semantic processing can be facilitated by contextual predictability (Van Petten et al., 1999;Schuster et al., 2016).
As the generation of predictions seem to be a prevalent brain computation (Friston, 2010(Friston, , 2009), we propose that phonological predictions are generated during speech perception in the absence of semantic and syntactic information. To test this hypothesis, we performed two EEG experiments with an Oddball design. The use of speech stimuli allowed us to test for predictions based on an abstract rule that go beyond local transitional probabilities.
Pseudowords were presented in a context that did not contain syntactic or semantic information. We expected that the presentation of deviants, constructed using the same phonemes as standard pseudowords but in an unexpected sequence, would elicit an early prediction error signal like the mismatch negativity (MMN; Friston, 2005;Näätänen et al., 2007;Garrido et al., 2009;Wacongne et al., 2011;Chennu et al., 2013;Winkler and Schröger, 2015). The presence of this prediction error signal would imply that listeners' brains generate predictions about incoming phonemes within pseudoword.
We propose that abstract predictions are deployed regardless of the task at hand. To test this, experiments 1 and 2 differed with respect to the instructions given to the participants. While in experiment 1 participants were instructed to count the occurrence of deviants, in experiment 2, they were required to learn all pseudowords. We expected that an early prediction error signal would be present in both experiments, implying that predictions are deployed even if the task at hand does not require error detection and independent of the strategy of rulelearning.
Finally, to test whether these predictions are constructed using information beyond local transitional probabilities, we tested whether the amplitude of prediction error would be modulated by the amount of phonemes presented before the point of deviance. We expected to find higher prediction error (higher amplitudes) when longer sequences of phonemes that are congruent with a standard pseudoword are presented. This modulation would not occur if predictions were made based solely on local transitional probabilities between phonemes.
Taken together, these experiments allowed us to study the predictive capabilities of the brain networks underlying the extraction of abstract rules.

Materials and Methods
Stimuli set, unprocessed data and processing scripts can be found at https://osf.io/tuvy6/.

Participants
Participants were self-reported right handed, Italian native speakers recruited from the city of Trieste with no auditory or language-related problems. Participants signed informed consent and received a monetary compensation of 15€. Thirty participants (10 male, 20 female, mean aged 22.86 Ϯ 3.42 years) took part in experiment 1, and 29 participants (9 male and 20 female, mean aged 23.24 Ϯ 3.52 years) took part in experiment 2. After data preprocessing, participants contributing with Ͻ30 clean EEG trials per condition were excluded from analysis (one participant excluded from each experiment). The remaining participants had sufficient trials to be included in a single subject statistical analysis and all contribute similarly to the group variance. Additionally, one participant was excluded from experiment 1 due to poor behavioral performance. Therefore, 28 participants (10 male, 18 female, mean age 23.25 Ϯ 3.23 years) from experiment 1 and 28 participants (8 male and 20 female, mean age 23.10 Ϯ 3.51 years) from experiment 2 were included in the final analyses.

Stimuli
Six pseudowords divided in three sets of two pseudowords each were used as stimuli. We applied a series of constrains in the construction of our stimuli to ensure that the resulting pseudowords would resemble real Italian words. First we consulted the phonItalia lexical database (Goslin et al., 2014) to identify syllable candidates composed by 1 consonant followed by 1 vowel (i.e., two phonemes each). To exclude monosyllabic words and onomatopoeias, we removed syllables with a token frequency above the 70th percentile. Next, to keep syllables that could take any position within a word, we removed syllables with initial, middle or final position token frequencies either bellow the 20th percentile or above the 90th percentile. This selection procedure allowed us to identify 24 syllable candidates that are not monosyllabic words (in Italian) and have an even frequency distribution across positions within a word.
Using these syllable candidates, we constructed two trisyllabic pseudowords that contained no vowel or consonant repetitions. Additionally, no syllables were repeated between these two pseudowords. Hereafter, these pseudowords will be referred to as STD (i.e., standard) pseudowords. Taking these STD pseudowords as a base, we constructed two different types of deviant pseudowords. The first deviant type, to which we will refer as XYY, consisted of the 1st syllable of a STD pseudoword and the 2nd and 3rd of the other STD pseudoword. The second type of deviant, to which we will refer as XXY, consisted of the 1st and 2nd syllable of a STD pseudoword, and the 3rd of the other STD pseudoword. Finally, two additional pseudowords with a XYX structure were constructed, only to be used as NEW pseudowords in a forced choice test at the end of experiment 2. None of these deviant pseudowords contained either consonant or vowel repetitions.
Audio file of these two STD pseudowords were generated using the MBROLA speech synthesizer (Dutoit et al., 1996) and the Italian female diphone database it4. Consonant and vowel durations were set to 150 and 175 ms, respectively, hence, pseudowords duration was 975 ms. Once the two STD pseudowords were produced, deviants were constructed by cross-splicing (i.e., cutting and replacing sound segments) the audio of the STD.
In natural speech, phonemes are co-articulated (i.e., the sound of each phoneme is influenced by the preceding and the forthcoming phoneme). Hence, using cross-splicing to generate the deviant pseudowords could result in sharp transitions that would sound unnatural. Because of this, we took measures to obtain a natural render for our stimuli (Steinberg et al., 2012). For the first and last syllable position, the vowels of both STD pseudowords had similar first and second formants. As one STD pseudoword had the vowel "o" in the first syllable, the other STD pseudoword had the vowel "u" at the same position. In the case of the third syllable, while one STD pseudoword used the vowel "i," the other one used the vowel "e." In the case of the second syllable, both STD pseudowords had "a" as the vowel (Fig. 1A). For each syllable position, the consonants of both STD pseudowords had the same mode of articulation. Finally, the point of cutting was set close to zero amplitude. These measures had the effect of reducing the difference between both STD pseudowords at the points of syllable transitions so that when crossspliced to construct the deviant pseudowords, these would not contain sharp transitions.
The final set consisted of two STD, two XYY deviants, two XXY deviants, and two NEW pseudowords (Fig. 1B). All pseudowords were checked by a native Italian speaker linguist to ensure that they sounded as plausible but not real Italian words.
While previous work in the literature has shown that the generation of predictions can serve word processing, phonemes in these experiments were either omitted (Bendixen et al., 2014), or replaced either by other phonemes (Cornell et al., 2013;Politzer-Ahles et al., 2016;Schluter et al., 2017) or by a non-linguistic sound (Kashino, 2006;Groppe et al., 2010). Because of this, changes in low level auditory features might have contributed to the recorded signals. In the case of our stimuli set, any difference in the EEG recording found between the STD condition and the deviant conditions could not be attributed to differences in instantaneous low-level features. Instead, they could in principle only be attributed to the violation of the abstract rule learnt during the experiment (Paavilainen, 2013), according to which given a syllable Xn, the next syllable of the word should be Xn ϩ 1.
Note that in the case of the stimuli used here, the only feature that defined a pseudoword as deviant was that following the syllable Xn, instead of the usual syllable Xn ϩ 1, the syllable Yn ϩ 1 (which belongs to a different STD pseudoword) was presented. Additionally, as the overall frequency of presentation of all syllables used to construct the stimuli was the same, this design avoids a common confound between expectation and frequency of presentation (Heilbron and Chait, 2018).

Experimental design
Participants were requested to minimize movement throughout the experiment, except during breaks between blocks. No particular instructions were given with respect to when to blink, as eye blink artefacts can be removed using independent component analysis (ICA; Delorme and Makeig, 2004;Chaumon et al., 2015).
Experiments followed an Oddball design, divided in 13 blocks with an average duration of 3.3 min each. During each block, a total of 98 pseudowords were presented, with an inter stimulus interval that varied between 900 and 1300 ms. During the first of such blocks, only STD pseudowords were presented. Subsequently, participants completed 12 blocks composed of 84% Standard pseudowords 8% XYY deviant pseudowords and 8% XXY deviant pseudowords. Within each block, pseudoword order was pseudo-random. A minimum of two and a maximum of four STD pseudowords were presented between deviants and no deviants were presented more than two times consecutively (Fig. 1C).
In experiment 1, participants were instructed to learn all made up "words" (i.e., pseudowords) in block one, and from block 2 onwards count the occurrence of "mistaken words" (i.e., deviant pseudowords) and write down the number of mistaken words during the pauses between blocks.
In experiment 2, participants were not informed about the presence of deviants and were simply instructed to learn all made up words (i.e., pseudowords). To ensure that the participant would pay attention during the experiment, they were informed that they would be subject to a test after the word learning task. After listening to the blocks of pseudowords, behavioral performance was assessed, by means of a forced choice test. On each trial, participants heard two pseudowords in sequence and were requested to choose the one that most likely was presented during the experiment. Participants completed four trials for each of six contrasts between conditions, for a total of 24 trials, presented in pseudorandom order (only 1 repetition of contrast type was allowed). The contrasts between conditions were "STD versus XYY," "STD versus XXY," "XYY versus XXY," "STD versus NEW," "XYY ver- Figure 1. A, Scatter plot of 1st and 2nd formant of each vowel. B, Stimulus set in IPA notation. Deviant pseudowords were produced by cross-splicing the two STD pseudowords either at the end of the first syllable (XYY) or at the end of the second syllable (XXY). Two additional NEW pseudowords with a XYX structure were used only in a forced choice test at the end of experiment 2. C, In both experiments, stimuli were presented in 13 blocks separated by 20 s. Within each block, pseudowords were presented with an inter stimulus interval between 900 and 1300 ms. The first blocks consisted solely of STD pseudowords. Subsequent blocks were composed of 84% STD pseudowords 8% XYY deviant pseudowords and 8% XXY deviant pseudowords. Pseudoword order was pseudo-random. A minimum of two and a maximum of four STD pseudowords were presented between deviants and no deviants were presented more than two times consecutively. New Research sus NEW," and "XXY versus NEW." Participants reported their answers verbally and the experimenter entered them through keyboard. Order of presentation of pseudowords within trial was counterbalanced.

Data acquisition setup
EEG data were collected using a 128 passive electrode system (Geodesic EEG System 300, Electrical Geodesics, Inc.) referenced to the vertex. EEG signal was bandpass filtered by hardware between 0.1 and 100 Hz, and digitalized at 250 Hz. Electrode impedance was kept below 100 k⍀ (equivalent to 10-k⍀ standard amplifiers; Johnson et al., 2001). Participants were tested in a soundproof faraday cage while sitting on a chair in front of a LCD 19-inch monitor. Sound was delivered via a loudspeaker located behind the monitor, at a comfortable sound intensity of ϳ60 dB. Experiments were programmed in MAT-LAB (MathWorks, Inc., RRID: SCR_001622) using the Psychophysics Toolbox extensions (Brainard, 1997;Pelli, 1997; RRID: SCR_002881). Pseudoword onset was marked on the EEG data by sending both a digital input signal (DIN) and a TCP/IP mark.

EEG data preprocessing
EEG data preprocessing was performed in MATLAB using custom code and the EEGLAB toolbox (Delorme and Makeig, 2004; RRID: SCR_007292). After being imported to EEGLAB, the data of each subject was bandpass filtered (0.1-30 Hz). As the anti-aliasing filter of the EGI 300 Amp introduces a delay of 36 ms, latencies of all events were corrected. The entire learning block, and the first six trials of each block, where excluded from analysis. Data were segmented into 1848-ms-long epochs starting 300 ms before pseudoword onset. Bad channels were rejected using the 3 available methods of EEGLAB's pop_rejchan function. Kurtosis threshold was set to 4, Joint probability threshold was set to 4, and Abnormal spectra was checked between 1 and 30 Hz, with a threshold of 3 (Delorme and Makeig, 2004). Following this automatic cleaning, additional channels were rejected by visual inspection of continuous data and spectra. ICA was use to remove eye blinks (Delorme and Makeig, 2004;Chaumon et al., 2015). Following, data were re-referenced to the average of all electrodes and baseline corrected using the 300 ms before pseudoword onset. Next, we performed trial rejection by eliminating trials containing extreme values (Ϯ200 mV) and improbable trials (EEGLAB pop_jointprob 4 for both Single Channel and All Channels). Finally, missing channels were interpolated (EE-GLAB pop_interp, "spherical").
Only after this cleaning procedure the data were divided into conditions. Given that STD pseudowords were presented far more frequently than deviant pseudowords, the datasets of each condition were pruned by randomly discarding trials to obtain exactly the same number of trials per condition. For example, if after trial rejection a participant had 763 STD trials, 76 XYY trials, and 68 XXY trials, then 68 randomly picked trials per condition were kept and the rest were discarded. Participants contributing with Ͻ30 clean EEG trials per condition were excluded from analysis (one participant was excluded from each experiment applying this criterion). After this, the mean amount of trials per participant and condition were 70.18 Ϯ 16.57 (minimum ϭ 35) for experiment 1 and 82.50 Ϯ 13.76 (minimum ϭ 41) for experiment 2. For each condition, the mean of all trials of each subject was calculated and saved into a final dataset. The result of preprocessing was 1 dataset per condition, containing the mean of each subject.
Deviant conditions differed between each other with respect to the amount of syllables presented before the point of deviance. to render possible the comparison of the deviant conditions, we re-segmented the trials of both deviant conditions so that the points of deviance would be aligned. The resulting epochs had a length of 1224 ms, starting 325 ms before the point of deviance. Additionally, as the processing of a pseudoword has an intrinsic temporal dynamic, we eliminated these confounding factors by subtracting the activation elicited by the STD condition from each deviant condition.

EEG regions of interest (ROIs)
Statistical analysis of EEG data were restricted to two predefined spatiotemporal ROIs. The first one consisted on a fronto-central ROI comprised of 13 electrodes and spanned over a 325-ms time window starting at the point of deviance of each deviant condition. With respect to word onset, this window spanned from 325 to 650 ms for the XYY condition, and from 650 to 975 ms for the XXY condition. This ROI coincided with the region were an early prediction error response like the MMN could be expected (Duncan et al., 2009;Bendixen et al., 2012;Wacongne et al., 2012;Lecaignard et al., 2015). The second ROI consisted on a Parietal ROI composed of 21 electrodes and temporally extended from 200 ms after the point of deviance of each deviant condition, to the end of the epoch. With respect to word onset, this window started at 525 ms for the XYY condition, and at 850 ms for the XXY condition. This ROI corresponded to the region were a P3b response would be expected (Comerchero and Polich, 1999;Polich, 2007;Duncan et al., 2009). As this component is strongly modulated by top-down attention (Sergent et al., 2005;Bekinschtein et al., 2009;Pegado et al., 2010;Dehaene and Changeux, 2011), it was used to test whether the attentional manipulation between experiments 1 and 2 was successful.

Statistical analysis
EEG group level contrast between conditions was performed using a nonparametric clustering methods, introduced first by Bullmore et al. (1999) and implemented in the FieldTrip toolbox for EEG/MEG analysis (Oostenveld et al., 2011;RRID: SCR_004849). This method offers a straightforward and intuitive solution to the Multiple Comparisons problem. It relies on the fact that EEG data has a spatiotemporal structure. A true effect should not be isolated but should instead spread over different electrodes and over time. Instead of assessing for differences between conditions in a point by point fashion, which would lead to a very big number of comparisons, this method groups together adjacent spatiotemporal points.
The procedure is as follows. For every point in time and space, the EEG signal of two conditions is statistically compared. In our case, we used a nonparametric permutation t test for this step. The t values of adjacent spatiotemporal points with ps Ͻ 0.05 are clustered together and a cluster-level statistic is calculated by summing the t values within a cluster. Once these candidate clusters have been defined, their probability of occurrence under the null hypothesis of no difference between conditions is assessed using a nonparametric permutation test. In this test, conditions are shuffled and cluster-level t values are calculated as before. This step is repeated 5000 times, and on each iteration, the most extreme cluster-level t value is retained. This allows to construct a histogram of expected cluster-level t values under the null hypothesis of no difference between the conditions. Cluster level p values are calculated as the proportion of expected t values under the null hypothesis that are more extreme than the observed t value. For further details, see Maris and Oostenveld (2007).
Additionally, to corroborate results found at the group level were robust and not driven by outliers, we performed a test at the participant level. For each individual participant, the mean amplitude over the time of the detected group level cluster was calculated, and the conditions of interest were submitted to a paired t test to obtain a t value. Next, the t values from all participants were converted to 1 if they show a difference between conditions in the same direction as the group lever cluster or 0 if otherwise. A one-tailed binomial test was performed on these transformed t values, with equal or lower likelihood as null hypothesis. The logic of this analysis is that if an effect is true at the group level, then the majority of participants should show a difference between conditions in the same direction. Note that the test used is one-tailed because the hypothesis to test is directional.

Results
Given that deviant conditions differed in the time point at which a pseudoword could be identified as a deviant (325 and 650 ms from pseudoword onset for XYY and XXY conditions, respectively), instead of defining time 0 as onset of stimulus presentation, we will use the time point of deviance of each condition as such. In other words, all times reported are with respect to the point of deviance. Furthermore, comparisons across deviants and experiment were performed on the difference wave between STD and deviant, and with all trials re-segmented to align the point of deviance, as described in Materials and Methods.

Behavioral results
In experiment 1, participants were requested to count the occurrence of mistaken words (i.e., deviant pseudowords) on each block. On average, participants reported 15.22 (out of 16 presented) deviant pseudowords per block ( ϭ 2.56). For each participant, we checked the number of blocks with a deviant count further than 2 s from the mean. While most of the participants reported a deviant count within these limits for all the blocks, three participants had one block with a lower count, and one participant had all 12 blocks outside this limit. This participant reported a mean of only 3.58 deviants per block, therefore, was excluded from the analysis. After excluding this participant and 1 other participant that contributed with Ͻ30 clean EEG trials per condition, the mean number of deviants reported per block increases to 15.62 ( ϭ 1.41). This performance is close to ceiling (16).
Note that the method of asking participants to mentally count the occurrence of deviants does not allow us to determine with certainty neither the occurrence of false alarms, nor the detection rate for each deviant condition. Despite this, given that the mean count of deviant was close to the actual number of deviants presented, we can conclude that in experiment 1, participants were able to perform the task with high accuracy for both deviant conditions.
Contrary to experiment 1, during experiment 2, participants were not aware of the presence of deviant pseudowords. Despite this, at the end of the experiment, they were requested to perform a forced choice test in which each stimuli condition was contrasted against the others and against new pseudowords not presented during the blocks. The mean preference in each contrast was calculated for each participant and a one sample t test was performed at the group level to test against the null hypothesis of no difference from chance (i.e., 50%). Results were corrected for multiple comparisons using the Bonferroni-Holm method.
These behavioral results allowed us to corroborate that participants paid attention during the blocks of pseudo-words. They also indicate that in experiment 2, despite the fact that the instructions provided did not explicitly distinguish between standard and deviant pseudowords, participants displayed a preference for STD pseudowords over both deviant pseudoword types. Although both deviant types had the same probability of occurrence, while XXY deviants could be distinguished from NEW pseudowords, XYY could not. Taken together, these behavioral results suggest that participants were sensitive to the frequency of occurrence of the different pseudowords.

EEG evidence of abstract rule extraction via phonological predictions
To test whether phonological predictions are deployed during speech perception in the absence of semantic and syntactic information, we used clustering (see Materials and Methods) to compared each deviant condition against the STD condition, focusing the analysis on the fronto-central ROI, where the presentation of a deviant pseudoword was expected to elicit an early prediction error signal.
In experiment 1, XYY deviants elicited such response, peaking in amplitude at 155 ms (t (27)  The results of experiment 1 show that the presentation of a deviants pseudoword, composed by an unexpected sequence of syllables, elicited prediction error signals. Since in experiment 1 participants were instructed to count mistaken (i.e., deviant) pseudowords, we sought to replicate these results under conditions more akin to natural speech perception. Experiment 2, while using the same stimuli and Oddball design of experiment 1, differed with respect to the instructions given to the participants. In experiment 2, participants were asked to learn all pseudowords, without informing them of the presence of deviants.
Taken together, the results of experiments 1 and 2 show that the presentation of deviants composed by an unexpected sequence of syllables trigger an early prediction error signal. The presence of this error signal indicates that a prediction about the forthcoming syllables had been made, even when the context did not contain any syntactic or semantic information.

Neural signals to violations of abstract rules under different instructions
To test whether predictions are deployed regardless of the task at hand, experiments 1 and 2 used the same stimuli and design, but differed in the instructions given to the participants. While in experiment 1 participants were requested to count the occurrence of deviants, in experiment 2, they were not informed about the presence of deviants and were instead requested to learn all pseudowords. Despite this difference, as we reported at the beginning of this section, the presentation of deviant pseudowords elicited an early prediction error signal in both experiments.
To confirm that the change in instructions successfully induced a different attention allocation between experiments, we analyzed the signal recorded at the parietal ROI. If the attentional manipulation was successful, the presentation of a deviant pseudoword should elicit a P3b response only in experiment 1, where deviant detection was relevant for the task at hand (Bekinschtein et al., 2009).
In experiment 1, our analysis of the parietal ROI revealed that both deviant types elicited the expected P3b response. In the case of the XYY deviant, P3b response started at 251 ms and reached 50% of its area under the curve at 743 ms (t (27)  Next, to further confirm that the attentional manipulation between experiments was successful, we contrasted the recorded signals across experiments using clustering analysis. We expected to find higher amplitudes in experiment 1, due to the presence of the P3b elicited by the deviants. We were able to confirm this for both deviants (XYY: t(54) ϭ 875.00, p ϭ 0.0002, g ϭ 1.41 [0.80, 1.99]; XXY: t(54) ϭ 734.07, p ϭ 0.0002, g ϭ 1.26 [0.66, 1.82]). Analyses were performed on the difference between STD and deviant conditions. These results confirm that the top-down attention paid to deviants was indeed different between experiments.
Having confirmed that the attentional manipulation between experiments was successful, and considering that regardless of this, an early prediction error signal was registered in both experiments, we decided to test whether the prediction error signals recorded across experiments where indeed equivalent. As our hypothesis stated that there would be no difference in prediction error amplitude across experiments (i.e., a null hypothesis), a Bayesian independent samples t test (Bayes factor; Rouder et al., 2009) was used for these comparisons. This test measures the relative evidence between the null and alternative hypothesis, allowing to assess evidence in favor of the null (Leppink et al., 2017). Tests were performed using a Cauchy prior with scale value of r ϭ 1.
We compared the amplitude of the early prediction error signals registered over the fronto-central ROI, elicited by each deviant condition across experiments, by taking the mean amplitude in a 44-ms time window (equal to the duration of the shortest cluster) centered at the peak of the detected negativity. For both deviant types, Bayes factor showed only anecdotal evidence in favor of no difference between experiments (XYY deviants: BF01 ϭ 2.48, g ϭ 0.32 [-0.20, 0.85]; XXY deviants: BF01 ϭ 1.14, g ϭ 0.48 [-0.06, 1.01]). Analyses were performed on the difference between STD and deviant conditions. Taken together, these results suggest that even if the task at hand does not explicitly imply deviance detection, phonological predictions are proactively deployed. However, it should be noted that the results with respect to the modulation of early prediction error by top down attention are inconclusive.

Predictions beyond local transitional probabilities
The prediction error signals described above could reflect violations of predictions based on local transitional probabilities, or alternatively these predictions could be constructed by considering information in a longer cognitive time window. To shed light on this issue, we contrasted conditions where deviance occurred at different time points within a pseudoword. The logic behind this comparison is that if predictions are built not solely on the basis of local transitional probabilities, an increase in the number of syllables presented before the point of deviance would elicit higher amplitude prediction error signals. In XXY, the second syllable lends further evidence that the pseudoword is about to be completed, but then this prediction is violated in the last syllable, while in XYY, the prediction is broken earlier.
It remained possible that small discrepancies in the number of STD trials presented before the deviants of each condition might be in part driving these effects. To rule out this possible confound, we fitted linear mixed effects models (using the lme4 package in RStudio, Bates et al., 2015;RStudioTeam, 2016) to predict single trial prediction error amplitude using deviant type and amount of preceding STD trials (STD count) as fixed factors, and including participant as random factor [PE ϳ Dev ϩ STD_count ϩ (1 ϩ Dev | participant); R2 experiment 1 ϭ 0.0124, R2 experiment 2 ϭ 0.0116]. An effect of deviant type was found in both experiments (experiment 1: B ϭ -0.94, t(3929) ϭ -3.705, p ϭ 0.00021; experiment 2: B ϭ -0.67, t(4619) ϭ -3.530, p ϭ 0.00042). In contrast, no effects of STD count were found (experiment 1: B ϭ -0.007, t(3929) ϭ -0.092, p ϭ 0.92; experiment 2: B ϭ 0.059, t(4619) ϭ 1.172, p ϭ 0.24). These results rule out the possibility that a substantial part of the difference in prediction error amplitude between deviant conditions would be driven by a difference in mean STD count preceding the deviants.

Discussion
As we argued in the Introduction, the experimental designs typically used to study prediction in auditory processing share a number of limitations. The majority of the experimental designs used are variations of the Oddball paradigm (Heilbron and Chait, 2018). In most of these experimental designs, what defines a particular stimulus as deviant is the disruption of an established physical feature such as pitch, duration, intensity, side of stimulation or the presence of a gap (Näätänen et al., 2007). This limitation applies to the classical Oddball paradigm, optimum-1 (Näätänen et al., 2004), omission (Yabe et al., 1997), and roving-standard (Garrido et al., 2008) designs.
While these designs define standard and deviant stimuli on the basis of their physical features, other designs explore the sensitivity of the predictive system to higher order regularities or abstract rules that define the relationship between successive stimuli. For example, Paavilainen et al. (2007) presented to their participants sequences of sinusoidal tone pips for which the duration varied randomly between short (50 ms) and long (150 ms). Importantly, the duration of each tone predicted the pitch of the next one, which could be either low (1000 Hz) or high (1500 Hz). The authors found that the violation of this arbitrary abstract rule, linking duration of a tone with pitch of the next, elicited an early error signal (MMN response). Other examples of paradigms that test for prediction of higher order regularities are the unexpected repetition (Wacongne et al., 2012) and repetition versus expectation (Todorovic and de Lange, 2012) designs (for review of abstract rule designs, see Paavilainen, 2013).
Abstract rule designs have given support to predictive coding by showing that putative early prediction error signals, like the MMN response, cannot be fully explained by simple adaptation to standard stimuli (and lack of adaptation to deviant stimuli). But in all the designs mentioned above, the rules used established relationships only between consecutive stimuli. Therefore, these experimental designs only allow to study the sensitivity of the predictive system to local transitional probabilities.
To the best of our knowledge, there are only two paradigms that allow to test violations of an abstract rule beyond local transitional probabilities. In the local/global paradigm (Bekinschtein et al., 2009), tones are presented in groups of five. This allows to establish regularities both locally (transitional probabilities between tones within groups) and globally (between groups change, only tractable over a time range of seconds). In the RAND-REG designs (Barascud et al., 2016), tones are presented in succession at multiple possible pitches, switching between randomness and regular patterns. In these experiments, the detection of a regular pattern requires to consider several consecutive tones (one full cycle plus  four tones according to an ideal observer model). While the local/global and RAND-REG designs allow to study predictions that integrate information beyond adjacent stimuli, these designs use tone stimuli that are far less complex than naturally occurring sounds.
As evidence suggests that the generation of predictions might be one of the strategies that the speech processing system uses to parse the speech signal (Hickok, 2012;Boudewyn et al., 2015;Kleinschmidt and Jaeger, 2015;Norris et al., 2015;Hauk, 2016), and given that abstract rules and long range dependencies are ubiquitous in language, one way to overcome the limitations of the experimental designs described above is to use speech-like stimuli.
In the context of speech processing, it has been shown that listeners tend to hallucinate the presence of phonemes replaced by tones. The strength of this illusion depend on how much the preceding context is informative about the missing phoneme (Kashino, 2006;Groppe et al., 2010). Similarly, when a phoneme is omitted from a word (Bendixen et al., 2014), this can elicit a MMN (Näätänen et al., 2007), which is a marker of violation of expectations (Friston, 2005;Winkler and Schröger, 2015), but only if the context in which the phoneme omission occurs contains semantic information that makes the omitted phoneme predictable. Phoneme replacements can also elicit a MMN response when the replacement violates a phonotactic rule of the language of the listener (Dehaene-Lambertz et al., 2000;Sun et al., 2015;Ylinen et al., 2016). Furthermore, and particularly framed in the context of predictive coding, it has been shown that the amplitude of the MMN response elicited by phoneme replacement is modulated by the availability of phonological evidence (i.e., degree of feature specification) of the preceding standard words before the presentation of a deviant (Scharinger et al., 2012a(Scharinger et al., ,b, 2016. The studies described in the previous paragraph have provided compelling evidence of the role that predictions play in speech processing, but besides using speech as complex auditory stimuli, they incorporate in their designs other linguistic factors such as syntax, semantic information, and phonotactics. We proposed that phonological prediction might be generated within words, even in the absence of these additional sources of information. To test this, we performed two EEG Oddball experiments in which only phonological information was available to generate phonological predictions. Importantly, the deviant pseudowords used in these experiments were constructed by cross-splicing standard pseudowords. Therefore, each phoneme in a deviant pseudoword was acoustically identical to a phoneme in a standard pseudoword. The only feature that defined a pseudoword as deviant, was that following the syllable Xn, instead of the usual syllable Xn ϩ 1, the syllable Yn ϩ 1, which belongs to a different pseudoword, was presented. In this way, the ERP responses registered in these experiments could not be elicited by low frequency of occurrence of a given sound, or a change in instantaneous low level auditory features, but by the violation of an abstract rule (Paavilainen, 2013). As the stimuli did not contain consec-utive phoneme repetitions, the registered responses cannot be explained by stimulus specific adaptation. Additionally, this stimuli design avoids a common confound between repetition and expectation (Todorovic and de Lange, 2012;Heilbron and Chait, 2018).
In both of the experiments presented here, the occurrence of an unexpected sequence of phonemes, reliably elicited an early prediction error signal, compatible with a MMN response (Näätänen, 2000;Näätänen et al., 2007). This ERP is a well-established prediction error signal that can be interpreted as the result of comparing a prediction with the actual bottom-up input (Friston, 2005;Garrido et al., 2009;Wacongne et al., 2011;Winkler and Czigler, 2012;Chennu et al., 2013;Paavilainen, 2013). The presence of this early prediction error signal, elicited by the presentation of an unexpected sequence of phonemes, can be considered as evidence that a prediction about the forthcoming phonemes had been made.
Experiments 1 and 2 differed in the instructions given to the participants. While in experiment 1 participants were instructed to count the occurrence of mistaken words (i.e., deviants), in experiment 2, they were not informed about the occurrence of deviants and were simply instructed to learn all the pseudowords. This aimed to induce in experiment 2, an attentional state that resembles more closely the one held during natural speech processing.
To confirm the effects of this attentional manipulation, we tested for the presence of a P3b component in both experiments. While clustering analysis detected clear P3b components in experiment 1, only smaller positivities were detected in experiment 2. This suggest that participants noted the difference in frequency of occurrence between STD and deviant pseudowords, even when they were not instructed to detect deviants. In line with this, the behavioral results from experiment 2 show that participants preferred STD pseudowords over both deviant pseudoword types.
Despite this, when contrasting the signals recorded between experiments, we could verify that the amplitude in the P3b time window was roughly four times higher in experiment 1. As the P3b component is an index of to top-down attention (Sergent et al., 2005;Bekinschtein et al., 2009;Dehaene and Changeux, 2011;Faugeras et al., 2011;Chennu and Bekinschtein, 2012;Strauss et al., 2015), this difference indicates that the degree to which top-down attention was deployed was different between experiments.
Despite the difference in instructions and in concomitant top-down attention between experiments, unexpected sequence of phonemes reliably elicited an early prediction error signal. This suggests that phonological predictions can be deployed, even if the task at hand does not require detecting abnormalities in the speech stream. Given that the results of our Bayesian analysis comparing amplitude of prediction error across experiments were inconclusive, the modulatory role that topdown attention might exert on these predictions remains an open question. As the attention allocation held by the participants during experiment 2 resembles closely the one use for natural speech processing, these results imply that the language comprehension system proactively anticipates incoming phonemes within individual words.
One way in which these phonological predictions could be implemented is by extracting the local transitional probabilities between adjacent syllables (Endress and Mehler, 2009;Koelsch, 2016). Our data indicates that this is unlikely, as we found that the amplitude of prediction error signals was modulated by the amount of syllables presented before the point of deviance. When two congruent syllables were presented before the point of deviance (XXY), the amplitudes were higher than when only one congruent syllable was presented (XYY). As the local transitional probabilities between X1 and X2 were the same as between X2 and X3 (0.92), this increase in amplitude indicates that the information used to generate predictions was not restricted to consecutive syllables. Instead prediction strength was modulated by integrating information from several past phonemes.
It has been shown that the number of phonological features differing between standard and deviants can modulate the amplitude of the MMN response (Cornell et al., 2013;Scharinger et al., 2016;Schluter et al., 2017). Taking this into account, the difference in prediction error amplitude between deviant conditions may be captured by this feature. Taking the position of Mioni (1993) and Kramer (2009), who propose that in the case of Italian, affricates do not constitute a separate class of manner of articulation, the phonological features that change from STD to deviant in our stimuli set are the following. Syllables in the 2nd position (XYY deviant) differ in their consonant voicing, place of articulation and manner of articulation. Syllables in the 3rd position (XXY deviant) differ in their consonant voicing and place of articulation, and in their vowel height (Mioni, 1993;Kramer, 2009;Paoli, 2016). While it should be noted that whether all these phonological features have a neural representation is on itself an open debate (Hestvik and Durvasula, 2016;Politzer-Ahles et al., 2016;Schluter et al., 2016Schluter et al., , 2017, in the case of our stimuli set, the number of phonological features that change for each deviant condition is the same. Finally, when the point of deviance is reached, more time has elapsed from pseudoword onset in the case of XXY deviants, compared to XYY deviants. This difference in time from pseudoword onset could contribute to the difference in MMN amplitude, but we find this improbable. Behavioral gating experiments (Tyler, 1984) and MEG experiments (Brodbeck et al., 2018) have shown that between 50 and 100 ms from word onset are enough to generate a prediction regarding the initial phoneme of a word. In the case of XYY deviants, the point of deviance is reached 325 ms after pseudoword onset, which is more than three times the suggested minimum time for prediction generation. Therefore, the difference in elapsed time before deviance between conditions is unlikely to contribute to the observed difference in prediction error amplitude.
One tentative interpretation for the difference in prediction error amplitude between deviant conditions is that, as language processing is characterized by extensive communication across representational levels (Davis and Johnsrude, 2007;Kuperberg and Jaeger, 2016), a lexical level of processing could be involved. Specifically, when a phoneme of a word is perceived, this could be used to pre-activate that word's lexical representation, with consecutive phonemes reinforcing the prediction of congruent words.
Taken together, our results suggest that even when no higher-level linguistic information such as syntax and semantics is present, the human auditory system can use phonological information from several past phonemes to generate predictions about forthcoming phonemes. In the experiments presented here, participants were exposed to new pseudowords that were learned in a period of minutes. This implies a formidable capacity of the auditory system to learn sequences of phonemes composing new words and generate predictions within those words. This capacity might play a fundamental role in the difficult task of mapping a complex, variable and noisy signal as speech into meaning. Moreover, the experiments presented here use stimuli and abstract rules more complex and ecologically valid that the ones routinely used in the study of auditory prediction, allowing to show that the auditory system can proactively generate predictions.