Introduction

Lexical databases are a vital resource for the study of language, providing increasingly comprehensive information on the representations and distributions of words in spoken and written language, as well as behavioral measures of recognition (e.g., Balota et al., 2007). This information plays a fundamental role in the design, control, or interpretation of psycholinguistic experiments, and it is an indispensable component for the modeling of word recognition. As such, it could be argued that the development and widespread adoption of these databases has been one of the key supporting factors behind our current understanding of language processing, especially in areas such as lexical access and word recognition.

Lexical databases have been developed for a range of languages, although English is perhaps by far the best served in this respect. Estimates of written word frequency have long been available (Kučera & Francis, 1967; Thorndike & Lorge, 1944) and extended with phonological representations in databases such as the MRC Psycholinguistic database (Coltheart, 1981; Wilson, 1988). Additional resources also provide information on ratings of age of acquisition or the imageability of words (e.g., Bird, Franklin, & Howard, 2001; Gilhooly & Logie, 1980), and behavioral data, such as reaction times for words in naming and lexical decision tasks (e.g., Balota et al., 2007; Keuleers, Lacey, Rastle, & Brysbaert, 2012). Studies in, and of, French and Dutch have also benefited from a rich history and wide coverage of lexical databases (BruLex, Content, Mousty, & Radeau, 1990; BDLex, Pérennou & de Calmes, 1987; de Calmès & Pérennou, 1998; Lexique, New, Pallier, Brysbaert, & Ferrand, 2004; New, Pallier, Ferrand, & Matos, 2001; CELEX, Baayen, Piepenbrock, & van Rijn, 1993) and recent behavioral measures (Ferrand et al., 2010; Keuleers, Diependaele, & Brysbaert, 2010). After English, French, and Dutch languages, lexical database coverage for other occidental languages becomes relatively sparse, with German described in the CELEX lexicon and phonological transcriptions and other information available for Spanish (LexEsp, Sebastián-Gallés, Martí, Carreiras, & Cuetos, 2000) and Greek (IPLR, Protopapas, Tzakosta, Chalamandaris, & Tsiakoulis, 2012)).

For Italian, we are aware of four freely accessible lexical databases. LEXVAR (Barca, Burani, & Arduino, 2002) provides naming latencies and psycholinguistic variables such as age of acquisition, imageability, adult and child frequency measures, and orthographic neighborhood size for 626 simple nouns. Colfis (Bertinetto et al., 2005; Laudanna, Thornton, Brown, Burani, & Marconi, 1995) has estimates of written frequency of use, derived lemmas, and syntactic part-of-speech tags for over 180,000 word forms. Syllables PD/DSS is a database of 2,719 orthographic syllables, provided with positional token frequency estimates derived from over 11 million word occurrences. Finally, a database by De Mauro, Mancini, Vedovelli, and Voghera (1993) provides frequency estimates for words across a 500,000 word corpus of spoken Italian. Unfortunately, none of these lexica provide phonological transcriptions of Italian words,Footnote 1 meaning that there is no large-scale database that covers the spoken forms and associated phonological variables for this language. It is highly possible that the lack of this type of database stems from the perception that Italian orthography is highly transparent (e.g., Maraschio, 1993), with a relatively simple bi-univocal mapping between grapheme and phoneme that could make word-level phonological transcription largely redundant. However, while Italian can be classified as being toward the extreme end of orthographic transparency, many of the relationships between orthography and phonology are not simple one-to-one mappings. These can require more complex rules that can take account of wider phonological or orthographic contexts (see Burani, Barca, & Ellis, 2006). Moreover, some phonological contrasts are not represented in the orthography, meaning that translation between representations can be a laborious process.

One example of a complex mapping rule relates to velar plosive and affricate sounds, which are both represented in the orthography by “g”and “c” in combination with other characters. The velar plosive /g/ is realized by the letter “g” if followed by the vowels “o,” “a,” or “u,” but by the bigram “gh” if followed by the vowels “i” and “e.” In contrast, the affricate /ʤ/ is realized by the letter “g” if followed by the vowels “i” and “e,” but by the bigram “gi” if followed by the vowels /a,o,u/ (thus, /ge/ > “ghe,” /gi/ > “ghi,” /ʤe/ > “ge,” /ʤi/ > “gi;” /go/ > “go,” /ga/ > “ga,” /gu/ > “gu,” but /ʤa/ > “gia,” /ʤo/ > “gio,” /ʤu/ > “giu”). The same rules hold for the unvoiced counterparts of these segments (/k/ and /c/). Some palatal sounds are also represented in the orthography by more than one letter (e.g., fricative /S/ > “sci,” nasal /N/ > “gn,” lateral /L/ > “gli”; but see affricate /Z/ > “z”). These phonemes, moreover, are always geminated in Italian, but the orthography represents them as a singleton. The Italian phonology has a large number of geminate consonants (e.g., 19 % of consonants by frequency type are geminate), and germination is a contrastive feature for the majority of consonants (e.g., pala [spade] vs. palla [ball]; poro [pore] vs. porro [leek]). The phonemes listed above, however, are present only in geminate form. Therefore, the orthography does not represent what would amount to redundant information (e.g., azione > az.zjo.ne, agnello > aN.Nel.lo, aglio > aL.Lo). Another example is the grapheme “h,” which has no phonological counterpart but is still contrastive in orthography (e.g., hanno [They have] vs. anno [year]). Conversely, the phonological contrast in openness between /e/ and /ɛ/ and /o/ and /ɔ/ in standard ItalianFootnote 2 can be lexically distinctive (e.g., /pɛska/ [peach] vs. /peska/ [fishing]) in stressed syllables, but these phoneme pairs are represented by the single graphemes “e’ and “o,” respectively. While stress can provide a cue to vowel aperture, with “e” or “o” usually corresponding to open vowels in stressed syllables (e.g., fra.tɛl.lo [brother] and fɔ.to [photo]), the frequent exceptions (e.g., in.so’r.ge.re [to rebel]) mean that this cue is indicative at best, requiring that phonological vowel aperture is established on an item-by-item basis.

The types of irregularities described above mean that Italian orthography does not provide a sufficiently accurate representation of the Italian phonology for many applications, from robust control of psycholinguistic stimuli, to statistical examinations of cross-linguistic contrast, to analyses of frequency effects in children and in language-impaired populations (e.g., aphasic patients, children with specific language impairments). In this article, we present an open-access lexical database designed to fill this gap, by providing phonological transcriptions across a wide range of Italian word forms, as well as a range of derived psycholinguistic variables, such as phonological neighborhood measures, plus statistical summaries of phoneme and syllable use. This article describes the methodology behind the construction of this database, describes the information provided in the lexical and derived databases, and provides statistical summaries of the data held within them. We will also present an example of the usefulness to this database by applying a study designed to examine aphasics' phonological errors. Another example of how the statistics derived by the database can be used to inform our understanding of language processing and its universal basis is presented in Romani, Galluzzi, and Goslin (2013).

Methodology

The basis for this lexicon was Colfis (Bertinetto et al., 2005; Laudanna et al., 1995), a database of written Italian word forms derived from 3,798,275 textual occurrences from a corpus of newspapers (1,836,119), magazines (1,306,653), and books (655,503) published between 1992 and 1994. This originally consisted of 188,792 word forms, each with fields describing their part-of-speech tag and the frequency of occurrence across the three textual sources. Using these Colfis word forms, we made an initial screening to remove all entries that contained non-alphabetic characters apart from the apostrophe. This resulted in the removal of 44,376 phrases (such as “in giro”) and 1,266 nonwords (such as “-se-“) and minor corrections to 2,294 word forms (e.g., changing the entry “canaletto (m)” to “canaletto”). The remaining word forms were then subjected to further manual screening, resulting in the removal of an additional 5,939 nonwords (such as “fndo”) and 17,211 imported words (such as “Dorothy”). It should be noted that not all imported words were removed in this screening process; any considered to be in current usage (such as “film” or “Marx”) remain in the database.

At the end of the screening process, exactly 120,000 word forms remained (63.56 % of the original Colfis word forms) as candidates for phonological transcription. The first stage of the process was implemented using the phonological transcription module from the Italian Festival text to speech system (Cosi, Gretter, & Tesser, 2000). This generated a phoneme string for each of the word forms, with additional markers for syllable boundaries and primary syllable stress. These representations were then converted from Festival’s SAMPA phonemic alphabet to a custom alphabet in which each of the 29 individual Italian phonemes labeled in the lexicon could be presented by a single standard text character, as described later in Table 2. It is worth noting that this transcription does not make a distinction between the alveo-palatal fricatives /s/ and /z/. This is because these phonemes are not used in a contrastive fashion in Italian, and differences in their distribution are a matter of regional preferences or an allophonic variation dependent on context. For example, the unvoiced allophone /s/ is used before voiceless consonants (as in “scarpa”), while the voiced allophone /z/ is used before voiced consonants (as in “sgravio”). Since our aim was to provide a phonological and not a phonetic description of Italian words, we transcribed both allophones with the same symbol (/s/; see later sections for more details). The placement of syllable boundaries was then modified where necessary to conform to Italian-specific syllabification rules based upon those created by Laporte (1993) for French. These rules dictate minimal syllable onsets, such that the syllable boundary should be placed before the last segment of an intervocalic consonant cluster that is not a glide (see Goslin & Frauenfelder, 2001, for a comparison of syllabification algorithms). This means that intervocalic syllable onsets would consist of a single consonant by default, such as in /vOl.ta/ (“volta”), /as.ta/ (pole)*. Exceptions, however, involve obstruent segments that are immediately followed by a liquid (e.g., /pl/), since these clusters are treated as tautosyllabic. Moreover, if there is a glide immediately preceding the vowel, the onset is extended to include another consonant, if one is available, such as in /stO.rja/ (“storia”) or /GraZ.Zje/ (“grazie”). Finally, both exceptions can combine to produce an onset consisting of an obstruent, liquid, and glide, such as in /is.trja/ (“istria”).

Each of the generated phonological representations (and syllable stress and boundary markers) was then manually checked by the second author, with additional random spot-checking from the final author, both of whom are native Italian speakers. Any disagreements were settled by discussion. The transcription was intended to conform to a standard Italian pronunciation that is generally uncontroversial, apart from some alternations between /e/–/ε/ and /o/–/ɔ/, which are subject to regional variations. Even in these cases, representations are intended to approximate a “standard” pronunciation, although both of these native Italian linguists have the regional accent of Rome, which may color their judgments. Multiple redundant checking meant that each phonological representation was verified at least twice. It was found that 28,168 representations required some form of manual correction (30.67 % of the lexicon).

An evaluation of the reliability of the phonological representations was made via blind phonemic transcription of 500 word forms selected at random from the database. These were hand transcribed using the phonetic alphabet adopted by phonItalia by a native Italian speaker that was independent of the development of the lexicon. Point-to-point agreement was calculated between each of the 2,917 phonemes representing those 500 words in the database and the independent transcription. Phonemic insertions or deletions made by the independent transcriber not found in the lexicon were also counted as errors. This comparison revealed phonemic agreement of 98.35 %, with a kappa of 0.983. It should be noted that the majority of the disagreements (28 of a total of 48) were due to differences in the marking of vowel aperture (/e/–/ε/ or /o/–/ɔ/), likely due to regional differences in the representations used by the original phonItalia linguists (Rome) and that of the independent transcriber (Florence).

Lexical statistics

As was described in the previous section, this new lexicon provides phonological representations for 120,000 Italian word forms, along with associated syllable boundary and stress markers. While the Colfis database provides frequency, part-of-speech tags, and the lemma for each word form (a description of original Colfis fields is provided on phonItalia.org), phonItalia augments this information with a range of additional fields that provide information related to both the phonological and orthographic representations of the words.

Additional orthographic fields include the consonant vowel structure of the word, the number of homographs of that word, and the uniqueness point—that is, the letter at which the orthographic representation becomes unique. Since the uniqueness point lists a value of zero if the representation never becomes unique, an additional field is also included that lists the uniqueness point minus one (OrthUniqM1). For nonunique words, this field will have a value of the length of the word and, thus, avoids the potential skewing in summarizing statistics that could result from the zero values of the uniqueness point field. All of these fields have also been reproduced for the phonological representation of the words, with a number of further additions. For the phonological vowel consonant structure, consonants that are in geminate pairs are given the representation “G” rather than “C.” For example, /kap.pot.to/ is /CVG.GVG.GV/. Other fields have been added that relate to syllabic information, listing the number of syllables in the word, the position of the stressed syllable, and a phonological representation that includes syllable boundary markers (denoted by “.”).

Each word is also provided with estimations of both orthographic and phonological neighborhood; these have been estimated using measures of Colheart’s N (Coltheart, Davelaar, Jonasson, & Besner, 1977) and Levenshtein distance. Coltheart’s N is calculated as the number of lexical character sequences that can be constructed by changing a single character of the current entry while the position and identity of the remaining characters remain unchanged. All neighboring lexical entries that are homographs were grouped and counted as a single neighbor. The Levenshtein distance is the number of single insertions, deletions, or substitutions required to change from one character string to another. To calculate this value, the Levenshtein distance between the orthographic/phonological representations of the current entry and all other unique orthographic/phonological entries in the lexicon are calculated. The reported orthographic/phonological Levenshtein distance (OLD/PLD) 20 is the mean of the 20 smallest distances found. Additional fields related to these metrics include estimates of the total frequency of neighbors and also estimates of the number and frequency of those with higher or lower frequency than the target word. Finally, the main phonItalia database also provides mean and summed frequencies of the orthographic and phonological bigrams contained within each word (individual character bigram and biphone statistics are also made available in a separate derived database described below).

For all fields that required calculation based upon estimate of frequency of use (such as Phon_N_MFreq, mean logFootnote 3 frequency of words in the phonological neighborhood), we based this upon the Colfis total frequency estimate field fqTot. All of the new data fields included in phonItalia are shown in Table 1, along with a summary of the global statistics for numeric fields calculated across the entire lexicon.

Table 1 Summary of phonItalia main database fields and summarizing statistics (where appropriate)

Derived sublexical statistics

The provision of phonological word forms within this lexicon allows for the first comprehensive estimation of the relative frequency of occurrence of Italian phones, syllables, and other phonological representations. These have been calculated across all word forms within the lexicon to produce both nonpositional and positional type and token frequency measures. Type frequency measures (identified by the fields TypeF) refer to the number of times a particular unit (phoneme, syllable, etc.) occurs within the words of lexicon, with each word counted once. Token frequency (identified by the fields TokenF, with the natural log of this value found in the field LnTokenF) refers to the number of times a unit occurs in the words of the language taking into account the frequency of the words. Thus, phoneme occurrences are multiplied by the frequency of the words in which they occur and then summed. All token frequencies are calculated using total lexical frequency measure from Colfis (field name fqTot). Multiple instances of a unit within a word are additive, so the type count for /p/ would be incremented twice for the word /prO.prjo/ (“proprio”), and the token count increased by twice its lexical frequency (2 * 2,408). Estimates for phone frequency are provided both overall and relative to syllabic position (see below for more details). In addition, overall frequency data for different types of multiconsonantal syllable onsets are provided (e.g., the frequency of onsets like, /p/, /pr/, /pl/, or /str/). Syllable frequencies are provided overall and according to word position. Character bigram and biphone frequency statistics have also been calculated across the lexicon, with frequency estimates provided relative to word and (for biphones) syllable position. This information is provided in a number of additional databases separate to the main lexicon, the contents of which are summarized in the following sections. As with the main lexicon, all these additional databases are available from the lexicon Web site in Excel and tab-delimited text format. The source code and program used to generate these derived statistics (as well as update statistics in the main word forms database, such as bigram frequency or uniqueness points) are also available in from the database Web site phonItalia.org.

Phone statistics

This database provides the frequency of occurrence for all 29 Italian phones used within this lexicon. Overall phonemic frequencies of use are summarized in Table 2, with the database also providing statistics for phones relative to specific syllabic positions. These fields are as follows:

  • Single Onset provides statistics for phones found in a single-consonant syllable onset—for example, the phone /n/ in the word /a.E.ro.pla.no/.

  • Onset /Cc/ for phones found in the first consonant of a double-consonant syllable onset—for example, /p/ in /a.E.ro.pla.no/.

  • Onset /cC/ for phones in the second consonant of a double-consonant syllable onset—for example, /l/ in /a.E.ro.pla.no/.

  • Onset /Ccc/ for phones in the first consonant of a triple-consonant syllable onset—for example, /G/ in /Gan.Gljo /.

  • Onset /cCc/ for phones in the second consonant of a triple-consonant syllable onset—for example, /l/ in /Gan.Gljo/.

  • Onset /ccC/ for phones in the third consonant of a triple-consonant syllable onset—for example, /j/ in /Gan.Gljo/.

  • Nucleus for phones that form the nucleus of a syllable—for example, /o/ is twice found as a nucleus in /a.E.ro.pla.no/.

  • Single coda provides statistics for phones found in a single-consonant syllable coda—for example, /n/ in the word /lan.ce/.

  • 1st coda for phones in the first consonant of a syllable coda (greater than one consonant in length)—for example, /l/ in /film film/.

  • 2nd coda for phones in the second consonant of a syllable coda (greater than one consonant in length)—for example, /m/ in /film/. There are very few of these cases in Italian.

  • Geminate provides statistics on phones that are found in geminate position in a word—for example, /g/ in the word /mag.go.re/. Table 3 provides a summary of the relative frequency of consonant occurrence when geminate (e.g., /n/ in /dOn.na/ “donna”) or nongeminate (e.g., /n/ in /pun.to/ “punto”).

Table 2 Summary of phone frequency of occurrences and the proportion of total frequency across the lexicon, ordered by type frequency
Table 3 Summary of relative geminate and nongeminate frequencies for consonants

Syllable statistics

This database contains calculations of the frequency of use for the 3,626 unique syllables found within the lexicon. An observation worth noting is that phonological syllables appear to be far more numerousFootnote 4 (33 % more types) in Italian than do orthographic syllables, with only 2,719 listed in PD/DPSS Syllables (Stella & Job, 2001). This serves to highlight the degree of ambiguity between the Italian orthography and phonological representations. A summary of the distribution of phonological syllabic frequency by syllable length is shown in Table 4, with a similar summary of syllable stress as a factor of length in Table 5. As in the phone database, type and token frequencies are provided for all occurrences, irrespective of their word position, with additional statistics for occurrences in specific word position, as follows:

Table 4 Summary of the frequency of use for syllables according to length
Table 5 Distribution of syllable stress by type frequency and token frequency (in parentheses) according to the number of syllables in each word
  • MonoSyll provides frequency information for syllables that occur in monosyllabic words.

  • Initial is the field that describes syllables that occur word initially in multisyllabic words—for example, /ti/ in /ti.fa.no/.

  • Medial provides frequency information for syllables from multisyllabic words that are not in either word-initial or word-final position—for example, /ti/ in /ul.ti.mo/.

  • Final gives frequency information for syllables found in multisyllabic words that are word final—for example, /ti/ in /van.ti/.

A subset of this syllabic frequency information, containing the 100 most frequent syllables, is listed in Appendix Table 8, ordered by token frequency. In addition to the overall syllabic data, each syllable in the database is also provided with additional fields with the frequency of occurrence for the corresponding phone sequence irrespective of syllable boundaries. The previous syllable fields only include frequencies for phone sequences that respected syllable boundaries, such as the syllable /par/ in the word /par.ti.ta/. In the following n-Gram type sequence frequency statistics, the token and frequency calculations also include occurrences of the same phone sequences that cross syllable boundaries, such as /par/ in the word /pre.pa.ra/.

  • PhonSeq_Total gives the frequency of occurrence for the phone sequence of the syllable in the lexicon irrespective of syllable boundaries.

  • PhonSeq_Word_Initial is similar to PhonSeq_Total but includes the statistics only for words where the syllable phone sequence is found word initially. For example, statistics for the syllable /tar/ would include an occurrence for the word /ta.ra.re/, but not in /kon.ta.re/.

Syllable onsets and codas

To complement the previously described syllabary, separate databases are also made available that describe each of the 132 syllable onsets and 58 syllable codas, summarized by length in Table 6. In these databases, the type and token frequencies of each particular onset or coda are provided. The onset and coda databases also list a blank entry that has been included to provide statistics for the occurrence of syllables with an empty onset (e.g., the syllable /ar/) or coda (e.g., the syllable /si/). As in the syllabary, these statistics are provided for all occurrences irrespective of word position, plus those found in particular word position, as described below.

Table 6 Summary of the frequency of use for syllable onsets and codas according to length
  • Total gives statistics for syllable onsets or codas found in any word position.

  • Word Initial gives statistics for syllable onsets found in word-initial position—for example, /t/ in /ti.fa.no/.

  • Word Medial provides statistics for both syllable onsets and codas that are medial to the word—for example, the onset /d/ or the coda /n/ in /mon.do/.

  • Word Final provides statistics only for syllable codas that are found in word final position.

  • Geminate is a subset of the word medial statistics, and is limited to syllable onsets or codas that are geminate—for example, the onset and coda /l/ in /al.lo/.

For clarity, syllable onsets and codas have also been split into their constituent consonants, with each consonant held in separate fields.

  • Number of phones is the number of phones in the syllable onset or coda.

    • 1st phone is the 1st (leftmost) phone in the syllable onset or coda—for example, /p/ in the onset /pl/ or /l/ in the coda /lm/.

    • 2nd phone is the 2nd phone in syllable onset or coda—for example, /l/ in the onset /pl/ or /m/ in the coda /lm/.

    • 3rd phone is the 3rd phone in syllable onset or coda; this would be blank in the example of /pl/ or would be /s/ in the coda /rks/ from “Marx”.

    • 4th phone is the 4th phone in syllable onset (this field is missing in the coda database).

Character bigram and biphone statistics

Two separate databases provide statistics covering 577 biphones and 478 character bigrams calculated across the lexicon. This information is provided for all occurrences, but additional statistics are provided for occurrences relative to word position, with biphones also having statistics for occurrences relative to syllable position.

  • Word Initial gives the statistics of bigrams that occur in word-initial position—for example, the biphone /ko/ in /kon.trad.det.te/ or the character bigram “se” in “sempre.”

  • Word Medial has statistics for bigrams that occur word medially—for example, the biphone /on/ in /kon.trad.det.te/ or the character bigram “mp” in “sempre.”

  • Word Final gives frequency information for bigrams that occur word finally—for example, the biphone /te/ in /kon.trad.det.te/ or the character bigram “re” in “sempre.”

  • Syllable Onset gives frequency statistics for biphones that are found in syllable-initial position—for example, /tr/ in /kon.trad.det.te/. This would include all occurrences in which the first and second phones of the biphone and syllable were shared.

  • Syllable Medial provides statistics for biphones found in syllable-medial position—for example, /ra/ in /kon.trad.det.te/. This would include all occurrences where neither the first nor second phone of the biphone coincided with the initial or final phone of a syllable.

  • Syllable Final gives frequency statistics for biphones that are found in syllable-final position—for example, /et/ in /kon.trad.det.te/. This would include all occurrences in which the final and penultimate phones of the bigram and a syllable were shared.

  • Cross Syllable biphones are those that cross syllable boundaries—for example, /nt/ in /kon.trad.det.te/. In this case, the first phone of the biphone must consist of the final phone of the syllable preceding the boundary, and the second phone the first phone of the syllable that precedes the boundary.

Orthographic character statistics

This database contains calculations of the frequency of use for 27 orthographic characters used in the word forms of the lexicon, including the apostrophe, irrespective of word position.

Application of lexical statistics to analyses of aphasic errors

Analyses of speech errors have played a very important role in constraining models of speech production, and they are a crucial tool for diagnosing the level of impairment in patients suffering from language difficulties following a stroke (aphasia). While analyses of the relationships between word frequency and errors are routinely used as a diagnostic tool, analyses of the influence of phoneme frequency have been very limited in their scope.

Early studies by Blumstein (1973, 1978) found no difference in frequency effects between small groups (n = ~ 6) of Broca, Wernicke's, and conduction aphasics. However, a larger study by MacNeilage (1982) contrasted 20 English-speaking nonfluent aphasics (with possible apraxic difficulties) with 10 fluent aphasics. He found that target error rates were greater in low- than in high-frequency phonemes (frequency correlated with percentage of errors), but only in the nonfluent group. In contrast, the incidence of intruding segments was found to increase with phoneme frequency across both groups, an effect also found by Robson, Pring, Marshall, and Chiat (2003) in a fluent patient with jargon aphasia. Goldrick and Rapp (2007) also reported contrastive effects, with an effect of frequency in a patient with a postlexical locus, but not in a patient with lexical phonological impairment.

An examination of the limited evidence from these studies suggests that it may only be apraxic patients, those with articulatory difficulties, who have greater difficulties in computing the articulatory programs associated with low-frequency phonemes. This hypothesis would predict an inverse relationship between articulatory complexity and phoneme frequency, with high-frequency phonemes being easier to articulate. For other patients, phonological errors do not appear to be due to difficulties in computing articulatory programs, but they occur because of confusion in lexical representations or difficulties in selecting the right phonemes for a word. For these patients, frequency will not affect the ability to produce target phonemes, although more frequent phonemes may still be selected erroneously over the actual targets.

In our study, we examine whether the relationship between phoneme frequency measures from phonItalia and the distribution of production impairments can be used to distinguish between these different of types of aphasic patients.

Method

Patients

Two patient subgroups were selected from a patient pool of 24 patients, all of whom had confirmed diagnosis of aphasia. Of these, 22 had suffered from left-hemisphere stroke, one from right CVA, and one from close head injury. All had been selected due to the high number of phonological errors they exhibited across a range of speech production tasks, an absence of peripheral dysartric difficulties (e.g., systematically distorted speech), and relatively good phonological discrimination abilities. Further details of this particular set of patients can be found from previous studies (see Romani & Galluzzi, 2005; Romani, Galluzzi, Bureca, & Olson, 2011; Romani, Galluzzi, & Olson, 2011). Subgroups were selected on the basis of particularly high or low rates of phonetic errors. The 11 members of the phonological-apraxic (ph-apraxic) group were selected because they made more than 10 % of phonetic errors, while the nine phonological-selection (ph-selection) patients made fewer than 5 % phonetic errors.

Task and analyses

Patients were asked to repeat 773 words, with a phonemic transcription made of their repetitions. Analyses were limited to phoneme substitution errors. Following the procedure of MacNeilage (1982), we examined the correlation between the percentages of times a phoneme was substituted in error (replaced rates) and its token frequency from phonItalia. We also conducted a separate analysis of the correlation between the number of times each phoneme type was used instead of targets in the substitution errors (replacing numbers) and its token frequency count. Phonemes /N/, /L/, /S/, and /z/ were removed from the analyses, since these segments are always geminate, which could have reduced error rates. Deletion and insertion errors were not included in the analyses. Patients generally avoid the production of phonotactically illegal sequences and/or difficult sequences of vowels, and for this reason, only a limited set of consonants can be deleted (sonorants in certain syllabic positions; see Romani, Galluzzi, Bureca, et al., 2011, for an explanation). This limits the potential scope of analyses on deletion and insertion errors.

Results and discussion

A summary of the results can be seen in Table 7. It was found that there was a significant negative correlation between the percentage of substitution errors and phoneme frequency in the ph-apraxic patients, r = −.50, p < .05, but no significant correlation in the ph-selection patients, r = −.22, p = .36. An examination of the relationship between the number of times a phoneme was used as a replacement and its frequency revealed significant positive correlations in both the ph-apraxic, r = .55, p < .05, and the ph-selection, r = .87, p < .001, patient groups. We also conducted linear regression analyses with frequency and patient group as predictors of rate of errors on the different phonemes and number of times different phonemes were used as replacements. For rate of errors, we found a marginally significant interaction between frequency and group, F(1, 33) = 3.93, p = .056. Individual analyses showed that frequency was significant for the apraxic groups, F(1, 17) = 5.26, p = .036, but not for the phonological group, F(1, 17) = 0.85, p = .37. The linear regression predicting the number of times different phonemes were used as replacements showed no significant interaction between frequency and group, F(1, 33) = 2.01, p = .17, but there was a significant main effect of frequency, F(1, 34) = 13.6, p < .001.

Table 7 Substitution errors made by phonological-apraxic and phonological-selection aphasic patients

The error rate results support our original diagnostic division between patients where phonological errors are motivated either by difficulties with the articulatory production of the phonemes (in the ph-apraxic group) or by difficulties in the selection of the right phonemes (in the ph-selection group). Moreover, it also points toward a relationship between phoneme frequency and articulatory complexity. Frequency influenced rate of substitutions only in the ph-apraxic group. It is possible that, in this group, errors on the low-frequency segments are more likely because, generally, these are the segments more difficult to articulate. These results are consistent with those of an earlier study (MacNeilage, 1982) and also with findings of the effects of syllable frequency in patients with apraxia of speech (Aichert & Ziegler, 2004; Staiger & Ziegler, 2008), but not in patients with more central phonological difficulties (Wilshire & Nespoulous, 2003; but see also Laganaro, 2008, for inconsistent results). These findings lend support to studies showing how phonological complexity and frequency can be used to selectively identify and characterize apraxic patients (Romani & Galluzzi, 2005; Romani, Galluzzi, Bureca, et al., 2011; Romani, Granà, & Semenza, 1996). Both analyses of frequency and complexity highlight important differences between types of patients that are not well recognized in the literature but that can have important implications for diagnosis and rehabilitation (see Blumstein, 1973, 1978).

Our results also revealed a significant positive correlation between the frequency of phonemes and how many times they are used as replacing phonemes across both patient groups. This result is an apparent contrast with the results of a recent study where we show that articulatory complexity does not influence which phonemes are used as replacement in the phonological group (Galluzzi, Bureca, Guariglia & Romani, 2013). It is possible, however, that, although strongly related, frequency and articulatory complexity of phonemes are partially independent variables. Thus, for patients without articulatory difficulties, frequency is a stronger variable than complexity in informing choice among alternatives and, therefore, in determining which phonemes are used as replacements. Similarly, in Romani et al., (2013), we found that complexity and frequency were strongly correlated when predicting age of acquisition in Italian children, indicating that within-language phoneme frequency is influenced by articulatory complexity. However, it must be noted that data from the latter study also point to other factors, independent of complexity, that influence the distribution of phoneme frequency.

Conclusion

The primary aim of this project was to produce a lexical database for Italian that would include the phonological transcriptions of word forms. This database includes a comprehensive set of common psycholinguistic variables to cover both the spoken and written modalities. The first use of this resource has been to produce a set of derived databases that include frequency-of-use statistics for Italian across a range of units, including both phonemic and syllabic units. All of these databases are open-access, available from the Web site phonItalia.org formatted in Excel, and tab-separated text format, freely distributed under a creative commons license. This resource will be of utility across a wide range of research, from the design or analysis of psycholinguistic experiments with Italian stimuli and in natural language processing and cross-linguistic applications. It is hoped that the distribution of this database under an open-access license will encourage further extensions or changes to the databases in the future. Finally, we have shown how important conclusions can be derived from applications of some our derived statistics. In particular, we demonstrated that analyses of phoneme frequency (as well as word frequency) on speech errors can provide important cues as to the locus of an individual’s language impairment.