Elsevier

Speech Communication

Volume 51, Issue 7, July 2009, Pages 622-629
Speech Communication

Do ‘Dominant Frequencies’ explain the listener’s response to formant and spectrum shape variations?

https://doi.org/10.1016/j.specom.2008.12.003Get rights and content

Abstract

Psychoacoustic experimentation shows that formant frequency shifts can give rise to more significant changes in phonetic vowel timber than differences in overall level, bandwidth, spectral tilt, and formant amplitudes. Carlson and Granström’s perceptual and computational findings suggest that, in addition to spectral representations, the human ear uses temporal information on formant periodicities (‘Dominant Frequencies’) in building vowel timber percepts. The availability of such temporal coding in the cat’s auditory nerve fibers has been demonstrated in numerous physiological investigations undertaken during recent decades. In this paper we explore, and provide further support for, the Dominant Frequency hypothesis using KONVERT, a computational auditory model. KONVERT provides auditory excitation patterns for vowels by performing a critical-band analysis. It simulates phase locking in auditory neurons and outputs DF histograms. The modeling supports the assumption that listeners judge phonetic distance among vowels on the basis formant frequency differences as determined primarily by a time-based analysis. However, when instructed to judge psychophysical distance among vowels, they can also use spectral differences such as formant bandwidth, formant amplitudes and spectral tilt. Although there has been considerable debate among psychoacousticians about the functional role of phase locking in monaural hearing, the present research suggests that detailed temporal information may nonetheless play a significant role in speech perception.

Section snippets

Acoustic bases of vowel percepts

Traditionally, vowel quality has been specified acoustically in terms of the first, second and third formant frequencies. F1 and F2 are the main determinants of vowel color. For back vowels the contribution of F3 is negligible, but including it makes it possible to characterize the F2F3 proximity of retroflection and front rounded vowels. Higher formants seem more linked to individual voice characteristics.

The formant-based approach is validated by, among other things, the success of formant

Listener responses to formant and spectrum variations

The remarkable thing about all these amplitude variations is that, although they can be drastic and can be perceived, they seem to leave the timber/vowel quality component of the stimulus virtually unaltered.

For instance, in trying to recreate a recorded utterance by means of high-fidelity copy synthesis, phoneticians have noted that the percept of a vowel’s “phonetic quality” that is, its “timber” transmitted in parallel with voice quality and channel characteristics – can be astonishingly

Phase locking

Comparing the auditory system to an analog or simulated spectrograph highlights a significant difference between biological and current technological sound analysis. Whereas conventional speech spectrography averages the temporal output from the analysis filters, the auditory system takes the process further making ingenious use of this information. Simplifying we can say that the ear’s operation is similar to a filter followed by a zero crossing counter.

Suppose we examine the output of a

DOMIN: an auditory spectrograph

Most public-domain software tools for speech analysis do not incorporate time-place representations. The “auditory” spectrograph described by Carlson and Granström (1982) is an exception. It was a pioneering effort. Spectrograms were produced using critical-band analysis, and filter outputs were displayed in phon/Bark units. Included in the model was a Dominant Frequency (DF) representation showing the number of channels dominated by (read: ‘phase locked to’) a certain frequency plotted against

Formants

Fig. 4 combines two KONVERT representations: A DF histogram and a sone/Bark pattern. The vowel is [ε]-like computed with F1 = 500, F2 = 2000, F3 = 2700 and F4 = 3300 Hz and F0 = 100 Hz. The x-axis represents frequency (in Hz). Number of channels is represented on the left ordinate. The sone/Bark values should be read along the second y-axis (right). Ten bins per Bark were used.

It is immediately clear that the frequencies with the largest number of channels are located at the peaks in the spectrum curve. If

Concluding Remarks

Building on the work of Carlson and Granström, we demonstrated how a realistic auditory model of vowel processing (KONVERT) can represent information about both whole spectra and formant patterns (i.e., F1, F2 and F3). Whole spectra are represented as the output of a critical-band filterbank (i.e., excitation patterns), whereas F0 and formants (as carried by their strong harmonics) are captured by Dominant Frequency histograms that model the effects of phase locking in auditory neurons.

Apart

References (27)

  • R.L. Diehl et al.

    On explaining certain male–female differences in the phonetic realization of vowel categories

    J. Phonetics

    (1996)
  • Assmann et al.

    Modeling the perception of concurrent vowels: vowels with the same fundamental frequency

    J. Acoust. Soc. Am.

    (1989)
  • R.A.W. Bladon et al.

    Modeling the judgment of vowel quality differences

    J. Acoust. Soc. Am.

    (1981)
  • Blomberg, M., Carlson, R., Elenius, K., Granstrom, B., 1984. Auditory models in isolated word recognition. In:...
  • Carlson, R., Granström, B., 1976. Detectability of changes of level and spectral slope in vowels. In: STL-QPSR, vol....
  • Carlson, R., Granström, B., 1979. Model predictions of vowel dissimilarity. In: STL-QPSR, vol. 20(3–4). Royal Institute...
  • R. Carlson et al.

    Towards an auditory spectrograph

  • Carlson, R., Granström, B., Fant, G., 1970. Some studies concerning perception of isolated vowels. In: STL-QPSR, vol....
  • R. Carlson et al.

    Two-formant models, pitch and vowel perception

  • Carlson, R., Granström, B., Klatt, D.H., 1979. Vowel perception: the relative perceptual salience of selected acoustic...
  • B. Delgutte et al.

    Speech coding in the auditory nerve I: vowel-like sounds

    J. Acoust. Soc. Am.

    (1984)
  • G. Fant

    Acoustic Theory of Speech Production

    (1960)
  • Fant, G., 1972. Vocal tract wall effects, losses, and resonance bandwidths. In: STL-QPSR, vol. 2–3. Royal Institute of...
  • Cited by (6)

    • Midbrain responses to micro-stimulation of the cochlea using high density thin-film arrays

      2012, Hearing Research
      Citation Excerpt :

      In all cases of deafness, auditory perceptual thresholds are increased and the ability to comprehend speech is compromised (Moore, 2007). For those with normal hearing the perception of acoustic environments requires the discrimination of several sources of sound that can have discreet fluctuations in frequency, intensity, timing and place (Lindblom et al., 2009; Lutfi, 2008; Moore, 2007; Plomp, 1976). Psychophysical aspects of frequency perception observed in the normal hearing population are fundamental to complex sound perception and as such the delivery of discrete frequency input and a greater control over amplitude should be goals in the development of CIs.

    • Study of the effects of age and body mass index on the carotid wall vibration: Extraction methodology and analysis

      2014, Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine
    • Vowel perception in normal speakers

      2013, Handbook of Vowels and Vowel Disorders
    • The role of formant amplitude in the perception of /i/ and /u/

      2010, Journal of the Acoustical Society of America
    View full text