Sebastian Heid and Sarah Hawkins
Formant-based synthetic speech typically changes abruptly between =93purely=
=94
periodic and purely aperiodic or silent segments. In contrast, changes in
natural speech at these segment boundaries are often more gradual,
resulting in =93transitional=94 patterns of mixed excitation and/or various
patterns of decaying periodicity. Some patterns vary systematically with
prosodic and segmental context. We hypothesize that these properties
contribute to robust, natural-sounding synthesis, by increasing the
acoustic-perceptual coherence of the signal, and because the systematic,
prosodic- and segment-dependent variation enhances cues to word
recognition. To test this hypothesis, we use PROCSY (combined
copy-synthesis and rule-based quasi-articulatory synthesis that drives
HLsyn: http://kiri.ling.cam.ac.uk/procsy/). Two types of variation are
examined: (1) mixed periodic and aperiodic excitation at boundaries
between vowels and voiceless fricatives; (2) the waveshape and waveform
amplitude envelope at vowel-stop boundaries. The natural speech comes from
the ProSynth database of a Southern British English man=92s speech
(http://www.york.ac.uk/~lang19/). Following trends in his speech, the
excitation types that PROCSY produces at the boundaries of interest partly
depend on the prosodic structure that the boundary is in. For example,
there is often a short stretch of mixed periodicity and frication noise at
the boundary in a vowel-fricative sequence, but not in a fricative-vowel
sequence, for which the transition from frication to periodicity is more
abrupt. However, the duration of mixed excitation in vowel-fricative
sequences depends on the height and stress of the surrounding vowels.
Similarly, the acoustic fine-detail of the waveform in the vicinity of
oral closure for stops varies with the voicing of the stop and its
segmental and prosodic context. The HLsyn parameters of glottal and oral
constriction areas, oral volume, wall compliance, and subglottal pressure
are used to reproduce these details. Thus the variation introduced
confines the incidence of abrupt voicing cessation to the appropriate
contexts, and systematically varies less abrupt changes in ways that
reflect segmental and prosodic structure. We will report the results of
perceptual tests conducted on utterances with such pure versus
transitional excitation patterns in the vicinity of segment boundaries.
Preliminary tests suggest that appropriate use of transitional patterns
improves naturalness; we predict that it will also improve
intelligibility. As part of ProSynth synthesis research, this work is
mainly significant for speech perception and synthesis, but it also has
potential applications in articulatory research.
Funded by EPSRC #GR/L53069