Richard
Richard Ogden
rao1@york.ac.uk
http://www.york.ac.uk/~rao1/
---------- Forwarded message ----------
Date: Thu, 15 Oct 1998 17:45:12 +0100 (BST)
From: Paul Carter <pgc104@york.ac.uk>
To: abstract@trill.Linguistics.Berkeley.EDU
Cc: RA Ogden <rao1@york.ac.uk>, J Local <lang4@york.ac.uk>
Subject: ICPhS abstract
"Temporal interpretation/Timing in ProSynth, a prosodic speech synthesis
system."
ProSynth* is an approach to speech synthesis which takes a rich linguistic
structure as central to the generation of natural-sounding speech.
ProSynth uses syntactic and phonological parses to model the fine
acoustic-phonetic detail of real speech, segmentally, temporally and
intonationally. Phonetic parameters are related to phonological structure
via a one-step phonetic interpretation making use of hierarchically
encoded linguistic knowledge.
We describe the model of temporal interpretation/timing employed in
ProSynth in generating polysyllabic utterances, and the phonological
structures used to drive this. The primary timing unit is the syllable.
Two mechanisms are used: (1) Syllables are joined by overlaying one over
another (2) Syllables are temporally compressed to produce the correct
rhythmical effects.
At syllable boundaries, we use a phonological model of structure sharing
(ambisyllabicity) which is sensitive to metrical and morphological
structure, so that syllables in different phonological domains can be
differently overlaid. Domains that typically need to be differentiated are
agglutinative (typically semantically transparent) morpheme boundaries vs.
more fusional (frequently less semantically transparent) morpheme
boundaries; word junctions; foot boundaries. Contrast, for example,
`mistake' with `mistime', which are rhythmically different, and have
different degrees of aspiration in the medial plosives. This is modelled
by different amounts of structure sharing between the syllables, and
consequently different degrees of temporal overlay between the syllables.
Syllables are part of a strictly layered, headed, prosodic hierarchy,
whose constituents are Syllable >> Foot >> Accent Group >> Intonational
Phrase. Within this hierarchy, we employ a model of temporal compression
(`squish') which makes extensive reference to prosodic structure. Syllable
strength and weight are part of the structural description, and
information about position within feet (initial, medial or final) is also
available. The rhythmical differences between eg. disyllabic feet
containing heavy-light vs. light-light syllables can therefore be
modelled: different degrees of squish are used for the second syllable,
depending on the weight of the first syllable. One side-effect of such
temporal control is that the different vowel qualities of the second
syllable in pairs such as `whinny' (light first syllable, monophthongal
second) vs. `windy' (heavy first syllable, diphthongal second) are
produced. Tempo effects whose domain is over larger prosodic units can
also be captured. A database of values for temporal compression is being
constructed on the basis of a natural speech database set up to contain
specific linguistic constructions. The resulting knowledge is being used
to drive a diphone synthesiser (MBROLA) as well as a formant synthesiser.
(*) Funded by the UK's EPSRC, Grant #GR/L51829.
Authors: Richard Ogden, Paul Carter, John Local
Affiliation: University of York
Author to contact: Richard Ogden
Full Postal Address: Department of Language & Linguistic Science,
University of York,
Heslington,
York. YO10 5DD
UK.
E-mail Address: rao1@york.ac.uk, pgc104@york.ac.uk,
lang4@york.ac.uk
Phone Number: +44 1904 432658
Fax number: +44 1904 432673
Word Count: 400
Subject Area: J. Phonology
Presentation Preference:Lecture