=A0Other authors :=A0Alex Chengyu Fang and Jill House
=A0
=A0Category of Submission=A0
=A01st choice session=A0=A0 :=A0(I) Prosody prediction and control
=A02nd choice session=A0=A0 :=A0(K) System design and testing
=A0
"Hierarchical representations for synthesis using XML"
Mark Huckvale, Alex Chengyu Fang and Jill House
Department of Phonetics and Linguistics
University College London
Gower Street, London WC1E 6BT, UK
This paper describes work being undertaken within a new speech synthesis
research project 'ProSynth', involving the University of York, the
University of Cambridge and University College London. Focussing initially
on British English, we take an integrated prosodic view of the
representation and processing of utterances for synthesis, building on work
done in the YorkTalk' synthesis system. Each utterance is represented in
the form of twin hierarchies, expressing its syntactic and prosodic
structure. These hierarchies are cross linked during construction such
that phonetic interpretation can be performed in a largely declarative
manner. The synthesis system is designed to be compatible with both
parametric and concatenative signal generation.
The paper outlines the different sources of linguistic representation and
process exploited in the creation of the linguistic description of an
utterance and explains how the XML mark-up language provides a convenient
formalism.
The UCL Survey parser (Fang, 1996b,c) generates detailed syntactic and
grammatical information about an utterance. It builds on the lexical
subcategorisations provided by the AUTASYS tagger (Fang 1996a). A lexicon
supplies word class information, and for each lexical item a prosodic
structure is generated from phonemic transcription by a syllabic parser.
For each utterance in the training corpus, and for each utterance to be
synthesised, the syntactic and grammatical information is linked via the
word sequence to the prosodic hierarchy. The prosodic hierarchy, composed
from the lexical entries, integrates the syllabic constituents into a
hierarchy of feet, accent groups and intonation phrases. The cross-links
between prosodic structure and syntactic structure can be used to control
prosodic phrasing and intonation. Phonological constraints on the
composition of the prosodic structure create coarticulatory
interdependencies which can extend over differently sized domains in the
hierarchy. The phonetic realisation of the structure can then be performed
in a declarative manner by exploiting the overall context in which each
syllable component appears.
We have found the text mark-up language XML particularly powerful in
supporting the linguistic representation and processing in our research.
Currently XML is used to represent the syntactic hierarchy for an
utterance, the prosodic structure with cross-links to the syntactic
hierarchy, and details of the phonetic interpretation. XML annotations to
our training corpus facilitate searching, while in the lexicon XML
representations allow us to store partially constructed prosodic structures
for words. It is easy to validate the inputs and outputs of our processing
programs, since the XML they use must conform to a Document Type Definition
(DTD). In summary XML provides a simple, yet powerful and machine-friendly
representation, well-suited to the design and operation of non-linear
speech synthesis systems.
References
Fang, C.Y. 1996a. AUTASYS: Automatic Tagging and Cross-Tagset Mapping. In
Comparing English World Wide: The International Corpus of English, ed. by
S. Greenbaum. Oxford: Oxford University Press. pp 110-124.
Fang, C.Y. 1996b. The Survey Parser: Design and Development. In Comparing
English World Wide: The International Corpus of English, ed. by S.
Greenbaum. Oxford: Oxford University Press. pp 142-160.
Fang, C.Y. 1996c. Automatically Generalising a Wide-Coverage Formal
Grammar. In Synchronic Corpus Linguistics, ed. by C. Percy, C. Meyer, and
I. Lancashire. Amsterdam and Atlanta: Rodopi. pp 131-146.