Draft ESCA Workshop abstract for comment

Mark Huckvale (mark@phonetics.ucl.ac.uk)
Fri, 24 Apr 1998 09:52:39 +0100

Draft Abstract for ESCA Speech Synthesis Workshop Version 0.2

“Hierarchical representations for synthesis using XML”

Mark Huckvale & Alex Chengyu Fang
Phonetics and Linguistics
University College London

This paper describes work being undertaken within a new English language
speech synthesis
research project ‘ProSynth’, involving the University of York, the
University of Cambridge and
University College London. The project takes an integrated prosodic view
of the
representation and processing of linguistic representations for synthesis,
building on the work
done in the ‘YorkTalk’ synthesis system. It exploits rich sources of
syntactic, phrasal, lexical
and prosodic information in the composition of multiple inter-related
hierarchical linguistic
structures for each utterance. Synthesis is then concerned with the
phonetic interpretation of
this structure and is compatible with both parametric and concatenative
signal generation.

The paper outlines the different sources of linguistic representation and
process exploited in
the creation of the linguistic description of an utterance to be
synthesised and explains how the
XML mark-up language provides a convenient formalism for representation.

For parsing, the UCL Survey parser (Fang, yyyy) provides highly detailed
syntactic and phrasal
information about an utterance by interrogating a corpus of parsed texts.
The lexicon contains
word class information and prosodic structures for each lexical item, the
latter being generated
from phonemic transcription by a syllabic parser. For each utterance in
the training corpus and
for each utterance to be synthesised, the syntactic and phrasal information
is linked via the
word sequence to the prosodic hierarchy. The prosodic hierarchy is
composed from the lexical
entries and integrates the syllabic constituents into a hierarchy of feet,
accent-groups and
intonation phrases. The cross-links between syllable structure and
syntactic structure can be
used to control prosodic phrasing and accent type. Constraints on the
phonological
composition of the structure take the place of phonological realisation
rules. Phonetic
realisation is largely declarative using the context in which each syllable
component appears.

We have found the text mark-up language XML particularly powerful in
supporting the
linguistic representation and processing in our research. Currently XML is
being used to
represent the syntactic hierarchy for an utterance, the prosodic structure
with cross-links to the
syntactic hierarchy, and details of the phonetic interpretation. We
believe XML can also
support the necessary processing components of a synthesis system: the
parsing of unknown
utterances by reference to a database of parsed text; the representation of
lexical entries, the
inter-relationships between the phonological structures in the utterance
with entries in the
parse tree, lexicon and signal; and through the support of searching for
structures in a corpus.
XML provides a simple machine-readable and syntactically-verifiable
representation, well-
suited to the design and operation of non-linear speech synthesis systems.

(about 440 words)