Mark Huckvale (mark@phonetics.ucl.ac.uk)
Wed, 29 Sep 1999 15:23:08 +0100
Jill asked me to have a go at rewriting the first paragraph of the paper,
taking into account recent work in corpus-based methods of synthesis.
I think it says much the same things but in a more contemporary context.
Mark
----------------------------------------------------------------------
Despite continued engineering advances in text to speech (TTS) systems,
speech synthesised by rule has yet to make a significant impact as an
output channel for information systems. A constant complaint is the
perceived 'unnatural' quality of the synthetic speech: that the speech does
not sound as if it could have been produced by a human speaker. Such
problems persist despite improvements in textual analysis, pronunciation
and signal generation. For example: although the use of a large corpus of
recorded speech for polyphone concatenation has produced signals with
sections with a highly-natural voice quality, utterances still exhibit
disfluencies, broken rhythm and lack of coherence. Contemporary synthetic
speech still suffers from dull and repetitive intonation, from poorly
modelled coarticulation, from unexpressive prosody. These failings arise
from the poverty of the linguistic representation underlying the utterance
to be produced, as well as a fundamental lack of attention to the fine
detail in human production - fine detail that listeners expect and also
utilise when listening in noise.
This archive was generated by hypermail 2.0b3 on Wed Sep 29 1999 - 15:19:50 BST