Questions for Mark
1. Does what I=92ve said fit in with your submissions?
2. Please provide references for submissions when available.
3. SAM test: I=92d be grateful for the official reference to SAM, if you
have it easily to hand. I=92ll find it if not.
4. I can=92t do semantically anomalous sentences of SAM because we=92re onl=
y
doing phrases and SAM=92s sentences, as I remember, were quite long. Were
there phrases as well? Or do you think just the principle of semantically
anomolous material is close enough to the spirit? (I suspect not.)
Questions for everyone.
1. Comments welcome.
2. Anyone who wants to share authorship welcome.
The form below is taken off the web.
Sarah
---------------------------------------------------------------------------=
-------------------------
ESCA synthesis Workshop
26-29 November 1998
Jenolan Caves, Blue Mountains, Australia
=20
=20
Title : Testing speech intelligibility, naturalness, and robustness using
PSOLA-manipulated speech and HLsyn
=20
First Author : Sebastian Heid
Email : heid@phonetik.uni-muenchen.de
Affiliation : Phonetics Laboratory, Department of Linguistics,
University of Cambridge, U.K.
Other authors : Sarah Hawkins
=20
Category of Submission=20
1st choice session : B (assessment)
2nd choice session : H (phonetics and phonology)
=20
Abstract (400 words approx): -- official instruction. I have 501 words at
present, so should lose 100.
This paper describes a system for assessing the quality of speech produced
by a device-independent, linguistically-motivated text-to-speech system
that is currently under development. The linguistic model is described in
other papers submitted to this workshop (Huckvale and Fang xx) and
ICLSP-98 (xx refs). The overall aim is natural-sounding speech that is
perceptually coherent because it reproduces the perceptually-significant
lawful variation of natural speech. This paper describes (a) a range of
tests that assess intelligibility and naturalness under both good and poor
listening conditions and/or conditions of high cognitive load, and (b) how
sounds are produced so the model can be tested before the system is fully
operational.
(a) Existing intelligibility protocols are used where possible, and some
new methods. Standard tests include SAM consortium recommendations
(xxref): mainly simple CV, VC and VCV syllables, and semantically
anomalous phrases. Psycholinguistic methods test more subtle aspects of
intelligibility and naturalness: these include lexical decision tasks,
tasks that load short-term and/or long-term memory e.g. when words must be
remembered, and tasks that increase cognitive load in other ways, e.g.
when listeners must do two things at once. In this =93high cognitive load=
=94
group we distinguish tasks that are largely cognitive e.g. phoneme
monitoring while identifying the words heard, from those that involve
hand-eye coordination e.g. tracking a target with a mouse. Each test can
be done in good or poor listening conditions, the latter allowing the
robustness of the speech to be assessed. Poor conditions use natural noise
e.g. fluctuating noise levels with multiple sound sources typical of
cafeterias, and the usually slower variations found inside a car. One
practical consideration is that the tests must tap a range of processes
yet not take long per listener. Another is that individuals or groups of
listeners can be tested.=20
(b) There are two types of sound output. PSOLA-manipulated natural speech
is used to model intonation and timing. This PSOLA-manipulated speech,
with known segment labels and durations, provides the input for formant
synthesis using HLsyn (xxref Sensimetrics or a Stevens paper?), which
allows spectral manipulation. HLsyn modeling has two distinct components.
Vowels [xx check manual: or all periodic portions? ] are reproduced from
the PSOLA-manipulated speech using standard copy-synthesis to extract
values for the (essentially Klatt) parameters of the serial branch.
Obstruents are synthesized by rule from the segment label and its
duration, from which values are calculated for the 10 or so [xx check]
higher-level parameters of HLsyn (pseudo-articulatory parameters that
drive the much larger number of acoustic parameters in the underlying
Klatt-type synthesizer). The strengths of this method are (a) that it uses
simple HLsyn input to capture the notoriously difficult-to-copy
parallel-branch excitation of a Klatt synthesizer, and (b) that HLsyn
parameters automatically produce complex and/or subtle acoustic properties
that accompany consonantal closures, especially at segment boundaries.
These properties are hard to produce =93by hand=94 and thus absent in most
formant synthesis, yet they provide some of the lawful variability we
hypothesize contributes to perceptual coherence, and hence to more robust
and natural-sounding synthetic speech.
NOTE: The subject field of your e-mail submission MUST contain the
keywords "submit abstract"
end of text
Sarah
______________________________________________________________________
Dr. Sarah Hawkins Email: sh110@cam.ac.uk
Dept. of Linguistics Phone: +44 1223 33 50 52
University of Cambridge Fax: +44 1223 33 50 53 =20
Sidgwick Avenue or +44 1223 33 50 62
Cambridge CB3 9DA
United Kingdom