Richard Ogden (rao1@york.ac.uk)
Tue, 24 Aug 1999 17:02:57 +0100 (BST)
Here is the CSL paper after today's work.
I've organised it according to Jill's suggestions (mostly: left out one or
two section headings, though I think the content is there). I hope I've
renumbered everything accordingly...
I've redone the introduction a bit. I had threatened to remove some of
the early text on perception, but ended up leaving it in; just
redistributed a bit between "intro" and "motivation" (Sec. 2). Jill had
some objections, the core of which were that there wasn't a clear enough
link with the phonology. I hope that that's a bit clearer now.
Mostly I think the paper's OK. I like Jill's structure, and I think it
makes things join together better. *THANKS JILL!*
There are some serious gaps and I need you to decide what we're doing
about one or two of them.
1. No UCL perception text.
Either dump all perception text (I think it looks unbalanced at the
moment) or include what we've got -- which the last I heard from Sarah was
that it's pretty good for UCL. So my preference would be for the text to
be written up and sent and included.
2. We go on about "variability"
Is "variability" what we model? to me that sounds like something that
changes eg. from one utterance to the next. I wonder if we mean (boring
word) "detail". I'd appreciate some motherly guidance on this.
3. Temporal modelling
--needs reworking/expanding a bit. I'll do this this week.
If you can give me feedback on what's there tomorrow (Weds) that would be
very helpful. I'll send another (almost final) version out Thurs/Fri this
week, if you can all co-operate with the nasty deadline!
Warning: I have zillions of other things that have to get done, and this
week is set aside for CSL. Beyond next week, *no promises*!
Richard
Richard Ogden
rao1@york.ac.uk
http://www.york.ac.uk/~rao1/
{\rtf1\mac\deff2 {\fonttbl{\f0\fswiss Chicago;}{\f2\froman New York;}{\f3\fswiss Geneva;}{\f4\fmodern Monaco;}{\f5\fscript Venice;}{\f6\fdecor London;}{\f7\fdecor Athens;}{\f12\fnil Los Angeles;}{\f13\fnil Zapf Dingbats;}{\f14\fnil Bookman;}
{\f15\fnil N Helvetica Narrow;}{\f16\fnil Palatino;}{\f18\fnil Zapf Chancery;}{\f20\froman Times;}{\f21\fswiss Helvetica;}{\f22\fmodern Courier;}{\f23\ftech Symbol;}{\f33\fnil Avant Garde;}{\f34\fnil New Century Schlbk;}{\f134\fnil Saransk;}
{\f237\fnil Petersburg;}{\f2017\fnil IPAPhon;}{\f2713\fnil IPAserif Lund1;}{\f9839\fnil Espy Serif;}{\f9840\fnil Espy Sans;}{\f9841\fnil Espy Serif Bold;}{\f9842\fnil Espy Sans Bold;}{\f10565\fnil M Times New Roman Expt;}
{\f12407\fnil SILDoulosIPA-Regular;}{\f12605\fnil SILSophiaIPA-Regular;}{\f13505\fnil SILManuscriptIPA-Regular;}}{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;
\red255\green255\blue0;\red255\green255\blue255;}{\stylesheet{\s243\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext243 footer;}{\s244\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext244 header;}{\s252\qj\sb240\sa60\keepn \b\i\f20
\sbasedon0\snext0 heading 4;}{\s253\qj\sb240\sa60\keepn \b\f20 \sbasedon0\snext0 heading 3;}{\s254\qj\sb360\keepn \b\i\f21 \sbasedon0\snext0 heading 2;}{\s255\qj\sb360\keepn \b\f21\fs28 \sbasedon0\snext0 heading 1;}{\qj\sb240 \f20 \sbasedon222\snext0
Normal;}{\s1\qj\sb120\sa120\sl360 \f65535 \sbasedon222\snext1 Abstract;}{\s2\qc\sb180\sl-280 \b\f20 \sbasedon222\snext2 AbstractHeading;}{\s3\li288\ri288\sb140\sl-219 \f20\fs18 \sbasedon222\snext3 Address;}{\s4\qc\sb180\sl-219 \f20\fs22
\sbasedon222\snext4 Affiliation;}{\s5\qc\sb180\sl-219 \i\f20\fs22 \sbasedon222\snext5 Author;}{\s6\qj\sb120\sa120\sl360 \f65535 \sbasedon222\snext6 Body;}{\s7\qc\sb120\sa240\sl360 \f65535 \sbasedon0\snext0 caption;}{\s8\qc\sl219 \f20\fs18
\sbasedon222\snext8 CellBody;}{\s9\qc\sl219 \b\f20\fs18 \sbasedon222\snext9 CellHeading;}{\s10\qc\sb180\sl-280\keepn \b\f20 \sbasedon222\snext10 Head1;}{\s11\fi-562\li562\sb180\sl-280\keepn\tx566 \b\f20 \sbasedon222\snext11 Head2;}{
\s12\qj\fi-283\li572\ri561\sb140\sl-220\tx566 \f65535\fs18 \sbasedon222\snext12 Item;}{\s13\qj\fi-283\li572\ri561\sb140\sl-220\tx560 \f65535\fs18 \sbasedon222\snext13 NumItem;}{\s14\qc \f20\fs8 \sbasedon4\snext14 bugfix;}{
\s15\qj\fi-284\li556\sb120\sl-219\tx560 \f65535\fs18 \sbasedon222\snext15 Reference;}{\s16\qj\sl-280 \f21 \sbasedon222\snext16 RTF_Defaults;}{\s17\qj\sl219 \f20\fs18 \sbasedon222\snext17 TableTitle;}{\s18\qc\sl-340 \b\f20\fs28 \sbasedon0\snext18 Title;}{
\s19\qc\sl280 \f20 \sbasedon222\snext19 CellFooting;}{\s20\qj\sb240 \f65535 \sbasedon0\snext20 Document Map;}{\s21\qj\fi-720\li720 \f65535 \sbasedon0\snext21 Indent;}{\s22\qj \f65535\fs20 \sbasedon0\snext22 Plain Text;}{\s23\qj\fi360 \f20\fs18
\sbasedon0\snext23 Normal Indent;}}{\info{\title INSTRUCTIONS FOR ICSLP96 AUTHORS}{\author Richard Ogden}}\paperw11880\paperh16820\margl1151\margr1151\margt1582\margb2098\widowctrl\ftnbj \sectd \sbkodd\linemod0\headery709\footery709\cols1\colsx288
{\header \pard\plain \qj \f20 \par
}{\footer \pard\plain \qj\tqc\tx4800\tqr\tx9520 \f20 CSL paper\tab {\field{\*\fldinst date \\@ "MMMM d, yyyy"}}\tab \chpgn \par
}\pard\plain \s18\qc\sl-340 \b\f20\fs28 ProSynth: An Integrated Prosodic Approach to Device-Independent, Natural-Sounding Speech Synthesis\par
\pard\plain \s5\qc\sb180\sl-219 \i\f20\fs22 Richard Ogden{\fs14\up11 ***}Sarah Hawkins{\fs14\up11 *}, Jill House{\fs14\up11 **}, Mark Huckvale{\fs14\up11 **}, John Local{\fs14\up11 ***}{\plain \f20\fs22 , }Paul Carter{\fs14\up11 ***}, Jana Dankovicov\'87
{\fs14\up11 **}, Sebastian Heid{\fs14\up11 *}\par
\pard\plain \s4\qc\sb180\sl-219 \f20\fs22 {\fs14\up11 *} University of Cambridge, {\fs14\up11 **} University College, London, {\fs14\up11 ***} University of York\par
\pard \s4\qc\sb180\sl-219 \par
\pard\plain \s14\qc \f20\fs8 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s2\qc\sb180\sl-280 \b\f20 ABSTRACT{\fs18 \par
}\pard\plain \s1\qj\sb120\sa120 \f65535 {\f20
This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the speech signal is informationally rich, and that this acoust
ic richness reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by structural richness produces a perceptually robust signal that is intelligible
in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech, segmentally, temporally and intonationally. [[In this paper, we present the results of some preliminary tests to eva
luate the effects of modelling timing, intonation and fine spectral detail.]]\par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 1. Introduction\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 Background\par
\pard\plain \qj\sb240 \f20 Speech synthesis by rule (text-to-speech, TTS) has restricted uses because it sounds unnatural and is often difficult to understand. Despite recent impro
vements in grammatical analysis and in deriving correct pronunciations for irregularly-spelled words, there remains a more fundamental problem, that of the inherent incoherence of the synthesized acoustic signal. This typically lacks the subtle systematic
variability of natural speech that underlies the perceptual coherence of syllables and their constituents, and the longer phrases of which they form part. Intonation is often dull and repetitive, timing and rhythm are poor, and modifications that word boun
daries undergo in connected speech are poorly modelled. Much of this incoherence arises because many modern TTS systems encode linguistic knowledge in ways which are not in tune with current understanding of human speech and language processes.\par
\pard \qj\sb240
Segmental intelligibility data illustrate the scale of the problem. When heard in noise, most synthetic speech loses intelligibility much faster than natural speech: natural speech is about 15% less intelligible at 0 dBs/n ratio than in quiet, whereas for
isolated wo
rds/syllables, Pratt (1986) reported that typical synthetic speech drops by 35%-50%. We can expect similar results today. Concatenated natural speech avoids those problems related solely to voice quality and local segment boundaries, but suffers just as mu
ch from poor models of timing, intonation, and systematic variability in segmental quality that is dependent on word and rhythmical
structure. Even when the grammatical analysis is right, one string of words can sound good, while another with the same grammatical pattern does not. \par
\pard \qj\sb240 ProSynth is an integrated {\i prosodic} (i.e. structure-based) approach to speech synthesis. At its core is a phonological model which allows for structurally important distinctions to be made, even when the phone
tic effect of these distinctions is subtle. The phonological model in ProSynth draws together insights from current phonology, and makes it easier to model phonetic and perceptual effects. Recent research in computational phonology (eg.
Bird 1995) combines highly structured linguistic representations (more technically, signs) with a declarative, computationally tractable formalism. Recent research in phonetics (eg. Simpson 1992, Hawkins & Slater 1994, Manuel 1995, Zsiga
1995) shows that speech is rich in non-phonemic information which contributes to its naturalness and robustness (Hawkins 1995). Other work (Local 1992 a & b, 1995a & b, Ogden 1992, Local & Ogden 1997)
has shown that it is possible to combine phonological with phonetic knowledge by means of a process known as phonetic interpretation: the assignment of phonetic parameters to pieces of phonological structure.
All these strands of work have contributed to the phonological model which ProSynth uses. By mimicking as far as possible the spectral, temporal and intonational detail which is observable in natural speech, we aim to improve the
intelligiblity of synthetic speech. \par
This paper has the following structure. Section 2 outlines the motivation for the ProSynth model
. Section 3 describes the linguistic model we use to represent the information necessary for modelling the kinds of phonetic effects described in Section 2. Section 4 sets out how the model described in Sectio
n 3 is implemented, and how segmental, temporal and intonational detail are modelled. ((We present also the results of some perceptual tests.))\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 2.\tab Motivation: the quest for perceptual coherence\par
\pard\plain \qj\sb240 \f20 {\b Possible cuts enclosed in [[ ]]. Tell me what you think.}\par
Interdependencies between grammatical, prosodic and segmental parameters are well known to phoneticians and to everyone who has synthesized speech. When these components are developed for synthesis in separate modules, the apparent convenience is offset by
the need to capture the interdependencies, which often leads to problems of rule ordering and rule proliferation to correct effects of earlier rules. Much of the robustness of natural speech is lost by neglecting systematic subphonem
ic variability, a neglect that results partly from an inappropriate emphasis on phoneme strings rather than on linguistic structure.\par
More phonetic detail such as secondary resonance effects, timing and rhythm and f0 alignment, which directly reflects phonological structure, is modelled in ProSynth than is standard in synthetic speech. This is
consistent with the view that the signal will be more robust when it includes the patterns of systematic phonetic variability found in natural speech. {\b [[}This view is based on the argument that it is t
he informational richness of natural speech that makes it such an effective communicative medium.
By informational richness, we mean that the acoustic fine detail of the time-varying speech signal reflects multidimensional properties of both vocal-tract dynamics and linguistic structure.{\b ]]} The well-known \ldblquote redundancy\rdblquote
of the speech signal, whereby a phone can be signalled by a number of more-or-less co-occurring acoustic properties, contributes some of this richness, but in our view, other less well-documented
properties are just as important. These properties can be roughly divided into two groups: those that make the speech signal sound as if it comes from a single talker, and those that reflect linguistic structure\emdash i.e.
those that make it sound as if the talker is using a consistent accent and style of speech. \par
A speech signal sounds as if it comes from a single talker when its properties reflect details of vocal-tract dynamics. This type of systematic variability contributes to the fundamental aco
ustic coherence of the speech signal, and hence to its perceptual coherence. By perceptual coherence
we mean that the speech signal sounds as if it comes from a single talker because its properties reflect details of vocal-tract dynamics. Listeners associate these time-varying properties with human speech, so that when they bear the right relationships t
o one another, the perceptual system groups them together into an internally coherent auditory stream (cf. Bregman 199xx, Remez 19xx). A wide range o
f properties seems to contribute to perceptual coherence. The influence of some, like patterns of formant frequencies, is widely acknowledged (cf. Remez and Rubin 19xx {\i Science}
paper). Others are known to be important but are not always well understood; examples are the amplitude envelope which governs some segmental distinctions (cf. Rosen and Howell 19xx) and also perceptions of rhythm and of \lquote integration\rquote
between stop bursts and following vowels (van Tasell, Soli et al 19xx); and correlations between the mode of glottal excitation and the behaviour of the upper articulators, especially at abrupt segment boundaries (Gobl and NiChasaide 19xx).\par
A speech signal sounds as if the talker is using a consistent accent and style of speech when all the phonetic details are
right. This requires producing often small distinctions that reflect different combinations of linguistic properties. As an example, take the words {\i mistakes} and {\i mistimes}. The /t/ of {\i mistimes} is aspirated whereas that of {\i mistakes} is not
. The two words also have quite different rhythms: the first syllable of {\i mistimes} has a heavier beat than that of {\i mistakes}
, even though the words begin with the same four phonemes. The spectrograms of the two words in Figure xx confirm the differences in aspiration of the /t/s, and also show that the /m/, /I/, and /s/ also have quite different durations in the two words, cons
istent with the perceived rhythmic difference. These differences arise because the morphology of the words differ: {\i mis} is a removable prefix in {\i mistimes}, but in {\i mistakes}
it is part of the word stem. These morphological differences are reflected in the syllable strcuture, as shown on the right of the Figure. In {\i mistimes}
, /s/ is the coda of syllable 1, and /t/ is the onset of syllable 2. So the /s/ is relatively short, the /t/ closure is long, and the /t/ is aspirated. Conversely, the /s/ and /t/ in {\i mistakes}
are ambisyllabic, which means that they form both the coda of syllable 1 and the onset of syllable 2. In an onset /st/, the /t/ is always unaspirated (cf. {\i step, stop, start). }The differences in the /m/ and the /I/ arise because {\i mist}
is a phonologcially heavy syllable whereas {\i mis}
is phonologcially light, and both syllables are metrically weak. So, in these metrically weak syllables, differences in morphology create differences in syllabification and phonological weight, and these appear as differences in duration and
aspiration across all four initial segments.\par
\par
\par
\par
\pard \qj\li720\sb240 Legend to Figure xx. Left: spectrograms of the words {\i mistimes} (top) and {\i mistakes }(bottom) spoken by a British English woman in the sentence {\i I\rquote d be surprised if Tess _______ it} with main stress on {\i Tess}
. Right: syllabic structures of each word.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 2.1 Modelling systematic variability\par
\pard\plain \qj\sb240 \f20 {\b I\rquote m not sure why we\rquote re calling it \ldblquote variability\rdblquote , since we\rquote re not modelling variability but detail. Am I missing something here?}\par
Some types of systematic variability may contribute both perceptual coherence and information about linguistic structure. So-called resonance effects (Kelly and Local 1989) provide one example. Resonance effects associated with /r/, for example, manifest a
coustically as lowered formant frequencies, and can spread
over several syllables, but the factors that determine whether and how far they will spread include syllable stress, the number of consonants in the onset of the syllable, vowel quality, and the number of syllables in the foot (Slater and Hawkins 199x, Tun
ley 1999). The formant lowering probably reflects slow movements of the tongue body as it accommodates to the complex requirements of the English approximant /r/.\par
On the one hand, including this type of information in synthetic speech makes it sound more natural in a subtle way that is hard to describe in phonetic terms but seems to make the signal \ldblquote fit together\rdblquote better\emdash
in other words, it seems to make it more coherent. On the other hand, the fact that the temporal extent of rhotic resonance effects depends on linguistic structure means not
only that cues to the identity of a single phoneme can be distributed across a number of acoustic segments (sometimes several syllables), but also that aspects of the linguistic structure of the affected syllable(s) can also be subtly signalled. \par
Listeners can use this type of distributed acoustic information to identify naturally-spoken words (Marslen-Wilson and Warren 199x; other wmw refs (Gaskell?); Hawkins and Nguyen submitted-labphon), and when it is included in synthetic speech it can increas
e phoneme intelligibility in noise by 10-15% or more (Slater and Hawkins, Tunley). Natural-sounding, systematic variation of this type may be especially influential in adverse listening conditions or when cognitive loads are high (c
f. Pisoni in van Santen book, Pisoni and Duffy 19xx. sh check these refs.) because it is distributed, thus increasing the redundancy of the signal. However, Heid and Hawkins (1999 -ICPhS) found similar increases in phoneme intelligibility simply by manipul
ating the excitation type at fricative-vowel and vowel-fricative boundaries and in the closure periods of voiced stops; these improvements to naturalness were quite local. Thus, although only some of the factors mentioned above have been shown to influence
perception, on the basis of our own and others\rquote
recent work (Slater and Hawkins, Tunley, Heid/Hawkins-ICPhS 1999; Pisoni in van Santen book, Pisoni and Duffy 19xx, Kwong and Stevens 1999), we suggest that most of those whose perceptual contribution has no
t yet been tested would prove to enhance perception in at least some circumstances, as developed below. [xxThis para is not great but will have to do for now.]\par
In summary, natural speech is robust because it contains many phonetic details at the spectral, temporal and intonational levels, which form a coherent whole and which are the exponents of an underlying rich linguistic structure. In ProSynth, we attempt
to model declaratively both linguistic structural richness and phonetic richness. The structures we use to represent phonological information are hierarchically organised, and contain information distributed across them. In the subs
equent sections, we set out how how the phonological model is organised, and how we interpret it phonetically.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 3.\tab ProSynth: a linguistic model\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 Overview\par
\pard\plain \qj\sb240 \f20 ProSynth uses a phonological model which encodes phonological information in a hierarchical fashion using structures based on attribute-value pairs. Each phonological unit occurs in a complete metrical context
. This context is a prosodic hierarchy with phonological contrasts available at all levels. The complex interacting levels of rules present in traditional layered systems are replaced in ProSynth by a one-step
phonetic interpretation function operating on the entire context, which makes rule-ordering unnecessary. Whereas conventional synthesis systems use
a relatively poor structure and complex, interacting rules, ProSynth uses instead a rich structure and applies simple rules of phonetic interpretation which are high structure-bound. Systematic phonetic variability is thus
constrained by position in structure. The basis of phonetic interpretation is not the segment, but phonological features at places in structure. These
principles have been successfully demonstrated in YorkTalk (Local & Ogden 1997; Local 1992) for structures of up to three feet. We thus extend the principle successfully demonstrated in [3, 4], to a wider variety of phonological domains.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.1 The Prosodic Hierarchy\par
\pard\plain \qj\sb240 \f20 The phonological structure is organised as a prosodic hierarchy, with phonological information distributed across the structure. The knowledge is formally represented as a Directed Acyclic Graph (DAG), a kind of tree structure.
Graph-structures in the form of trees are commonly used in phonological analysis, except for the important addition of ambisyllabicity. Formally, ambisyllabicity is represented as re-entrant nodes at the terminal level: i.e. a terminal node
(a consonant or vowel) may simultaneously be the daughter of two syllable nodes. Phonological attribute-value pairs are distributed around the entire prosodic hierarchy rather than at just the terminal nodes (or even associated to just terminal nodes)
, as in many phonological theories. Attributes at any level in the hierarchy may be accessed for use in phonetic interpretation.\par
Text is parsed into a prosodic hierarchy which has units at the following levels: syllable constituents (Onset, Rhyme, Nucleus, Coda); Syllable; Foot; Accent Group (AG); Intonational Phrase (IP).
Our prosodic hierarchy, building on House & Hawkins (1995) and Local & Ogden (1997) is a head\_driven and strictly layered (Selkirk 1984) structure.{\plain }
Each unit is dominated by a unit at the next highest level (Strict Layer Hypothesis [10]). This produces a linguistically well-motivated and computationally tractable hierarchy which accords with the representational requirements of
our implementation in XML. Constituents at each level have a set of possible attributes, and relationships between units at the same level are determined by the principle of headedness. Structure-sharing is explicitly recognized through ambisyllabicity.
\par
Fig. XX shows a partial phonological structure for the phrase \ldblquote Come with a bloom\rdblquote . Note that phonological information is spread around the structure
. For example, the feature [voice] is treated as a property of the Rhyme as a whole, and not of just one of the terminal nodes headed by the Rhyme. Timing information is also included: in the diagram below, the [start] of the IP is the same as the [start]
of the Onset of the first syllable of the utterance, and the [end] of the IP is the same as the [end] of the Coda of the last syllable, as indicated by the tags {\f13 \'c0} and {\f13 \'c1}.
The value for [ambisyllabic] is shown for two consonants: note that for the [ambisyllabic: +] consonant /{\f12407 D}/, the terminal node is re-entrant.\par
\pard\plain \s7\qc\sb120\sl360 \f65535 {\plain {\pict\macpict\picw370\pich266
0af7ffffffff010901711101a00082a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00000111000c011b2c000800140554696d65730300140d000a2e0004000001002800090112024950a00097a10096000c020000000200000000000000a1009a0008fffd00000007000001000a002401
0e0030011e28002d010f024147a00097a10096000c020000000200000000000000a1009a0008fffd00000009000001000a0048010e005401222a2404466f6f74a0009701000affffffff0109017122000f011600122200330116001222005701160012a0008ca10096000c020000000200000000000000a1009a0008fffd00
000008000001000a006c010e007801202a240453796c6ca00097a10096000c020000000200000000000000a1009a0008fffd00000005000001000a0090010e009c011b2a24025268a00097a10096000c020000000200000000000000a1009a0008fffd00000006000001000a00b4010e00c0011c2a24024e75a0009701000a
ffffffff0109017122007b0116001222009f01160012a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00b4012100c0012e291302436fa0009701000affffffff0109017122009f01161212a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00b400f500
c000fe2800bd00f6014fa0009701000affffffff0109017122007b0116e536a0008da10096000c020000000200000000000000a1009a0008fffd00000008000001000a006c0090007800a228007500910453796c6ca00097a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00900090009c
009d2a24025268a00097a10096000c020000000200000000000000a1009a0008fffd00000006000001000a00b4009000c0009e2a24024e75a0009701000affffffff0109017122007b0098001222009f00980012a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00b400a300c000b02913
02436fa0009701000affffffff0109017122009f00981212a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00b4007700c000802800bd0078014fa0009701000affffffff0109017122007b0098e536a10096000c020000000200000000000000a1009a0008fffd00000008000001000a00
6c00480078005a28007500490453796c6ca00097a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00900048009c00552a24025268a00097a10096000c020000000200000000000000a1009a0008fffd00000006000001000a00b4004800c000562a24024e75a0009701000affffffff0109
017122007b0050001222009f00500012a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00b4005b00c00068291302436fa0009701000affffffff0109017122009f00501212a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00b4002f00c000382800bd
0030014fa0009701000affffffff0109017122007b0050e536a10096000c020000000200000000000000a1009a0008fffd00000008000001000a006c00d8007800ea28007500d90453796c6ca00097a10096000c020000000200000000000000a1009a0008fffd00000005000001000a009000d8009c00e52a24025268a000
97a10096000c020000000200000000000000a1009a0008fffd00000006000001000a00b400d800c000e62a24024e75a0009701000affffffff0109017122007b00e0001222009f00e00012a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00b400bf00c000c82800bd00c0014fa0009701
000affffffff0109017122007b00e0e536a10096000c020000000200000000000000a1009a0008fffd00000007000001000a002400480030005828002d0049024147a00097a10096000c020000000200000000000000a1009a0008fffd00000009000001000a004800480054005c2a2404466f6f74a0009701000affffffff
010901712200330050001222005700500012220057005048122000570050006900e020000f0116002100502200c3005000122200c3003500122200c3006200122200c3007d00122200c3009800122200c300aa00122200c300c5e5122200c300e000122200c300fb09122200c3011600122200c301280012a10096000c0200
00000200000000000000a1009a0008fffd00000002000001000a00d8003000e500372c001730771453494c446f756c6f734950412d526567756c61720330772800e10031016ba00097a10096000c020000000200000000000000a1009a0008fffd00000002000001000a00d8004b00e50052291b01c3a00097a10096000c02
0000000200000000000000a1009a0008fffd00000003000001000a00d8005d00e500672912016da00097a10096000c020000000200000000000000a1009a0008fffd00000000000001000a00d8009500e5009929380149a00097a10096000c020000000200000000000000a1009a0008fffd00000002000001000a00d800a6
00e500ad29110144a00097a10096000c020000000200000000000000a1009a0008fffd00000002000001000a00d800dc00e500e2293601aba0009701000affffffff010901712200c300fbf712a10096000c020000000200000000000000a1009a0008fffd00000002000001000a00d800ee00e500f529120162a00097a100
96000c020000000200000000000000a1009a0008fffd00000001000001000a00d8010200e501062914016ca00097a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00d8011200e5011b29100275f9a00097a10096000c020000000200000000000000a1009a0008fffd0000000400000100
0a00d8012400e5012e2912016da00097a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00d8007800e500812800e100790177a00097a10096000c010000000200000000000000a1009a0008000200000022000001000a00630129007b017003001428006c012a135b737472656e6774683a
207374726f6e675d0d2a0c0f5b7765696768743a2068656176795da00097a10096000c010000000200000000000000a1009a0008000200000017000001000a00870129009f015a2a180c5b636865636b6564202b5d0d2a0c095b766f696365202b5da00097a10096000c010000000200000000000000a1009a0008fffd0000
000f000001000a00b4013200c001532b0921055b656e643a2c0010000d0d5a6170662044696e676261747303000d291501c10300142907015da00097a10096000c010000000200000000000000a1009a0008fffd00000022000001000a00fc0090010800d62801050091115b616d626973796c6c616269633a202b5da00097
a10096000c010000000200000000000000a1009a0008fffd00000021000001000a00fc00480108008c2801050049115b616d626973796c6c616269633a202d5da0009701000affffffff010901712200e7006200122200e700aa0012a10096000c010000000200000000000000a1009a0008000300000016000001000a0000
012900180158280009012a095b73746172743a202003000d291c01c00300142908025d0d280015012a055b656e643a03000d291501c10300142907015da00097a10096000c010000000200000000000000a1009a0008fffd00000012000001000a00b4000000c000272800bd0001085b73746172743a2003000d291a01c003
00142908015da00097a00083ff}}{\f20 \par
}\pard \s7\qc\sb120\sa240\sl360 {\f20 Fig. 1. Partial tree structure of the utterance: \ldblquote Come with a bloom\rdblquote . See text for details.\par
}\pard\plain \qj\sb240 \f20 There is no separate level of {\i phonological word} within the hierarchy. Such a unit does not sit happily in a strictly layered structure, because the boundaries of prosodic constituents like AG and Foot may well occur in th
e middle of a lexical item. Conversely, word boundaries may occur in the middle of a Foot/AG. For example, in the phrase \ldblquote phonetics and phonology\rdblquote there are two feet (and potentially two AGs): [-netics and phon-], and [-nology].
Both begin in the middle of a words, and the first contains word boundaries. Lexico-grammatical information may nonetheless be highly relevant to phonetic interpretation and is not be discarded. The computational representation of our prosodic structure
allows us to get round this problem: word\_level and syntactic\_level information is hyper\_linked into the prosodic hierarchy. In this way lexical boundaries and the grammatical functions of words can be used to inform phonetic interpretation. \par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.2 Units of Structure and their Attributes\par
\pard\plain \qj\sb240 \f20 {\b Note to authors: I think the easiest thing to do is to use the convention [attribute: value], and where it\rquote s a boolean choice, use [\'b1], which is more familiar, rather than Y/N, which is less familiar.} {\b
Features are usually written in [like this], so I think square brackets are fine. RAO.}\par
Input text is parsed to head-driven syntactic and phonological hierarchical structures. The phonological parse allots material to places in the prosodic hierarchy and is supplemented with links to the syntactic parse. The lexicon itself is in the form of a
partially parsed representation. Phonetic interpretation may be sen
sitive to information at any level, so that it is possible to distinguish, for instance, a plosive in the onset of a weak foot-final syllable from an onset plosive in a weak foot-medial syllable. \par
{\b Headedness}: When a unit branches into sub-constituents, one of these constituents is its Head. If the leftmost constituent is the head, the constituent is said to be left-headed.
If the rightmost constituent is the head, the structure is right-headed. Thus, IPs are right-headed, since the rightmost constituent AG is the head of the IP. AGs and Feet are left-headed. Properties of a head are shared by the nodes i
t dominates [11]. Therefore a [heavy:+] syllable has a [heavy:+] rhyme; the syllable-level resonance features [grave:\'b1] and [round:\'b1] can also be shared by nodes they dominate: this is how some aspects of coarticulation are modelled.
In Fig. XX, headedness is indicated by vertical lines, as opposed to slanting ones. Phonetic interpretation proceeds head-first and is therefore determined in a structurally principled fashion without resort to extrinsic ordering.\par
{\b Intonational Phrase (IP)}: The IP, the domain of a well-formed, coherent intonation contour, contains one or more AGs; minimally it must include a strong AG. The rightmost AG\emdash traditionally the intonational nucleus\emdash
is the head of the IP. It is the largest prosodic domain recognised in the current implementation of our model.\par
{\b Accent Groups (AG)}
: AGs are made up of one or more Feet, which are primarily units of timing. An accented syllable is a stressed syllable associated with a pitch accent; an AG is a unit of intonation initiated by such a syllable, and incorporating any following unaccented s
yllables. The head of the AG is the leftmost heavy foot. A weak foot is also a weak, headless AG. \par
AG attributes include [headedness], pitch accent specifications, and positional information within the IP.\par
{\b Feet}: All syllables are organised into Feet, which are primarily rhythmic units. Types of feet can be differentiated using attributes of [weight], [strength] and [headedness]. A foot is left-headed, with a [strong:+] syllabl
e at its head, and includes any [strong:-] syllables to the right. Any phrase-initial, weak syllables are grouped into a weak, headless foot, sometimes referred to as a \ldblquote degenerate\rdblquote foot. Degenerate feet are
always [light]. Thus when an IP begins with one or more weak, unaccented syllables, we maintain the strictly layered structure by organising them into [light] feet which are in turn contained within similarly [light] (or degenerate) AGs. Consistent
with the declarative formalism, attributes of the Foot are shared with its constituents, so that a syllable with the values [head:+, strong:+] is stressed.\par
{\b Syllables:} The Syllable contains the constituents Onset and Rhyme. The rhyme branches into Nucleus and Coda. Nuclei, onsets and codas can all branch. The syllable is right-headed, the rhyme left-headed. Attributes of the syllable are [weight:
heavy/light], and [strength: strong/weak]: these are necessary for the correct assignment of temporal compression (\'a4XX). Foot-initial Syllables are strong.\par
Weight is defined with regard to the subconstituents of the Rhyme. A Syllable is heavy if its Nucleus attribute [length] has the value [long] (in segmental terms, if it contains a long vowel or a diphthong). A Syllable i
s also heavy if its coda has more than one constituent, as in /rent/, /ask/, /taks/.\par
There is not a direct relationship between syllable strength and syllable weight. Strong syllables need not be heavy. In {\i loving}, /{\f12407 l\'c3v}/ has a [short] Nucleus, and the coda has only one constituent (corresponding to /{\f12407 v}
/, yet it is the strong syllable in the Foot. Similarly, weak syllables need not be light. In {\i amazement}, the final Syllable has a branching Coda (i.e. more than one constituent) and therefore is [heavy] but [weak]. ProSynth does not make
use of extrametricality: all phonological material must be dominated by an appropriate node in structure.\par
{\b Phonological features:} We use binary features, with each {\i attribute} having a {\i value}, where the {\i value} slot can also be filled by another attribute-value{\i }pair. To our set of conventional features we add the features [rhotic:\'b1
], to allow us to mimic the long-domain resonance effects of /r/ [5, 8], and [ambisyllabic:\'b1] for ambisyllabic constituents (\'a4XX). Not all features are stated at the terminal nodes in the hierarchy: [voice:\'b1
], for instance, is a property of the rhyme as a whole in order to model durational and resonance effects.\par
{\b Ambisyllabicity}: Constituents which are shared between syllables are marked [ambisyllabic:+
]. Ambisyllabicity makes it easier to model coarticulation [4] and is an essential piece of knowledge in the overlaying of syllables to produce polysyllabic utterances. It is also used to predict properties such as plosive aspiration in intervocalic cluste
rs (\'a4XX).\par
Constituents are [ambisyllabic:+] wherever this does not result in a breach of syllable structure constraints. {\i Loving} comprises two Syllables, /{\f12407 l\'c3v}/ and /{\f12407 vIN}/, since /{\f12407 v}
/ is both a legitimate Coda for the first Syllable, and a legitimate Onset for the second. {\i Loveless} has no ambisyllabicity, since /{\f12407 vl}/ is neither a legitimate Onset nor a legitimate Coda. Clusters may be entirely ambisyllabic, as in {\i
risky} (/{\f12407 rIsk}/+/{\f12407 ski}/), where /{\f12407 sk}/ is a good Coda and Onset cluster; partially ambisyllabic (i.e. one consonant is [ambisyllabic:+], and one is [ambisyllabic:-]), as in {\i selfish} /{\f12407 sElf}/+/{\f12407 fIS}
/), or non-ambisyllabic as in {\i risk them} (/{\f12407 rIsk}/+/{\f12407 D\'abm}/).{\b \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 4. Implementation\par
\pard\plain \qj\sb240 \f20 In this section, we describe the structure of ProSynth in more detail. We describe the database used for the spectral, temporal and intonational modelling; the use of XML for representation; and then we set out in more detail
what effects we model, and how, at the spectral, temporal and intonational levels.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.1 Database\par
\pard\plain \qj\sb240\tx0 \f20
Analysis for modelling is based on a core speech database of over 450 utterances, recorded by a single male speaker of southern British English. Database speech files have been exhaustively labelled to identify segmental and prosodic constituent boundaries
, using careful hand\_correction of an automated procedure. F0 contours, calculated from a simultaneously recorded Laryngograph signal, can be displayed time\_aligned with constituent boundaries.\par
\pard \qj\sb240 The database has been designed to exemplify a subset of possible structures, within which we can predict that we will find interesting examples of systematic variability. Each utterance consists of one IP, and up to two AGs.
The foot-types within the AG are varied, according to the weight of the head syllable, the number and type of consonants in the onset and rhyme, whether the medial consonants are ambisyllabic, and the vowel length. There are also phrases containing se
gments whose secondary resonance is expected to spread, and some which we expect to block the spreading of such effects.\par
The database thus provides us with material for analysis of the spectral, temporal and intonational phenomena we aim to synthesise. We are currently expanding it to cover more types of IP.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.2 Architecture\par
\pard\plain \qj\sb240 \f20 ProSynth builds on the knowledge gained in YorkTalk (refs.), and uses
an open computational architecture for synthesis. There is a clear separation between the computational engine and the computational representations of data and knowledge. The overall architecture is shown in Fig. XX. \par
\pard \qc\sb240\keepn {{\pict\macpict\picw426\pich156
082affffffff009b01a91101a00082a0008c01000affffffff009b01a9600013006d002300bf0000005a68010e005a68005a005a6800b4005a600031006d004100bf005a005a6800b4005a22001a006d001e22001a00be001ea0008da0008c60001300e2002301340000005a68010e005a68005a005a6800b4005a60003100
e200410134005a005a6800b4005a22001a00e2001e22001a0133001ea0008da10096000c010000000200000000000000a1009a0008fffd00000010000001000a00280084003400a72c000800140554696d65730300140d000a2e0004000001002b8531074c657869636f6ea00097a10096000c010000000200000000000000
a1009a000800030000001a000001000a002200f4003a012a28002b00f50c4465636c617261746976650d2a0c096b6e6f776c65646765a0009701000affffffff009b01a909000000000000000031005b006d008000bf09ffffffffffffffff3809000000000000000031005b00e20080013d09ffffffffffffffff38a10096
000c010000000200000000000000a1009a0008fffc0000001a000001000a0067007c007300b2280070007d0b436f6d706f736974696f6ea00097a10096000c010000000200000000000000a1009a0008fffc0000001b000001000a006700f40073012c29780e496e746572707265746174696f6ea0009701000affffffff00
9b01a90900000000000000000b001b001b4100010157002f01a909ffffffffffffffff480900000000000000004100370157006501a909ffffffffffffffff4809000000000000000041006d0157009b01a909ffffffffffffffff48a10096000c020000000200000000000000a1009a0008000800000015000001000a0005
01680029019628000e016a074d42524f4c410d2b050c08646970686f6e650d280026016d0973796e746865736973a00097a10096000c020000000200000000000000a1009a000800080000000c000001000a003b015b005f01a32b051e06484c73796e20280050015c1371756173692d6172746963756c61746f72790d2b11
0c0973796e746865736973a00097a10096000c020000000200000000000000a1009a0008000800000010000001000a007101630095019d2b021e0950726f736f6479200d28008601680c6d616e6970756c617465640d2b0a0c06537065656368a0009701000affffffff009b01a9070000000022007f00010000a000a0a100
a400020d0801000a0000000000000000070001000122005b000100242300002348002300002300ca23000023b812230000a000a301000affffffff009b01a92300242348002300ca23b812a000a1a10096000c010000000200000000000000a1009a0008000300000016000001000a00580014007000422800610015084d61
726b6564200d2a0c0474657874a00097a0008c01000affffffff009b01a90700000000220070005e0000a000a0a100a400020e0371001e0069005e0070006d006d006d0070005e006d005e0069005e006d006d01000a000000000000000022006d006df1032300002300fd2300002300fc230000230f0423000084000a0000
000000000000a000a301000affffffff009b01a984000a0000000000000000a000a1070001000122006d00491500a0008da0008c070000000022004c008d0000a000a0a100a400020e0371001e004c008d005b0094005b0091004c008d004c0091004c0094005b009101000a000000000000000022005b0091fcf123000023
040023000023030023000023fd0f23000084000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a107000100012200400091000ca0008da0008c070000000022004c010b0000a000a0a100a400020e0371001e004c010b005b0112005b010f004c010b004c010f004c0112005b010f
01000a000000000000000022005b010ffcf123000023040023000023030023000023fd0f23000084000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a10700010001220040010f000ca0008da0008c070000000022007000d30000a000a0a100a400020e0371001e006900d30070
00e2006d00e2007000d3006d00d3006900d3006d00e201000a000000000000000022006d00e2f1032300002300fd2300002300fc230000230f0423000084000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a1070001000122006d00be1500a0008da0008c070000000022002a01
550000a000a0a100a400020e0371001e001c014f002a0157001c0157002a0155002901520028014f001c015701000a000000000000000022001c0157fe0e23000023fdff23000023fdff2300002308f423000084000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a10700010001
22006d013c16bca0008da0008c070000000022005e014f0000a000a0a100a400020e0371001e0052014a005e015700520157005e014f005b014d0059014a0052015701000a00000000000000002200520157f80c23000023fefd23000023fdfe230000230df923000084000a0000000000000000a000a301000affffffff00
9b01a984000a0000000000000000a000a1070001000122006d013c11eea0008da0008c0700000000220080014a0000a000a0a100a400020e0371001e007b014a00880157008801570080014a007e014d007b014f0088015701000a00000000000000002200880157f3f82300002303fe2300002302fd23000023080d230000
84000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a1070001000122006d013c1111a0008da00083ff}}\par
\pard \qc\sb240 Fig. XX: ProSynth synthesis architecture.\par
\pard \qj\sb240 Text marked for the type and placement of accents is input to the system, and a pronunciation lexicon is used to construct a strictly layered metrical structure for
each intonational phrase in turn. The overall utterance is then represented as a hierarchy, described in more detail in Section XX.\par
The interpreted structure is converted to a parametric form depending on the signal generation method. The phonetic descriptions and timing can be used to select diphones and express their durations and pitch contours foroutput with the MBROLA system
(Dutoit et al ref). The phonetic details can also be used to augment copy-synthesis parameters for HLsyn quasi-articulatory formant synthesiser (Heid & Hawkins ref., Jenolan Caves
.). The timings and pitch information have also been used to manipulate the prosody of natural speech using PSOLA (Hamon et al. ref).\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.3 Linguistic Representation and Modelling\par
\pard\plain \qj\sb240 \f20 The Extensible Markup Language (XML) is an extremely simple dialect of SGML (Standard Generalised markup Language),
the goal of which is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML is a standard proposed by the World Wide Web Consortium of industry specific mark\endash up for: vendor\endash
neutral data exchange, media\endash independent publishing, collaborative authoring, the processing of documents by intelligent agents and other metadata applications [Ref1]. \par
We have chosen to use XML as the external data representation for our phonological structures in ProSynth. The features of XML which make it ideal for this application are: storage of hierarchical information expressed in
nodes with attributes; a standard text\endash based format suitable for networking; a strict and formal syntax; facilities for the expression of linkage between parts of the structure; and readily\endash available software support. \par
In the ProSynth system, the input word sequence is converted to an XML representation which then passes through a number of stages representing phonetic interpretation. A declarative knowledge representation is used to encode knowledge of phonetic interpre
tation and to drive transformation of the XML data structures. Finally, special purpose code translates the XML structures into parameter tables for signal generation. \par
In ProSynth, XML is used to encode the following: \par
{\b Word Sequences}:{\i }The text input to the synthesis system needs to be marked\endash
up in a number of ways. Importantly, it is assumed that the division into prosodic phrases and the assignment of accent types to those phrases has already been performed. This information is added to the text using a simple mark\endash up of Intonational
Phrases and Accent Groups (Section XX). \par
{\b Lexical Pronunciations}:{\b }
The lexicon maps word forms to syllable sequences. Each possible pronunciation of a word form has its own entry comprising: SYLSEQ (i.e. syllable sequence), SYL, ONSET, RHYME, NUC, ACODA, CODA, VOC and CNS nodes. Information present in the input mark
\endash up, possibly derived from syntactic analysis, selects the appropriate pronunciation for each word form. \par
{\b Prosodic Structure}:{\i }Each composed utterance comprising a single intonational phrase is stored in a hierarchy of: UTT, WORDSEQ, WORD, IP, AG, FOOT, SYL, ONSET, RHYME, NUC, CODA, ACODA, VOC and CNS nodes. Syllables are cross\endash
linked to the word nodes using linking attributes. This allows for phonetic interpretation rules to be sensitive to the grammatical function of a word as well as to the position of the syllable in the word. \par
{\b Database Annotation}: Our database has been manually annotated and a prosodic structure complete with timing information has been constructed for each phrase. This annotation is stored in XML using the same f
ormat as for synthesis. Tools for searching this database help us in generating knowledge for interpretation. \par
An interesting characteristic of our prosodic structure is the use of ambisyllabic consonants (discussed in more detail in Section XX). This allows one or more consonants to be in the Coda of one syllable and in the O
nset position of the next syllable. Examples are the medial consonants in "pity" or "tasty". To achieve ambisyllabicity in XML it is necessary to duplicate and link nodes, since XML rigidly enforces a strict hierarchy of components. \par
An extract of a prosodic structure expressed in XML is shown in Figure XX, taken from the phrase \ldblquote Come with a bloom\rdblquote (see Fig. XX for another representation of this information).
(In the XML representations, Y/N are used in place of +/-.)\par
{\f22\fs18 <FOOT DUR="1" START="0.5561" STOP="1.0883">\par
}\pard \qj {\f22\fs18 \par
<SYL DUR="1" FPOS="1" RFPOS="1" RWPOS="1" START="0.5561" STOP="1.0883"\par
STRENGTH="STRONG" WEIGHT="HEAVY" WPOS="1" WREF="WORD4">\par
\par
}\pard \qj\li720 {\f22\fs18 <ONSET DUR="1" START="0.5561" STOP="0.7341" STRENGTH="STRONG">\par
<CNS AMBI="N" CNSCMP="N" CNSGRV="Y" CNT="N" DUR="1" NAS="N" RELEASE="0.6565" RHO="N" SON="N" START="0.5561" STOP="0.6670" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N" VOI="Y">b</CNS>\par
<CNS AMBI="N" CNSCMP="N" CNSGRV="N" CNT="Y" DUR="1" NAS="N" RHO="N" SON="Y" START="0.6670" STOP="0.7341" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N"\par
VOI="Y">l</CNS>\par
</ONSET>\par
}\pard \qj {\f22\fs18 \par
}\pard \qj\li720 {\f22\fs18 <RHYME CHECKED="Y" DUR="1" START="0.7341" STOP="1.0883" STRENGTH="STRONG"\par
VOI="Y" WEIGHT="HEAVY">\par
}\pard \qj\li1440 {\f22\fs18 \par
<NUC CHECKED="Y" DUR="1" LONG="Y" START="0.7341" STOP="0.9126" STRENGTH="STRONG" VOI="Y" WEIGHT="HEAVY">\par
<VOC DUR="1" FXGRD="-251.2" FXMID="126.7" GRV="Y" HEIGHT="CLOSE" RND="Y" START="0.7341" STOP="0.8234">u</VOC>\par
<VOC DUR="1" FXGRD="-171.1" FXMID="105.4" GRV="Y" HEIGHT="CLOSE" RND="Y" START="0.8234" STOP="0.9126">u</VOC>\par
</NUC>\par
\par
<CODA DUR="1" START="0.9126" STOP="1.0883" VOI="Y">\par
<CNS AMBI="N" CNSCMP="N" CNSGRV="Y" CNT="N" DUR="1" NAS="Y" RHO="N" SON="Y" START="0.9126" STOP="1.0883" STR="N" VOCGRV="Y" VOCHEIGHT="CLOSE" VOCRND="Y"\par
VOI="Y">m</CNS>\par
</CODA>\par
}\pard \qj\li720 {\f22\fs18 </RHYME>\par
}\pard \qj {\f22\fs18 </SYL>\par
</FOOT>\par
}\pard \qc\sb240 Fig 2. Partial XML representation of utterance: \ldblquote with a bloom\rdblquote .\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.4 Knowledge Representation\par
\pard\plain \qj\sb240 \f20 In ProSynth knowledge for phonetic interpretation is expressed in a declarative form that operates on the prosodic structure. This means firstly that the knowledge is expressed as unordered rules, and secondly
that it operates solely by manipulating the attributes on the XML encoded phonological structure. To encode such knowledge a representational language called ProXML was developed in which it is easy to express the hierarchical contexts which drive processi
ng and to make the appropriate changes to attributes. The ProXML language is read by an interpreter PRX written in C which takes XML on its input and produces XML on its output. ProXML is a very simple language modelled on both C and Cascading Style Sheets
(see [Ref2] for more information). A ProXML script consists of functions which are named after each element type in the XML file (each node type) and which are triggered by the presence of a node of that type in the input. When a function is called to pro
cess a node, a context is supplied centered on that node so that reference to parent, child and sibling nodes is easy to express. \par
Figure XX shows a simple example of a ProXML script to adjust syllable durations for strong syllables in a disyllabic word whose second and final syllable is weak. If the first syllable is heavy, the rule is dependent on
the length of the vowel. In this example, the DUR attribute on SYL nodes is set as a function of the phonological attributes found on that node and on others in the hierarchy. Note that the rules modify the duration
attribute (*= means scale existing value) rather than set it to a specific value. In this way, the declarative aspect of the rule is maintained. The compression factors in the script are computed from regression tree data
taken from a database of natural speech.\par
\pard \qj\li1440\sb240 {\f22\fs18 SYL \{\par
}\pard \qj\li1440 {\f22\fs18 if ((:STRENGTH=="STRONG")&&(:WPOS=="1")&&(:RWPOS=="2")\par
&&(../SYL[2]:WEIGHT=="LIGHT"))\par
if (:WEIGHT=="HEAVY")\par
if (./RHYME/NUC:LONG=="Y")\par
:DUR *= 1.0884;\par
else\par
:DUR *= 1.1420;\par
else\par
:DUR *= 0.8274;\par
\}}{\f22\fs18 \par
}\pard \qc\sb240 Fig. X: Example ProXML script, which modifies syllable durations dependent on the syllable level and nucleus level attributes.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 5. Modelling (phonetic interpretation)\par
\pard\plain \qj\sb240 \f20 This section describes more details of phonetic interpretation in ProSynth, focussing on temporal relations, intonation, and spectral detail. Our assumption is that there are close relationships between each
of these aspects of speech. For example, once timing relations are accurately modelled, some of the spectral details (such as longer-domain resonance effects) can also be modelled as a by-product of the temporal modelling, when the output system is HLsy
n (or any formant synthesizer). This particular trade-off between duration and spectral shape is not of course available to concatenative synthesis, but the knowledge it reflects could influence [be applied to?] unit selection. [????]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.1\tab Spectral detail\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.1.1\tab Segmental identity\par
\pard\plain \qj\sb240 \f20
Whichever type of synthesis output system is used, the immediate input comes from the XML file. For concatenative synthesis, we currently use the MBROLA system, with sound segments chosen in the standard way from the MBROLA inventory for British English. [
xx correct/add details?] For formant synthesis, we use HLsyn driven by {\scaps procsy, }
which is part copy-synthesizer from labelled speech files, and part rule-driven from information in the XML file. Most formant trajectories for vowels and approximants are copy-synthesized, while obstruent consonants and some other sounds are produced by r
ule. {\scaps Procsy} is described in detail by Heid and Hawkins (under review). At the time of writing, efforts to make {\scaps procsy} entirely rule-driven have just begun.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.1.2.\tab Fine-tuning spectral shape\par
\pard\plain \qj\sb240 \f20 In concatenative synthesis, the task of fine-tunin
g spectral shape is achieved by selecting appropriate units. ProSynth as yet makes no attempt to improve upon the standard MBROLA unit selection, but ultimately our work should have applications in unit selection inasmuch as it should increase our understa
nding of how factors such as long-domain resonance effects and grammatical dependencies influence spectral variability.\par
When the parameters are set to appropriate values, HLsyn itself does much local fine-tuning of spectral shape automatically. In comparis
on with standard formant synthesizers, it is relatively straightforward to produce complex acoustic changes at segment boundaries that closely mimic those of natural speech. Most notably, HLsyn produces natural-sounding, perceptually-robust transitions bet
ween adjacent segments that differ in excitation type, such as the transition between vowels and voiced or voiceless stops or fricatives. This attribute of HLsyn means that some of the immediate appeal of concatentive synthesis\emdash
natural-sounding, perceptually-robust transitions between adjacent segments, together with a pleasant voice quality\emdash is also available in formant synthesis at little computational cost.\par
Although these types of acoustic fine-detail are relatively easily achievable using HLsyn, they have to be programmed to occur in only the right contexts. {\scaps Procsy }
provides the rules that do this. Some of the systematic variation is programmed by reference to the structure of the prosodic hierarchy, and some in the traditional way by reference to linear s
egmental context. Examples of prosodically-dependent rules include stress-dependent variations in the waveform amplitude envelope, and stress-dependent differences in excitation type in certain CVC sequences. For example, in Southern British English, the f
irst CVC of {\i today} and {\i to disappoint }are spectrally very different from those of {\i turtle }and {\i tiddler}, as are the {\i tit} sequences in {\i attitude }and {\i titter}
. Examples of rules that rely mainly on local segmental context include coarticulation of nasality and the am
ount of voicing in the closure of voiced stops. These sorts of properties, though in need of more work, are reasonably well understood and most are relatively straightforward to implement to a satisfactory standard.\par
More challenging, because more subtle and less well understood, is the temporal extent of long-domain coarticulatory processes such as the resonance effects discussed in the Introduction, which are known to be perceptually salient. For example, Tunley (199
9) has shown that in SSBE, /r/-colouring
varies with vowel height and the number of consonants in the syllable onset, and spreads for at least two syllables on either side of the conditioning consonant, as long as those syllables are unstressed and especially if they are in feet of 3 or more syl
lables. Thus, whereas strong /r/-colouring might be expected to be found throughout a phrase like {\i The tapestry bikini}, it would be expected to be weak and confined only to {\i bad} and {\i rap} in a phrase like {\i The bad rap artist}
(in a non-rhotic accent). Work by West (1999) is broadly supportive of these observations.\par
It is not yet known, however, what limits the spread of rhotic resonance effects. Some of our current efforts are directed towards answering this question. For example, when an /r/ occurs in a context that is susceptible to /r/-colouring, such as the last
syllable of {\i tapestry}, is the resonance effect blocked by the next stressed syllable, or can it spread through into unstressed syllables of the adjacent foot? Just as low vowels show less susceptibilit
y than high vowels, are some consonants (for example, velar stops) more likely to affect the the spread of resonance effects than others? The way that resonance effects are modelled in ProSynth will depend to a large extent on the answers to these question
s. For example, if rhotic resonance effects are restricted to unstressed syllables in the foot or feet immediately adjacent to the conditioning /r/, then the feature [rhotic] can be an atrribute of the foot in the prosodic tree. If however these effects pa
ss through stressed syllables into the next feet, then they might have to be modelled as an attribute of a level higher than the foot. (Preliminary evidence suggests we should not rule out that possibility.) Finally, if some segments block the spread of re
sonance effects, even in unstressed syllables, then either the domain of the [rhotic] feature may be best placed below the foot, or else the acoustic realisation of the feature must also take account of the segmental context in a relatively complicated way
. In essence, we are asking to what extent rhotic resonance effects are part of the phonology of SBE, and to what extent they can be regarded as a phonetic consequence of, for example, vowel-to-vowel coarticulation. [{\b
John, Richard et al: This seems a possible place to put this phonology-phonetics point, but the more I (Sarah) think about it, the more unhappy I am with it. Secretly, I think I am a BRowman-Goldstein type who sees no dividing line between phonol and phone
t., and/or a Keating type who says it\rquote s all controlled. But that\rquote s because, as you know, I am no phonologist. My problem is: the v-to-v coartic doesn\rquote t HAVE to happen\emdash
the lang/accent ALLOWS it to happen. So is that phonol or phonet? And, ultimately, does it matter which?? Answers may, if you wi
sh, direct me to your Linguistics 101 handouts. Another answer may be to re-write the preceding para so it makes some of the points, but in a less theoretical way, which may be more appropriate for CSL. Opinions, please. Tx) }
This is not an easy question to
answer: the variation with vowel height, for example, may reflect a process that is in the phonology of SBE, but is nevertheless manifested to different degrees for independent articulatory-acoustic reasons. If that were the case, then a formant synthesiz
er would have to deal with the acoustic differences between different vowels, even though the basic control would reside in the phonological structure. These issues are currently being investigated.\par
The temporal extent of systematic spectral variat
ion due to coarticulatory processes is modelled using two intersecting principles. One reflects how much a given allophone blocks the influence of neighbouring sounds, and is like coarticulation resistance [12]. The other principle reflects resonance effec
ts, or how far coarticulatory effects spread. The extent of resonance effects depends on a range of factors including syllabic weight, stress, accent, and position in the foot, vowel height, and featural properties of other segments in the domain of potent
ial influence. For example, intervening bilabials let lingual resonance effects spread to more distant syllables, whereas other lingual consonants may block their spread; similarly, resonance effects usually spread through unstressed but not stressed sylla
bles.{\i \par
}\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.2\tab Temporal modelling\par
\pard\plain \qj\sb240 \f20 {\b This section will be expanded (this week) so it\rquote s about the same length as the other sections; I\rquote ll use parts of the ICPhS paper. Alternatively, CAM and UCL could reduce their sections. (Prefer first option.)}
\par
Timing relations in ProSynth are handled primarily in terms of (1) temporal compression and (2) syllable overlap. Like spectral detail, temporal effects are treated as an aspect of th
e phonetic interpretation of phonological representations. Linguistic information necessary for temporal interpretation includes a grammar of syllable and word joins, using ambisyllabicity and an appropriate feature system. Such details as formant transiti
on times, and inherent durational differences between close and open vowels, are handled in the statements of phonetic exponency pertaining to each bundle of features at a given place in structure. \par
{\b A model of temporal compression} allows the statement of relationships between constituents (primarily syllables) at different places in metrical structure [3], using a knowledge database. For instance, the syllable /man/ in the words {\i man}, {\i
manage}, {\i manager} and in the utterance \ldblquote {\i She\rquote s a bank manager}\rdblquote has different degrees of temporal compression which can be related to the metrical structure as a whole.
The timing model works top-down, i.e. from the highest unit in the hierarchy to the lowest. This reflects the assumption that the IP, AG, Foot and Syllable are all levels of timing
, and that details of lower-level differences (such as segment type) can be overlaid on details of higher-level differences (such as syllable weight and strength; the strength and weight of an adjacent syllable; etc.). The top-down model also has the effec
t of constaining search spaces. For instance, if the distinction between heavy and light
is relevant to the temporal interpretation of a syllable, then the temporal characteristics of the Onset of that syllable are sensitive to this fact, so that Onsets in heavy syllables and in light syllables have different durational properties.\par
The model of temporal compression is being constructed on the basis of the metrical structures of natural speech in a database (Section XX), although originally it was constructed on the basis of impressionistic listening.
The labelled waveforms and their XML-parsed description files are searched according to relevant feature information (eg. syllable weight and strength), and a Classification and Regression Tree model is used to generalise across this da
ta and generate duration statistics for feature bundles at given places in the phonological structure. The duration model can be used to drive MBROLA, since it predicts the durations of acoustic segments.\par
{\b Syllable overlap:} Another model of timing is based not on durations between acoustic events, but on a non-segmental model of temporal interpretation (Local & Ogden ref., Ogden, Local & Carter ref.). According to this model,
higher-level constituents in the hierarchy are compressed, and their daughter nodes are compressed in the same way. The temporal interpretation of ambisyllabicity is the degree of overlap that exists between syllables, so an intervocalic consonant (
typically ambisyllabic) has duration properties inherited from both the syllables it is in.\par
Syllable{\i\fs20\dn4 n} can be overlaid on Syllable{\i\fs20\dn4 n-1} by setting its start point to be before that of Syllable{\i\fs20\dn4 n-1}. By overlaying syllables to varying degrees and making reference to ambisyllabicity
, it is possible to lengthen or shorten intervocalic consonants systematically. There are morphologically bound differences which can be modelled in this way, provided that the phonological structure is sensitive to them. For instance, the Latinate prefix
{\i in-} is fully overlaid with the stem to which it attaches, giving a short nasal in {\i innocuous}, while the Germanic prefix {\i un-} is not overlaid to the same degree, giving a long nasal in {\i unknowing}. Rhythmical differences in pairs like {\i
recite} and {\i re-site} can likewise be treated as differences in phonological structure and consequent differences in the temporal interpretation of those structures.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.3\tab Intonational modelling\par
\pard\plain \qj\sb240 \f20 We assume, in common with most theories of intonation, that the highly variable F0 contours encountered in n
atural speech can be analysed into component parts and classified according to a finite set of possible pitch melodies, which need to be defined phonologically.
There is, then, a dimension of paradigmatic choice in modelling intonation: the overall pitch pattern selected for an IP is not itself predictable from structure but is determined by discourse factors.
Once that discourse-based selection has been made, then a pitch accent specification can be assigned to each of the AGs within the IP. The pattern for an IP is thus composed of the pitch accents assigned to AGs, and of boundary tones associated wit
h the edges of the IP domain. For example,
IP attributes will tell us (i) about position in discourse (initial, medial, final), (ii) about speech act function (declarative, interrogative, imperativeY), and (iii) about linguistic focus. The information in (i) is relevant to pitch range and
will be interpreted in terms of F0 scaling and boundary tone. Information in (ii) is used in determining the choice
of pitch accents for the component AGs, whereas (iii) determines nuclear accent placement, and hence the AG structure itself, since the nucleus must be located on the final AG of an IP (IPs being right-headed).
By default, AGs are co-terminous with headed, heavy Feet (those beginning with stressed syllables), so that the intonation nucleus falls on the final such Foot; in context the focus may shift to an earlier Foot position, thus creating an AG constituent c
ontaining more than one Foot. In this case, since AGs are left-headed, the first Foot within the AG is the head of that AG and the domain for the nuclear pitch contour. {\i (Examples available if required.)}\par
A discourse-final declarative IP, then, consisting of two well-formed (non-degenerate) AGs, would typically be assigned a relatively high accent in AG1, a falling nuclear pitch movement in AG2 and a low final boundary tone (equivalent to H* H
*L L% in ToBI-style notation).\par
The interpretation of the selected pitch contour in terms of F0 is, like other phonetic parameters, structure-dependent. Precise alignment of contour turning-points is constrained by the properties of units at lower levels in the hierarchy.
In our model, described in more detail in (ICPhS paper 1999 ref), nuclear pitch accents are defined in terms of a template based on a sequence of contour turning-points. These templates are in turn based on a set of essential parameters derived by automat
ic means from the Laryngograph recording used to calculate the F0 trace, and checked u
sing informal listening tests to ensure that there was perceptual equivalence between natural F0 contours and those constructed by linking the target points we identified. For example, for a falling (H*L) pitch accent we identify three crucial contour t
urning-points: Peak ONset (PON)
, Peak OFfset (POF) and Level ONset (LON). In other words, we recognise that the Apeak@ associated with H* accents is often manifested as a plateau, with its own duration, rather than as a single peak: PON and POF represent th
e start and end of such a plateau, with POF therefore denoting the beginning of the F0 fall. LON occurs at the end of the fall, and is the point from which the low tone spreads till the end of voicing in the AG (cf Aphrase accent@ (ref)). \par
***{\i Include suitable F0 plot as illustration, + ICPhS diagram with following procedural explanation***} \par
Firstly, the location of the key syllable components was established using the manual annotations. Then the peak F0 value in the accented syllable was found. The ons
et (PON) and the offset (POF) of the peak were then found by finding the range of times around the peak where the F0 value was within 4% (approximating to a range for perceptual equality). The schematic representation below illustra
tes the search for PON and POF.\par
The template turning-points are specified as attributes of the leftmost Foot (=head) within the AG. Our statistical analysis of the database suggests that the timing of all these points varies systematically with aspects of the structure of this
Foot, such as its length in terms of number of component syllables, and characteristics of the onset and rhyme of the accented syllable at its head. Many earlier studies of F0 alignment relate e.g. H* Apeak@ timing to this accented syllable, rather than
to the Foot (various refs). Our early results suggest that we can cut down on some of the variability by treating the Foot as the primary domain for our template.\par
\pard \qj\sb240 The patterns of alignment across structures which we observe for our single speaker model ar
e consistent with those reported in the literature (see House & Wichmann 1996, Wichmann and House 1999 for summary). We claim that successful modelling of the F0 values for this speaker, integrated with the same speaker=s timing and spectral properties, e
nhances the coherence of the synthesised output. Acoustic-phonetic coherence will be further enhanced by incorporating microprosodic perturbations of the F0 contour (Silverman, Y), clearly observable for e.g. obstruent consonants on our database.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 6.\tab Perceptual testing/experiments\par
\pard\plain \qj\sb240 \f20 {\b Urgent decisions needed on this. Include or exclude? If include (in }{\b\i any}{\b form), UCL must provide me with text immediately please. If you don\rquote
t, I think the paper looks a bit funny with only two experiments reported.}\par
[[This section will be expanded with the experimental results from respective sites. STILL LOTS OF WORK TO DO HERE ON JOINING THINGS UP BETTER. WILL WAIT TILL I GET UCL TEXT, THEN TAKE OUT COMMONALITIES AND PUT IN 6.1]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.1\tab conditions shared by all experiments\par
\pard\plain \qj\sb240 \f20 [[This section will contain information relevant to all the experiments.]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.2 f0\par
\pard\plain \qj\sb240 \f20 [[Emphasises the innovation in our testing of intonation; something about lack of good standard models for testing intonation.]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.3 timing\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.1.\~Hypothesis\par
\pard\plain \qj\sb240 \f20 The hypothe
sis we are testing in ProSynth is that having hierarchically organised, prosodically structured linguistic information should make it possible to produce more natural-sounding synthetic speech which is also more robust under difficult listening conditions.
As an initial test of our hypotheses about temporal structure and its relation to prosodic structure, an experiment has been conducted to test whether the categories set out in Section 2 make a significant difference to listeners\rquote
ability to interpret syn
thetic speech. If the timings predicted by ProSynth for structural positions are perceptually important, listeners should be more successful at interpreting synthetic speech when the timing appropriate for structure is used than in the case where the timin
g is inappropriate for the linguistic structures set up.\par
\pard \qj\fi357\sb240 The data consists of phrases from the database of natural English generated by MBROLA [11] synthesis using timings of two different kinds: (1)\~the segment durations predicted by the ProSynth model ta
king into account all the linguistic structure outlined in Section 2 (2)\~
the segment durations predicted by ProSynth for a different linguistic structure. If the linguistic structure makes no significant linguistic difference, then (1) and (2) should be pe
rceived equally well (or badly). If temporal interpretation is sensitive to linguistic structure in the way that we have suggetsed, then the results for (1) should be better than the results for (2).\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.2.\~Data\par
\pard\plain \qj\sb240 \f20 12 groups of structures to be compared on structural linguistic grounds were established (eg "light ambisyllabic short initial syllable" versus "light nonambisyllabic short initial syllable"). Each group has two members (eg {\i
robber}/{\i rob them} and {\i loving}/{\i loveless}
). For each phrase, two synthetic stimuli were generated: one with the predicted ProSynth timings for that structure, and another one with the timings for the other member of the pair. Files were produced with timing information from the natural-speech utt
erances, and an approximation to f0 of th
e speech in the database. The timing information for the final foot was then replaced with timing from the ProSynth model. This produced the 'correct' timings. In order to produce the 'broken' timings, timing information for the rhyme of the strong syllabl
e in this final foot was swapped within the group so, for example the durations for {\i ob} in {\i robber} were replaced with the durations for {\i ob} in {\i rob them} and vice versa.\par
The stimuli have segment labels ultimately from the label files from the database, f0 information from the recordings in the database, and timing information partly from natural speech and partly from the ProSynth model.\par
As an example, consider the pair {\i (he\rquote s a) robber} and {\i (to) rob them}. The durations (in ms.) for {\i robber} and {\i rob them} are:\par
\pard \qj\li720\sb240 {\f12407 \'81\tab }120\tab {\f12407 \'81}\tab 110\par
\pard \qj\li720 {\f12407 b\tab }65\tab {\f12407 b}\tab 85\par
{\f12407 \'ab\tab }150\tab {\f12407 D}\tab 60\par
{\f12407 \tab }\tab {\f12407 \'ab}\tab 120\par
{\f12407 \tab }\tab {\f12407 m}\tab 135\par
\pard \qj\sb240 Stimuli with these durations are compared with stimuli with the durations swapped round:\par
\pard \qj\li720\sb240 {\f12407 \'81\tab }110\tab {\f12407 \'81\tab }120\par
\pard \qj\li720 {\f12407 b\tab }85\tab {\f12407 b\tab }65\par
{\f12407 \'ab\tab }150\tab {\f12407 D\tab }60\par
{\f12407 \tab }\tab {\f12407 \'ab\tab }120\par
{\f12407 \tab }\tab {\f12407 m}\tab 135\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.3.\~Experimental design.\par
\pard\plain \qj\sb240 \f20 22 subjects heard every phrase once at comforta
ble listening levels over headphones, presented by a Tucker-Davies DD1 digital analogue interface. The signal-to-noise ratio was -5dB. The noise was cafeteria noise, i.e. different background noises like voices and laughter. Subjects were instructed to tra
nscribe what they heard using normal English spelling, and were given as much time as they needed. When they were ready, they pressed a key and the next stimulus was played.\par
Each subject heard half of the phrases as generated with the ProSynth model, and half with the timings switched. The subjects heard six practice items before hearing the test items, but were not informed of this.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.4.\~Results\par
\pard\plain \qj\sb240 \f20
The phoneme recognition rate for the correct timings from the ProSynth model is 77.5%, and for the switched timings it is 74.2%. Although this is only a small improvement, it is nevertheless significant using a one-tailed correlated t-test (t(21) = 2.21, p
< 0.02).\par
Examples of the stimuli and further details of the results of the experiments (including updates) are available on the world wide web [12].\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.5.\~Discussion\par
\pard\plain \qj\sb240 \f20
The results show a significant effect of linguistic structure on improved intelligibility. The results are for the whole phrase, including parts which were not switched round: excluding these may result in improved results. The MBROLA diphone synthesis mod
els durational effects, but not the segmental effects predicted by our model and described in more detail in Section 3: for example, the synthesis produces aspirated plosives in words like {\i roast}[{\f12407 H}]{\i ing}
where our model predicts non-aspiration. It uses only a small diphone database. The rather low phoneme recognition rates may be due to the quality of the synthesis was problematic, or the cognitive load imposed by high levels of background noise. Further
statistical analysis will group the data according to foot-type, and future experiments will use a formant synthesiser.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.6.\~Future work\par
\pard\plain \qj\sb240 \f20 Future work will concentrate on refining the temporal model so that it generates durations which approximate those
of our natural speech model as well as possible. The work will be checked by more perceptual experiments, including presenting the synthetic stimuli under listening conditions that impose a high cognitive load, such as having the subjects perform some othe
r task while listening to synthesis.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.4 segmental boundaries\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.1. Material. \par
\pard\plain \qj\sb240 \f20 18 phrases from the database were copy-synthesized into HLsyn using {\scaps procsy} [15], at 11.025 kHz SR, and hand-edited to a good standard of intelligibility, as judged by a number
of listeners. In 10 phrases, the sound of interest was a voiceless fricative: at the onset of a stressed syllable\emdash {\i in a }{\i\ul f}{\i ield}; unstressed onset\emdash {\i it\rquote s }{\i\ul s}{\i urreal}; coda of an unstressed syllable\emdash {
\i to di}{\i\ul s}{\i robe}; between unstressed syllables\emdash {\i di}{\i\ul s}{\i appoint}; coda of a final stressed syllable\emdash {\i on the roo}{\i\ul f}{\i , his ri}{\i\ul ff}{\i , a my}{\i\ul th}{\i , at a lo}{\i\ul ss}{\i , to cla}{\i\ul sh}
; and both unstressed and stressed onsets\emdash {\i\ul f}{\i ul}{\i\ul f}{\i illed.} The other 8 items had voiced stops as the focus: in the coda of a final stressed syllable\emdash {\i it\rquote s mislai}{\i\ul d}{\i , he\rquote s a ro}{\i\ul gue}{\i
, he was ro}{\i\ul bb}{\i ed}; stressed onset\emdash {\i in the }{\i\ul b}{\i and}; unstressed onset\emdash {\i the }{\i\ul d}{\i elay, to }{\i\ul b}{\i e wronged}; unstressed and final post-stress contexts\emdash {\i to }{\i\ul d}{\i eri}{\i\ul de}
; and in the onset and coda of a stressed syllable\emdash {\i he }{\i\ul b}{\i e}{\i\ul gg}{\i ed.\par
}The sound of interest was synthesized with the \ldblquote right\rdblquote type of excitation pattern. From each right version, a \ldblquote wrong\rdblquote
one was made be substituting a type or duration of excitation that was inappropriate for the context. Changes were systematic; no attempt was made to copy the exact details of the natural version of each phrase
, as our aim was to test the perceptual salience of the type of change, with a view to incorporating it in a synthesis-by-rule system.\par
At FV boundaries, the right version had simple excitation (an abrupt transition between aperiodic and periodic excitation), and the wrong version had mixed periodic and aperiodic excitation. VF boundaries had the opposite pattern: wrong versions had no mix
ed excitation. See Fig. 1. Right versions were expected to be more intelligible than wrong versions of fricatives.\par
Each stop had one of two types of wrong voicing: longer-than-normal voicing for {\i\ul b}{\i and} and{\i }{\i\ul b}{\i e}{\i\ul gg}{\i ed}
(see Fig. 2) whose onset stops normally have a short proportion of voicing in the closure; and unnaturally short voicing in the closures of the other six words. The wrong versions of {\i\ul b}{\i and} and{\i }{\i\ul b}{\i e}{\i\ul gg}{\i ed}
were classed as hyper-speech and expected to be more intelligible than the right versions. The other 6 were expected to be less intelligible in noise if naturalness and intelligibility co-varied.\par
<FIG MISSING>\par
Figure 1. Spectrograms of part of /{\scaps\f12407 is}/ in {\i disappoint}. Left: natural; mid: synthetic \ldblquote right\rdblquote version; right: synthetic \ldblquote wrong\rdblquote version.\par
<FIG MISSING>\par
<FIG MISSING>\par
<FIG MISSING>\par
Figure 2. Waveforms showing the region around the closure of /b/ in {\i he begged}. Upper panel: natural speech; middle: \ldblquote right\rdblquote synthetic version; lower: hyper-speech synthetic version.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.2. Subjects. \par
\pard\plain \qj\sb240 \f20 The 22 subjects were Cambridge University students, all native speakers of British English with no known speech or hearing problems and less than 30 years old.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.3. Procedure. \par
\pard\plain \qj\sb240 \f20
The 18 experimental items were mixed with randomly-varying cafeteria noise at an average s/n ratio of -4 dB relative to the maximum amplitude of the phrase. They were presented to listeners over high-quality headphones, using a Tucker-Davis DD1 D-to-A syst
em from a PC computer, and a comfortable listening level. Listeners were tested individually in a sound-treated room. They pressed a key to hear each item, and wrote down what they heard. Each listener heard each phrase onc
e: half the phrases in the right version, half wrong or hyper-speech. The order of items was randomized for each listener separately, and, because the noise was variable, it too was randomized separately for each listener. Five practice items preceded each
test.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.4. Results\par
\pard\plain \qj\sb240 \f20 Responses were scored for number of phonemes correct. Wrong insertions in otherwise correct responses counted as errors. There were two analyses, one on all phonemes in the phrase, the other on just three\emdash
the manipulated phoneme and
the 2 adjacent to it. Table 6 shows results for 16 phrases i.e. excluding the two hyper-speech phrases. Responses were significantly better (p < 0.02) for the right versions in the 3-phone analysis, and achieved a significance level of 0.063 in the whole-p
hrase analysis.\par
\par
\trowd \trqc\trgaph107\trleft-107 \clbrdrt\brdrs \clbrdrl\brdrs \clshdng0\cellx1129\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrr\brdrs \clshdng0\cellx4531\clbrdrt\brdrs \clbrdrr\brdrs \clshdng0\cellx6232\pard \qj\keepn\intbl context\cell version of phrase\cell
t(21) p (1-tail)\cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx1127\clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx2828\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx4529
\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\keepn\intbl \cell \ldblquote right\rdblquote \cell \ldblquote wrong\rdblquote \cell \cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdot
\clbrdrr\brdrs \clshdng0\cellx1127\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx2828\clbrdrt\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx4529\clbrdrt\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx6230\pard
\qj\sb240\keepn\intbl 3 phones\cell 69\cell 61\cell 2.35 0.015\cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx1127\clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx2828
\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx4529\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\sb240\intbl entire phrase\cell 72\cell 68\cell 1.59 0.063\cell \pard \intbl \row \pard \qj\sb240 Table {\*\bkmkstart perc_data_\bkmkcoll32 }
{\*\bkmkend perc_data_}6. Percentage correctly identified phonemes in 16 phrases.\par
Responses to the hyper-speech words differed: 84% vs. 89% correct for normal vs. hyper-speech {\i begged}; 85% vs. 76% correct for normal vs. hyper-speech {\i band} (3-phone analysis). Hyper-speech {\i in the} {\i band} was often misheard as {\i
in the van}. This lexical effect is an obvious consequence of enhanced periodicity in the /b/ closure of {\i band}.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.5. Discussion\par
\pard\plain \qj\sb240 \f20 We have shown for one speaker of Southern British English that linguistic structure influences the type of excitation at the boundaries between voiceless fricatives and vowels, as well as the duration
of periodic excitation in the closures of voiced stops. Most FV boundaries are simple, whereas most VF boundaries are mixed. Within these broad patterns, syllable stress, vowel height, and final vs. non-final position within the phrase all influence the i
ncidence and/or duration of mixed excitation. We interpret these data as indicating that the principal determinant of mixed excitation seems to be asynchrony in coordinating glottal and upper articulator movement. Timing relationships seem to be tighter at
FV than VF boundaries, and there can be considerable latitude in the timing of VF boundaries when the fricative is a phrase-final coda.\par
Our findings for voiced stops were as expected, if one assumes that the main determinants of the duration of low-frequency periodicity in the closure interval are aerodynamic. One interesting pattern is that voicing in the closures of prestressed onset sto
ps is short both in absolute terms and relative to the total duration of the closure.\par
We further showed that phoneme id
entification is better when the pattern of excitation at segment boundaries is appropriate for the structural context. Considering that only one acoustic boundary i.e. one edge of one phone or diphone, was manipulated in most of the phrases, and that there
are relatively few data points, the significance levels achieved testify to the importance of synthesizing edges that are appropriate to the context. It is encouraging that differences were still fairly reliable in the whole-phrase analysis under these ci
rcumstances, since we would expect more response variability over the whole phrase.\par
If local changes in excitation type at segment boundaries enhance intelligibility significantly, then systematic attention to boundary details throughout the whole of a synthetic utterance will presumably enhance its robustness in noise considerably. Howev
er, it is a truism that at times the speech style that is most appropriate to the situation is not necessarily the most natural one. The two instances of hyper-speech are a
case in point. By increasing the duration of closure voicing in stressed onset stops, we imitated what people do to enhance intelligibility in adverse conditions such as noise or telephone bandwidths. But this manipulation risked making the /b/s sound lik
e /v/s, effectively widening the neighborhood of {\i band} to include {\i van.} Since {\i in the van} is as likely as {\i in the band}, contextual cues could not help, and {\i band}\rquote s intelligibility fell. {\i Begged}\rquote
s intelligibility may have risen because there were no obvious lexic
al competitors, and because we also enhanced the voicing in the syllable coda, thus making a more extreme hyper-speech style, and, perhaps crucially, a more consistent one. These issues need more work.\par
The perceptual data do not distinguish between whether the \ldblquote right\rdblquote
versions are more intelligible because the manipulations enhance the acoustic and perceptual coherence of the signal at the boundary, or because they provide information about linguistic structure. The two possibilities are not mutually exclus
ive in any case. The data do suggest, however, that one reason for the appeal of diphone synthesis is not just that segment boundaries sound more natural, but that their naturalness may make them easier to understand, at least in noise. It thus seems worth
incorporating fine phonetic detail at segment boundaries into formant synthesis. It is relatively easy to produce these details with HLsyn, on which {\scaps procsy} is based.\par
\pard \qj\sb240 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 7. Future work\par
\pard\plain \qj\sb240 \f20 {\b Yes, this section is deadful\emdash a real construction site. Please be patient!}\par
Work is in progress [15] to automatically copy-synthesize database items into parameters for HLsyn, a Klatt-like formant synthesizer that synthesizes obstruents by means of pseudo-articulatory parameters. This method allows for easy production of utterance
s whose parameters can then be edited. Utterances can be altered to either conform to rules of the model, or to break such rules, thus allowing the perceptual salience of particular aspects of phonological structure to be assessed. Tests will as
sess speech intelligibility when listeners have competing tasks involving combinations of auditory vs. nonauditory modalities, and linguistic vs. nonlinguistic behaviour.\par
A statistical model based on our hypotheses about relevant phonological factors for temporal interpretation will be constructed from the database, leading to a fuller non-segmental model of temporal compression. Temporal, intonational and segmental details
will be stated as the phonetic exponents of the phonological structure.{\ul \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 8. REFERENCES\par
\pard\plain \qj\sb240 \f20 {\b I need you all to compile a set of references. I\rquote ll send out details of how to do it later this week; but for now please keep in mind that it\rquote ll need to be done very soon and check up any you aren\rquote
t sure about.}\par
\pard\plain \s15\qj\fi-284\li556\sb120\sl-219\tx560 \f65535\fs18 {\f20 1.\tab Hawkins, S. \ldblquote Arguments for a nonsegmental view of speech perception.\rdblquote }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm. Vol. 3, 18-25, 1995.\par
2.\tab House, J. & Hawkins, S., \ldblquote An integrated phonological-phonetic model for text-to-speech synthesis\rdblquote , }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm, Vol. 2, 326-329, 1995.\par
3.\tab Local, J.K. & Ogden R. \ldblquote A model of timing for nonsegmental phonological structure.\rdblquote In Jan P.H. van Santen, R W. Sproat, J. P. Olive & J. Hirschberg (eds.) }{\i\f20 Progress in Speech Synthesis}{\f20
. Springer, New York. 109-122, 1997.\par
4.\tab Local, J.K. \ldblquote Modelling assimilation in a non-segmental rule-free phonology.\rdblquote In G J Docherty & D R Ladd (eds): }{\i\f20 Papers in Laboratory Phonology II}{\f20 . Cambridge: CUP, 190-223, 1992.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 5.\tab Kelly, J. & Local, J. }{\i\f20 Doing Phonology.}{\f20 Manchester: University Press, 1989.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 6.\tab Hawkins, S., & Nguyen, N. \ldblquote Effects on word recognition of syllable-onset cues to syllable-coda voicing\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
7.\tab Hawkins, S. & Slater, A. \ldblquote Spread of CV and V-to-V coarticulation in British English: implications for the intelligibility of synthetic speech.\rdblquote }{\i\f20 ICSLP}{\f20 94, 1: 57-60, 1994.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 8.\tab Tunley, A. \ldblquote Metrical influences on /r/-colouring in English\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 9.\tab Fixmer, E. and Hawkins, S. \ldblquote The influence of quality of information on the McGurk effect.\rdblquote Presented at Australian Workshop on Auditory-Visual Speech Processing, 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 10.\tab Selkirk, E. O., }{\i\f20 Phonology and Syntax}{\f20 , MIT Press, Cambridge MA, 1984.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 11.\tab Broe, M. \ldblquote A unification-based approach to Prosodic Analysis.\rdblquote }{\i\f20 Edinburgh Working Papers in Cognitive Science}{\f20 \~7, 27-44, 1991.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 12.\tab Bladon, R.A.W. & Al-Bamerni, A. \ldblquote Coarticulation resistance in English /l/.\rdblquote }{\i\f20 J. Phon}{\f20 4: 137-150, 1976.\par
13.\tab http://www.w3.org/TR/1998/REC-xml-19980210\par
14.\tab http://www.ltg.ed.ac.uk/\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 15.\tab Heid, S. & Hawkins, S. \ldblquote Automatic parameter-estimation for high-quality formant synthesis using HLSyn.\rdblquote Presented }{\i\f20 at 3rd ESCA Workshop on Speech Synthesis}{\f20
, Jenolan Caves, Australia, 1998.\par
}\pard\plain \qj\sb240 \f20 [Ref1] http://www.w3.org/XML/\par
[Ref2] http://www.phon.ucl.ac.uk/project/prosynth.htm \par
[Ref3] Klatt, D., (1979) "Synthesis by rule of segmental durations in English sentences", Frontiers of Speech Communication Research, ed B.Lindblom & S.\'85hman, Academic Press.\par
}
This archive was generated by hypermail 2.0b3 on Tue Aug 24 1999 - 17:05:13 BST