CSL paper (3)


Richard Ogden (rao1@york.ac.uk)
Wed, 7 Jul 1999 16:40:01 +0100 (BST)


Enclosed is the CSL paper as it stands at the end of today. I've made
major amendments to the figures (including Mark's ProXML example, which
I've replaced with one of our timing examples), and I've included a
version of a diagram Jana sent me this morning. I've rewritten York's bit
on temporal interpretation and made most of the amendments Sarah
suggested.

I'm putting this on hold now till you send me any outstanding text, or
send me comments on things that you don't think are right. Here's how I
imagine the schedule:

till July 20th (approx): I will add any text you send me and send you
updates after major changes only; and send an update before I leave.

July 22-Aug 9 (maybe a day or two later): I'm away.

till end of August: any more text; more work on joining sections together;
bibliographical details and cross-references; typos etc. get fixed

end of August: we should submit this paper!

Richard

Richard Ogden
rao1@york.ac.uk
http://www.york.ac.uk/~rao1/

{\rtf1\mac\deff2 {\fonttbl{\f0\fswiss Chicago;}{\f2\froman New York;}{\f3\fswiss Geneva;}{\f4\fmodern Monaco;}{\f5\fscript Venice;}{\f6\fdecor London;}{\f7\fdecor Athens;}{\f12\fnil Los Angeles;}{\f13\fnil Zapf Dingbats;}{\f14\fnil Bookman;}
{\f15\fnil N Helvetica Narrow;}{\f16\fnil Palatino;}{\f18\fnil Zapf Chancery;}{\f20\froman Times;}{\f21\fswiss Helvetica;}{\f22\fmodern Courier;}{\f23\ftech Symbol;}{\f33\fnil Avant Garde;}{\f34\fnil New Century Schlbk;}{\f134\fnil Saransk;}
{\f237\fnil Petersburg;}{\f2017\fnil IPAPhon;}{\f2713\fnil IPAserif Lund1;}{\f9839\fnil Espy Serif;}{\f9840\fnil Espy Sans;}{\f9841\fnil Espy Serif Bold;}{\f9842\fnil Espy Sans Bold;}{\f10565\fnil M Times New Roman Expt;}
{\f12407\fnil SILDoulosIPA-Regular;}{\f12605\fnil SILSophiaIPA-Regular;}{\f13505\fnil SILManuscriptIPA-Regular;}}{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;
\red255\green255\blue0;\red255\green255\blue255;}{\stylesheet{\s243\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext243 footer;}{\s244\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext244 header;}{\s252\qj\sb240\sa60\keepn \b\i\f20
\sbasedon0\snext0 heading 4;}{\s253\qj\sb240\sa60\keepn \b\f20 \sbasedon0\snext0 heading 3;}{\s254\qj\sb360\keepn \b\i\f21 \sbasedon0\snext0 heading 2;}{\s255\qj\sb360\keepn \b\f21\fs28 \sbasedon0\snext0 heading 1;}{\qj\sb240\sl360 \f20
\sbasedon222\snext0 Normal;}{\s1\qj\sb120\sa120\sl360 \f65535 \sbasedon222\snext1 Abstract;}{\s2\qc\sb180\sl-280 \b\f20 \sbasedon222\snext2 AbstractHeading;}{\s3\li288\ri288\sb140\sl-219 \f20\fs18 \sbasedon222\snext3 Address;}{\s4\qc\sb180\sl-219
\f20\fs22 \sbasedon222\snext4 Affiliation;}{\s5\qc\sb180\sl-219 \i\f20\fs22 \sbasedon222\snext5 Author;}{\s6\qj\sb120\sa120\sl360 \f65535 \sbasedon222\snext6 Body;}{\s7\qc\sb120\sa240\sl360 \f65535 \sbasedon0\snext0 caption;}{\s8\qc\sl219 \f20\fs18
\sbasedon222\snext8 CellBody;}{\s9\qc\sl219 \b\f20\fs18 \sbasedon222\snext9 CellHeading;}{\s10\qc\sb180\sl-280\keepn \b\f20 \sbasedon222\snext10 Head1;}{\s11\fi-562\li562\sb180\sl-280\keepn\tx566 \b\f20 \sbasedon222\snext11 Head2;}{
\s12\qj\fi-283\li572\ri561\sb140\sl-220\tx566 \f65535\fs18 \sbasedon222\snext12 Item;}{\s13\qj\fi-283\li572\ri561\sb140\sl-220\tx560 \f65535\fs18 \sbasedon222\snext13 NumItem;}{\s14\qc \f20\fs8 \sbasedon4\snext14 bugfix;}{
\s15\qj\fi-284\li556\sb120\sl-219\tx560 \f65535\fs18 \sbasedon222\snext15 Reference;}{\s16\qj\sl-280 \f21 \sbasedon222\snext16 RTF_Defaults;}{\s17\qj\sl219 \f20\fs18 \sbasedon222\snext17 TableTitle;}{\s18\qc\sl-340 \b\f20\fs28 \sbasedon0\snext18 Title;}{
\s19\qc\sl280 \f20 \sbasedon222\snext19 CellFooting;}{\s20\qj\sb240\sl360 \f65535 \sbasedon0\snext20 Document Map;}{\s21\qj\fi-720\li720 \f65535 \sbasedon0\snext21 Indent;}{\s22\qj \f65535\fs20 \sbasedon0\snext22 Plain Text;}{\s23\qj\fi360 \f20\fs18
\sbasedon0\snext23 Normal Indent;}}{\info{\title INSTRUCTIONS FOR ICSLP96 AUTHORS}{\author Richard Ogden}}\paperw11880\paperh16820\margl1151\margr1151\margt1582\margb2098\widowctrl\ftnbj \sectd \sbkodd\linemod0\headery709\footery709\cols1\colsx288
{\header \pard\plain \qj \f20 \par
}{\footer \pard\plain \qj\tqc\tx4800\tqr\tx9520 \f20 CSL paper\tab {\field{\*\fldinst date \\@ "MMMM d, yyyy"}}\tab \chpgn \par
}\pard\plain \s18\qc\sl-340 \b\f20\fs28 ProSynth: An Integrated Prosodic Approach to Device-Independent, Natural-Sounding Speech Synthesis\par
\pard\plain \s5\qc\sb180\sl-219 \i\f20\fs22 Paul Carter{\fs14\up11 ***}, Jana Dankovicov\'87{\fs14\up11 **}, Sarah Hawkins{\fs14\up11 *}, Sebastian Heid{\fs14\up11 *}, Jill House{\fs14\up11 **}, Mark Huckvale{\fs14\up11 **}, John Local{\fs14\up11 ***}
, Richard Ogden{\fs14\up11 ***}\par
\pard\plain \s4\qc\sb180\sl-219 \f20\fs22 {\fs14\up11 *} University of Cambridge, {\fs14\up11 **} University College, London, {\fs14\up11 ***} University of York\par
\pard \s4\qc\sb180\sl-219 \par
\pard\plain \s14\qc \f20\fs8 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s2\qc\sb180\sl-280 \b\f20 ABSTRACT{\fs18 \par
}\pard\plain \s1\qj\sb120\sa120\sl360 \f65535 {\f20
This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the speech signal is informationally rich, and that this acoust
ic richness reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by structural richness produces a perceptually robust signal intelligible
in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech, segmentally, temporally and intonationally. [[In this paper, we present the results of some preliminary tests to eva
luate the effects of modelling timing, intonation and fine spectral detail.]]\par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 1. Introduction\par
\pard\plain \qj\sb240\sl360 \f20 Speech synthesis by rule (text-to-speech, TTS) has restricted uses because it sounds unnatural and is often difficult to understand. Despite recent impro
vements in grammatical analysis and in deriving correct pronunciations for irregularly-spelled words, there remains a more fundamental problem, that of the inherent incoherence of the synthesized acoustic signal. This typically lacks the subtle systematic
variability of natural speech that underlies the perceptual coherence of syllables and their constituents, and the longer phrases of which they form part. Intonation is often dull and repetitive, timing and rhythm are poor, and modifications that word boun
daries undergo in connected speech are poorly modelled. Much of this incoherence arises because many modern TTS systems encode linguistic knowledge in ways which are not in tune with current understanding of human speech and language processes.\par

Segmental intelligibility data illustrate the scale of the problem. When heard in noise, most synthetic speech loses intelligibility much faster than natural speech: natural speech is about 15% less intelligible at 0 dBs/n ratio than in quiet, whereas for
isolated wo
rds/syllables, Pratt (1986) reported that typical synthetic speech drops by 35%-50%. We can expect similar results today. Concatenated natural speech avoids those problems related solely to voice quality and local segment boundaries, but suffers just as mu
ch from poor models of timing, intonation, and systematic variability in segmental quality that is dependent on word and stress structure. Even when the grammatical analysis is right, one string of words can sound good, while another with the same grammati
cal pattern does not.\par

Interdependencies between grammatical, prosodic and segmental parameters are well known to phoneticians and to everyone who has synthesized speech. When these components are developed for synthesis in separate modules, the apparent convenience is offset by
 the need to capture the interdependencies, which often leads to problems of rule ordering and rule proliferation to correct effects of earlier rules. Much of the robustness of natural speech is lost by neglecting systematic subphonem
ic variability, a neglect that results partly from an inappropriate emphasis on phoneme strings rather than on linguistic structure. Recent research in computational phonology (eg. Bird 1995, Dirksen & Coleman forthcoming) combines highly structured lingui
stic representations (more technically, signs) with a declarative, computationally tractable formalism. Recent research in phonetics (eg. Simpson 1992, Manuel et al. 1992, Hawkins & Slater 1994, Manuel 1995, Zsiga 1995) shows that speech is rich in non-pho
nemic information which contributes to its naturalness and robustness (Hawkins 1995). Work at York (Local 1992a & b, 1994, 1995a & b, Local & Fletcher 1991a & b, Ogden 1992) has shown it is possible to combine phonological with phonetic knowledge by means
of a process known as phonetic interpretation: the assignment of phonetic parameters to pieces of phonological structure. Listening tests have shown that the synthetic speech generated by YorkTalk is interpreted and misinterpreted by listeners in ways that
 are very like those found for natural speech (Local 1993).{\plain \par
}ProSynth, an integrated prosodic
 approach to speech synthesis, explores the viability of a phonological model that addresses phonetic weaknesses found in current concatenative and formant-based text-to-speech (TTS) systems, in which the speech often sounds unnatural because the rhythm, i
ntonation and fine phonetic detail reflecting coarticulatory patterns are poor. Building on [1, 2, 3, 4], ProSynth integrates and extends existing knowledge to prod
uce the core of a new model of computational phonology and phonetic interpretation which will deliver high-quality speech synthesis. Key objectives are: (1)\~
demonstration of selected parts of a TTS system constructed on linguistically-motivated, declarative computational principles; (2)\~
a system-independent description of the linguistic structures developed; (3) perceptual test results using criteria of naturalness and robustness. To initially test the viability of our approach, we use a set of representati
ve linguistic structures applied to Southern British English. \par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 2.\tab Phonetic detail and perceptual coherence\par
\pard\plain \qj\sb240\sl360 \f20
More acoustic-phonetic fine detail is included in ProSynth than is standard in synthetic speech, consistent with the view that the signal will be more robust when it includes the patterns of systematic phonetic variability found in natural speech. This vie
w is based on the argument that it is t
he informational richness of natural speech that makes it such an effective communicative medium. By informational richness, we mean that the acoustic fine detail of the time-varying speech signal reflects multidimensional properties of both vocal-tract dy
namics and linguistic structure. The well-known \ldblquote redundancy\rdblquote
 of the speech signal, whereby a phone can be signalled by a number of more-or-less co-occurring acoustic properties, contributes some of this richness, but in our view, other less well-documented
 properties are just as important. These properties can be roughly divided into two groups: those that make the speech signal sound as if it comes from a single talker, and those that reflect linguistic structure\emdash i.e.
 those that make it sound as if the talker is using a consistent accent and style of speech. \par
A speech signal sounds as if it comes from a single talker when its properties reflect details of vocal-tract dynamics. This type of systematic variability contributes to the fundamental aco
ustic coherence of the speech signal, and hence to its perceptual coherence. By perceptual coherence, then, we mean that the speech signal sounds as if it comes from a single talker because its properties reflect details of vocal-tract dynamics. Listeners
associate these time-varying properties with human speech, so that when they bear the right relationships to one another, the perceptual system groups them together into an internally coherent auditory stream (cf. Bregman 199xx, Remez 19xx). A wide range o
f properties seems to contribute to perceptual coherence. The influence of some, like patterns of formant frequencies, is widely acknowledged (cf. Remez and Rubin 19xx {\i Science}
 paper). Others are known to be important but are not always well understood; examples are the amplitude envelope which governs some segmental distinctions (cf. Rosen and Howell 19xx) and also perceptions of rhythm and of \lquote integration\rquote
 between stop bursts and following vowels (van Tasell, Soli et al 19xx); and correlations between the mode of glottal excitation and the behaviour of the upper articulators, especially at abrupt segment boundaries (Gobl and NiChasaide 19xx).\par

A speech signal sounds as if the talker is using a consistent accent and style of speech when the allophonic variation is right. This requires producing often small distinctions that reflect different combinations of linguistic properties. As an example, t
ake the words {\i mistakes} and {\i mistimes}. Most people have no difficulty hearing that the /t/ of {\i mistimes} is aspirated whereas that of {\i mistakes} is not. The two words also have quite different rhythms: the first syllable of {\i mistimes}
 has a heavier beat than that of {\i mistakes}
, even though the words begin with the same four phonemes. The spectrograms of the two words in Figure xx confirm the differences in aspiration of the /t/s, and also show that the /m/, /I/, and /s/ also have quite different durations in the two words, cons
istent with the perceived rhythmic difference. These differences arise because the morphology of the words differ: {\i mis} is a removable prefix in {\i mistimes}, but in {\i mistakes}
 it is part of the word stem. These morphological differences are reflected in the syllable strcuture, as shown on the right of the Figure. In {\i mistimes}
, /s/ is the coda of syllable 1, and /t/ is the onset of syllable 2. So the /s/ is relatively short, the /t/ closure is long, and the /t/ is aspirated. XConversely, the /s/ and /t/ in {\i mistakes}
 are ambisyllabic, which means that they form both the coda of syllable 1 and the onset of syllable 2. On an onset /st/, the /t/ is always unaspirated (cf. {\i step, stop, start). }The differences in the /m/ and the /I/ arise because {\i mist}
 is a phonologcially heavy syllable whereas {\i mis}
 is phonologcially light, and both syllables are metrically weak. So, in these metrically weak syllables, differences in morphology create differences in syllabification and phonological weight, and these appear as differences in duration or aspiration acr
oss all four initial segments.\par
\par
\par
\par
\pard \qj\li720\sb240\sl360 Legend to Figure xx. Left: spectrograms of the words {\i mistimes} (top) and {\i mistakes }(bottom) spoken by a British English woman in the sentence {\i I\rquote d be surprised if Tess _______ it} with main stress on {\i Tess}
. Right: syllabic structures of each word.\par
\pard \qj\sb240\sl360 \par

Some types of systematic variability may contribute both perceptual coherence and information about linguistic structure. So-called resonance effects (Kelly and Local 1989) provide one example. Resonance effects associated with /r/, for example, manifest a
coustically as lowered formant frequencies, and can spread
over several syllables, but the factors that determine whether and how far they will spread include syllable stress, the number of consonants in the onset of the syllable, vowel quality, and the number of syllables in the foot (Slater and Hawkins 199x, Tun
ley 1999). The formant lowering probably reflects slow movements of the tongue body as it accommodates to the complex requirements of the English approximant /r/. On the one hand, including this type of information in synthetic speech makes it sound more n
atural in a subtle way that is hard to describe in phonetic terms but seems to make the signal \ldblquote fit together\rdblquote better\emdash
in other words, it seems to make it more coherent. On the other hand, the fact that the temporal extent of rhotic resonance effects depends on linguistic structure means not
 only that cues to the identity of a single phoneme can be distributed across a number of acoustic segments (sometimes several syllables), but also that aspects of the linguistic structure of the affected syllable(s) can also be subtly signalled.\par

Listeners can use this type of distributed acoustic information to identify naturally-spoken words (Marslen-Wilson and Warren 199x; other wmw refs (Gaskell?); Hawkins and Nguyen submitted-labphon), and when it is included in synthetic speech it can increas
e phoneme intelligibility in noise by 10-15% or more (Slater and Hawkins, Tunley). Natural-sounding, systematic variation of this type may be especially influential in adverse listening conditions or when cognitive loads are high (c
f. Pisoni in van Santen book, Pisoni and Duffy 19xx. sh check these refs.) because it is distributed, thus increasing the redundancy of the signal. However, Heid and Hawkins (1999 -ICPhS) found similar increases in phoneme intelligibility simply by manipul
ating the excitation type at fricative-vowel and vowel-fricative boundaries and in the closure periods of voiced stops; these improvements to naturalness were quite local. Thus, although only some of the factors mentioned above have been shown to influence
 perception, on the basis of our own and others\rquote
 recent work (Slater and Hawkins, Tunley, Heid/Hawkins-ICPhS 1999; Pisoni in van Santen book, Pisoni and Duffy 19xx, Kwong and Stevens 1999), we suggest that most of those whose perceptual contribution has no
t yet been tested would prove to enhance perception in at least some circumstances, as developed below. [xxThis para is not great but will have to do for now.]\par
In summary, natural speech is robust because it contains many phonetic details at the spectral, temporal and intonational levels, which form a coherent whole and which are the exponents of an underlying rich linguistic structure. In ProSynth, we attempt
 to model declaratively both linguistic, structural richness and phonetic richness. \par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 3.\tab Structure of ProSynth\par
\pard\plain \qj\sb240\sl360 \f20 ProSynth builds on the knowledge gained in YorkTalk (refs.), and uses
 an open computational architecture for synthesis. There is a clear separation between the computational engine and the computational representations of data and knowledge. The overall architecture is shown in Fig. XX. \par
\pard \qc\sb240\sl360\keepn {{\pict\macpict\picw426\pich156
082affffffff009b01a91101a00082a0008c01000affffffff009b01a9600013006d002300bf0000005a68010e005a68005a005a6800b4005a600031006d004100bf005a005a6800b4005a22001a006d001e22001a00be001ea0008da0008c60001300e2002301340000005a68010e005a68005a005a6800b4005a60003100
e200410134005a005a6800b4005a22001a00e2001e22001a0133001ea0008da10096000c010000000200000000000000a1009a0008fffd00000010000001000a00280084003400a72c000800140554696d65730300140d000a2e0004000001002b8531074c657869636f6ea00097a10096000c010000000200000000000000
a1009a000800030000001a000001000a002200f4003a012a28002b00f50c4465636c617261746976650d2a0c096b6e6f776c65646765a0009701000affffffff009b01a909000000000000000031005b006d008000bf09ffffffffffffffff3809000000000000000031005b00e20080013d09ffffffffffffffff38a10096
000c010000000200000000000000a1009a0008fffc0000001a000001000a0067007c007300b2280070007d0b436f6d706f736974696f6ea00097a10096000c010000000200000000000000a1009a0008fffc0000001b000001000a006700f40073012c29780e496e746572707265746174696f6ea0009701000affffffff00
9b01a90900000000000000000b001b001b4100010157002f01a909ffffffffffffffff480900000000000000004100370157006501a909ffffffffffffffff4809000000000000000041006d0157009b01a909ffffffffffffffff48a10096000c020000000200000000000000a1009a0008000800000015000001000a0005
01680029019628000e016a074d42524f4c410d2b050c08646970686f6e650d280026016d0973796e746865736973a00097a10096000c020000000200000000000000a1009a000800080000000c000001000a003b015b005f01a32b051e06484c73796e20280050015c1371756173692d6172746963756c61746f72790d2b11
0c0973796e746865736973a00097a10096000c020000000200000000000000a1009a0008000800000010000001000a007101630095019d2b021e0950726f736f6479200d28008601680c6d616e6970756c617465640d2b0a0c06537065656368a0009701000affffffff009b01a9070000000022007f00010000a000a0a100
a400020d0801000a0000000000000000070001000122005b000100242300002348002300002300ca23000023b812230000a000a301000affffffff009b01a92300242348002300ca23b812a000a1a10096000c010000000200000000000000a1009a0008000300000016000001000a00580014007000422800610015084d61
726b6564200d2a0c0474657874a00097a0008c01000affffffff009b01a90700000000220070005e0000a000a0a100a400020e0371001e0069005e0070006d006d006d0070005e006d005e0069005e006d006d01000a000000000000000022006d006df1032300002300fd2300002300fc230000230f0423000084000a0000
000000000000a000a301000affffffff009b01a984000a0000000000000000a000a1070001000122006d00491500a0008da0008c070000000022004c008d0000a000a0a100a400020e0371001e004c008d005b0094005b0091004c008d004c0091004c0094005b009101000a000000000000000022005b0091fcf123000023
040023000023030023000023fd0f23000084000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a107000100012200400091000ca0008da0008c070000000022004c010b0000a000a0a100a400020e0371001e004c010b005b0112005b010f004c010b004c010f004c0112005b010f
01000a000000000000000022005b010ffcf123000023040023000023030023000023fd0f23000084000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a10700010001220040010f000ca0008da0008c070000000022007000d30000a000a0a100a400020e0371001e006900d30070
00e2006d00e2007000d3006d00d3006900d3006d00e201000a000000000000000022006d00e2f1032300002300fd2300002300fc230000230f0423000084000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a1070001000122006d00be1500a0008da0008c070000000022002a01
550000a000a0a100a400020e0371001e001c014f002a0157001c0157002a0155002901520028014f001c015701000a000000000000000022001c0157fe0e23000023fdff23000023fdff2300002308f423000084000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a10700010001
22006d013c16bca0008da0008c070000000022005e014f0000a000a0a100a400020e0371001e0052014a005e015700520157005e014f005b014d0059014a0052015701000a00000000000000002200520157f80c23000023fefd23000023fdfe230000230df923000084000a0000000000000000a000a301000affffffff00
9b01a984000a0000000000000000a000a1070001000122006d013c11eea0008da0008c0700000000220080014a0000a000a0a100a400020e0371001e007b014a00880157008801570080014a007e014d007b014f0088015701000a00000000000000002200880157f3f82300002303fe2300002302fd23000023080d230000
84000a0000000000000000a000a301000affffffff009b01a984000a0000000000000000a000a1070001000122006d013c1111a0008da00083ff}}\par
\pard \qc\sb240\sl360 Fig. XX: ProSynth synthesis architecture.\par
\pard \qj\sb240\sl360 Text marked for the type and placement of accents is input to the system, and a pronunciation lexicon is used to construct a strictly layered metrical structure for
 each intonational phrase in turn. The overall utterance is then represented as a hierarchy, described in more detail in Section XX.\par
The interpreted structure is converted to a parametric form depending on the signal generation method. The phonetic descriptions and timing can be used to select diphones and express their durations and pitch contours foroutput with the MBROLA system
(Dutoit et al ref). The phonetic details can also be used to augment copy-synthesis parameters for HLsyn quasi-articulatory formant synthesiser (Heid & Hawkins ref., Jenolan Caves
.). The timings and pitch information have also been used to manipulate the prosody of natural speech using PSOLA (Hamon et al. ref).\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.1 Linguistic Representation and Processing\par
\pard\plain \qj\sb240\sl360 \f20 The Extensible Markup Language (XML) is an extremely simple dialect of SGML (Standard Generalised markup Language),
 the goal of which is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML is a standard proposed by the World Wide Web Consortium of industry specific mark\endash up for: vendor\endash
neutral data exchange, media\endash independent publishing, collaborative authoring, the processing of documents by intelligent agents and other metadata applications [Ref1]. \par
We have chosen to use XML as the external data representation for our phonological structures in ProSynth. The features of XML which make it ideal for this application are: storage of hierarchical information expressed in
nodes with attributes; a standard text\endash based format suitable for networking; a strict and formal syntax; facilities for the expression of linkage between parts of the structure; and readily\endash available software support. \par

In the ProSynth system, the input word sequence is converted to an XML representation which then passes through a number of stages representing phonetic interpretation. A declarative knowledge representation is used to encode knowledge of phonetic interpre
tation and to drive transformation of the XML data structures. Finally, special purpose code translates the XML structures into parameter tables for signal generation. \par
In ProSynth, XML is used to encode the following: \par
{\i Word Sequences }The text input to the synthesis system needs to be marked\endash
up in a number of ways. Importantly, it is assumed that the division into prosodic phrases and the assignment of accent types to those phrases has already been performed. This information is added to the text using a simple mark\endash up of Intonational
Phrases and Accent Groups (Section XX). \par
{\i Lexical Pronunciations }
The lexicon maps word forms to syllable sequences. Each possible pronunciation of a word form has its own entry comprising: SYLSEQ (i.e. syllable sequence), SYL, ONSET, RHYME, NUC, ACODA, CODA, VOC and CNS nodes. Information present in the input mark
\endash up, possibly derived from syntactic analysis, selects the appropriate pronunciation for each word form. \par
{\i Prosodic Structure }Each composed utterance comprising a single intonational phrase is stored in a hierarchy of: UTT, WORDSEQ, WORD, IP, AG, FOOT, SYL, ONSET, RHYME, NUC, CODA, ACODA, VOC and CNS nodes. Syllables are cross\endash
linked to the word nodes using linking attributes. This allows for phonetic interpretation rules to be sensitive to the grammatical function of a word as well as to the position of the syllable in the word. \par
{\i Database Annotation }A Database has been constructed containing tokens of relevant linguistic structures for the purpose of analysis of the temporal, intonational and spectral phenomena we wish to replicate in synthesis. It
has been manually annotated and a prosodic structure complete with timing information has been constructed for each phrase. This annotation is stored in XML using the same f
ormat as for synthesis. Tools for searching this database help us in generating knowledge for interpretation. \par
An interesting characteristic of our prosodic structure is the use of ambisyllabic consonants (discussed in more detail in Section XX). This allows one or more consonants to be in the Coda of one syllable and in the O
nset position of the next syllable. Examples are the medial consonants in "pity" or "tasty". To achieve ambisyllabicity in XML it is necessary to duplicate and link nodes, since XML rigidly enforces a strict hierarchy of components. \par
An extract of a prosodic structure expressed in XML is shown in Figure XX, taken from the phrase \ldblquote with a bloom\rdblquote (see Fig. XX for another representation of this information).\par
{\f22\fs18 <FOOT DUR="1" START="0.5561" STOP="1.0883">\par
}\pard \qj {\f22\fs18 \par
<SYL DUR="1" FPOS="1" RFPOS="1" RWPOS="1" START="0.5561" STOP="1.0883"\par
STRENGTH="STRONG" WEIGHT="HEAVY" WPOS="1" WREF="WORD4">\par
\par
}\pard \qj\li720 {\f22\fs18 <ONSET DUR="1" START="0.5561" STOP="0.7341" STRENGTH="STRONG">\par
<CNS AMBI="N" CNSCMP="N" CNSGRV="Y" CNT="N" DUR="1" NAS="N" RELEASE="0.6565" RHO="N" SON="N" START="0.5561" STOP="0.6670" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N" VOI="Y">b</CNS>\par
<CNS AMBI="N" CNSCMP="N" CNSGRV="N" CNT="Y" DUR="1" NAS="N" RHO="N" SON="Y" START="0.6670" STOP="0.7341" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N"\par
VOI="Y">l</CNS>\par
</ONSET>\par
}\pard \qj {\f22\fs18 \par
}\pard \qj\li720 {\f22\fs18 <RHYME CHECKED="Y" DUR="1" START="0.7341" STOP="1.0883" STRENGTH="STRONG"\par
VOI="Y" WEIGHT="HEAVY">\par
}\pard \qj\li1440 {\f22\fs18 \par
<NUC CHECKED="Y" DUR="1" LONG="Y" START="0.7341" STOP="0.9126" STRENGTH="STRONG" VOI="Y" WEIGHT="HEAVY">\par
<VOC DUR="1" FXGRD="-251.2" FXMID="126.7" GRV="Y" HEIGHT="CLOSE" RND="Y" START="0.7341" STOP="0.8234">u</VOC>\par
<VOC DUR="1" FXGRD="-171.1" FXMID="105.4" GRV="Y" HEIGHT="CLOSE" RND="Y" START="0.8234" STOP="0.9126">u</VOC>\par
</NUC>\par
\par
<CODA DUR="1" START="0.9126" STOP="1.0883" VOI="Y">\par
<CNS AMBI="N" CNSCMP="N" CNSGRV="Y" CNT="N" DUR="1" NAS="Y" RHO="N" SON="Y" START="0.9126" STOP="1.0883" STR="N" VOCGRV="Y" VOCHEIGHT="CLOSE" VOCRND="Y"\par
VOI="Y">m</CNS>\par
</CODA>\par
}\pard \qj\li720 {\f22\fs18 </RHYME>\par
}\pard \qj {\f22\fs18 </SYL>\par
</FOOT>\par
}\pard \qc\sb240\sl360 Fig 2. Partial XML representation of utterance: \ldblquote with a bloom\rdblquote .\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.2 Knowledge Representation\par
\pard\plain \qj\sb240\sl360 \f20 In ProSynth knowledge for phonetic interpretation is expressed in a declarative form that operates on the prosodic structure. This means firstly that the knowledge is expressed as unordered rules, and secondly
that it operates solely by manipulating the attributes on the XML encoded phonological structure. To encode such knowledge a representational language called ProXML was developed in which it is easy to express the hierarchical contexts which drive processi
ng and to make the appropriate changes to attributes. The ProXML language is read by an interpreter PRX written in C which takes XML on its input and produces XML on its output. ProXML is a very simple language modelled on both C and Cascading Style Sheets
 (see [Ref2] for more information). A ProXML script consists of functions which are named after each element type in the XML file (each node type) and which are triggered by the presence of a node of that type in the input. When a function is called to pro
cess a node, a context is supplied centered on that node so that reference to parent, child and sibling nodes is easy to express. \par
Figure XX shows a simple example of a ProXML script to adjust syllable durations for strong syllables in a disyllabic word whose second and final syllable is weak. If the first syllable is heavy, the rule is dependent on
the length of the vowel. In this example, the DUR attribute on SYL nodes is set as a function of the phonological attributes found on that node and on others in the hierarchy. Note that the rules modify the duration
attribute (*= means scale existing value) rather than set it to a specific value. In this way, the declarative aspect of the rule is maintained. The compression factors in the script are computed from regression tree data
taken from a database of natural speech.\par
\pard \qj\li1440\sb240 {\f22\fs18 SYL \{\par
}\pard \qj\li1440 {\f22\fs18 if ((:STRENGTH=="STRONG")&&(:WPOS=="1")&&(:RWPOS=="2")\par
  &&(../SYL[2]:WEIGHT=="LIGHT"))\par
    if (:WEIGHT=="HEAVY")\par
      if (./RHYME/NUC:LONG=="Y")\par
        :DUR *= 1.0884;\par
      else\par
        :DUR *= 1.1420;\par
    else\par
     :DUR *= 0.8274;\par
\}}{\f22\fs18 \par
}\pard \qc\sb240\sl360 Fig. X: Example ProXML script, which modifies syllable durations dependent on the syllable level and nucleus level attributes.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 4. The Phonological Model\par
\pard\plain \qj\sb240\sl360 \f20 In this section, we describe the phonological model used in ProSynth and show how the linguistic knowledge the system encodes is vital for modelling \lquote segmental\rquote , temporal and intonational fine detail.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.1 Overview\par
\pard\plain \qj\sb240\sl360 \f20 Central to ProSynth is a model which encodes phonological information in a hierarchical
fashion using structures based on attribute-value pairs. A declarative framework based on constraint satisfaction identifies for each phonological unit a complete metrical context. This context is a prosodic hierarchy with phonological contrasts available
at all level. The complex interacting levels of rules present in traditional layered systems are replaced in ProSynth by a single phonetic interpretation function operating on the entire context, which makes rule-ordering unnecessary
. Phonetics is related to phonology via a one-step phonetic interpretation function which makes use of as much linguistic knowledge as necessary. Systematic phonetic variability is constrained by position in structure
. The basis of phonetic interpretation is not the segment, but phonological features at places in structure. These
principles have been successfully demonstrated in YorkTalk (Local & Ogden 1997; Local 1992) for structures of up to three feet. We thus extend the principle successfully demonstrated in [3, 4], to a wider variety of phonological domains.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.2\tab The Prosodic Hierarchy\par
\pard\plain \qj\sb240\sl360 \f20 Our phonological structure is organised as
 a prosodic hierarchy, with phonological information distributed across the structure. The knowledge is formally represented as a Directed Acyclic Graph (DAG) which is constrained so that re-entrant nodes are only found at the terminal level. Graph-structu
res in the form of trees are commonly used in phonological analysis, except for the important addition of ambisyllabicit
y. Phonological attributes and their associated values are distributed around the entire prosodic hierarchy rather than at the terminal nodes, as in many phonological theories. Attributes at any level in the hierarchy may be accessed for use in
phonetic interpretation.\par
Text is parsed into a prosodic hierarchy which has units at the following levels: syllable constituents (Onset, Rhyme, Nucleus, Coda); Syllable; Foot; Accent Group (AG); Intonational Phrase (IP).
Our prosodic hierarchy, building on House & Hawkins (1995) and Local & Ogden (1997) is a head\_driven and strictly layered (Selkirk 1984) structure.{\plain }
Each unit is dominated by a unit at the next highest level (Strict Layer Hypothesis [10]). This produces a linguistically well-motivated and computationally tractable hierarchy. Constituents at each level have a set of po
ssible attributes, and relationships between units at the same level are determined by the principle of headedness. Structure-sharing is explicitly recognized through ambisyllabicity. \par
\pard \qj\sb240\sl360\tx0 The richness of the hierarchy comes from the information stored within structural nodes in the form of attributes and parameter values. Attributes of the IP, for example, include discourse information which will
 determine choice of intonation pattern. The IP consists of one or more Accent Groups (AGs), which in turn include as attributes specifications for the individual pitch accents making up the intonation contour. \par
\pard \qj\sb240\sl360 Fig. XX shows a partial phonological structure for the phrase \ldblquote with a bloom\rdblquote . Note that phonological information is spread around the structure
. For example, the feature [voice] is treated as a property of the Rhyme as a whole, and not of just one of the terminal nodes headed by the Rhyme. Timing information is also included: in the diagram below, the [start] of the IP is the same as the [start]
of the Onset of the first syllable of the utterance, and the [end] of the IP is the same as the [end] of the Coda of the last syllable, as indicated by the tags {\f13 \'c0} and {\f13 \'c1}.
 The value for [ambisyllabic] is shown for two consonants: note that for the [ambisyllabic: +] consonant /{\f12407 D}/, the terminal node is re-entrant.\par
\pard\plain \s7\qc\sb120\sl360 \f65535 {\plain {\pict\macpict\picw370\pich266
0af7ffffffff010901711101a00082a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00000111000c011b2c000800140554696d65730300140d000a2e0004000001002800090112024950a00097a10096000c020000000200000000000000a1009a0008fffd00000007000001000a002401
0e0030011e28002d010f024147a00097a10096000c020000000200000000000000a1009a0008fffd00000009000001000a0048010e005401222a2404466f6f74a0009701000affffffff0109017122000f011600122200330116001222005701160012a0008ca10096000c020000000200000000000000a1009a0008fffd00
000008000001000a006c010e007801202a240453796c6ca00097a10096000c020000000200000000000000a1009a0008fffd00000005000001000a0090010e009c011b2a24025268a00097a10096000c020000000200000000000000a1009a0008fffd00000006000001000a00b4010e00c0011c2a24024e75a0009701000a
ffffffff0109017122007b0116001222009f01160012a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00b4012100c0012e291302436fa0009701000affffffff0109017122009f01161212a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00b400f500
c000fe2800bd00f6014fa0009701000affffffff0109017122007b0116e536a0008da10096000c020000000200000000000000a1009a0008fffd00000008000001000a006c0090007800a228007500910453796c6ca00097a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00900090009c
009d2a24025268a00097a10096000c020000000200000000000000a1009a0008fffd00000006000001000a00b4009000c0009e2a24024e75a0009701000affffffff0109017122007b0098001222009f00980012a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00b400a300c000b02913
02436fa0009701000affffffff0109017122009f00981212a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00b4007700c000802800bd0078014fa0009701000affffffff0109017122007b0098e536a10096000c020000000200000000000000a1009a0008fffd00000008000001000a00
6c00480078005a28007500490453796c6ca00097a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00900048009c00552a24025268a00097a10096000c020000000200000000000000a1009a0008fffd00000006000001000a00b4004800c000562a24024e75a0009701000affffffff0109
017122007b0050001222009f00500012a10096000c020000000200000000000000a1009a0008fffd00000005000001000a00b4005b00c00068291302436fa0009701000affffffff0109017122009f00501212a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00b4002f00c000382800bd
0030014fa0009701000affffffff0109017122007b0050e536a10096000c020000000200000000000000a1009a0008fffd00000008000001000a006c00d8007800ea28007500d90453796c6ca00097a10096000c020000000200000000000000a1009a0008fffd00000005000001000a009000d8009c00e52a24025268a000
97a10096000c020000000200000000000000a1009a0008fffd00000006000001000a00b400d800c000e62a24024e75a0009701000affffffff0109017122007b00e0001222009f00e00012a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00b400bf00c000c82800bd00c0014fa0009701
000affffffff0109017122007b00e0e536a10096000c020000000200000000000000a1009a0008fffd00000007000001000a002400480030005828002d0049024147a00097a10096000c020000000200000000000000a1009a0008fffd00000009000001000a004800480054005c2a2404466f6f74a0009701000affffffff
010901712200330050001222005700500012220057005048122000570050006900e020000f0116002100502200c3005000122200c3003500122200c3006200122200c3007d00122200c3009800122200c300aa00122200c300c5e5122200c300e000122200c300fb09122200c3011600122200c301280012a10096000c0200
00000200000000000000a1009a0008fffd00000002000001000a00d8003000e500372c001730771453494c446f756c6f734950412d526567756c61720330772800e10031016ba00097a10096000c020000000200000000000000a1009a0008fffd00000002000001000a00d8004b00e50052291b01c3a00097a10096000c02
0000000200000000000000a1009a0008fffd00000003000001000a00d8005d00e500672912016da00097a10096000c020000000200000000000000a1009a0008fffd00000000000001000a00d8009500e5009929380149a00097a10096000c020000000200000000000000a1009a0008fffd00000002000001000a00d800a6
00e500ad29110144a00097a10096000c020000000200000000000000a1009a0008fffd00000002000001000a00d800dc00e500e2293601aba0009701000affffffff010901712200c300fbf712a10096000c020000000200000000000000a1009a0008fffd00000002000001000a00d800ee00e500f529120162a00097a100
96000c020000000200000000000000a1009a0008fffd00000001000001000a00d8010200e501062914016ca00097a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00d8011200e5011b29100275f9a00097a10096000c020000000200000000000000a1009a0008fffd0000000400000100
0a00d8012400e5012e2912016da00097a10096000c020000000200000000000000a1009a0008fffd00000003000001000a00d8007800e500812800e100790177a00097a10096000c010000000200000000000000a1009a0008000200000022000001000a00630129007b017003001428006c012a135b737472656e6774683a
207374726f6e675d0d2a0c0f5b7765696768743a2068656176795da00097a10096000c010000000200000000000000a1009a0008000200000017000001000a00870129009f015a2a180c5b636865636b6564202b5d0d2a0c095b766f696365202b5da00097a10096000c010000000200000000000000a1009a0008fffd0000
000f000001000a00b4013200c001532b0921055b656e643a2c0010000d0d5a6170662044696e676261747303000d291501c10300142907015da00097a10096000c010000000200000000000000a1009a0008fffd00000022000001000a00fc0090010800d62801050091115b616d626973796c6c616269633a202b5da00097
a10096000c010000000200000000000000a1009a0008fffd00000021000001000a00fc00480108008c2801050049115b616d626973796c6c616269633a202d5da0009701000affffffff010901712200e7006200122200e700aa0012a10096000c010000000200000000000000a1009a0008000300000016000001000a0000
012900180158280009012a095b73746172743a202003000d291c01c00300142908025d0d280015012a055b656e643a03000d291501c10300142907015da00097a10096000c010000000200000000000000a1009a0008fffd00000012000001000a00b4000000c000272800bd0001085b73746172743a2003000d291a01c003
00142908015da00097a00083ff}}{\f20 \par
}\pard \s7\qc\sb120\sa240\sl360 {\f20 Fig. 1. Partial tree structure of the utterance: \ldblquote with a bloom\rdblquote . See text for details.\par
}\pard\plain \qj\sb240\sl360 \f20 There is no separate level of {\i phonological word} within our hierarchy. Such a unit does not sit happily in a strictly layered structure\emdash
the boundaries of prosodic constituents like AG and Foot may well occur in th
e middle of a lexical item. Conversely, word boundaries may occur in the middle of a Foot/AG. Lexico-grammatical information may nonetheless be highly relevant to phonetic interpretation and must not be discarded. The computational representa
tion of our prosodic structure allows us to get round this problem: word\_level and syntactic\_level information is hyper\_linked into the prosodic hierarchy. In this way lexical boundaries
 and the grammatical functions of words can be used to inform phonetic interpretation. \par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.3\tab Units of Structure and their Attributes\par
\pard\plain \qj\sb240\sl360 \f20
Input text is parsed to head-driven syntactic and phonological hierarchical structures. The phonological parse allots material to places in the prosodic hierarchy and is supplemented with links to the syntactic parse. The lexicon itself is in the form of a
 partially parsed representation. Phonetic interpretation may be sen
sitive to information at any level, so that it is possible to distinguish, for instance, a plosive in the onset of a weak foot-final syllable from an onset plosive in a weak foot-medial syllable. \par
{\b Headedness}
: When a unit branches into sub-constituents, one of these constituents is its Head. If the leftmost constituent is the head, the constituent is said to be left-headed. Feet are left-headed. If the rightmost constituent is the head, the structure is right-
headed. Properties of a head are shared by the nodes it dominates [11]. Therefore a [+heavy] syllable has a [+heavy] rhyme; the syllable-level resonance features [\'b1grave] and [\'b1round] can also be shared by nodes they dominate: this is how
some aspects of coarticulation are modelled. In Fig. XX, headedness is indicated by vertical lines, as opposed to slanting ones.
Phonetic interpretation proceeds head-first and is therefore determined in a structurally principled fashion without resort to extrinsic ordering.\par
{\b Intonational Phrase (IP)}: The IP, the domain of a well-formed, coherent intonation contour, contains one or more AGs; minimally it must include a strong AG. The rightmost AG\emdash traditionally the intonational nucleus\emdash
is the head of the IP. It is the largest prosodic domain recognised in the current implementation of our model.\par
{\b Accent Groups (AG)}
: AGs are made up of one or more Feet, which are primarily units of timing. An accented syllable is a stressed syllable associated with a pitch accent; an AG is a unit of intonation initiated by such a syllable, and incorporating any following unaccented s
yllables. The head of the AG is the leftmost heavy foot. A weak foot is also a weak, headless AG. \par
AG attributes include [headedness], pitch accent specifications, and positional information within the IP.\par
{\b Feet}: All syllables are organised into Feet, which are primarily rhythmic units. Types of feet can be differentiated using attributes of [weight], [strength] and [headedness]. A foot is left-headed, with a [+strong] syllabl
e at its head, and includes any [-strong] syllables to the right. Any phrase-initial, weak syllables are grouped into a weak, headless foot, sometimes referred to as a \ldblquote degenerate\rdblquote foot. Degenerate feet are
always [light]. Thus when an IP begins with one or more weak, unaccented syllables, we maintain the strictly layered structure by organising them into [light] feet which are in turn contained within similarly [light] (or degenerate) AGs. Consistent
with the declarative formalism, attributes of the Foot are shared with its constituents, so that a syllable with the values [+head, +strong] is stressed.\par
{\b Syllables:} The Syllable contains the constituents Onset and Rhyme. The rhyme branches into Nucleus and Coda. Nuclei, onset
s and codas can all branch. The syllable is right-headed, the rhyme left-headed. Attributes of the syllable are [weight] (values heavy/light), and [strength] (values strong/weak): these are necessary for the correct assignment of temporal compression (
\'a4XX). Foot-initial Syllables are strong.\par
Weight is defined with regard to the subconstituents of the Rhyme. A Syllable is heavy if its Nucleus attribute [length] has the value long (in segmental terms, if it contains a long vowel or a diphthong). A Syllable i
s also heavy if its coda has more than one constituent. EXAMPLES\par
There is not a direct relationship between syllable strength and syllable weight. Strong syllables need not be heavy. In {\i loving}, /{\f12407 l\'c3v}/ has a [short] Nucleus, and the coda has only one constituent (corresponding to /{\f12407 v}
/, yet it is the strong syllable in the Foot. Similarly, weak syllables need not be light. In {\i amazement}, the final Syllable has a branching Coda (i.e. more than one constituent) and therefore is [heavy] but [weak]. ProSynth does not make
 use of extrametricality: all phonological material must be dominated by an appropriate node in structure.\par
{\b Phonological features:} We use binary features, with each {\i attribute} having a {\i value}, where the {\i value} slot can also be filled by another attribute-value{\i }pair. To our set of conventional features we add the features [\'b1
rhotic], to allow us to mimic the long-domain resonance effects of /r/ [5, 8], and [\'b1ambisyllabic] for ambisyllabic constituents (see below). Not all features are stated at the terminal nodes in the hierarchy: [\'b1
voice], for instance, is a property of the rhyme as a whole in order to model durational and resonance effects.\par
{\b Ambisyllabicity}
: Constituents which are shared between syllables are marked [+ambisyllabic]. Ambisyllabicity makes it easier to model coarticulation [4] and is an essential piece of knowledge in the overlaying of syllables to produce polysyllabic utterances. It is also u
sed to predict properties such as plosive aspiration in intervocalic clusters (\'a4XX).\par
Constituents are [+ambisyllabic] wherever this does not result in a breach of syllable structure constraints. {\i Loving} comprises two Syllables, /{\f12407 l\'c3v}/ and /{\f12407 vIN}/, since /{\f12407 v}
/ is both a legitimate Coda for the first Syllable, and a legitimate Onset for the second. {\i Loveless} has no ambisyllabicity, since /{\f12407 vl}/ is neither a legitimate Onset nor a legitimate Coda. Clusters may be entirely ambisyllabic, as in {\i
risky} (/{\f12407 rIsk}/+/{\f12407 ski}/), where /{\f12407 sk}/ is a good Coda and Onset cluster; partially ambisyllabic (i.e. one consonant is [+ambisyllabic], and one is [-ambisyllabic]), as in {\i selfish} /{\f12407 sElf}/+/{\f12407 fIS}
/), or non-ambisyllabic as in {\i risk them} (/{\f12407 rIsk}/+/{\f12407 D\'abm}/).{\b \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 5.\tab Phonetic interpretation\par
\pard\plain \qj\sb240\sl360 \f20 This section describes more details of phonetic interpretation in ProSynth, focussing on temporal relations, intonation, and spectral detail. Our assumption is that there are close relationships between each
of these aspects of speech. For example, once timing relations are accurately modelled, some of the spectral details (such as longer-domain resonance effects) can also be modelled as a by-product of the temporal modelling, when the output system is HLsy
n (or any formant synthesizer). This particular trade-off between duration and spectral shape is not of course available to concatenative synthesis, but the knowledge it reflects could influence [be applied to?] unit selection. [????]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.1\tab Temporal detail\par
\pard\plain \qj\sb240\sl360 \f20 Timing relations in ProSynth are handled primarily in terms of (1) temporal compression and (2) syllable overlap. Like spectral detail, temporal effects are treated as an aspect of th
e phonetic interpretation of phonological representations. Linguistic information necessary for temporal interpretation includes a grammar of syllable and word joins, using ambisyllabicity and an appropriate feature system. Such details as formant transiti
on times, and inherent durational differences between close and open vowels, are handled in the statements of phonetic exponency pertaining to each bundle of features at a given place in structure. \par
{\b A model of temporal compression} allows the statement of relationships between constituents (primarily syllables) at different places in metrical structure [3], using a knowledge database. For instance, the syllable /man/ in the words {\i man}, {\i
manage}, {\i manager} and in the utterance \ldblquote {\i She\rquote s a bank manager}\rdblquote has different degrees of temporal compression which can be related to the metrical structure as a whole.
The timing model works top-down, i.e. from the highest unit in the hierarchy to the lowest. This reflects the assumption that the IP, AG, Foot and Syllable are all levels of timing
, and that details of lower-level differences (such as segment type) can be overlaid on details of higher-level differences (such as syllable weight and strength; the strength and weight of an adjacent syllable; etc.). The top-down model also has the effec
t of constaining search spaces. For instance, if the distinction between heavy and light
is relevant to the temporal interpretation of a syllable, then the temporal characteristics of the Onset of that syllable are sensitive to this fact, so that Onsets in heavy syllables and in light syllables have different durational properties.\par
The model of temporal compression is being constructed on the basis of the metrical structures of natural speech in a database (Section XX), although originally it was constructed on the basis of impressionistic listening.
The labelled waveforms and their XML-parsed description files are searched according to relevant feature information (eg. syllable weight and strength), and a Classification and Regression Tree model is used to generalise across this da
ta and generate duration statistics for feature bundles at given places in the phonological structure. The duration model can be used to drive MBROLA, since it predicts the durations of acoustic segments.\par
{\b Syllable overlap:} Another model of timing is based not on durations between acoustic events, but on a non-segmental model of temporal interpretation (Local & Ogden ref., Ogden, Local & Carter ref.). According to this model,
higher-level constituents in the hierarchy are compressed, and their daughter nodes are compressed in the same way. The temporal interpretation of ambisyllabicity is the degree of overlap that exists between syllables, so an intervocalic consonant (
typically ambisyllabic) has duration properties inherited from both the syllables it is in.\par
Syllable{\i\fs20\dn4 n} can be overlaid on Syllable{\i\fs20\dn4 n-1} by setting its start point to be before that of Syllable{\i\fs20\dn4 n-1}. By overlaying syllables to varying degrees and making reference to ambisyllabicity
, it is possible to lengthen or shorten intervocalic consonants systematically. There are morphologically bound differences which can be modelled in this way, provided that the phonological structure is sensitive to them. For instance, the Latinate prefix
{\i in-} is fully overlaid with the stem to which it attaches, giving a short nasal in {\i innocuous}, while the Germanic prefix {\i un-} is not overlaid to the same degree, giving a long nasal in {\i unknowing}. Rhythmical differences in pairs like {\i
recite} and {\i re-site} can likewise be treated as differences in phonological structure and consequent differences in the temporal interpretation of those structures.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.2\tab Intonational detail\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.2.1.\tab What\par
5.2.2.\tab How\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.3\tab Spectral detail\par
\pard\plain \qj\sb240\sl360 \f20 1. Spectral shape: what sound is it.\par
2. Fine-tune this so it fits in with rest.\par
Copy synthesis.\par
The temporal extent of systematic spectral variat
ion due to coarticulatory processes is modelled using two intersecting principles. One reflects how much a given allophone blocks the influence of neighbouring sounds, and is like coarticulation resistance [12]. The other principle reflects resonance effec
ts, or how far coarticulatory effects spread. The extent of resonance effects depends on a range of factors including syllabic weight, stress, accent, and position in the foot, vowel height, and featural properties of other segments in the domain of potent
ial influence. For example, intervening bilabials let lingual resonance effects spread to more distant syllables, whereas other lingual consonants may block their spread; similarly, resonance effects usually spread through unstressed but not stressed sylla
bles.{\i \par
}\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.3.1.\tab What\par
5.3.2.\tab How\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 6.\tab Perceptual testing/experiments\par
\pard\plain \qj\sb240\sl360 \f20 [[This section will be expanded with the experimental results from respective sites. STILL LOTS OF WORK TO DO HERE ON JOINING THINGS UP BETTER. WILL WAIT TILL I GET UCL TEXT, THEN TAKE OUT COMMONALITIES AND PUT IN 6.1]]
\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.1\tab conditions shared by all experiments\par
\pard\plain \qj\sb240\sl360 \f20 [[This section will contain information relevant to all the experiments.]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.2 f0\par
\pard\plain \qj\sb240\sl360 \f20 [[Emphasises the innovation in our testing of intonation; something about lack of good standard models for testing intonation.]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.3 timing\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.1.\~Hypothesis\par
\pard\plain \qj\sb240\sl360 \f20
The hypothesis we are testing in ProSynth is that having hierarchically organised, prosodically structured linguistic information should make it possible to produce more natural-sounding synthetic speech which is also more robust under difficult listening
conditions. As an initial test of our hypotheses about temporal structure and its relation to prosodic structure, an experiment has been conducted to test whether the categories set out in Section 2 make a significant difference to listeners\rquote
 ability to interpret synthetic speech. If the timings predicted by ProSynth for structural positions are perceptually i
mportant, listeners should be more successful at interpreting synthetic speech when the timing appropriate for structure is used than in the case where the timing is inappropriate for the linguistic structures set up.\par
\pard \qj\fi357\sb240\sl360 The data consists of phrases from the database of natural English generated by MBROLA [11] synthesis using timings of two different kinds: (1)\~
the segment durations predicted by the ProSynth model taking into account all the linguistic structure outlined in Section 2 (2)\~the segment durations
predicted by ProSynth for a different linguistic structure. If the linguistic structure makes no significant linguistic difference, then (1) and (2) should be perceived equally well (or badly). If temporal interpretation is sensitive to linguistic structur
e in the way that we have suggetsed, then the results for (1) should be better than the results for (2).\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.2.\~Data\par
\pard\plain \qj\sb240\sl360 \f20
12 groups of structures to be compared on structural linguistic grounds were established (eg "light ambisyllabic short initial syllable" versus "light nonambisyllabic short initial syllable"). Each group has two members (eg {\i robber}/{\i rob them} and {
\i loving}/{\i loveless}). For each phrase, two synthetic stimuli were generated: one with the predicted ProSynth timings for that structure, and another one with
 the timings for the other member of the pair. Files were produced with timing information from the natural-speech utterances, and an approximation to f0 of the speech in the database. The timing information for the final foot was then replaced with timing
 from the ProSynth model. This produced the 'correct' timings. In order to produce the 'broken' timings, timing information for the rhyme of the strong syllable in this final foot was swapped within the group so, for example the durations for {\i ob} in {
\i robber} were replaced with the durations for {\i ob} in {\i rob them} and vice versa.\par
The stimuli have segment labels ultimately from the label files from the database, f0 information from the recordings in the database, and timing information partly from natural speech and partly from the ProSynth model.\par
As an example, consider the pair {\i (he\rquote s a) robber} and {\i (to) rob them}. The durations (in ms.) for {\i robber} and {\i rob them} are:\par
\pard \qj\li720\sb240 {\f12407 \'81\tab }120\tab {\f12407 \'81}\tab 110\par
\pard \qj\li720 {\f12407 b\tab }65\tab {\f12407 b}\tab 85\par
{\f12407 \'ab\tab }150\tab {\f12407 D}\tab 60\par
{\f12407 \tab }\tab {\f12407 \'ab}\tab 120\par
{\f12407 \tab }\tab {\f12407 m}\tab 135\par
\pard \qj\sb240\sl360 Stimuli with these durations are compared with stimuli with the durations swapped round:\par
\pard \qj\li720\sb240 {\f12407 \'81\tab }110\tab {\f12407 \'81\tab }120\par
\pard \qj\li720 {\f12407 b\tab }85\tab {\f12407 b\tab }65\par
{\f12407 \'ab\tab }150\tab {\f12407 D\tab }60\par
{\f12407 \tab }\tab {\f12407 \'ab\tab }120\par
{\f12407 \tab }\tab {\f12407 m}\tab 135\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.3.\~Experimental design.\par
\pard\plain \qj\sb240\sl360 \f20
22 subjects heard every phrase once at comfortable listening levels over headphones, presented by a Tucker-Davies DD1 digital analogue interface. The signal-to-noise ratio was -5dB. The noise was cafeteria noise, i.e. different background noises like voice
s and laughter. Subjects were instructed to transcribe what they heard using normal English spelling, and were given as much time as they needed. When they were ready, they pressed a key and the next stimulus was played.\par
Each subject heard half of the phrases as generated with the ProSynth model, and half with the timings switched. The subjects heard six practice items before hearing the test items, but were not informed of this.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.4.\~Results\par
\pard\plain \qj\sb240\sl360 \f20 The phoneme recognition rate for the correct timings from the ProSynth model is 77.5%, and for the switched timings it is 74.2%. Although this is only a small improvement, it is nevertheless significant using a
one-tailed correlated t-test (t(21) = 2.21, p < 0.02).\par
Examples of the stimuli and further details of the results of the experiments (including updates) are available on the world wide web [12].\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.5.\~Discussion\par
\pard\plain \qj\sb240\sl360 \f20
The results show a significant effect of linguistic structure on improved intelligibility. The results are for the whole phrase, including parts which were not switched round: excluding these may result in improved results. The MBROLA diphone synthesis mod
els durational effects, but not the segmental effects predicted by our model and described in more detail in Section 3: for example, the synthesis produces aspirated plosives in words like {\i roast}[{\f12407 H}]{\i ing}
where our model predicts non-aspiration. It uses only a small diphone database. The rather low phoneme recognition rates may be due to the quality of the synthesis was problematic, or the cognitive load imposed by high levels of background noise. Further s
tatistical analysis will group the data according to foot-type, and future experiments will use a formant synthesiser.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.6.\~Future work\par
\pard\plain \qj\sb240\sl360 \f20
Future work will concentrate on refining the temporal model so that it generates durations which approximate those of our natural speech model as well as possible. The work will be checked by more perceptual experiments, including presenting the synthetic
stimuli under listening conditions that impose a high cognitive load, such as having the subjects perform some other task while listening to synthesis.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.4 segmental boundaries\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.1. Material. \par
\pard\plain \qj\sb240\sl360 \f20 18 phrases from the database were copy-synthesized into HLsyn using {\scaps procsy}
 [15], at 11.025 kHz SR, and hand-edited to a good standard of intelligibility, as judged by a number of listeners. In 10 phrases, the sound of interest was a voiceless fricative: at the onset of a stressed syllable\emdash {\i in a }{\i\ul f}{\i ield}
; unstressed onset\emdash {\i it\rquote s }{\i\ul s}{\i urreal}; coda of an unstressed syllable\emdash {\i to di}{\i\ul s}{\i robe}; between unstressed syllables\emdash {\i di}{\i\ul s}{\i appoint}; coda of a final stressed syllable\emdash {\i on the roo}
{\i\ul f}{\i , his ri}{\i\ul ff}{\i , a my}{\i\ul th}{\i , at a lo}{\i\ul ss}{\i , to cla}{\i\ul sh}; and both unstressed and stressed onsets\emdash {\i\ul f}{\i ul}{\i\ul f}{\i illed.}
 The other 8 items had voiced stops as the focus: in the coda of a final stressed syllable\emdash {\i it\rquote s mislai}{\i\ul d}{\i , he\rquote s a ro}{\i\ul gue}{\i , he was ro}{\i\ul bb}{\i ed}; stressed onset\emdash {\i in the }{\i\ul b}{\i and}
; unstressed onset\emdash {\i the }{\i\ul d}{\i elay, to }{\i\ul b}{\i e wronged}; unstressed and final post-stress contexts\emdash {\i to }{\i\ul d}{\i eri}{\i\ul de}; and in the onset and coda of a stressed syllable\emdash {\i he }{\i\ul b}{\i e}{\i\ul
gg}{\i ed.\par
}The sound of interest was synthesized with the \ldblquote right\rdblquote type of excitation pattern. From each right version, a \ldblquote wrong\rdblquote one was made be s
ubstituting a type or duration of excitation that was inappropriate for the context. Changes were systematic; no attempt was made to copy the exact details of the natural version of each phrase, as our aim was to test the perceptual salience of the type of
 change, with a view to incorporating it in a synthesis-by-rule system.\par
At FV boundaries, the right version had simple excitation (an abrupt transition between aperiodic and periodic excitation), and the wrong version had mixed periodic and aperiodic excit
ation. VF boundaries had the opposite pattern: wrong versions had no mixed excitation. See Fig. 1. Right versions were expected to be more intelligible than wrong versions of fricatives.\par
Each stop had one of two types of wrong voicing: longer-than-normal voicing for {\i\ul b}{\i and} and{\i }{\i\ul b}{\i e}{\i\ul gg}{\i ed}
 (see Fig. 2) whose onset stops normally have a short proportion of voicing in the closure; and unnaturally short voicing in the closures of the other six words. The wrong versions of {\i\ul b}{\i and} and{\i }{\i\ul b}{\i e}{\i\ul gg}{\i ed}
 were classed as hyper-speech and expected to be more intelligible than the right versions. The other 6 were expected to be less intelligible in noise if naturalness and intelligibility co-varied.\par
<FIG MISSING>\par
Figure 1. Spectrograms of part of /{\scaps\f12407 is}/ in {\i disappoint}. Left: natural; mid: synthetic \ldblquote right\rdblquote version; right: synthetic \ldblquote wrong\rdblquote version.\par
<FIG MISSING>\par
<FIG MISSING>\par
<FIG MISSING>\par
Figure 2. Waveforms showing the region around the closure of /b/ in {\i he begged}. Upper panel: natural speech; middle: \ldblquote right\rdblquote synthetic version; lower: hyper-speech synthetic version.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.2. Subjects. \par
\pard\plain \qj\sb240\sl360 \f20 The 22 subjects were Cambridge University students, all native speakers of British English with no known speech or hearing problems and less than 30 years old.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.3. Procedure. \par
\pard\plain \qj\sb240\sl360 \f20
The 18 experimental items were mixed with randomly-varying cafeteria noise at an average s/n ratio of -4 dB relative to the maximum amplitude of the phrase. They were presented to listeners over high-quality headphones, using a Tucker-Davis DD1 D-to-A syst
em from a PC computer, and a comfortable listening level. Listeners were tested
individually in a sound-treated room. They pressed a key to hear each item, and wrote down what they heard. Each listener heard each phrase once: half the phrases in the right version, half wrong or hyper-speech. The order of items was randomized for each
listener separately, and, because the noise was variable, it too was randomized separately for each listener. Five practice items preceded each test.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.4.{ Results\par
}\pard\plain \qj\sb240\sl360 \f20 Responses were scored for number of phonemes correct. Wrong insertions in otherwise correct responses counted as errors. There were two analyses, one on all phonemes in the phrase, the other on just three\emdash
the manipulated phoneme and the 2 adjacent to it. Table
6 shows results for 16 phrases i.e. excluding the two hyper-speech phrases. Responses were significantly better (p < 0.02) for the right versions in the 3-phone analysis, and achieved a significance level of 0.063 in the whole-phrase analysis.\par
\par
\trowd \trqc\trgaph107\trleft-107 \clbrdrt\brdrs \clbrdrl\brdrs \clshdng0\cellx1129\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrr\brdrs \clshdng0\cellx4531\clbrdrt\brdrs \clbrdrr\brdrs \clshdng0\cellx6232\pard \qj\sl360\intbl context\cell version of phrase\cell
 t(21) p (1-tail)\cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx1127\clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx2828\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx4529
\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\sl360\intbl \cell \ldblquote right\rdblquote \cell \ldblquote wrong\rdblquote \cell \cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdot
\clbrdrr\brdrs \clshdng0\cellx1127\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx2828\clbrdrt\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx4529\clbrdrt\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx6230\pard
\qj\sb240\sl360\intbl 3 phones\cell 69\cell 61\cell 2.35 0.015\cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx1127\clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx2828
\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx4529\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\sb240\sl360\intbl entire phrase\cell 72\cell 68\cell 1.59 0.063\cell \pard \intbl \row \pard \qj\sb240\sl360 Table {\*\bkmkstart perc_data
\bkmkcoll32 }6{\*\bkmkend perc_data}. Percentage correctly identified phonemes in 16 phrases.\par
Responses to the hyper-speech words differed: 84% vs. 89% correct for normal vs. hyper-speech {\i begged}; 85% vs. 76% correct for normal vs. hyper-speech {\i band} (3-phone analysis). Hyper-speech {\i in the} {\i band} was often misheard as {\i
in the van}. This lexical effect is an obvious consequence of enhanced periodicity in the /b/ closure of {\i band}.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.5. Discussion\par
\pard\plain \qj\sb240\sl360 \f20 We have shown for one speaker of Southern British English that
linguistic structure influences the type of excitation at the boundaries between voiceless fricatives and vowels, as well as the duration of periodic excitation in the closures of voiced stops. Most FV boundaries are simple, whereas most VF boundaries are
mixed. Within these broad patterns, syllable stress, vowel height, and final vs. non-final position within the phrase all influence the incidence and/or duration of mixed excitation. We interpret these data as indicating that the principal determinant of m
ixed excitation seems to be asynchrony in coordinating glottal and upper articulator movement. Timing relationships seem to be tighter at FV than VF boundaries, and there can be considerable latitude in the timing of VF boundaries when the fricative is a p
hrase-final coda.\par
Our findings for voiced stops were as expected, if one assumes that the main determinants of the duration of low-frequency periodicity in the closure interval are aerodynamic. One interesting pattern is that voicing in the closures of pre
stressed onset stops is short both in absolute terms and relative to the total duration of the closure.\par

We further showed that phoneme identification is better when the pattern of excitation at segment boundaries is appropriate for the structural context. Considering that only one acoustic boundary i.e. one edge of one phone or diphone, was manipulated in mo
st of the phrases, and that there are relatively few data points, the significance levels achieved testify to the importance of synthesizing edges that
are appropriate to the context. It is encouraging that differences were still fairly reliable in the whole-phrase analysis under these circumstances, since we would expect more response variability over the whole phrase.\par

If local changes in excitation type at segment boundaries enhance intelligibility significantly, then systematic attention to boundary details throughout the whole of a synthetic utterance will presumably enhance its robustness in noise considerably. Howev
er, it is a truism that at times th
e speech style that is most appropriate to the situation is not necessarily the most natural one. The two instances of hyper-speech are a case in point. By increasing the duration of closure voicing in stressed onset stops, we imitated what people do to en
hance intelligibility in adverse conditions such as noise or telephone bandwidths. But this manipulation risked making the /b/s sound like /v/s, effectively widening the neighborhood of {\i band} to include {\i van.} Since {\i in the van} is as likely as
{\i in the band}, contextual cues could not help, and {\i band}\rquote s intelligibility fell. {\i Begged}\rquote
s intelligibility may have risen because there were no obvious lexical competitors, and because we also enhanced the voicing in the syllable coda, thus making a more extreme hyper-speech style, and, perhaps crucially, a more consistent one. These issues n
eed more work.\par
The perceptual data do not distinguish between whether the \ldblquote right\rdblquote versions are more intelligible because the manipulations enhance the acoustic and perceptual coherence of
the signal at the boundary, or because they provide information about linguistic structure. The two possibilities are not mutually exclusive in any case. The data do suggest, however, that one reason for the appeal of diphone synthesis is not just that seg
ment boundaries sound more natural, but that their naturalness may make them easier to understand, at least in noise. It thus seems worth incorporating fine phonetic detail at segment boundaries into formant synthesis. It is relatively easy to produce thes
e details with HLsyn, on which {\scaps procsy} is based.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 7. Future work\par
\pard\plain \qj\sb240\sl360 \f20
Work is in progress [15] to automatically copy-synthesize database items into parameters for HLsyn, a Klatt-like formant synthesizer that synthesizes obstruents by means of pseudo-articulatory parameters. This method allows for easy production of utterance
s whose parameters can then be edited. Utterances can be altered to either conform to rules of the model, or to break such rules, thus allowing the perceptual salience of particular aspects of phonological structure to be assessed. Tests will as
sess speech intelligibility when listeners have competing tasks involving combinations of auditory vs. nonauditory modalities, and linguistic vs. nonlinguistic behaviour.\par

A statistical model based on our hypotheses about relevant phonological factors for temporal interpretation will be constructed from the database, leading to a fuller non-segmental model of temporal compression. Temporal, intonational and segmental details
 will be stated as the phonetic exponents of the phonological structure.{\ul \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 8. REFERENCES\par
\pard\plain \s15\qj\fi-284\li556\sb120\sl-219\tx560 \f65535\fs18 {\f20 1.\tab Hawkins, S. \ldblquote Arguments for a nonsegmental view of speech perception.\rdblquote }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm. Vol. 3, 18-25, 1995.\par
2.\tab House, J. & Hawkins, S., \ldblquote An integrated phonological-phonetic model for text-to-speech synthesis\rdblquote , }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm, Vol. 2, 326-329, 1995.\par
3.\tab Local, J.K. & Ogden R. \ldblquote A model of timing for nonsegmental phonological structure.\rdblquote In Jan P.H. van Santen, R W. Sproat, J. P. Olive & J. Hirschberg (eds.) }{\i\f20 Progress in Speech Synthesis}{\f20
. Springer, New York. 109-122, 1997.\par
4.\tab Local, J.K. \ldblquote Modelling assimilation in a non-segmental rule-free phonology.\rdblquote In G J Docherty & D R Ladd (eds): }{\i\f20 Papers in Laboratory Phonology II}{\f20 . Cambridge: CUP, 190-223, 1992.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 5.\tab Kelly, J. & Local, J. }{\i\f20 Doing Phonology.}{\f20 Manchester: University Press, 1989.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 6.\tab Hawkins, S., & Nguyen, N. \ldblquote Effects on word recognition of syllable-onset cues to syllable-coda voicing\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
7.\tab Hawkins, S. & Slater, A. \ldblquote Spread of CV and V-to-V coarticulation in British English: implications for the intelligibility of synthetic speech.\rdblquote }{\i\f20 ICSLP}{\f20 94, 1: 57-60, 1994.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 8.\tab Tunley, A. \ldblquote Metrical influences on /r/-colouring in English\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 9.\tab Fixmer, E. and Hawkins, S. \ldblquote The influence of quality of information on the McGurk effect.\rdblquote Presented at Australian Workshop on Auditory-Visual Speech Processing, 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 10.\tab Selkirk, E. O., }{\i\f20 Phonology and Syntax}{\f20 , MIT Press, Cambridge MA, 1984.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 11.\tab Broe, M. \ldblquote A unification-based approach to Prosodic Analysis.\rdblquote }{\i\f20 Edinburgh Working Papers in Cognitive Science}{\f20 \~7, 27-44, 1991.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 12.\tab Bladon, R.A.W. & Al-Bamerni, A. \ldblquote Coarticulation resistance in English /l/.\rdblquote }{\i\f20 J. Phon}{\f20 4: 137-150, 1976.\par
13.\tab http://www.w3.org/TR/1998/REC-xml-19980210\par
14.\tab http://www.ltg.ed.ac.uk/\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 15.\tab Heid, S. & Hawkins, S. \ldblquote Automatic parameter-estimation for high-quality formant synthesis using HLSyn.\rdblquote Presented }{\i\f20 at 3rd ESCA Workshop on Speech Synthesis}{\f20
, Jenolan Caves, Australia, 1998.\par
}\pard\plain \qj\sb240\sl360 \f20 [Ref1] http://www.w3.org/XML/\par
[Ref2] http://www.phon.ucl.ac.uk/project/prosynth.htm \par
[Ref3] Klatt, D., (1979) "Synthesis by rule of segmental durations in English sentences", Frontiers of Speech Communication Research, ed B.Lindblom & S.\'85hman, Academic Press.\par
}



This archive was generated by hypermail 2.0b3 on Wed Jul 07 1999 - 16:41:25 BST