CSL paper (2)


Richard Ogden (rao1@york.ac.uk)
Tue, 6 Jul 1999 17:16:46 +0100 (BST)


Enclosed is the CSL paper as it now stands (rtf format). I can't work on
it any more today, my mind is mush.

Feedback on general structural issues would be welcome: is the paper more
coherent than it was?
Jill is sending me a picture or two tomorrow.

Leave typo comments out for now please: we'll deal with dots and commas
last. But some of the paper might be repetitious. I removed some of the
repetition, but suspect it's not all gone. If you find any problems from
sequencing of material, that'd be helpful.

There are lots of gaps to be filled in the latter sections of the paper.
Most notably in the section called "Phonetic interpretation" -- this is a
working title for the section that's going to contain material from each
of the three sites on temporal/spectral/intonational stuff. Sarah reminded
me (very helpfully) that this is a position paper: so give us some whats
and hows and don't get hung up on the detail. (Says he, whose section is a
miserable failure in this respect!!) We all have something to write here.
I suggest no more than 1 page A4 from each place.

UCL, please write up your test results so we can include them! It'll look
weird if they're not in, and yours are more robust than ours or Cam's. If
the reviewers think they're too crappy to include, they will say so! With
results from all of us, this paper will look really great.

I'll work on this tomorrow afternoon; we have to submit applications for
research leave within the next two years tomorrow, so that's got to get
done first.

Richard

Richard Ogden
rao1@york.ac.uk
http://www.york.ac.uk/~rao1/

{\rtf1\mac\deff2 {\fonttbl{\f0\fswiss Chicago;}{\f2\froman New York;}{\f3\fswiss Geneva;}{\f4\fmodern Monaco;}{\f5\fscript Venice;}{\f6\fdecor London;}{\f7\fdecor Athens;}{\f12\fnil Los Angeles;}{\f13\fnil Zapf Dingbats;}{\f14\fnil Bookman;}
{\f15\fnil N Helvetica Narrow;}{\f16\fnil Palatino;}{\f18\fnil Zapf Chancery;}{\f20\froman Times;}{\f21\fswiss Helvetica;}{\f22\fmodern Courier;}{\f23\ftech Symbol;}{\f33\fnil Avant Garde;}{\f34\fnil New Century Schlbk;}{\f134\fnil Saransk;}
{\f237\fnil Petersburg;}{\f2017\fnil IPAPhon;}{\f2713\fnil IPAserif Lund1;}{\f9839\fnil Espy Serif;}{\f9840\fnil Espy Sans;}{\f9841\fnil Espy Serif Bold;}{\f9842\fnil Espy Sans Bold;}{\f10565\fnil M Times New Roman Expt;}
{\f12407\fnil SILDoulosIPA-Regular;}{\f12605\fnil SILSophiaIPA-Regular;}{\f13505\fnil SILManuscriptIPA-Regular;}}{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;
\red255\green255\blue0;\red255\green255\blue255;}{\stylesheet{\s243\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext243 footer;}{\s244\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext244 header;}{\s252\qj\sb240\sa60\keepn \b\i\f20
\sbasedon0\snext0 heading 4;}{\s253\qj\sb240\sa60\keepn \b\f20 \sbasedon0\snext0 heading 3;}{\s254\qj\sb360\keepn \b\i\f21 \sbasedon0\snext0 heading 2;}{\s255\qj\sb360\keepn \b\f21\fs28 \sbasedon0\snext0 heading 1;}{\qj\sb240\sl360 \f20
\sbasedon222\snext0 Normal;}{\s1\qj\sb120\sa120\sl360 \f65535 \sbasedon222\snext1 Abstract;}{\s2\qc\sb180\sl-280 \b\f20 \sbasedon222\snext2 AbstractHeading;}{\s3\li288\ri288\sb140\sl-219 \f20\fs18 \sbasedon222\snext3 Address;}{\s4\qc\sb180\sl-219
\f20\fs22 \sbasedon222\snext4 Affiliation;}{\s5\qc\sb180\sl-219 \i\f20\fs22 \sbasedon222\snext5 Author;}{\s6\qj\sb120\sa120\sl360 \f65535 \sbasedon222\snext6 Body;}{\s7\qc\sb120\sa240\sl360 \f65535 \sbasedon0\snext0 caption;}{\s8\qc\sl219 \f20\fs18
\sbasedon222\snext8 CellBody;}{\s9\qc\sl219 \b\f20\fs18 \sbasedon222\snext9 CellHeading;}{\s10\qc\sb180\sl-280\keepn \b\f20 \sbasedon222\snext10 Head1;}{\s11\fi-562\li562\sb180\sl-280\keepn\tx566 \b\f20 \sbasedon222\snext11 Head2;}{
\s12\qj\fi-283\li572\ri561\sb140\sl-220\tx566 \f65535\fs18 \sbasedon222\snext12 Item;}{\s13\qj\fi-283\li572\ri561\sb140\sl-220\tx560 \f65535\fs18 \sbasedon222\snext13 NumItem;}{\s14\qc \f20\fs8 \sbasedon4\snext14 bugfix;}{
\s15\qj\fi-284\li556\sb120\sl-219\tx560 \f65535\fs18 \sbasedon222\snext15 Reference;}{\s16\qj\sl-280 \f21 \sbasedon222\snext16 RTF_Defaults;}{\s17\qj\sl219 \f20\fs18 \sbasedon222\snext17 TableTitle;}{\s18\qc\sl-340 \b\f20\fs28 \sbasedon0\snext18 Title;}{
\s19\qc\sl280 \f20 \sbasedon222\snext19 CellFooting;}{\s20\qj\sb240\sl360 \f65535 \sbasedon0\snext20 Document Map;}{\s21\qj\fi-720\li720 \f65535 \sbasedon0\snext21 Indent;}{\s22\qj \f65535\fs20 \sbasedon0\snext22 Plain Text;}{\s23\qj\fi360 \f20\fs18
\sbasedon0\snext23 Normal Indent;}}{\info{\title INSTRUCTIONS FOR ICSLP96 AUTHORS}{\author Richard Ogden}}\paperw11880\paperh16820\margl1151\margr1151\margt1582\margb2098\widowctrl\ftnbj \sectd \sbkodd\linemod0\headery709\footery709\cols1\colsx288
{\header \pard\plain \qj \f20 \par
}{\footer \pard\plain \qj\tqc\tx4800\tqr\tx9520 \f20 CSL paper\tab {\field{\*\fldinst date \\@ "MMMM d, yyyy"}}\tab \chpgn \par
}\pard\plain \s18\qc\sl-340 \b\f20\fs28 ProSynth: An Integrated Prosodic Approach to Device-Independent, Natural-Sounding Speech Synthesis\par
\pard\plain \s5\qc\sb180\sl-219 \i\f20\fs22 Paul Carter{\fs14\up11 ***}, Jana Dankovicov\'87{\fs14\up11 **}, Sarah Hawkins{\fs14\up11 *}, Sebastian Heid{\fs14\up11 *}, Jill House{\fs14\up11 **}, Mark Huckvale{\fs14\up11 **}, John Local{\fs14\up11 ***}
, Richard Ogden{\fs14\up11 ***}\par
\pard\plain \s4\qc\sb180\sl-219 \f20\fs22 {\fs14\up11 *} University of Cambridge, {\fs14\up11 **} University College, London, {\fs14\up11 ***} University of York\par
\pard \s4\qc\sb180\sl-219 \par
\pard\plain \s14\qc \f20\fs8 \sect \sectd \sbknone\linemod0\headery709\footery709\cols1\colsx289 \pard\plain \s2\qc\sb180\sl-280 \b\f20 ABSTRACT{\fs18 \par
}\pard\plain \s1\qj\sb120\sa120\sl360 \f65535 {\f20
This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the speech signal is informationally rich, and that this acoust
ic richness reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by structural richness produces a perceptually robust signal intelligible
in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech, segmentally, temporally and intonationally. [[In this paper, we present the results of some preliminary tests to eva
luate the effects of modelling timing, intonation and fine acoustic phonetic detail.]]\par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 1. Introduction\par
\pard\plain \qj\sb240\sl360 \f20 Speech synthesis by rule (text-to-speech, TTS) has restricted uses because it sounds unnatural and is often difficult to understand. Despite recent impro
vements in grammatical analysis and in deriving correct pronunciations for irregularly-spelled words, there remains a more fundamental problem, that of the inherent incoherence of the synthesized acoustic signal. This typically lacks the subtle systematic
variability of natural speech that underlies the perceptual coherence of syllables and their constituents, and the longer phrases of which they form part. Intonation is often dull and repetitive, timing and rhythm are poor, and modifications that word boun
daries undergo in connected speech are poorly modelled. Much of this incoherence arises because many modern TTS systems encode linguistic knowledge in ways which are not in tune with current understanding of human speech and language processes.\par

Segmental intelligibility data illustrate the scale of the problem. When heard in noise, most synthetic speech loses intelligibility much faster than natural speech: natural speech is about 15% less intelligible at 0 dBs/n ratio than in quiet, whereas for
isolated wo
rds/syllables, Pratt (1986) reported that typical synthetic speech drops by 35%-50%. We can expect similar results today. Concatenated natural speech avoids those problems related solely to voice quality and local segment boundaries, but suffers just as mu
ch from poor models of timing, intonation, and systematic variability in segmental quality that is dependent on word and stress structure. Even when the grammatical analysis is right, one string of words can sound good, while another with the same grammati
cal pattern does not.\par

Interdependencies between grammatical, prosodic and segmental parameters are well known to phoneticians and to everyone who has synthesized speech. When these components are developed for synthesis in separate modules, the apparent convenience is offset by
 the need to capture the interdependencies, which often leads to problems of rule ordering and rule proliferation to correct effects of earlier rules. Much of the robustness of natural speech is lost by neglecting systematic subphonem
ic variability, a neglect that results partly from an inappropriate emphasis on phoneme strings rather than on linguistic structure. Recent research in computational phonology (eg. Bird 1995, Dirksen & Coleman forthcoming) combines highly structured lingui
stic representations (more technically, signs) with a declarative, computationally tractable formalism. Recent research in phonetics (eg. Simpson 1992, Manuel et al. 1992, Hawkins & Slater 1994, Manuel 1995, Zsiga 1995) shows that speech is rich in non-pho
nemic information which contributes to its naturalness and robustness (Hawkins 1995). Work at York (Local 1992a & b, 1994, 1995a & b, Local & Fletcher 1991a & b, Ogden 1992) has shown it is possible to combine phonological with phonetic knowledge by means
of a process known as phonetic interpretation: the assignment of phonetic parameters to pieces of phonological structure. Listening tests have shown that the synthetic speech generated by YorkTalk is interpreted and misinterpreted by listeners in ways that
 are very like those found for natural speech (Local 1993).{\plain \par
}ProSynth, an integrated prosodic
 approach to speech synthesis, explores the viability of a phonological model that addresses phonetic weaknesses found in current concatenative and formant-based text-to-speech (TTS) systems, in which the speech often sounds unnatural because the rhythm, i
ntonation and fine phonetic detail reflecting coarticulatory patterns are poor. Building on [1, 2, 3, 4], ProSynth integrates and extends existing knowledge to prod
uce the core of a new model of computational phonology and phonetic interpretation which will deliver high-quality speech synthesis. Key objectives are: (1)\~
demonstration of selected parts of a TTS system constructed on linguistically-motivated, declarative computational principles; (2)\~
a system-independent description of the linguistic structures developed; (3) perceptual test results using criteria of naturalness and robustness. To initially test the viability of our approach, we use a set of representati
ve linguistic structures applied to Southern British English. \par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 2.\tab Phonetic detail and perceptual coherence\par
\pard\plain \qj\sb240\sl360 \f20
More acoustic-phonetic fine detail is included in ProSynth than is standard in synthetic speech, consistent with the view that the signal will be more robust when it includes the patterns of systematic phonetic variability found in natural speech. This vie
w is based on the argument that it is t
he informational richness of natural speech that makes it such an effective communicative medium. By informational richness, we mean that the acoustic fine detail of the time-varying speech signal reflects multidimensional properties of both vocal-tract dy
namics and linguistic structure. The well-known \ldblquote redundancy\rdblquote
 of the speech signal, whereby a phone can be signalled by a number of more-or-less co-occurring acoustic properties, contributes some of this richness, but in our view, other less well-documented
 properties are just as important. These properties can be roughly divided into two groups: those that make the speech signal sound as if it comes from a single talker, and those that reflect linguistic structure\emdash i.e.
 those that make it sound as if the talker is using a consistent accent and style of speech. \par
A speech signal sounds as if it comes from a single talker when its properties reflect details of vocal-tract dynamics. This type of systematic variability contributes to the fundamental aco
ustic coherence of the speech signal, and hence to its perceptual coherence. By perceptual coherence, then, we mean that the speech signal sounds as if it comes from a single talker because its properties reflect details of vocal-tract dynamics. Listeners
associate these time-varying properties with human speech, so that when they bear the right relationships to one another, the perceptual system groups them together into an internally coherent auditory stream (cf. Bregman 199xx, Remez 19xx). A wide range o
f properties seems to contribute to perceptual coherence. The influence of some, like patterns of formant frequencies, is widely acknowledged (cf. Remez and Rubin 19xx {\i Science}
 paper). Others are known to be important but are not always well understood; examples are the amplitude envelope which governs some segmental distinctions (cf. Rosen and Howell 19xx) and also perceptions of rhythm and of \lquote integration\rquote
 between stop bursts and following vowels (van Tasell, Soli et al 19xx); and correlations between the mode of glottal excitation and the behaviour of the upper articulators, especially at abrupt segment boundaries (Gobl and NiChasaide 19xx).\par

A speech signal sounds as if the talker is using a consistent accent and style of speech when the allophonic variation is right. This requires producing often small distinctions that reflect different combinations of linguistic properties. As an example, t
ake the words {\i mistakes} and {\i mistimes}. Most people have no difficulty hearing that the /t/ of {\i mistimes} is aspirated whereas that of {\i mistakes} is not. The two words also have quite different rhythms: the first syllable of {\i mistimes}
 has a heavier beat than that of {\i mistakes}
, even though the words begin with the same four phonemes. The spectrograms of the two words in Figure xx confirm the differences in aspiration of the /t/s, and also show that the /m/, /I/, and /s/ also have quite different durations in the two words, cons
istent with the perceived rhythmic difference. These differences arise because the morphology of the words differ: {\i mis} is a removable prefix in {\i mistimes}, but in {\i mistakes}
 it is part of the word stem. These morphological differences are reflected in the syllable strcuture, as shown on the right of the Figure. In {\i mistimes}
, /s/ is the coda of syllable 1, and /t/ is the onset of syllable 2. So the /s/ is relatively short, the /t/ closure is long, and the /t/ is aspirated. XConversely, the /s/ and /t/ in {\i mistakes}
 are ambisyllabic, which means that they form both the coda of syllable 1 and the onset of syllable 2. On an onset /st/, the /t/ is always unaspirated (cf. {\i step, stop, start). }The differences in the /m/ and the /I/ arise because {\i mist}
 is a phonologcially heavy syllable whereas {\i mis}
 is phonologcially light, and both syllables are metrically weak. So, in these metrically weak syllables, differences in morphology create differences in syllabification and phonological weight, and these appear as differences in duration or aspiration acr
oss all four initial segments.\par
\par
\par
\par
\pard \qj\li720\sb240\sl360 Legend to Figure xx. Left: spectrograms of the words {\i mistimes} (top) and {\i mistakes }(bottom) spoken by a British English woman in the sentence {\i I\rquote d be surprised if Tess _______ it} with main stress on {\i Tess}
. Right: syllabic structures of each word.\par
\pard \qj\sb240\sl360 \par

Some types of systematic variability may contribute both perceptual coherence and information about linguistic structure. So-called resonance effects (Kelly and Local 1989) provide one example. Resonance effects associated with /r/, for example, manifest a
coustically as lowered formant frequencies, and can spread
over several syllables, but the factors that determine whether and how far they will spread include syllable stress, the number of consonants in the onset of the syllable, vowel quality, and the number of syllables in the foot (Slater and Hawkins 199x, Tun
ley 1999). The formant lowering probably reflects slow movements of the tongue body as it accommodates to the complex requirements of the English approximant /r/. On the one hand, including this type of information in synthetic speech makes it sound more n
atural in a subtle way that is hard to describe in phonetic terms but seems to make the signal \ldblquote fit together\rdblquote better\emdash
in other words, it seems to make it more coherent. On the other hand, the fact that the temporal extent of rhotic resonance effects depends on linguistic structure means not
 only that cues to the identity of a single phoneme can be distributed across a number of acoustic segments (sometimes several syllables), but also that aspects of the linguistic structure of the affected syllable(s) can also be subtly signalled.\par

Listeners can use this type of distributed acoustic information to identify naturally-spoken words (Marslen-Wilson and Warren 199x; other wmw refs (Gaskell?); Hawkins and Nguyen submitted-labphon), and when it is included in synthetic speech it can increas
e phoneme intelligibility in noise by 10-15% or more (Slater and Hawkins, Tunley). Natural-sounding, systematic variation of this type may be especially influential in adverse listening conditions or when cognitive loads are high (c
f. Pisoni in van Santen book, Pisoni and Duffy 19xx. sh check these refs.) because it is distributed, thus increasing the redundancy of the signal. However, Heid and Hawkins (1999 -ICPhS) found similar increases in phoneme intelligibility simply by manipul
ating the excitation type at fricative-vowel and vowel-fricative boundaries and in the closure periods of voiced stops; these improvements to naturalness were quite local. Thus, although only some of the factors mentioned above have been shown to influence
 perception, on the basis of our own and others\rquote
 recent work (Slater and Hawkins, Tunley, Heid/Hawkins-ICPhS 1999; Pisoni in van Santen book, Pisoni and Duffy 19xx, Kwong and Stevens 1999), we suggest that most of those whose perceptual contribution has no
t yet been tested would prove to enhance perception in at least some circumstances, as developed below. [xxThis para is not great but will have to do for now.]\par
[[I THINK THE CONCLUSION OF THIS SECTION NEEDS TO SUMMARISE BY SAYING THAT THE DETAIL OF SPEECH REFLECTS LINGUISTIC STRUCTURE AT A NUMBER OF LEVELS, AND PROSYNTH AIMS TO CAPTURE THESE THROUGH THE REPRESENTATIONS IT USES.]]\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 3.\tab Structure of ProSynth\par
\pard\plain \qj\sb240\sl360 \f20 [[This section is designed to give an idea of the architecture of the system. It basically tells you about XML and its uses, then ProXML. I\rquote ve reduced the material on the database. While that\rquote
s an important part of what we\rquote ve done, it\rquote s not part of the synthesis system per se: it\rquote s part of the analysis before synthesis. I\rquote m not sure it\rquote
s all that relevant in this paper; certainly is in eg. the final project report though.]]\par
[[THIS TEXT IS ADAPTED FROM MARK\rquote S EUROSPEECH PAPER:]]\par
ProSynth builds on the knowledge gained in YorkTalk (refs.), and uses
 an open computational architecture for synthesis. There is a clear separation between the computational engine and the computational representations of data and knowledge. The overall architecture is shown in Fig. XX. \par
\pard \qc\sb240\sl360\keepn {{\pict\macpict\picw426\pich156
0601ffffffff009b01a91101a00082a0008c01000affffffff009b01a9600013006d002300bf0000005a68010e005a68005a005a6800b4005a600031006d004100bf005a005a6800b4005a22001a006d001e22001a00be001ea0008da0008c60001300e2002301340000005a68010e005a68005a005a6800b4005a60003100
e200410134005a005a6800b4005a22001a00e2001e22001a0133001ea0008da10096000c010000000200000000000000a1009a0008fffd0000000f000001000a00280083003400a52c000800140554696d65730300140d000a2e0004000001002b8431074c657869636f6ea00097a10096000c010000000200000000000000
a1009a000800030000001a000001000a002200f3003a012a28002b00f40c4465636c617261746976650d2a0c096b6e6f776c65646765a0009701000affffffff009b01a909000000000000000031005b006d008000bf09ffffffffffffffff3809000000000000000031005b00e20080013d09ffffffffffffffff38a10096
000c010000000200000000000000a1009a0008fffc00000019000001000a0067007b007300b0280070007c0b436f6d706f736974696f6ea00097a10096000c010000000200000000000000a1009a0008fffc0000001a000001000a006700f30073012b29780e496e746572707265746174696f6ea0009701000affffffff00
9b01a90900000000000000000b001b001b4100010157002f01a909ffffffffffffffff480900000000000000004100370157006501a909ffffffffffffffff4809000000000000000041006d0157009b01a909ffffffffffffffff48a10096000c010000000200000000000000a1009a0008000800000016000001000a0005
01670029019728000e0168074d42524f4c410d2a0c08446970686f6e650d2a0c0973796e746865736973a00097a10096000c010000000200000000000000a1009a000800080000001a000001000a003b0163005f019b280044016406484c73796e0d2a0c0d4172746963756c61746f72790d2a0c0953796e746865736973a0
0097a10096000c010000000200000000000000a1009a000800080000001c000001000a007101620095019d28007a01630950726f736f6479200d2a0c0c6d616e6970756c617465640d2a0c06537065656368a0009701000affffffff009b01a9070000000022007f00010000a000a0a100a400020d0801000a000000000000
0000070001000122005b000100242300002348002300002300ca23000023b812230000a000a301000affffffff009b01a92300242348002300ca23b812a000a1a10096000c010000000200000000000000a1009a0008000300000016000001000a00580013007000432800610014084d61726b6564200d2a0c0474657874a0
0097a10064000a4d44504c000900010002a0008c01000affffffff009b01a971001e0069005e0070006d006d006d0070005e006d005e0069005e006d006d22006d00491500a0008da1006400084d44504c000a0000a10064000a4d44504c000900010002a0008c71001e004c008d005b0094005b0091004c008d004c009100
4c0094005b00912200400091000ca0008da1006400084d44504c000a0000a10064000a4d44504c000900010002a0008c71001e004c010b005b0112005b010f004c010b004c010f004c0112005b010f220040010f000ca0008da1006400084d44504c000a0000a10064000a4d44504c000900010002a0008c71001e006900d3
007000e2006d00e2007000d3006d00d3006900d3006d00e222006d00be1500a0008da1006400084d44504c000a0000a10064000a4d44504c000900010002a0008c71001e001c014f002a0157001c0157002a0155002901520028014f001c015722006d013c16bca0008da1006400084d44504c000a0000a10064000a4d4450
4c000900010002a0008c71001e0052014a005e015700520157005e014f005b014d0059014a0052015722006d013c11eea0008da1006400084d44504c000a0000a10064000a4d44504c000900010002a0008c71001e007b014a00880157008801570080014a007e014d007b014f0088015722006d013c1111a0008da1006400
084d44504c000a0000a00083ff}}\par
\pard \qc\sb240\sl360 Fig. XX: ProSynth synthesis architecture.\par
\pard \qj\sb240\sl360 Text marked for the type and placement of accents is input to the system, and a pronunciation lexicon is ued to construct a strictly layered metrical structure for
 each intonational phrase in turn. The overall utterance is then represented as a hierarchy, described in more detail in Section XX.\par
The interpreted structure is converted to a parametric form depending on the signal generation method. The phonetic descriptions and timing can be used to select diphones and express their durations and pitch contours foroutput with the MBROLA system
(Dutoit et al ref). The phonetic details can also be used to augment copy-synthesis parameters for HLsyn quasi-articulatory formant synthesiser (Heid & Hawkins ref., Jenolan Caves
.). The timings and pitch information have also been used to manipulate the prosody of natural speech using PSOLA (Hamon et al. ref).\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.1 Linguistic Representation and Processing\par
\pard\plain \qj\sb240\sl360 \f20 The Extensible Markup Language (XML) is an extremely simple dialect of SGML (Standard Generalised markup Language),
 the goal of which is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML is a standard proposed by the World Wide Web Consortium of industry specific mark\endash up for: vendor\endash
neutral data exchange, media\endash independent publishing, collaborative authoring, the processing of documents by intelligent agents and other metadata applications [Ref1]. \par
We have chosen to use XML as the external data representation for our phonological structures in ProSynth. The features of XML which make it ideal for this application are: storage of hierarchical information expressed in
nodes with attributes; a standard text\endash based format suitable for networking; a strict and formal syntax; facilities for the expression of linkage between parts of the structure; and readily\endash available software support. \par

In the ProSynth system, the input word sequence is converted to an XML representation which then passes through a number of stages representing phonetic interpretation. A declarative knowledge representation is used to encode knowledge of phonetic interpre
tation and to drive transformation of the XML data structures. Finally, special purpose code translates the XML structures into parameter tables for signal generation. \par
In ProSynth, XML is used to encode the following: \par
{\i Word Sequences }The text input to the synthesis system needs to be marked\endash
up in a number of ways. Importantly, it is assumed that the division into prosodic phrases and the assignment of accent types to those phrases has already been performed. This information is added to the text using a simple mark\endash up of Intonational
Phrases and Accent Groups. \par
{\i Lexical Pronunciations }
The lexicon maps word forms to syllable sequences. Each possible pronunciation of a word form has its own entry comprising: SYLSEQ (i.e. syllable sequence), SYL, ONSET, RHYME, NUC, ACODA, CODA, VOC and CNS nodes. Information present in the input mark
\endash up, possibly derived from syntactic analysis, selects the appropriate pronunciation for each word form. \par
{\i Prosodic Structure }Each composed utterance comprising a single intonational phrase is stored in a hierarchy of: UTT, WORDSEQ, WORD, IP, AG, FOOT, SYL, ONSET, RHYME, NUC, CODA, ACODA, VOC and CNS nodes. Syllables are cross\endash
linked to the word nodes using linking attributes. This allows for phonetic interpretation rules to be sensitive to the grammatical function of a word as well as to the position of the syllable in the word. \par
{\i Database Annotation }A Database has been constructed containing tokens of relevant linguistic structures for the purpose of analysis of the temporal, intonational and spectral phenomena we wish to replicate in synthesis. It
has been manually annotated and a prosodic structure complete with timing information has been constructed for each phrase. This annotation is stored in XML using the same f
ormat as for synthesis. Tools for searching this database help us in generating knowledge for interpretation. \par
An interesting characteristic of our prosodic structure is the use of ambisyllabic consonants. This allows one or more consonants to be in the Coda of one syllable and in the Onset position of the next syllable. Examples
 are the medial consonants in "pity" or "tasty". To achieve ambisyllabicity in XML it is necessary to duplicate and link nodes, since XML rigidly enforces a strict hierarchy of components. \par
A small extract of a prosodic structure expressed in XML is shown in Figure 2.\par
\pard \sb240\sl240 {\f22\fs20 <AG HEAD="Y" START="0.5011" STOP="0.9727" STRENGTH="STRONG" WEIGHT="HEAVY">\par
}\pard \sl240 {\f22\fs20 <FOOT HEAD="Y" START="0.5011" STOP="0.9727" STRENGTH="STRONG" WEIGHT="HEAVY">\par
    <SYL FPOS="1" RFPOS="1" RWPOS="1" START="0.5011" STOP="0.9727" STRENGTH="STRONG"\par
WEIGHT="HEAVY" WPOS="1" WREF="WORD3">\par
      <ONSET START="0.5011" STOP="0.6516" STRENGTH="WEAK">\par
        <CNS AMBI="N" CNSCMP="N" CNSGRV="N" CNT="Y" NAS="N" RHO="N" SON="Y" START="0.5011"\par
STOP="0.6516" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N" VOI="Y">l</CNS>\par
      </ONSET>\par
      <RHYME CHECKED="N" START="0.6516" STOP="0.9727" STRENGTH="WEAK" VOI="N" WEIGHT="HEAVY">\par
        <NUC CHECKED="N" LONG="Y" START="0.6516" STOP="0.9727" STRENGTH="WEAK" VOI="N"\par
WEIGHT="HEAVY">\par
          <VOC GRV="Y" HEIGHT="OPEN" RND="N" START="0.6516" STOP="0.8620">a</VOC>\par
          <VOC GRV="N" HEIGHT="CLOSE" RND="N" START="0.8620" STOP="0.9727">I</VOC>\par
        </NUC>\par
      </RHYME>\par
    </SYL>\par
  </FOOT>\par
}\pard\plain \s7\sa240\sl240 \f65535 {\f22\fs16 </AG>\par
}\pard\plain \qc\sb240\sl360 \f20 Fig 2. Partial XML representation of utterance: \ldblquote it\rquote s a lie\rdblquote .\par
\pard \qc\sl360 PERHAPS THE PROXML SHOULD BE TAKEN FROM YORK\rquote S RATHER THAN KLATT\rquote S TIMING MODEL?\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.2 Knowledge Representation\par
\pard\plain \qj\sb240\sl360 \f20 In ProSynth knowledge for phonetic interpretation is expressed in a declarative form that operates on the prosodic structure. This means firstly that the knowledge is expressed as unordered rules, and secondly
that it operates solely by manipulating the attributes on the XML encoded phonological structure. To encode such knowledge a representational language called ProXML was developed in which it is easy to express the hierarchical contexts which drive processi
ng and to make the appropriate changes to attributes. The ProXML language is read by an interpreter PRX written in C which takes XML on its input and produces XML on its output. ProXML is a very simple language modelled on both C and Cascading Style Sheets
 (see [Ref2] for more information). A ProXML script consists of functions which are named after each element type in the XML file (each node type) and which are triggered by the presence of a node of that type in the input. When a function is called to pro
cess a node, a context is supplied centered on that node so that reference to parent, child and sibling nodes is easy to express. \par
A simple example of a ProXML script to increase the duration of a nucleus according to the post vocalic context is shown in F
igure 2. It is based on Klatt duration rule 9 [Ref3]. In this example, the DUR attribute on NUC nodes is set as a function of the hierarchical context in which the NUC node is found and as a function of the phonological attributes found on adjacent nodes.
Note that the rules modify the duration attribute (*= means scale existing value) rather than set it to a specific value. In this way, the declarative aspect of the rule is maintained.\par
\pard \qj\sl240 {\f22\fs16 \par
}\pard \qj\li2160 {\f22\fs20 /* Syllable durations */\par
SYL \{\par
 if (:STRENGTH=="STRONG") \{\par
  :DUR *= 1.6002;\par
  if (:WEIGHT=="HEAVY") \{\par
   :DUR *= 1.0409;\par
  \} else \{\par
   :DUR *= 0.9333;\par
  \}\par
 \} else \{\par
  :DUR *= 0.6529;\par
  if (:WEIGHT=="HEAVY") \{\par
   :DUR *= 0.9611;\par
  \} else \{\par
   :DUR *= 1.0124;\par
  \}\par
 \}\par
\}}\par
\pard \qj\li2160\sl240 {\f22\fs16 \par
}\pard \qc\sb240\sl360 Fig. X: Example ProXML script, which modifies syllable durations dependenton the syllable level attributes.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 4. The Phonological Model\par
\pard\plain \qj\sb240\sl360 \f20 In this section, we describe the phonological model used in ProSynth and show how the linguistic knowledge the system encodes is vital for modelling \lquote segmental\rquote , temporal and intonational fine detail.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.1 Overview\par
\pard\plain \qj\sb240\sl360 \f20 Central to ProSynth is a model which encodes phonological information in a hierarchical
fashion using structures based on attribute-value pairs. A declarative framework based on constraint satisfaction identifies for each phonological unit a complete metrical context. This context is a prosodic hierarchy with phonological contrasts available
at all level. The complex interacting levels of rules present in traditional layered systems are replaced in ProSynth by a single phonetic interpretation function operating on the entire context, which makes rule-ordering unnecessary
. Phonetics is related to phonology via a one-step phonetic interpretation function which makes use of as much linguistic knowledge as necessary. Systematic phonetic variability is constrained by position in structure
. The basis of phonetic interpretation is not the segment, but phonological features at places in structure. These
principles have been successfully demonstrated in YorkTalk (Local & Ogden 1997; Local 1992) for structures of up to three feet. We thus extend the principle successfully demonstrated in [3, 4], to larger phonological domains.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.2\tab The Prosodic Hierarchy\par
\pard\plain \qj\sb240\sl360 \f20 Our phonological structure is organised as
 a prosodic hierarchy, with phonological information distributed across the structure. The knowledge is formally represented as a Directed Acyclic Graph (DAG) which is constrained so that re-entrant nodes are only found at the terminal level. Graph-structu
res in the form of trees are commonly used in phonological analysis, except for the important addition of ambisyllabicit
y. Phonological attributes and their associated values are distributed around the entire prosodic hierarchy rather than at the terminal nodes, as in many phonological theories. Attributes at any level in the hierarchy may be accessed for use in
phonetic interpretation.\par
Text is parsed into a prosodic hierarchy which has units at the following levels: syllable constituents (Onset, Rhyme, Nucleus, Coda); Syllable; Foot; Accent Group (AG); Intonational Phrase (IP). \par
\pard\plain \s7\qc\sb120\sl360 \f65535 {\f20\fs20 {\pict\macpict\picw233\pich175\picscaled
05e8ffffffff00cd0111001102ff0c00ffffffffffff0000ffff00000111000000cd00000000000000a0008c001e000cffffffff0001000afffefffe00cc01100007000000000008000a0022f380f38000000023000000a100b6000400010001000700010001000800080022000b00d8001700a10096000a02000000020000
000000000cff3400000001000a00000000000d00210007000000000022f380f3800000002c000a001607436f757269657200030016000d0009002e000400000000002b050a03495020002c0010000d0d5a6170662044696e67626174730003000d000d000c00290f01c00000a0009700a10096000a02000000020000000000
000c0000ffdf0023000000030016000d00090028000a0005034147200003000d000d000c00290f01c10000a0009700a10096000a02000000020000000000000c001fffbd0001000a000000000024005f0023000000030016000d00090029010953796c6c61626c6520000003000d000d000c00292d01c30000030016000402
00000d0009002800150019093c66656174757265730028002000170a666f72202f6c61492f3e0000a00097000c00ad00640001000afffefffe00cc0110002200200049000000a100b60004000100010007000100010022002c00d800170007000000000023000000a100b60004000100010007000100010020002c00d8003a
004800a10096000a02000000020000000000000cffc4ffbf0001000a00000000000c00190007000000000022f380f38000000004000000280008000304466f6f740000a0009700a10096000a02000000020000000000000cffebffdd0001000a000000000021004c0023000000290f0853796c6c61626c6500040200002800
13000d0a3c66656174757265732000002a0b0a666f72202f73406c2f3e00a0009700a0008c000c002f00180001000affb2ffdc008000ee002300000023000000a100b60004000100010007000100010022ffff002527170007000000000023000000a100b60004000100010007000100010022ffff0025da1700a0008d00a1
0096000a02000000020000000000000c0022ffe80001000a000000000021004c0007000000000022f380f3800000000400000028000800120853796c6c61626c650000040200002800130010093c66656174757265730028001e000d0a666f72202f4974732f3e0000a0009700a10096000a01000000020000000000000cff
df00580001000a000000000016009000230000002c000800140554696d65730003001400040000002800080000295061727469616c20747265652073747275637475726520666f7220d24974d5732061206c6965d32e20002a0b23566572746963616c206c696e657320696e646963617465206865616465646e6573732e00
00a0009700a10096000a02000000020000000000000cff66ffcc0001000a00000000000d004300230000000300160028000a001005466f6f74200003000d000d000c00291901c20000a00097000c00bb00400001000afffefffe00cc01100022000a0032000000a100b60004000100010007000100010022004b00d8001700
07000000000023000000a100b60004000100010007000100010022008500d800170007000000000023000000a100b60004000100010007000100010022008500d8af1700a10096000a02000000020000000000000cff8fff660001000a00000000000d00340007000000000022f380f380000000030016000d00090028000a
0007064f6e73657420000003000d000d000c00291e01c60000a0009700a10096000a02000000020000000000000cffb500000001000a00000000000d00410023000000030016000d00090028000a000d065268796d6520000003000d000d000c00291e01c40000a0009700a10096000a02000000020000000000000c0008ff
df0001000a00000000000d00500023000000030016000d00090028000a000f084e75636c65757320000003000d000d000c00292801c50000a00097000c00b400bb0001000afffefffe00cc01100022000a0040000000a100b6000400010001000700010001002200a500d8001700a0008d00ff62696e003135313200003835
0000}}{\f20 \par
}\pard \s7\qc\sb120\sa240\sl360 {\f20 Fig. 1. Partial tree structure of the utterance: \ldblquote it\rquote s a lie\rdblquote . Indices (such as }{\f13 \'c0}{\f20 ) relate to the XML structure in Fig. 2.\par
}\pard\plain \qj\sb240\sl360 \f20 Our prosodic hierarchy, building on House & Hawkins (1995) and Local & Ogden (1997) is a head\_driven and strictly layered (Selkirk 1984) structure.{\plain }
Each unit is dominated by a unit at the next highest level (Strict Layer Hypothesis [10]). This produces a linguistically well-motivated and computationally tractable hierarchy. Constituents at each level have a set of po
ssible attributes, and relationships between units at the same level are determined by the principle of headedness. Structure-sharing is explicitly recognized through ambisyllabicity. \par
\pard \qj\sb240\sl360\tx0 {\i (Fig. e.g. as in ICPhS paper, but expand to show syllabic constituents properly. Could be more elaborate, to include degenerate Foot/AG, and attributes on selected nodes)\par
}\par
\pard \qj\tx0 {\f22\fs20 IP\par
                           \par
                 AG AG\par
\par
                 F F F\par
                        \par
                 S S S S S \par
                 Fin (d) a be(tt)erone}\par
\pard \qc\sb240\sl360\tx0 Figure 1. Supra\_syllabic tree structure for "Find a better one".\par
\pard \qj\sb240\sl360\tx0 The richness of the hierarchy comes from the information stored within structural nodes in the form of attributes and parameter values. Attributes of the IP, for example, include discourse information which will
 determine choice of intonation pattern. The IP consists of one or more Accent Groups (AGs), which in turn include as attributes specifications for the individual pitch accents making up the intonation contour. \par
\pard \qj\sb240\sl360 There is no separate level of {\i phonological word} within our hierarchy. Such a unit does not sit happily in a strictly layered structure\emdash the boundaries of prosodic constituents like AG and Foot may well occur in th
e middle of a lexical item. Conversely, word boundaries may occur in the middle of a Foot/AG. Lexico-grammatical information may nonetheless be highly relevant to phonetic interpretation and must not be discarded. The computational representa
tion of our prosodic structure allows us to get round this problem: word\_level and syntactic\_level information is hyper\_linked into the prosodic hierarchy. In this way lexical boundaries
 and the grammatical functions of words can be used to inform phonetic interpretation. \par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.3\tab Units of Structure and their Attributes\par
\pard\plain \qj\sb240\sl360 \f20 [[WOULD THIS SECTION BENEFIT FROM SOME MORE PICTURES?]]\par

Input text is parsed to head-driven syntactic and phonological hierarchical structures. The phonological parse allots material to places in the prosodic hierarchy and is supplemented with links to the syntactic parse. The lexicon itself is in the form of a
 partially parsed representation. Phonetic interpretation may be sen
sitive to information at any level, so that it is possible to distinguish, for instance, a plosive in the onset of a weak foot-final syllable from an onset plosive in a weak foot-medial syllable. \par
{\b Headedness}
: When a unit branches into sub-constituents, one of these constituents is its Head. If the leftmost constituent is the head, the constituent is said to be left-headed. Feet are left-headed. If the rightmost constituent is the head, the structure is right-
headed. Properties of a head are shared by the nodes it dominates [11]. Therefore a [+heavy] syllable has a [+heavy] rhyme; the syllable-level resonance features [\'b1grave] and [\'b1
round] can also be shared by nodes they dominate: this is how coarticulation is modelled. Phonetic interpretation proceeds head-first and is therefore determined in a structurally principled fashion without resort to extrinsic ordering.\par
{\b Intonational Phrase (IP)}: The IP, the domain of a well-formed, coherent intonation contour, contains one or more AGs; minimally it must include a strong AG. The rightmost AG\emdash traditionally the intonational nucleus\emdash
is the head of the IP. It is the largest prosodic domain recognised in the current implementation of our model.\par
{\b Accent Groups (AG)}
: AGs are made up of one or more Feet, which are primarily units of timing. An accented syllable is a stressed syllable associated with a pitch accent; an AG is a unit of intonation initiated by such a syllable, and incorporating any following unaccented s
yllables. The head of the AG is the leftmost heavy foot. A weak foot is also a weak, headless AG. \par
AG attributes include [headedness], pitch accent specifications, and positional information within the IP.\par
{\b Feet}
: All syllables are organised into Feet, which are primarily rhythmic units. The foot is left-headed, with a [+strong] syllable at its head, and includes any [-strong] syllables to the right. Types of feet can be differentiated using attributes of [weight]
, [strength] and [headedness]. Any phrase-initial, weak syllables are grouped into a weak, headless foot. A syllable with the values [+head, +strong] is stressed. When an IP begins with one or more weak, unaccented syllables, we maintain our strictly la
yered structure by organising them into "degenerate" ([light]) feet which are in turn contained within similarly [light] AGs.\par
{\b Syllables:} The Syllable contains the constituents Onset and Rhyme. The rhyme branches into Nucleus and Coda. Nuclei, onset
s and codas can all branch. The syllable is right-headed, the rhyme left-headed. Attributes of the syllable are [weight] (values heavy/light), and [strength] (values strong/weak): these are necessary for the correct assignment of temporal compression (
\'a4XX). Foot-initial Syllables are strong.\par
Weight is defined with regard to the subconstituents of the Rhyme. A Syllable is heavy if its Nucleus attribute [length] has the value long (in segmental terms, if it contains a long vowel or a diphthong). A Syllable i
s also heavy if its coda has more than one constituent. EXAMPLES\par
There is not a direct relationship between syllable strength and syllable weight. Strong syllables need not be heavy. In {\i loving}, /{\f12407 l\'c3v}/ has a SHORT Nucleus, and the coda has only one constituent (corresponding to /{\f12407 v}
/, yet it is the strong syllable in the Foot. Similarly, weak syllables need not be light. In {\i amazement}, the final Syllable has a branching Coda (i.e. more than one constituent) and therefore is HEAVY but WEAK. ProSynth does not make
 use of extrametricality.\par
{\b Phonological features:} We use binary features, with each {\i attribute} having a {\i value}, where the {\i value} slot can also be filled by another attribute-value{\i }pair. To our set of conventional features we add the features [\'b1
rhotic], to allow us to mimic the long-domain resonance effects of /r/ [5, 8], and [\'b1ambisyllabic] for ambisyllabic constituents (see below). Not all features are stated at the terminal nodes in the hierarchy: [\'b1
voice], for instance, is a property of the rhyme as a whole in order to model durational and resonance effects.\par
{\b Ambisyllabicity}
: Constituents which are shared between syllables are marked [+ambisyllabic]. Ambisyllabicity makes it easier to model coarticulation [4] and is an essential piece of knowledge in the overlaying of syllables to produce polysyllabic utterances. It is also u
sed to predict properties such as plosive aspiration in intervocalic clusters (\'a4XX).\par
Constituents are [+ambisyllabic] wherever this does not result in a breach of syllable structure constraints. {\i Loving} comprises two Syllables, /{\f12407 l\'c3v}/ and /{\f12407 vIN}/, since /{\f12407 v}
/ is both a legitimate Coda for the first Syllable, and a legitimate Onset for the second. {\i Loveless} has no ambisyllabicity, since /{\f12407 vl}/ is neither a legitimate Onset nor a legitimate Coda. Clusters may be entirely ambisyllabic, as in {\i
risky} (/{\f12407 rIsk}/+/{\f12407 ski}/), where /{\f12407 sk}/ is a good Coda and Onset cluster; partially ambisyllabic (i.e. one consonant is [+ambisyllabic], and one is [-ambisyllabic]), as in {\i selfish} /{\f12407 sElf}/+/{\f12407 fIS}
/), or non-ambisyllabic as in {\i risk them} (/{\f12407 rIsk}/+/{\f12407 D\'abm}/).{\b \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 5.\tab Phonetic interpretation\par
\pard\plain \qj\sb240\sl360 \f20 In this section, we describe more d
etails of phonetic interpretation in ProSynth, focussing on temporal relations, intonation, and spectral detail. Our assumption is that there are close relationships between each of these aspects of speech, and that once, for example, timing relations are
accurately modelled (for example using HLsyn), some of the spectral details (such as longer-domain resonance effects) will also be modelled as a by-product of the temporal modelling.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.1\tab Temporal detail\par
\pard\plain \qj\sb240\sl360 \f20 [[RAO: I\rquote M REALLY NOT HAPPY WITH THIS TEXT, BECAUSE I AM NOT SURE HOW WELL IT SAYS WHAT WE\rquote RE DOING IN THE TEMPORAL MODEL THAT\rquote S EVOLVING.
 ON THE OTHER HAND, THE EXPERIMENT REPORTED LATER IN THE PAPER MAKES NO SENSE WITHOUT THIS SECTION BEING MORE YORK-TALK LIKE.]]\par
Timing relations in ProSynth are handled primarily in terms of (1) temporal compression and (2) syllable overlap. Like spectral detail, temporal effects are treated as an aspect of th
e phonetic interpretation of phonological representations. Linguistic information necessary for temporal interpretation includes a grammar of syllable and word joins, using ambisyllabicity and an appropriate feature system. Such details as formant transiti
on times, and inherent durational differences between close and open vowels, are handled in the statements of phonetic exponency pertaining to each bundle of features at a given place in structure. \par
{\b A model of temporal compression} allows the statement of relationships between constituents (primarily syllables) at different places in metrical structure [3], using a knowledge database. For instance, the syllable /man/ in the words {\i man}, {\i
manage}, {\i manager} and in the utterance \ldblquote {\i She\rquote s a bank manager}\rdblquote has different degrees of temporal compression which can be related to the metrical structure as a whole.
The timing model works top-down, i.e. from the highest unit in the hierarchy to the lowest. This reflects the assumption that the IP, AG, Foot and Syllable are all levels of timing
, and that details of lower-level differences (such as segment type) can be overlaid on details of higher-level differences (such as syllable weight and strength; the strength and weight of an adjacent syllable; etc.). The top-down model also has the effec
t of constaining search spaces. For instance, if the distinction between heavy and light
is relevant to the temporal interpretation of a syllable, then the temporal characteristics of the Onset of that syllable are sensitive to this fact, so that Onsets in heavy syllables and in light syllables have different durational properties.\par
{\b Syllable overlap:} Syllable{\i\fs20\dn4 n} can be overlaid on Syllable{\i\fs20\dn4 n-1} by setting its start point to be before that of Syllable{\i\fs20\dn4 n-1}. By overlaying syllables to varying degrees and making reference to ambisyllabicity
, it is possible to lengthen or shorten intervocalic consonants systematically. There are morphologically bound differences which can be modelled in this way, provided that the phonological structure is sensitive to them. For instance, the Latinate prefix
{\i in-} is fully overlaid with the stem to which it attaches, giving a short nasal in {\i innocuous}, while the Germanic prefix {\i un-} is not overlaid to the same degree, giving a long nasal in {\i unknowing}. Differences in aspiration in pairs like {
\i mistake} and {\i mis-take} can likewise be treated as differences in phonological structure and consequent differences in the temporal interpretation of those structures.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.2\tab Intonational detail\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.2.1.\tab What\par
5.2.2.\tab How\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.3\tab Spectral detail\par
\pard\plain \qj\sb240\sl360 \f20 1. Spectral shape: what sound is it.\par
2. Fine-tune this so it fits in with rest.\par
Copy synthesis.\par
The temporal extent of systematic spectral variat
ion due to coarticulatory processes is modelled using two intersecting principles. One reflects how much a given allophone blocks the influence of neighbouring sounds, and is like coarticulation resistance [12]. The other principle reflects resonance effec
ts, or how far coarticulatory effects spread. The extent of resonance effects depends on a range of factors including syllabic weight, stress, accent, and position in the foot, vowel height, and featural properties of other segments in the domain of potent
ial influence. For example, intervening bilabials let lingual resonance effects spread to more distant syllables, whereas other lingual consonants may block their spread; similarly, resonance effects usually spread through unstressed but not stressed sylla
bles.{\i \par
}\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.3.1.\tab What\par
5.3.2.\tab How\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 6.\tab Perceptual testing/experiments\par
\pard\plain \qj\sb240\sl360 \f20 [[This section will be expanded with the experimental results from respective sites. STILL LOTS OF WORK TO DO HERE ON JOINING THINGS UP BETTER. WILL WAIT TILL I GET UCL TEXT, THEN TAKE OUT COMMONALITIES AND PUT IN 6.1]]
\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.1\tab conditions shared by all experiments\par
\pard\plain \qj\sb240\sl360 \f20 [[This section will contain information relevant to all the experiments.]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.2 f0\par
\pard\plain \qj\sb240\sl360 \f20 [[Emphasises the innovation in our testing of intonation; something about lack of good standard models for testing intonation.]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.3 timing\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.1.\~Hypothesis\par
\pard\plain \qj\sb240\sl360 \f20
The hypothesis we are testing in ProSynth is that having hierarchically organised, prosodically structured linguistic information should make it possible to produce more natural-sounding synthetic speech which is also more robust under difficult listening
conditions. As an initial test of our hypotheses about temporal structure and its relation to prosodic structure, an experiment has been conducted to test whether the categories set out in Section 2 make a significant difference to listeners\rquote
 ability to interpret synthetic speech. If the timings predicted by ProSynth for structural positions are perceptually i
mportant, listeners should be more successful at interpreting synthetic speech when the timing appropriate for structure is used than in the case where the timing is inappropriate for the linguistic structures set up.\par
\pard \qj\fi357\sb240\sl360 The data consists of phrases from the database of natural English generated by MBROLA [11] synthesis using timings of two different kinds: (1)\~
the segment durations predicted by the ProSynth model taking into account all the linguistic structure outlined in Section 2 (2)\~the segment durations
predicted by ProSynth for a different linguistic structure. If the linguistic structure makes no significant linguistic difference, then (1) and (2) should be perceived equally well (or badly). If temporal interpretation is sensitive to linguistic structur
e in the way that we have suggetsed, then the results for (1) should be better than the results for (2).\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.2.\~Data\par
\pard\plain \qj\sb240\sl360 \f20
12 groups of structures to be compared on structural linguistic grounds were established (eg "light ambisyllabic short initial syllable" versus "light nonambisyllabic short initial syllable"). Each group has two members (eg {\i robber}/{\i rob them} and {
\i loving}/{\i loveless}). For each phrase, two synthetic stimuli were generated: one with the predicted ProSynth timings for that structure, and another one with
 the timings for the other member of the pair. Files were produced with timing information from the natural-speech utterances, and an approximation to f0 of the speech in the database. The timing information for the final foot was then replaced with timing
 from the ProSynth model. This produced the 'correct' timings. In order to produce the 'broken' timings, timing information for the rhyme of the strong syllable in this final foot was swapped within the group so, for example the durations for {\i ob} in {
\i robber} were replaced with the durations for {\i ob} in {\i rob them} and vice versa.\par
The stimuli have segment labels ultimately from the label files from the database, f0 information from the recordings in the database, and timing information partly from natural speech and partly from the ProSynth model.\par
As an example, consider the pair {\i (he\rquote s a) robber} and {\i (to) rob them}. The durations (in ms.) for {\i robber} and {\i rob them} are:\par
\pard \qj\li720\sb240 {\f12407 \'81\tab }120\tab {\f12407 \'81}\tab 110\par
\pard \qj\li720 {\f12407 b\tab }65\tab {\f12407 b}\tab 85\par
{\f12407 \'ab\tab }150\tab {\f12407 D}\tab 60\par
{\f12407 \tab }\tab {\f12407 \'ab}\tab 120\par
{\f12407 \tab }\tab {\f12407 m}\tab 135\par
\pard \qj\sb240\sl360 Stimuli with these durations are compared with stimuli with the durations swapped round:\par
\pard \qj\li720\sb240 {\f12407 \'81\tab }110\tab {\f12407 \'81\tab }120\par
\pard \qj\li720 {\f12407 b\tab }85\tab {\f12407 b\tab }65\par
{\f12407 \'ab\tab }150\tab {\f12407 D\tab }60\par
{\f12407 \tab }\tab {\f12407 \'ab\tab }120\par
{\f12407 \tab }\tab {\f12407 m}\tab 135\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.3.\~Experimental design.\par
\pard\plain \qj\sb240\sl360 \f20
22 subjects heard every phrase once at comfortable listening levels over headphones, presented by a Tucker-Davies DD1 digital analogue interface. The signal-to-noise ratio was -5dB. The noise was cafeteria noise, i.e. different background noises like voice
s and laughter. Subjects were instructed to transcribe what they heard using normal English spelling, and were given as much time as they needed. When they were ready, they pressed a key and the next stimulus was played.\par
Each subject heard half of the phrases as generated with the ProSynth model, and half with the timings switched. The subjects heard six practice items before hearing the test items, but were not informed of this.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.4.\~Results\par
\pard\plain \qj\sb240\sl360 \f20 The phoneme recognition rate for the correct timings from the ProSynth model is 77.5%, and for the switched timings it is 74.2%. Although this is only a small improvement, it is nevertheless significant using a
one-tailed correlated t-test (t(21) = 2.21, p < 0.02).\par
Examples of the stimuli and further details of the results of the experiments (including updates) are available on the world wide web [12].\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.5.\~Discussion\par
\pard\plain \qj\sb240\sl360 \f20
The results show a significant effect of linguistic structure on improved intelligibility. The results are for the whole phrase, including parts which were not switched round: excluding these may result in improved results. The MBROLA diphone synthesis mod
els durational effects, but not the segmental effects predicted by our model and described in more detail in Section 3: for example, the synthesis produces aspirated plosives in words like {\i roast}[{\f12407 H}]{\i ing}
where our model predicts non-aspiration. It uses only a small diphone database. The rather low phoneme recognition rates may be due to the quality of the synthesis was problematic, or the cognitive load imposed by high levels of background noise. Further s
tatistical analysis will group the data according to foot-type, and future experiments will use a formant synthesiser.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.6.\~Future work\par
\pard\plain \qj\sb240\sl360 \f20
Future work will concentrate on refining the temporal model so that it generates durations which approximate those of our natural speech model as well as possible. The work will be checked by more perceptual experiments, including presenting the synthetic
stimuli under listening conditions that impose a high cognitive load, such as having the subjects perform some other task while listening to synthesis.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.4 segmental boundaries\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.1. Material. \par
\pard\plain \qj\sb240\sl360 \f20 18 phrases from the database were copy-synthesized into HLsyn using {\scaps procsy}
 [15], at 11.025 kHz SR, and hand-edited to a good standard of intelligibility, as judged by a number of listeners. In 10 phrases, the sound of interest was a voiceless fricative: at the onset of a stressed syllable\emdash {\i in a }{\i\ul f}{\i ield}
; unstressed onset\emdash {\i it\rquote s }{\i\ul s}{\i urreal}; coda of an unstressed syllable\emdash {\i to di}{\i\ul s}{\i robe}; between unstressed syllables\emdash {\i di}{\i\ul s}{\i appoint}; coda of a final stressed syllable\emdash {\i on the roo}
{\i\ul f}{\i , his ri}{\i\ul ff}{\i , a my}{\i\ul th}{\i , at a lo}{\i\ul ss}{\i , to cla}{\i\ul sh}; and both unstressed and stressed onsets\emdash {\i\ul f}{\i ul}{\i\ul f}{\i illed.}
 The other 8 items had voiced stops as the focus: in the coda of a final stressed syllable\emdash {\i it\rquote s mislai}{\i\ul d}{\i , he\rquote s a ro}{\i\ul gue}{\i , he was ro}{\i\ul bb}{\i ed}; stressed onset\emdash {\i in the }{\i\ul b}{\i and}
; unstressed onset\emdash {\i the }{\i\ul d}{\i elay, to }{\i\ul b}{\i e wronged}; unstressed and final post-stress contexts\emdash {\i to }{\i\ul d}{\i eri}{\i\ul de}; and in the onset and coda of a stressed syllable\emdash {\i he }{\i\ul b}{\i e}{\i\ul
gg}{\i ed.\par
}The sound of interest was synthesized with the \ldblquote right\rdblquote type of excitation pattern. From each right version, a \ldblquote wrong\rdblquote one was made be s
ubstituting a type or duration of excitation that was inappropriate for the context. Changes were systematic; no attempt was made to copy the exact details of the natural version of each phrase, as our aim was to test the perceptual salience of the type of
 change, with a view to incorporating it in a synthesis-by-rule system.\par
At FV boundaries, the right version had simple excitation (an abrupt transition between aperiodic and periodic excitation), and the wrong version had mixed periodic and aperiodic excit
ation. VF boundaries had the opposite pattern: wrong versions had no mixed excitation. See Fig. 1. Right versions were expected to be more intelligible than wrong versions of fricatives.\par
Each stop had one of two types of wrong voicing: longer-than-normal voicing for {\i\ul b}{\i and} and{\i }{\i\ul b}{\i e}{\i\ul gg}{\i ed}
 (see Fig. 2) whose onset stops normally have a short proportion of voicing in the closure; and unnaturally short voicing in the closures of the other six words. The wrong versions of {\i\ul b}{\i and} and{\i }{\i\ul b}{\i e}{\i\ul gg}{\i ed}
 were classed as hyper-speech and expected to be more intelligible than the right versions. The other 6 were expected to be less intelligible in noise if naturalness and intelligibility co-varied.\par
<FIG MISSING>\par
Figure 1. Spectrograms of part of /{\scaps\f12407 is}/ in {\i disappoint}. Left: natural; mid: synthetic \ldblquote right\rdblquote version; right: synthetic \ldblquote wrong\rdblquote version.\par
<FIG MISSING>\par
<FIG MISSING>\par
<FIG MISSING>\par
Figure 2. Waveforms showing the region around the closure of /b/ in {\i he begged}. Upper panel: natural speech; middle: \ldblquote right\rdblquote synthetic version; lower: hyper-speech synthetic version.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.2. Subjects. \par
\pard\plain \qj\sb240\sl360 \f20 The 22 subjects were Cambridge University students, all native speakers of British English with no known speech or hearing problems and less than 30 years old.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.3. Procedure. \par
\pard\plain \qj\sb240\sl360 \f20
The 18 experimental items were mixed with randomly-varying cafeteria noise at an average s/n ratio of -4 dB relative to the maximum amplitude of the phrase. They were presented to listeners over high-quality headphones, using a Tucker-Davis DD1 D-to-A syst
em from a PC computer, and a comfortable listening level. Listeners were tested
individually in a sound-treated room. They pressed a key to hear each item, and wrote down what they heard. Each listener heard each phrase once: half the phrases in the right version, half wrong or hyper-speech. The order of items was randomized for each
listener separately, and, because the noise was variable, it too was randomized separately for each listener. Five practice items preceded each test.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.4.{ Results\par
}\pard\plain \qj\sb240\sl360 \f20 Responses were scored for number of phonemes correct. Wrong insertions in otherwise correct responses counted as errors. There were two analyses, one on all phonemes in the phrase, the other on just three\emdash
the manipulated phoneme and the 2 adjacent to it. Table
6 shows results for 16 phrases i.e. excluding the two hyper-speech phrases. Responses were significantly better (p < 0.02) for the right versions in the 3-phone analysis, and achieved a significance level of 0.063 in the whole-phrase analysis.\par
\par
\trowd \trqc\trgaph107\trleft-107 \clbrdrt\brdrs \clbrdrl\brdrs \clshdng0\cellx1129\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrr\brdrs \clshdng0\cellx4531\clbrdrt\brdrs \clbrdrr\brdrs \clshdng0\cellx6232\pard \qj\sl360\intbl context\cell version of phrase\cell
 t(21) p (1-tail)\cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx1127\clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx2828\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx4529
\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\sl360\intbl \cell \ldblquote right\rdblquote \cell \ldblquote wrong\rdblquote \cell \cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdot
\clbrdrr\brdrs \clshdng0\cellx1127\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx2828\clbrdrt\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx4529\clbrdrt\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx6230\pard
\qj\sb240\sl360\intbl 3 phones\cell 69\cell 61\cell 2.35 0.015\cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx1127\clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx2828
\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx4529\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\sb240\sl360\intbl entire phrase\cell 72\cell 68\cell 1.59 0.063\cell \pard \intbl \row \pard \qj\sb240\sl360 Table {\*\bkmkstart perc_data
\bkmkcoll32 }6{\*\bkmkend perc_data}. Percentage correctly identified phonemes in 16 phrases.\par
Responses to the hyper-speech words differed: 84% vs. 89% correct for normal vs. hyper-speech {\i begged}; 85% vs. 76% correct for normal vs. hyper-speech {\i band} (3-phone analysis). Hyper-speech {\i in the} {\i band} was often misheard as {\i
in the van}. This lexical effect is an obvious consequence of enhanced periodicity in the /b/ closure of {\i band}.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.5. Discussion\par
\pard\plain \qj\sb240\sl360 \f20 We have shown for one speaker of Southern British English that
linguistic structure influences the type of excitation at the boundaries between voiceless fricatives and vowels, as well as the duration of periodic excitation in the closures of voiced stops. Most FV boundaries are simple, whereas most VF boundaries are
mixed. Within these broad patterns, syllable stress, vowel height, and final vs. non-final position within the phrase all influence the incidence and/or duration of mixed excitation. We interpret these data as indicating that the principal determinant of m
ixed excitation seems to be asynchrony in coordinating glottal and upper articulator movement. Timing relationships seem to be tighter at FV than VF boundaries, and there can be considerable latitude in the timing of VF boundaries when the fricative is a p
hrase-final coda.\par
Our findings for voiced stops were as expected, if one assumes that the main determinants of the duration of low-frequency periodicity in the closure interval are aerodynamic. One interesting pattern is that voicing in the closures of pre
stressed onset stops is short both in absolute terms and relative to the total duration of the closure.\par

We further showed that phoneme identification is better when the pattern of excitation at segment boundaries is appropriate for the structural context. Considering that only one acoustic boundary i.e. one edge of one phone or diphone, was manipulated in mo
st of the phrases, and that there are relatively few data points, the significance levels achieved testify to the importance of synthesizing edges that
are appropriate to the context. It is encouraging that differences were still fairly reliable in the whole-phrase analysis under these circumstances, since we would expect more response variability over the whole phrase.\par

If local changes in excitation type at segment boundaries enhance intelligibility significantly, then systematic attention to boundary details throughout the whole of a synthetic utterance will presumably enhance its robustness in noise considerably. Howev
er, it is a truism that at times th
e speech style that is most appropriate to the situation is not necessarily the most natural one. The two instances of hyper-speech are a case in point. By increasing the duration of closure voicing in stressed onset stops, we imitated what people do to en
hance intelligibility in adverse conditions such as noise or telephone bandwidths. But this manipulation risked making the /b/s sound like /v/s, effectively widening the neighborhood of {\i band} to include {\i van.} Since {\i in the van} is as likely as
{\i in the band}, contextual cues could not help, and {\i band}\rquote s intelligibility fell. {\i Begged}\rquote
s intelligibility may have risen because there were no obvious lexical competitors, and because we also enhanced the voicing in the syllable coda, thus making a more extreme hyper-speech style, and, perhaps crucially, a more consistent one. These issues n
eed more work.\par
The perceptual data do not distinguish between whether the \ldblquote right\rdblquote versions are more intelligible because the manipulations enhance the acoustic and perceptual coherence of
the signal at the boundary, or because they provide information about linguistic structure. The two possibilities are not mutually exclusive in any case. The data do suggest, however, that one reason for the appeal of diphone synthesis is not just that seg
ment boundaries sound more natural, but that their naturalness may make them easier to understand, at least in noise. It thus seems worth incorporating fine phonetic detail at segment boundaries into formant synthesis. It is relatively easy to produce thes
e details with HLsyn, on which {\scaps procsy} is based.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 7. Future work\par
\pard\plain \qj\sb240\sl360 \f20
Work is in progress [15] to automatically copy-synthesize database items into parameters for HLsyn, a Klatt-like formant synthesizer that synthesizes obstruents by means of pseudo-articulatory parameters. This method allows for easy production of utterance
s whose parameters can then be edited. Utterances can be altered to either conform to rules of the model, or to break such rules, thus allowing the perceptual salience of particular aspects of phonological structure to be assessed. Tests will as
sess speech intelligibility when listeners have competing tasks involving combinations of auditory vs. nonauditory modalities, and linguistic vs. nonlinguistic behaviour.\par

A statistical model based on our hypotheses about relevant phonological factors for temporal interpretation will be constructed from the database, leading to a fuller non-segmental model of temporal compression. Temporal, intonational and segmental details
 will be stated as the phonetic exponents of the phonological structure.{\ul \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 8. REFERENCES\par
\pard\plain \s15\qj\fi-284\li556\sb120\sl-219\tx560 \f65535\fs18 {\f20 1.\tab Hawkins, S. \ldblquote Arguments for a nonsegmental view of speech perception.\rdblquote }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm. Vol. 3, 18-25, 1995.\par
2.\tab House, J. & Hawkins, S., \ldblquote An integrated phonological-phonetic model for text-to-speech synthesis\rdblquote , }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm, Vol. 2, 326-329, 1995.\par
3.\tab Local, J.K. & Ogden R. \ldblquote A model of timing for nonsegmental phonological structure.\rdblquote In Jan P.H. van Santen, R W. Sproat, J. P. Olive & J. Hirschberg (eds.) }{\i\f20 Progress in Speech Synthesis}{\f20
. Springer, New York. 109-122, 1997.\par
4.\tab Local, J.K. \ldblquote Modelling assimilation in a non-segmental rule-free phonology.\rdblquote In G J Docherty & D R Ladd (eds): }{\i\f20 Papers in Laboratory Phonology II}{\f20 . Cambridge: CUP, 190-223, 1992.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 5.\tab Kelly, J. & Local, J. }{\i\f20 Doing Phonology.}{\f20 Manchester: University Press, 1989.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 6.\tab Hawkins, S., & Nguyen, N. \ldblquote Effects on word recognition of syllable-onset cues to syllable-coda voicing\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
7.\tab Hawkins, S. & Slater, A. \ldblquote Spread of CV and V-to-V coarticulation in British English: implications for the intelligibility of synthetic speech.\rdblquote }{\i\f20 ICSLP}{\f20 94, 1: 57-60, 1994.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 8.\tab Tunley, A. \ldblquote Metrical influences on /r/-colouring in English\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 9.\tab Fixmer, E. and Hawkins, S. \ldblquote The influence of quality of information on the McGurk effect.\rdblquote Presented at Australian Workshop on Auditory-Visual Speech Processing, 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 10.\tab Selkirk, E. O., }{\i\f20 Phonology and Syntax}{\f20 , MIT Press, Cambridge MA, 1984.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 11.\tab Broe, M. \ldblquote A unification-based approach to Prosodic Analysis.\rdblquote }{\i\f20 Edinburgh Working Papers in Cognitive Science}{\f20 \~7, 27-44, 1991.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 12.\tab Bladon, R.A.W. & Al-Bamerni, A. \ldblquote Coarticulation resistance in English /l/.\rdblquote }{\i\f20 J. Phon}{\f20 4: 137-150, 1976.\par
13.\tab http://www.w3.org/TR/1998/REC-xml-19980210\par
14.\tab http://www.ltg.ed.ac.uk/\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 15.\tab Heid, S. & Hawkins, S. \ldblquote Automatic parameter-estimation for high-quality formant synthesis using HLSyn.\rdblquote Presented }{\i\f20 at 3rd ESCA Workshop on Speech Synthesis}{\f20
, Jenolan Caves, Australia, 1998.\par
}\pard\plain \qj\sb240\sl360 \f20 [Ref1] http://www.w3.org/XML/\par
[Ref2] http://www.phon.ucl.ac.uk/project/prosynth.htm \par
[Ref3] Klatt, D., (1979) "Synthesis by rule of segmental durations in English sentences", Frontiers of Speech Communication Research, ed B.Lindblom & S.\'85hman, Academic Press.\par
}



This archive was generated by hypermail 2.0b3 on Tue Jul 06 1999 - 17:18:13 BST