Richard Ogden (rao1@york.ac.uk)
Thu, 2 Sep 1999 16:24:53 +0100 (BST)
Attached is csl6.rtf
It's got the intro/section 2 that Sarah and I have been swea(r/t)ing over,
and some comments to *all* authors (ie. everyone that receives this mail!)
at the top.
It's the last version I'll be sending till next week. The weather is
lovely and I'm taking tomorrow off to go walking in the Dales.
The text is looking good, in my opinion, but there's still a lot of work
to do of the fiddly, editorial type. So please help me:
¥Êread the text in full and send me any comments, but *do not* edit the
text yourself. More instructions on the paper.
¥Êsend me *.ps files of any outstanding figures (Sarah has about 4-5, Jill
has one or two)
¥Êsend me references if possible with the ingredients in the right order,
but don't bother formatting them, I'll do that
¥Êoutstanding text on tests remains to be done (next week, I think).
WOuld you mind also letting me know your schedule over the next couple of
weeks? We did say that this would be finished by the end of August, and it
wasn't because we weren't around. I'd like to know so I can plan the
things I do in the next couple of weeks. Ta ever so.
Richard
Richard Ogden
rao1@york.ac.uk
http://www.york.ac.uk/~rao1/
{\rtf1\mac\deff2 {\fonttbl{\f0\fswiss Chicago;}{\f2\froman New York;}{\f3\fswiss Geneva;}{\f4\fmodern Monaco;}{\f5\fscript Venice;}{\f6\fdecor London;}{\f7\fdecor Athens;}{\f12\fnil Los Angeles;}{\f13\fnil Zapf Dingbats;}{\f14\fnil Bookman;}
{\f15\fnil N Helvetica Narrow;}{\f16\fnil Palatino;}{\f18\fnil Zapf Chancery;}{\f20\froman Times;}{\f21\fswiss Helvetica;}{\f22\fmodern Courier;}{\f23\ftech Symbol;}{\f33\fnil Avant Garde;}{\f34\fnil New Century Schlbk;}{\f134\fnil Saransk;}
{\f237\fnil Petersburg;}{\f2017\fnil IPAPhon;}{\f2713\fnil IPAserif Lund1;}{\f9839\fnil Espy Serif;}{\f9840\fnil Espy Sans;}{\f9841\fnil Espy Serif Bold;}{\f9842\fnil Espy Sans Bold;}{\f10565\fnil M Times New Roman Expt;}
{\f12407\fnil SILDoulosIPA-Regular;}{\f12605\fnil SILSophiaIPA-Regular;}{\f13505\fnil SILManuscriptIPA-Regular;}}{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;
\red255\green255\blue0;\red255\green255\blue255;}{\stylesheet{\s243\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext243 footer;}{\s244\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext244 header;}{\s252\qj\sb240\sa60\keepn \b\i\f20
\sbasedon0\snext0 heading 4;}{\s253\qj\sb240\sa60\keepn \b\f20 \sbasedon0\snext0 heading 3;}{\s254\qj\sb360\keepn \b\i\f21 \sbasedon0\snext0 heading 2;}{\s255\qj\sb360\keepn \b\f21\fs28 \sbasedon0\snext0 heading 1;}{\qj\sb240 \f20 \sbasedon222\snext0
Normal;}{\s1\qj\sb120\sa120\sl360 \sbasedon222\snext1 Abstract;}{\s2\qc\sb180\sl-280 \b\f20 \sbasedon222\snext2 AbstractHeading;}{\s3\li288\ri288\sb140\sl-219 \f20\fs18 \sbasedon222\snext3 Address;}{\s4\qc\sb180\sl-219 \f20\fs22 \sbasedon222\snext4
Affiliation;}{\s5\qc\sb180\sl-219 \i\f20\fs22 \sbasedon222\snext5 Author;}{\s6\qj\sb120\sa120\sl360 \sbasedon222\snext6 Body;}{\s7\qc\sb120\sa240\sl360 \sbasedon0\snext0 caption;}{\s8\qc\sl219 \f20\fs18 \sbasedon222\snext8 CellBody;}{\s9\qc\sl219
\b\f20\fs18 \sbasedon222\snext9 CellHeading;}{\s10\qc\sb180\sl-280\keepn \b\f20 \sbasedon222\snext10 Head1;}{\s11\fi-562\li562\sb180\sl-280\keepn\tx566 \b\f20 \sbasedon222\snext11 Head2;}{\s12\qj\fi-283\li572\ri561\sb140\sl-220\tx566 \fs18
\sbasedon222\snext12 Item;}{\s13\qj\fi-283\li572\ri561\sb140\sl-220\tx560 \fs18 \sbasedon222\snext13 NumItem;}{\s14\qc \f20\fs8 \sbasedon4\snext14 bugfix;}{\s15\qj\fi-284\li556\sb120\sl-219\tx560 \fs18 \sbasedon222\snext15 Reference;}{\s16\qj\sl-280 \f21
\sbasedon222\snext16 RTF_Defaults;}{\s17\qj\sl219 \f20\fs18 \sbasedon222\snext17 TableTitle;}{\s18\qc\sl-340 \b\f20\fs28 \sbasedon0\snext18 Title;}{\s19\qc\sl280 \f20 \sbasedon222\snext19 CellFooting;}{\s20\qj\sb240 \sbasedon0\snext20 Document Map;}{
\s21\qj\fi-720\li720 \sbasedon0\snext21 Indent;}{\s22\qj \fs20 \sbasedon0\snext22 Plain Text;}{\s23\qj\fi360 \f20\fs18 \sbasedon0\snext23 Normal Indent;}}{\info{\title INSTRUCTIONS FOR ICSLP96 AUTHORS}{\author Richard Ogden}}
\paperw11880\paperh16820\margl1151\margr1151\margt1582\margb2098\widowctrl\ftnbj \sectd \sbkodd\linemod0\linex0\headery709\footery709\cols1\colsx288 {\header \pard\plain \qj \f20 \par
}{\footer \pard\plain \qj\tqc\tx4800\tqr\tx9520 \f20 CSL paper\tab {{\field{\*\fldinst date \\@ "MMMM"}}} {{\field{\*\fldinst date \\@ "d"}}}, {{\field{\*\fldinst date \\@ "yyyy"}}}\tab {\chpgn }\par
}\pard\plain \s18\qc\sl-340 \b\f20\fs28 ProSynth: An Integrated Prosodic Approach to Device-Independent, Natural-Sounding Speech Synthesis\par
\pard\plain \s5\qc\sb180\sl-219 \i\f20\fs22 Richard Ogden{\fs14\up11 ***}Sarah Hawkins{\fs14\up11 *}, Jill House{\fs14\up11 **}, Mark Huckvale{\fs14\up11 **}, John Local{\fs14\up11 ***}{\plain \f20\fs22 , }Paul Carter{\fs14\up11 ***}, Jana Dankovicov\'87{
\fs14\up11 **}, Sebastian Heid{\fs14\up11 *}\par
\pard\plain \s4\qc\sb180\sl-219 \f20\fs22 {\fs14\up11 *} University of Cambridge, {\fs14\up11 **} University College, London, {\fs14\up11 ***} University of York\par
\pard \s4\qc\sb180\sl-219 \sect \sectd \sbkodd\linemod0\linex0\headery709\footery709\cols1\colsx288 {\header \pard\plain \qj \f20 \par
}{\footer \pard\plain \qj\tqc\tx4800\tqr\tx9520 \f20 CSL paper\tab {{\field{\*\fldinst date \\@ "MMMM"}}} {{\field{\*\fldinst date \\@ "d"}}}, {{\field{\*\fldinst date \\@ "yyyy"}}}\tab {\chpgn }\par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 {\fs34 In}structions to {\i all} authors:\par
\pard\plain \qj\sb240 \f20 {\b\fs28 Please make sure you read through }{\b\i\fs28 the whole}{\b\fs28 text. If there are any corrections, please tell me where they are. }{\b\fs28\ul Do not}{\b\fs28
edit the text and send it back to me yourself: this will cause chaos with multiple versions flying around!\par
I\rquote ve numbered the paragraphs so you can tell me where I have to go in the text. These numbers will be removed later.\par
}\pard \qj\sb240 {\b\fs28 Please do a search through the text for \ldblquote ref\rdblquote and \ldblquote XX\rdblquote and \ldblquote [\ldblquote . These indicate places in the text where work remains to be done.
Let me know if any of it is yours, and tell me what needs to go in instead. If there\rquote s a reference like [1] to an item in the bibliography, tell me what text (author, date) goes in there instead. I\rquote
ll do all the x-referencing myself at a later stage.\par
}\pard \qj\sb240 {\b\fs28 Please submit any remaining figures and accompanying text separately.\par
}\pard \qj\sb240 {\b\fs28 Occasional comments in bold.}{\fs28 \par
}\pard\plain \s14\qc \f20\fs8 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s2\qc\sb180\sl-280 \b\f20 ABSTRACT{\fs18 \par
}\pard\plain \s1\qj\sb120\sa120 {\f20
This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the speech signal is informationally rich, and that this acoust
ic richness reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by structural richness produces a perceptually robust signal that is intell
igible in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech, segmentally, temporally and intonationally.
Results of preliminary tests to evaluate the effects of modelling timing, intonation and fine spectral detail are summarised.}{\b\f20 \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 1. Introduction\par
\pard\plain \qj\sb240 \f20 1.\tab
Speech synthesis by rule (text-to-speech, TTS) has restricted uses because it sounds unnatural and is often difficult to understand. Despite recent improvements in grammatical analysis and in deriving correct pronunciations for irregularly-spelled words, t
here remains a more fundamental problem, that of the inherent incoherence of the synthesized acoustic signal. Synthetic speech typically lacks the subtle systematic variability of natural speech that underlies the perceptual coherence of s
yllables and their constituents
and the longer phrases of which they form part. Intonation is often dull and repetitive, timing and rhythm are poor, and modifications that word boundaries undergo in connected speech are poorly modelled. Much of this incoherence arises because many modern
TTS systems encode linguistic knowledge in ways which are not in tune with current understanding of human speech and language processes.\par
\pard \qj\sb240 2.\tab
Segmental intelligibility data illustrate the scale of the problem. When heard in noise, most synthetic speech loses intelligibility much faster than natural speech: natural speech is about 15% less intelligible at 0 dB s/n ratio than in quiet, whereas for
isolated words or syllables, Pratt (1986) reported that typical synthetic speech drops by 35%-50%. We can expect similar results today. Concatenated natural speech avoids those problems related solely to voice qu
ality and local segment boundaries, but suffers just as much from poor models of timing, intonation, and systematic variability in segmental quality that is dependent on word and rhythmical structure. Even when the grammatical analysis is right, one string
of words can sound good, while another with the same grammatical pattern does not. \par
\pard \qj\sb240 3.\tab ProSynth is an integrated {\i prosodic} (i.e. structure-based) approach to speech synthesis. At its core is a phonological model which allows for structurally important distinc
tions to be made, even when the phonetic effect of these distinctions is subtle. The phonological model in ProSynth draws together insights from current phonology, and makes it easier to model phonetic and perceptual effects. Recent research in computation
al phonology (eg. Bird 1995) combines highly structured linguistic representations (more technically, signs) with a declarative, computationally tractable formalism. Recent research in phonetics (eg. Simpson 1992, Hawkins & Slater 1994, Manuel 1995, Zsiga
1995) shows that speech is rich in non-phonemic information which contributes to its naturalness and robustness
. Other work (Local 1992 a & b, 1995a & b, Ogden 1992, Local & Ogden 1997) has shown that it is possible to combine phonological with phonetic knowledge by means of a process known as phonetic interpretation: the assignment of phonetic parameters to pieces
of phonological structure. All these strands of work have contributed t
o the phonological model which ProSynth uses. By mimicking as far as possible the spectral, temporal and intonational detail which is observable in natural speech, we aim to improve the intelligiblity of synthetic speech. \par
\pard \qj\sb240 4.\tab
This paper has the following structure. Section 2 outlines the motivation for the ProSynth model. Section 3 describes the linguistic model we use to represent the information necessary for modelling the kinds of phonetic effects described in Section 2. Sec
tion 4 sets out how the model described in Section 3 is implemented, and how segmental, temporal and intonational detail are modelled. [[Section 5 presents results of some preliminary perceptual tests.]]\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 2.\tab Motivation\par
\pard\plain \qj\sb240 \f20 1.\tab
Interdependencies between grammatical, prosodic and segmental parameters are well known to phoneticians and to everyone who has synthesized speech. When these components are developed for synthesis in separate modules, the apparent convenience is offset by
the need to capture the interdependencies, which often leads to probl
ems of rule ordering and rule proliferation to correct effects of earlier rules. In our view, much of the robustness of natural speech is lost by neglecting systematic subphonemic detail, a neglect that results partly from an inappropriate emphasis on phon
eme strings rather than on linguistic structure. Fine phonetic detail, also called systematic, or lawful, variation (or variability, cf. Elman and McClelland 1986), contributes to making the time-varying speech signal an effective communicative medium beca
use it reflects multidimensional properties of both vocal-tract dynamics and linguistic structure.\par
\pard \qj\sb240 2.\tab Accordingly, ProSynth models more phonetic detail than is standard in synthetic speech. Such detail includes secondary resonance effects, timing and rhythm, and f0 alignment. The ai
m is to create a signal that sounds natural because it seems to come from a single talker and provides rich phonetic information about the linguistic structure of the utterance. The well-known \ldblquote redundancy\rdblquote of the spe
ech signal, whereby a phone can be signalled by a number of more-or-less co-occurring acoustic properties, contributes some of this detail, but in our view, other less well-documented properties are just as important. As implied above, they can be roughly
divided into two groups: those that make the speech signal sound as if it comes from a single talker, and those that reflect linguistic structure for a given accent.\par
\pard \qj\sb240 3.\tab A speech signal sounds as if it comes from a single talker when it is perceptually coheren
t, meaning that its properties reflect details of vocal-tract dynamics. To be heard as speech, time-varying acoustic properties must bear the right relationships to one another. When they do, the perceptual system groups them together into an internally co
herent auditory stream (Bregman 199xx) or more abstract entity{\b }
(cf. Remez 19xx). A wide range of properties seems to contribute to perceptual coherence. The influence of some, like patterns of formant frequencies, is widely acknowledged (cf. Remez and Rubin 19xx {\i Science}
paper). Others are known to be important but are not always well understood; examples are the amplitude envelope which governs some segmental distinctions (cf. Rosen and Howell 19xx) and also perceptions of rhythm and of \lquote integration\rquote
between stop bursts and following vowels (van Tasell, Soli et al 19xx); and correlations between the mode of glottal excitation and the behaviour of the upper articulators, especially at abrupt segment boundaries (Gobl and NiChasaide 19xx).\par
\pard \qj\sb240 4.\tab A speech signal will not sound as if the talker is using a consistent accent and style of speech unless all the systematic phonetic details are right. This requires producing often small distinctions that reflect
different combinations of linguistic properties. As an example, take the words {\i mistakes} and {\i mistimes}, whose spectrograms are shown at the left hand side of Figure XX. The beginnings of these two words are phonetically different in a number of
ways, even though the first four phonemes are the same. The /t/ of {\i mistimes} is aspirated and has a longer closure, whereas the one in {\i mistakes} is not aspirated and is shorter. The /s/ of {\i mistimes} is shorter, and the /m/ and /I/ are longer
, which is heard as a rhythmic difference: the first syllable of {\i mistimes} has a heavier beat than that of {\i mistakes}. \par
\pard \qj\sb240 5.\tab These phonetic differences arise because the morphological structure of the words differs: {\i mistimes} contains the morphemes {\i mis}+{\i time}, which each have a separate meaning; and the meaning of {\i mistimes} is
straightforwardly related to the meaning of each of the two morphemes. But the meaning of {\i mistakes} is not obviously related to the meaning of its constituent morphemes. This morphological difference is reflected phonologically in the syllable
structure, as shown on the right of Figure XX. In {\i mistimes}, /s/ is the coda of syllable 1, and /t/ is the onset of syllable 2. Conversely, the /s/ and /t/ in {\i mistakes} belong to both syllables and form both the coda of syllable 1 and
the onset of syllable 2. In an onset /st/, the /t/ is always unaspirated (cf. step, stop, start). The durational differences in the /m/ and the /I/ arise because the morphologically-conditioned differences in syllable structure result in {\i mist}
being a rhythmically heavy syllable whereas {\i mis} is rhythmically light, while both syllables are metrically weak (i.e. unstressed). So the morphological differences between the words are reflected in structural phonological
differences; and these in turn have implications for the phonetic detail of the utterances, despite the segmental similarities between the words.\par
\pard \qj\sb240 INSERT FIGURE XX ABOUT HERE\par
\par
\pard \qj\li720\sb240 Legend to Figure xx. Left: spectrograms of the words {\i mistimes} (top) and {\i mistakes }(bottom) spoken by a British English woman in the sentence {\i I\rquote d be surprised if Tess ____ it} with main stress on {\i Tess}
. Right: syllabic structures of each word.\par
\pard \qj\sb240 \par
\pard \qj\sb240 6.\tab Some types of systematic fine detail
may contribute both perceptual coherence and information about linguistic structure. So-called resonance effects (Kelly and Local 1989) provide one example. Resonance effects associated with /r/, for example, manifest acoustically as lowered formant freque
ncies, and can spread over several syllables, but the factors that
determine whether and how far they will spread include syllable stress, the number of consonants in the onset of the syllable, vowel quality, and the number of syllables in the foot (Tunley 1999).\par
\pard \qj\sb240 7.\tab On the one hand, including this type of fine phonetic detail (or systematic variation) in synthetic speech makes it sound more natural in a subtle way that is hard to describe in phonetic terms but seems to make the signal
\ldblquote fit together\rdblquote better\emdash in other words, it seems to make it more coherent. On the other ha
nd, the fact that the temporal extent of rhotic resonance effects depends on linguistic structure means not only that cues to the identity of a single phoneme can be distributed across a number of acoustic segments (sometimes several syllables), but also t
hat aspects of the linguistic structure of the affected syllable(s) can be subtly signalled.\par
\pard \qj\sb240 8.\tab Listeners can use distributed{\b }
acoustic information to identify naturally-spoken words (Marslen-Wilson and Warren 199x; other wmw refs (Gaskell?); Hawkins and Nguyen in press), and when such information is included in synthetic speech it can increase phoneme intelligibility in noise by
10-15% or more (Hawkins and Slater 1994, Tunley 1999). Both classical and recent experiments ((xxref Repp, \'85hman, Strange, Heid and Hawkins 1999; Pisoni in van Santen book, Pisoni and Duffy 19xx,{\b [[sh check these refs]]}
Kwong and Stevens 1999) suggest that most systematically varying properties will enhance perception in at least some circumstances. Natural-sounding, systematic variation of this type may be especially influential in adverse listening conditions or when c
ognitive loads are high. \par
\pard \qj\sb240 9.\tab In summary, ProSynth is based on the philosophy that natural speech is robust because it contains many phonetic details at the spectral, temporal and intonational levels. These details vary systematically to form a perceptually
coherent whole and are the product of the phonetic interpretation of a rich linguistic structure.{\b }In ProSynth, we attempt to model declaratively the richness of both linguistic structure and of the acoustic-phonetic signal which results from its
interpretation (Pierrehumbert 1990). The next sections set out how the phonological model is organised, and how we interpret it phonetically.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 3.\tab ProSynth: a linguistic model\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 Overview\par
\pard\plain \qj\sb240 \f20 1.\tab ProSynth uses a phonological model which encodes phonological information in a hierarchical fashion using structures based on attribute-value pairs. Each phonologic
al unit occurs in a complete metrical context. This context is a prosodic hierarchy with phonological contrasts available at all levels. The complex interacting levels of rules present in traditional layered systems are replaced in ProSynth by a one-step p
honetic interpretation function operating on the entire context, which makes rule-ordering unnecessary. Whereas conventional synthesis systems use a relatively poor structure and complex, interacting rules, ProSynth uses instead a rich structure and applie
s simple rules of phonetic interpretation which are highly
structure-bound. Systematic phonetic variability is thus constrained by position in structure. The basis of phonetic interpretation is not the segment, but phonological features at places in structure. These principles have been successfully demonstrated
in YorkTalk (Local & Ogden 1997; Local 1992XX) for structures of up to three feet. We thus extend the principle successfully demonstrated in [3, 4], to a wider variety of phonological domains.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.1 The Prosodic Hierarchy\par
\pard\plain \qj\sb240 \f20 1.\tab
The phonological structure is organised as a prosodic hierarchy, with phonological information distributed across the structure. The knowledge is formally represented as a Directed Acyclic Graph (DAG), a kind of tree structure. Graph-structures in the form
of trees are commonly used in phonological analysis, except for the important addition of ambisyllabicity. Formally, ambisyllabicity is represented as re-entrant nodes at the terminal level: i.e. a terminal node (a consonant or vowel) ma
y simultaneously be the daughter of two syllable nodes. Phonological attribute-value pairs are distributed around the entire prosodic hierarchy rather than at just the terminal nodes (or even associated to just terminal nodes), as in many phonological theo
ries. Attributes at any level in the hierarchy may be accessed for use in phonetic interpretation.\par
2.\tab Text is parsed into a prosodic hierarchy which has units at the following levels: syllable constituents (Onset, Rhyme, Nucleus, Coda); Syllable; Foot; Accent
Group (AG); Intonational Phrase (IP). Our prosodic hierarchy, building on House & Hawkins (1995) and Local & Ogden (1997) is a head\_driven and strictly layered (Selkirk 1984) structure.{\plain }
Each unit is dominated by a unit at the next highest level (Strict Layer Hypothesis [10]). This produces a linguistically well-motivated and computationally tractable hierarchy which accords with the representational requirements of our implementation in X
ML. Constituents at each level have a set of possible attributes, and relationships between units at the same level are determined by the principle of headedness. Structure-sharing is explicitly recognized through ambisyllabicity. \par
\pard \qj\sb240 3.\tab Fig. XX shows a partial phonological structure for the phrase \ldblquote Come with a bloom\rdblquote
. Note that phonological information is spread around the structure. For example, the feature [voice] is treated as a property of the Rhyme as a whole, and not of just one of
the terminal nodes headed by the Rhyme. Timing information is also included: in the diagram
below, the [start] of the IP is the same as the [start] of the Onset of the first syllable of the utterance, and the [end] of the IP is the same as the [end] of the Coda of the last syllable, as indicated by the tags {\f13 \'c0} and {\f13 \'c1}
. The value for [ambisyllabic] is shown for two consonants: note that for the [ambisyllabic: +] consonant /{\f12407 D}/, the terminal node is re-entrant.\par
\pard\plain \s7\qc\sb120\sl360 {\f20\fs20 {\pict\macpict\picw370\pich266
064800000000010a0172001102ff0c00ffffffff000000000000000001720000010a000000000000001e0001000a00000000010a0172002c000c00150948656c76657469636100030015000d000b002e0004000000000028000a0113024950000028002e011002414700002a2404466f6f740001000a001001160022011800
22fffe011700360001000a0034011600460118000900b7087c00b7086400220022011700360001000a00580116006a0118002200460117003600a100640010474449310001ffffffff0000000000000001000a00000000010a017200030000000d0000002a240453796c6c002a24025268002a24024e750001000a007c0116
008e01180022006a011700360001000a00a0011600b201180022008e011700360001000a00000000010a017200291302436f0001000a00a0011700b201290022008e010536360001000a00000000010a0172002800be00f7014f0001000a007c00fc00b2011700200046013200e800e100a100640010474449310001ffffff
ff0100000000000001000a00000000010a01720028007600920453796c6c00002a24025268002a24024e750001000a007c0098008e009a0022006a009900360001000a00a0009800b2009a0022008e009900360001000a00000000010a017200291302436f0001000a00a0009900b200ab0022008e008736360001000a0000
0000010a0172002800be0079014f0001000a007c007e00b200990020004600b400e800630001000a00000000010a017200280076004a0453796c6c00002a24025268002a24024e750001000a007c0050008e00520022006a005100360001000a00a0005000b200520022008e005100360001000a00000000010a0172002913
02436f0001000a00a0005100b200630022008e003f36360001000a00000000010a0172002800be0031014f0001000a007c003600b2005100200046006c00e8001b0001000a00000000010a01720028007600da0453796c6c00002a24025268002a24024e750001000a007c00e0008e00e20022006a00e100360001000a00a0
00e000b200e20022008e00e100360001000a00000000010a0172002800be00c1014f0001000a007c00c600b200e10020004600fc00e800ab0001000a00000000010a01720028002e004a02414700002a2404466f6f740001000a003400500046005200220022005100360001000a00580050006a0052002200460051003600
01000a00580051006a0099002000460009007c00e10001000a00580051006a00e100200046ffc1007c01710001000a00100051002201170020fffe01dd0034ff8b0001000a00c4005000d60052002200b2005100360001000a00c4003500d60037002200b2003600360001000a00c4006200d60064002200b2006300360001
000a00c4007d00d6007f002200b2007e00360001000a00c4009800d6009a002200b2009900360001000a00c400aa00d600ac002200b200ab00360001000a00c400ab00d600c6002200b200e1af360001000a00c400e000d600e2002200b200e100360001000a00c400fc00d60105002200b200f31b360001000a00c4011600
d60118002200b2011700360001000a00c4012800d6012a002200b2012900360001000a00000000010a017200030015000d000b002800e20032016b00291b010000002912016d0000293801490000291101440000293601ab000001000a00c400f300d600fc002200b20105e5360001000a00000000010a017200030000000d
0000002912016200002914016c000029100275d9002912016d00002800e2007a017700030015000d000b0028006d012b135b737472656e6774683a207374726f6e675d20002a0c0f5b7765696768743a2068656176795d00002a180c5b636865636b6564202b5d20002a0c095b766f696365202b5d00002b0921055b656e64
3a00291501e700002907015d00002801060092115b616d626973796c6c616269633a202b5d00280106004a115b616d626973796c6c616269633a202d5d0001000a00e8006200fa0064002200d6006300360001000a00e800aa00fa00ac002200d600ab00360001000a00000000010a017200030000000d00000028000a012b
095b73746172743a202000030015000d000b00291c01cb00002908025d2000280016012b055b656e643a00291501e700002907015d00002800be0002085b73746172743a200000291a01cb00002908015d0000ff}}{\f20 \par
}\pard \s7\qc\sb120\sa240\sl360 {\f20 Fig. 1. Partial tree structure of the utterance: \ldblquote Come with a bloom\rdblquote . See text for details.\par
}\pard\plain \qj\sb240 \f20 4.\tab There is no separate level of {\i phonological word} wi
thin the hierarchy. Such a unit does not sit happily in a strictly layered structure, because the boundaries of prosodic constituents like AG and Foot may well occur in the middle of a lexical item. Conversely, word boundaries may occur in the middle of a
Foot/AG. For example, in the phrase \ldblquote phonetics and phonology\rdblquote
there are two feet (and potentially two AGs): [-netics and phon-], and [-nology]. Both begin in the middle of a words, and the first contains word boundaries. Lexico-grammatical information may
nonetheless be highly relevant to phonetic interpretation and is not be discarded. The computational representation of our prosodic structure allows us to get round this problem: word\_level and syntactic\_level information is hyper\_
linked into the prosodic hierarchy. In this way lexical boundaries and the grammatical functions of words can be used to inform phonetic interpretation. \par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.2 Units of Structure and their Attributes\par
\pard\plain \qj\sb240 \f20 1.\tab
Input text is parsed to head-driven syntactic and phonological hierarchical structures. The phonological parse allots material to places in the prosodic hierarchy and is supplemented with links to the syntactic parse. The lexicon itself is in the form of a
partially parsed representation. Phonetic interpr
etation may be sensitive to information at any level, so that it is possible to distinguish, for instance, a plosive in the onset of a weak foot-final syllable from an onset plosive in a weak foot-medial syllable. \par
\pard \qj\sb240 {\b 2.\tab Headedness}
: When a unit branches into sub-constituents, one of these constituents is its Head. If the leftmost constituent is the head, the constituent is said to be left-headed. If the rightmost constituent is the head, the structure is right-headed. Thus, IPs are
right-headed, since the rightmo
st constituent AG is the head of the IP. AGs and Feet are left-headed. Properties of a head are shared by the nodes it dominates [11]. Therefore a [heavy:+] syllable has a [heavy:+] rhyme; the syllable-level resonance features [grave:\'b1] and [round:\'b1
] can a
lso be shared by nodes they dominate: this is how some aspects of coarticulation are modelled. In Fig. XX, headedness is indicated by vertical lines, as opposed to slanting ones. Phonetic interpretation proceeds head-first and is therefore determined in a
structurally principled fashion without resort to extrinsic ordering.\par
\pard \qj\sb240 {\b 3.\tab Intonational Phrase (IP)}: The IP, the domain of a well-formed, coherent intonation contour, contains one or more AGs; minimally it must include a strong AG. The rightmost AG\emdash traditionally the intonational nucleus\emdash
is the head of the IP. It is the largest prosodic domain recognised in the current implementation of our model.\par
{\b 4.\tab Accent Groups (AG)}: AGs are made up of one or more Feet, which are primarily units of timing. An accented syllable is
a stressed syllable associated with a pitch accent; an AG is a unit of intonation initiated by such a syllable, and incorporating any following unaccented syllables. The head of the AG is the leftmost heavy foot. A weak foot is also a weak, headless AG.
\par
\pard \qj\sb240 5.\tab AG attributes include [headedness], pitch accent specifications, and positional information within the IP.\par
\pard \qj\sb240 {\b 6.\tab Feet}: All syllables are organised into Feet, which are primarily rhythmic units. Types of feet can be differentiated using attributes of [weight], [s
trength] and [headedness]. A foot is left-headed, with a [strong:+] syllable at its head, and includes any [strong:-] syllables to the right. Any phrase-initial, weak syllables are grouped into a weak, headless foot, sometimes referred to as a \ldblquote
degenerate\rdblquote
foot. Degenerate feet are always [light]. Thus when an IP begins with one or more weak, unaccented syllables, we maintain the strictly layered structure by organising them into [light] feet which are in turn contained within similarly [light] (or degenera
te) AGs. Consistent with the declarative formalism, attributes of the Foot are shared with its constituents, so that a syllable with the values [head:+, strong:+] is stressed.\par
\pard \qj\sb240 {\b 7.\tab Syllables:}
The Syllable contains the constituents Onset and Rhyme. The rhyme branches into Nucleus and Coda. Nuclei, onsets and codas can all branch. The syllable is right-headed, the rhyme left-headed. Attributes of the syllable are [weight: heavy/light], and [stre
ngth: strong/weak]: these are necessary for the correct assignment of temporal compression (\'a4XX). Foot-initial Syllables are strong.\par
\pard \qj\sb240 8.\tab
Weight is defined with regard to the subconstituents of the Rhyme. A Syllable is heavy if its Nucleus attribute [length] has the value [long] (in segmental terms, if it contains a long vowel or a diphthong). A Syllable is also heavy if its coda has more th
an one constituent, as in /rent/, /ask/, /taks/.\par
\pard \qj\sb240 9.\tab There is not a direct relationship between syllable strength and syllable weight. Strong syllables need not be heavy. In {\i loving}, /{\f12407 l\'c3v}/ has a [sh
ort] Nucleus, and the coda has only one constituent (corresponding to /{\f12407 v}/, yet it is the strong syllable in the Foot. Similarly, weak syllables need not be light. In {\i amazement}
, the final Syllable has a branching Coda (i.e. more than one constituent) and therefore is [heavy] but [weak]. ProSynth does not make use of extrametricality: all phonological material must be dominated by an appropriate node in structure.\par
\pard \qj\sb240 {\b 10.\tab Phonological features:} We use binary features represented as <attribute, value> pairs, where the {\i value} slot can also be filled by another attribute-value{\i }
pair. To our set of conventional features we add the features [rhotic:\'b1], to allow us to mimic the long-domain resonance effects of /r/ [5, 8], and [ambisyllabic:\'b1] for ambisyllabic constituents (\'a4
XX). Not all features are stated at the terminal nodes in the hierarchy: [voice:\'b1], for instance, is a property of the rhyme as a whole in order to model durational and resonance effects.\par
\pard \qj\sb240 {\b 11.\tab Ambisyllabicity}: Constituents which are shared between syllables are mark
ed [ambisyllabic:+]. Ambisyllabicity makes it easier to model coarticulation [4] and is an essential piece of knowledge in the overlaying of syllables to produce polysyllabic utterances. It is also used to predict properties such as plosive aspiration in i
ntervocalic clusters (\'a4XX).\par
\pard \qj\sb240 12.\tab Constituents are [ambisyllabic:+] wherever this does not result in a breach of syllable structure constraints. {\i Loving} comprises two Syllables, /{\f12407 l\'8bv}/ and /{\f12407 vIN}/, since /{\f12407 v}
/ is both a legitimate Coda for the first Syllable, and a legitimate Onset for the second. {\i Loveless} has no ambisyllabicity, since /{\f12407 vl}/ is neither a legitimate Onset nor a legitimate Coda. Clusters may be entirely ambisyllabic, as in {\i
risky} (/{\f12407 rIsk}/+/{\f12407 ski}/), where /{\f12407 sk}/ is a good Coda and Onset cluster; partially ambisyllabic (i.e. one consonant is [ambisyllabic:+], and one is [ambisyllabic:-]), as in {\i selfish} /{\f12407 sElf}/+/{\f12407 fIS}
/), or non-ambisyllabic as in {\i risk them} (/{\f12407 rIsk}/+/{\f12407 D\'abm}/).{\b \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 4. Implementation\par
\pard\plain \qj\sb240 \f20 In this section, we describe the structure of ProSynth in more detail. We desc
ribe the database used for the spectral, temporal and intonational modelling; the use of XML for representation; and then we set out in more detail what effects we model, and how, at the spectral, temporal and intonational levels.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.1 Database\par
\pard\plain \qj\sb240\tx0 \f20 1.\tab
Analysis for modelling is based on a core speech database of over 450 utterances, recorded by a single male speaker of southern British English. Database speech files have been exhaustively labelled to identify segmental and prosodic constituent boundaries
, using careful hand\_correction of an automated procedure. F0 contours, calculated from a simultaneously recorded Laryngograph signal, can be displayed time\_aligned with constituent boundaries.\par
\pard \qj\sb240 2.\tab
The database has been designed to exemplify a subset of possible structures, within which we can predict that we will find interesting examples of systematic variability. Each utterance consists of one IP, and up to two AGs. The foot-types within the AG ar
e varied, according to the weight of the head syllable, the number and typ
e of consonants in the onset and rhyme, whether the medial consonants are ambisyllabic, and the vowel length. There are also phrases containing segments whose secondary resonance is expected to spread, and some which we expect to block the spreading of suc
h effects.\par
\pard \qj\sb240 3.\tab The database thus provides us with material for analysis of the spectral, temporal and intonational phenomena we aim to synthesise. We are currently expanding it to cover more types of IP.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.2 Architecture\par
\pard\plain \qj\sb240 \f20 1.\tab ProSynth builds on the knowledge gained
in YorkTalk (refs.), and uses an open computational architecture for synthesis. There is a clear separation between the computational engine and the computational representations of data and knowledge. The overall architecture is shown in Fig. XX. \par
\pard \qc\sb240\keepn {\fs20 {\pict\macpict\picw426\pich156
065000000000009c01aa001102ff0c00ffffffff000000000000000001aa0000009c00000000000000a100640010474449310001ffffffff000000000000001e0001000a00000000009c01aa00600014006e002400c00000005a0068010e005a0068005a005a006800b4005a00600032006e004200c0005a005a006800b400
5a0001000a001b006d0039006f0022fffd006e005a0001000a001b00be003900c0000900b7088800b708700022fffd00bf005a00a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa0060001400e3002401350000005a0068010e005a
0068005a005a006800b4005a0060003200e300420135005a005a006800b4005a0001000a001b00e2003900e40022fffd00e3005a0001000a001b0133003901350022fffd0134005a00a100640010474449310001ffffffff0100000000000001000a00000000009c01aa002c000c00150948656c7665746963610003001500
0d000b002e000400000000002b8632074c657869636f6e0028002c00f60c4465636c617261746976652000002a0c096b6e6f776c6564676500001aff00ff00ff00001bff00ff00ff000009ffffffffffffffff0031005c006e008100c0001a0000000000000038001aff00ff00ff000031005c00e30081013e001a00000000
0000003800280071007e0b436f6d706f736974696f6e0029780e496e746572707265746174696f6e001aff00ff00ff00000b001b001b004100020158003001aa001a0000000000000048001aff00ff00ff00004100380158006601aa001a0000000000000048001aff00ff00ff000041006e0158009c01aa001a0000000000
0000480028000f016b074d42524f4c4120002b050c08646970686f6e65200000280027016e0973796e746865736973002b051e06484c73796e200000280051015d1371756173692d6172746963756c61746f727920002b110c0973796e746865736973002b021e0950726f736f647920200028008701690c6d616e6970756c
617465642000002b0a0c065370656563680000700022004a00020080004a005c0002008000020080004a004a004a005c0002005c0002002800620016084d61726b6564202000002a0c047465787400a100640010474449310001ffffffff00000000000000710022006a005f0071006e006e006e0071005f006e005f006a00
5f006e006e006e006e0001000a006d004a006f005f0022006e00353f0000a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa00710022004d008e005c0095005c0092004d008e004d0092004d0095005c0092005c00920001000a0041
0091004d0093002200350092002400a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa00710022004d010c005c0113005c0110004d010c004d0110004d0113005c0110005c01100001000a0041010f004d0111002200350110002400
a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa00710022006a00d4007100e3006e00e3007100d4006e00d4006a00d4006e00e3006e00e30001000a006d00bf006f00d40022006e00aa3f0000a100640010474449310001ffffffff
01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa00710022001d0150002b0158001d0158002b0156002a015300290150001d0158001d01580001000a002a013d006e0153002000b20127ffe6016900a100640010474449310001ffffffff01000000000000a10064001047
4449310001ffffffff0000000000000001000a00000000009c01aa007100220053014b005f015800530158005f0150005c014e005a014b00530158005301580001000a005c013d006e014e00220080012c33ca00a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff000000000000
0001000a00000000009c01aa00710022007c014b00890158008901580081014b007f014e007c015000890158008901580001000a006e013d007f014e0022005d012c333300a100640010474449310001ffffffff01000000000000ff}}\par
\pard \qc\sb240 Fig. XX: ProSynth synthesis architecture.\par
\pard \qj\sb240 2.\tab Text marked for the type and placement of accents is input to the system, and a pronunciation lexicon is used to construct a strictly layered metrical structure for each intonational phrase in turn. The overall utteran
ce is then represented as a hierarchy, described in more detail in Section XX.\par
\pard \qj\sb240 3.\tab
The interpreted structure is converted to a parametric form depending on the signal generation method. The phonetic descriptions and timing can be used to select diphones and express their durations and pitch contours foroutput with the MBROLA system (Duto
it et al ref). The phonetic details can also be used to augment copy-synthesis parameters for HLsyn quasi-articulatory formant synthesiser (Heid & Hawkins ref., Jenolan Caves.).
The timings and pitch information have also been used to manipulate the prosody of natural speech using PSOLA (Hamon et al. ref).\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.3 Linguistic Representation and Modelling\par
\pard\plain \qj\sb240 \f20 1.\tab The Extensible Markup Language (XML) is an extremely simple dialect of SGML (Standard Generalised M
arkup Language), the goal of which is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML is a standard proposed by the World Wide Web Consortium of industry specific mark\endash
up for: vendor\endash neutral data exchange, media\endash independent publishing, collaborative authoring, the processing of documents by intelligent agents and other metadata applications [Ref1]. \par
\pard \qj\sb240 2.\tab
We have chosen to use XML as the external data representation for our phonological structures in ProSynth. The features of XML which make it ideal for this application are: storage of hierarchical information expressed in nodes with attributes; a standard
text\endash based format suitable for networking; a strict and formal syntax; facilities for the expression of linkage between parts of the structure; and readily\endash available software support. \par
3.\tab
In the ProSynth system, the input word sequence is converted to an XML representation which then passes through a number of stages representing phonetic interpretation. A declarative knowledge representation is used to encode knowledge of phonetic interpre
tation and to drive transformation of the XML data structures. Finally, special purpose code translates the XML structures into parameter tables for signal generation. \par
\pard \qj\sb240 4.\tab In ProSynth, XML is used to encode the following: \par
\pard \qj\sb240 {\b 5.\tab Word Sequences}:{\i }The text input to the synthesis system needs to be marked\endash
up in a number of ways. Importantly, it is assumed that the division into prosodic phrases and the assignment of accent types to those phrases has already been performed. This information is added to the text using a simple mark\endash
up of Intonational Phrases and Accent Groups (Section XX). \par
\pard \qj\sb240 {\b 6.\tab Lexical Pronunciations}:{\b }The lexicon maps word forms to syllable sequenc
es. Each possible pronunciation of a word form has its own entry comprising: SYLSEQ (i.e. syllable sequence), SYL, ONSET, RHYME, NUC, ACODA, CODA, VOC and CNS nodes. Information present in the input mark\endash
up, possibly derived from syntactic analysis, selects the appropriate pronunciation for each word form. \par
{\b 7.\tab Prosodic Structure}:{\i }Each composed utterance comprising a single intonational phrase is stored in a hierarchy of: UTT, WORDSEQ, WORD, IP, AG, FOOT, SYL, ONSET, RHYME, NUC, CODA, ACODA, VOC and CNS nodes. Syllables are cross\endash
linked to the word nodes using linking attributes. This allows for phonetic interpretation rules to be sensitive to the grammatical function of a word as well as to the position of the syllable in the word. \par
\pard \qj\sb240 {\b 8.\tab Database Annotation}
: Our database has been manually annotated and a prosodic structure complete with timing information has been constructed for each phrase. This annotation is stored in XML using the same format as for synthesis. Tools for searching this database help us in
generating knowledge for interpretation. \par
\pard \qj\sb240 9.\tab
An interesting characteristic of our prosodic structure is the use of ambisyllabic consonants (discussed in more detail in Section XX). This allows one or more consonants to be in the Coda of one syllable and in the Onset position of the next syllable. Exa
mples are the medial consonants in "pity" or "tasty". To achieve ambisyllabicity in XML it is necessary to duplicate and link nodes, since XML rigidly enforces a strict hierarchy of components. \par
\pard \qj\sb240 10.\tab An extract of a prosodic structure expressed in XML is shown in Figure XX, taken from the phrase \ldblquote Come with a bloom\rdblquote
(see Fig. XX for another representation of this information). (In the XML representations, Y/N are used in place of +/-.)\par
\pard \qj\sb240 {\f22\fs18 <FOOT DUR="1" START="0.5561" STOP="1.0883">\par
}\pard \qj {\f22\fs18 \par
<SYL DUR="1" FPOS="1" RFPOS="1" RWPOS="1" START="0.5561" STOP="1.0883"\par
STRENGTH="STRONG" WEIGHT="HEAVY" WPOS="1" WREF="WORD4">\par
\par
}\pard \qj\li720 {\f22\fs18 <ONSET DUR="1" START="0.5561" STOP="0.7341" STRENGTH="STRONG">\par
}\pard \qj\li720 {\f22\fs18 <CNS AMBI="N" CNSCMP="N" CNSGRV="Y" CNT="N" DUR="1" NAS="N" RELEASE="0.6565" RHO="N" SON="N" START="0.5561" STOP="0.6670" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N" VOI="Y">b</CNS>\par
}\pard \qj\li720 {\f22\fs18 <CNS AMBI="N" CNSCMP="N" CNSGRV="N" CNT="Y" DUR="1" NAS="N" RHO="N" SON="Y" START="0.6670" STOP="0.7341" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N"\par
}\pard \qj\li720 {\f22\fs18 VOI="Y">l</CNS>\par
</ONSET>\par
}\pard \qj {\f22\fs18 \par
}\pard \qj\li720 {\f22\fs18 <RHYME CHECKED="Y" DUR="1" START="0.7341" STOP="1.0883" STRENGTH="STRONG"\par
VOI="Y" WEIGHT="HEAVY">\par
}\pard \qj\li1440 {\f22\fs18 \par
}\pard \qj\li1440 {\f22\fs18 <NUC CHECKED="Y" DUR="1" LONG="Y" START="0.7341" STOP="0.9126" STRENGTH="STRONG" VOI="Y" WEIGHT="HEAVY">\par
<VOC DUR="1" FXGRD="-251.2" FXMID="126.7" GRV="Y" HEIGHT="CLOSE" RND="Y" START="0.7341" STOP="0.8234">u</VOC>\par
<VOC DUR="1" FXGRD="-171.1" FXMID="105.4" GRV="Y" HEIGHT="CLOSE" RND="Y" START="0.8234" STOP="0.9126">u</VOC>\par
}\pard \qj\li1440 {\f22\fs18 </NUC>\par
\par
<CODA DUR="1" START="0.9126" STOP="1.0883" VOI="Y">\par
}\pard \qj\li1440 {\f22\fs18 <CNS AMBI="N" CNSCMP="N" CNSGRV="Y" CNT="N" DUR="1" NAS="Y" RHO="N" SON="Y" START="0.9126" STOP="1.0883" STR="N" VOCGRV="Y" VOCHEIGHT="CLOSE" VOCRND="Y"\par
}\pard \qj\li1440 {\f22\fs18 VOI="Y">m</CNS>\par
</CODA>\par
}\pard \qj\li720 {\f22\fs18 </RHYME>\par
}\pard \qj {\f22\fs18 </SYL>\par
</FOOT>\par
}\pard \qc\sb240 Fig 2. Partial XML representation of utterance: \ldblquote with a bloom\rdblquote .\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.4 Knowledge Representation\par
\pard\plain \qj\sb240 \f20 1.\tab
In ProSynth knowledge for phonetic interpretation is expressed in a declarative form that operates on the prosodic structure. This means firstly that the knowledge is expressed as unordered rules, and secondly that it operates solely by manipulating the at
tributes on the XML encoded phonological structure. To encode such knowledge a representational language called ProXML was developed in which it is easy to express the hierarchical contexts which drive
processing and to make the appropriate changes to attributes. The ProXML language is read by an interpreter PRX written in C which takes XML on its input and produces XML on its output. ProXML is a very simple language modelled on both C and Cascading Sty
le Sheets (see [Ref2] for more information). A ProXML script consists of functions which are named after each element type in the XML file (each node type) and which are triggered by the presence of a node of that type in the input. When a function is call
ed to process a node, a context is supplied centered on that node so that reference to parent, child and sibling nodes is easy to express. \par
\pard \qj\sb240 2.\tab
Figure XX shows a simple example of a ProXML script to adjust syllable durations for strong syllables in a disyllabic word whose second and final syllable is weak. If the first syllable is heavy, the rule is dependent on the length of the vowel. In this ex
ample, the DUR attribute on SYL nodes is set as a function of the phonological attributes found on that node and on others in the hierarchy.
Note that the rules modify the duration attribute (*= means scale existing value) rather than set it to a specific value. In this way, the declarative aspect of the rule is maintained. The compression factors in the script are computed from regression tree
data taken from a database of natural speech (see Section 5.2).\par
\pard \qj\li1440\sb240 {\f22\fs18 SYL \{\par
}\pard \qj\li1440 {\f22\fs18 if ((:STRENGTH=="STRONG")&&(:WPOS=="1")&&(:RWPOS=="2")\par
&&(../SYL[2]:WEIGHT=="LIGHT"))\par
if (:WEIGHT=="HEAVY")\par
if (./RHYME/NUC:LONG=="Y")\par
:DUR *= 1.0884;\par
else\par
:DUR *= 1.1420;\par
else\par
:DUR *= 0.8274;\par
\}\par
}\pard \qc\sb240 Fig. X: Example ProXML script, which modifies syllable durations dependent on the syllable level and nucleus level attributes.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 5. Modelling (phonetic interpretation)\par
\pard\plain \qj\sb240 \f20
This section describes more details of phonetic interpretation in ProSynth, focussing on temporal relations, intonation, and spectral detail. Our assumption is that there are close relationships between each of these aspects of speech. For example, once ti
ming relations are accurat
ely modelled, some of the spectral details (such as longer-domain resonance effects) can also be modelled as a by-product of the temporal modelling, when the output system is HLsyn (or any formant synthesizer). This particular trade-off between duration an
d spectral shape is not of course available to concatenative synthesis, but the knowledge it reflects could influence [be applied to?] unit selection. [????]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.1\tab Spectral detail\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.1.1\tab Segmental identity\par
\pard\plain \qj\sb240 \f20 Whichever type of synthesis output system is used, t
he immediate input comes from the XML file. For concatenative synthesis, we currently use the MBROLA system, with sound segments chosen in the standard way from the MBROLA inventory for British English. For formant synthesis, we use HLsyn driven by {
\scaps procsy, }which is part copy-synthesizer from labelled speech files, and part rule-driven from information in the XML file. Most formant trajectories for vowels and approximants are copy-synthesized, while obstruent consonants and some oth
er sounds are produced by rule. {\scaps Procsy} is described in detail by Heid and Hawkins (under review). At the time of writing, efforts to make {\scaps procsy} entirely rule-driven have just begun.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.1.2.\tab Fine-tuning spectral shape\par
\pard\plain \qj\sb240 \f20 1.\tab
In concatenative synthesis, the task of fine-tuning spectral shape is achieved by selecting appropriate units. ProSynth as yet makes no attempt to improve upon the standard MBROLA unit selection, but ultimately our work should have applications in unit sel
ection inasmuch as it should increase our understanding of how factors such as long-domain resonance effects and grammatical dependencies influence spectral variability.\par
\pard \qj\sb240 2.\tab
When the parameters are set to appropriate values, HLsyn itself does much local fine-tuning of spectral shape automatically. In comparison with standard formant synthesizers, it is relatively straightforward to produce complex acoustic changes at segment b
oundaries that closely mimic those of natural speech. Most notably, HLsyn produces natural-sounding, perceptually-robust tran
sitions between adjacent segments that differ in excitation type, such as the transition between vowels and voiced or voiceless stops or fricatives. This attribute of HLsyn means that some of the immediate appeal of concatentive synthesis\emdash
natural-sounding, perceptually-robust transitions between adjacent segments, together with a pleasant voice quality\emdash is also available in formant synthesis at little computational cost.\par
\pard \qj\sb240 3.\tab Although these types of acoustic fine-detail are relatively easily achievable using HLsyn, they have to be programmed to occur in only the right contexts. {\scaps Procsy }
provides the rules that do this. Some of the systematic variation is programmed by reference to the structure of the prosodic hierarchy, and some in the traditional way by reference to linear segmental context. Examples of prosodically-dependent rules incl
ude stress-dependent variations in the waveform amplitude envelope, and stress-dependent differences in excitation type in certain CVC sequences. For example, in Southern British English, the first CVC of {\i today} and {\i to disappoint }
are spectrally very different from those of {\i turtle }and {\i tiddler}, as are the {\i tit} sequences in {\i attitude }and {\i titter}
. Examples of rules that rely mainly on local segmental context include coarticulation of nasality and the amount of voicing in the closure of voiced stops. These sorts of properties, though in need of more work, are reasonably well understood and most are
relatively straightforward to implement to a satisfactory standard.\par
\pard \qj\sb240 4.\tab More challenging, because more subtle and less well understood, is the temporal extent of long-domain coarticulatory processes such as the resonance effects discussed in Section 2
, which are known to be perceptually salient. For example, Tunley (1999) has shown that in SSBE, /r/-colouring varies with vowel height and the number of consonants in the syllable onset, and spreads for at least two syllables on either side of the conditi
oning consonant, as long as those syllables are unstressed and especially if they are in feet of 3 or more syllables. Thus, whereas strong /r/-colouring might be expected to be found throughout a phrase like {\i The tapestry bikini}
, it would be expected to be weak and confined only to {\i bad} and {\i rap} in a phrase like {\i The bad rap artist} (in a non-rhotic accent). Work by West (1999) is broadly supportive of these observations.\par
\pard \qj\sb240 5.\tab It is not yet known, however, what limits the spread of rhotic resonance effects. Some of our current efforts are directed towards answering this question. For example, when an /r/ occurs in a context that is susceptible to
/r/-colouring, such as the last syllable of {\i tapestry}
, is the resonance effect blocked by the next stressed syllable, or can it spread through into unstressed syllables of the adjacent foot? Just as low vowels show less susceptibility than high vowels, are some consonants (for example, velar stops) more like
ly to affect the the spread of resonance effects than others? The way that resonance effects are modelled in ProSynth will depend to a large extent on the answers to the
se questions. For example, if rhotic resonance effects are restricted to unstressed syllables in the foot or feet immediately adjacent to the conditioning /r/, then the feature [rhotic] can be an atrribute of the foot in the prosodic tree. If however these
effects pass through stressed syllables into the next feet, then they might have to be modelled as an attribute of a level higher than the foot. (Preliminary evidence suggests we should not rule out that possibility.) Finally, if some segments block the s
pread of resonance effects, even in unstressed syllables, then either the domain of the [rhotic] feature may be best placed below the foot, or else the acoustic realisation of the feature must also take account of the segmental context in a relatively comp
licated way. In essence, we are asking to what extent rhotic resonance effects are part of the phonology of SBE, and to what extent they can be regarded as a phonetic consequence of, for example, vowel-to-vowel coarticulation. [{\b
John, Richard et al: This se
ems a possible place to put this phonology-phonetics point, but the more I (Sarah) think about it, the more unhappy I am with it. Secretly, I think I am a BRowman-Goldstein type who sees no dividing line between phonol and phonet., and/or a Keating type wh
o says it\rquote s all controlled. But that\rquote s because, as you know, I am no phonologist. My problem is: the v-to-v coartic doesn\rquote t HAVE to happen\emdash
the lang/accent ALLOWS it to happen. So is that phonol or phonet? And, ultimately, does it matter which?? Answers may
, if you wish, direct me to your Linguistics 101 handouts. Another answer may be to re-write the preceding para so it makes some of the points, but in a less theoretical way, which may be more appropriate for CSL. Opinions, please. Tx }{\b\i
RAO responds: I think this is too sophisticated an incomplete a discussion/piece of work to bother putting in here. It\rquote s a can of worms we don\rquote t need to open. I\rquote d either leave this text as it is, or ditch it.}{\b ] }
This is not an easy question to answer: the variation with vowel height, for example, may reflect a process that is in the phonology of SBE, but is nevertheless manifested to different degrees for independent articulatory-acoustic reasons. If that were the
case, then a formant synthesizer would have to deal with the acoustic differences between different vowels, even though the basic control would reside in the phonological structure. These issues are currently being investigated.\par
\pard \qj\sb240 6.\tab
The temporal extent of systematic spectral variation due to coarticulatory processes is modelled using two intersecting principles. One reflects how much a given allophone blocks the influence of neighbouring sounds, and is like coarticulation resistance [
12]. The other principle reflects resonance ef
fects, or how far coarticulatory effects spread. The extent of resonance effects depends on a range of factors including syllabic weight, stress, accent, and position in the foot, vowel height, and featural properties of other segments in the domain of pot
ential influence. For example, intervening bilabials let lingual resonance effects spread to more distant syllables, whereas other lingual consonants may block their spread; similarly, resonance effects usually spread through unstressed but not stressed sy
llables.{\i \par
}\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.2\tab Temporal modelling\par
\pard\plain \qj\sb240 \f20 {\b Still (unfortunately) some work to do on this. The problem is that there\rquote s a notional but relatively complete (YorkTalk) model of timing\emdash also used for expt. described later!\emdash and a partial (ProSynth) one.
}\par
\pard \qj\sb240 1.\tab One of the goals of temporal modelling is to model English rhythms accurately. Our model is foot-based and for any given syllable takes into account (1)\~its strength (2)\~its weight (3)\~its place in the foot (3)\~
the strength and weight of adjacent syllables. Information about word boundaries is also available, allowing (eg.) word-finality to influence the temporal interpretation of any syllable.\par
\pard \qj\sb240 2.\tab Abercrombie (ref.) describes two rhythms which are important for disyllabic words in the variety of English we are modelling: (1)\~short-long: {\i happy funny city}, (2)\~equal-equal: {\i hamper funding seedy}
. The words with short-long rhythm have a light first syllable, while the words with equal-equal rhythm have a heavy first syllable. The second syllable vowels in the two sets are durationally different. Taking the vowels in the database as a whole,
and looking specifically at utterance-final disyllabic feet with short vowels in the first syllable, it is found that the duration of the vowel of both the first and the second syllable is sensitive to the weight of the first syllable (Table X). The
duration of a second syllable vowel after a heavy first syllable is 23% greater than after a light first syllable.\par
\pard \qj\sb240 \par
\trowd \trgaph80\trleft-80 \clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdb \clshdng0\cellx2680\clbrdrt\brdrs \clbrdrb\brdrdb \clshdng0\cellx5440\clbrdrt\brdrs \clbrdrb\brdrdb \clbrdrr\brdrs \clshdng0\cellx8200\pard \qc\sb240\intbl Weight of 1st syll\cell
Duration of 1st syll\cell Duration of 2nd syll\cell \pard \intbl \row \trowd \trgaph80\trleft-80 \clbrdrt\brdrdb \clbrdrl\brdrs \clshdng0\cellx2680\clbrdrt\brdrdb \clshdng0\cellx5440\clbrdrt\brdrdb \clbrdrr\brdrs \clshdng0\cellx8200\pard \qc\sb240\intbl
heavy\cell 381.2\cell 329.6\cell \pard \intbl \row \trowd \trgaph80\trleft-80 \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx2680\clbrdrb\brdrs \clshdng0\cellx5440\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx8200\pard \qc\sb240\intbl light\cell 276.2\cell
268.4\cell \pard \intbl \row \pard \qj\sb240 3.\tab As well as durational differences, there are also qualitative differences in the second-syllable vowels. The words of type (1) have
diphthongised vowels, while the words of type (2) have monophthongal vowels. The implication of these results is that when the second syllable of words like these is phonetically interpreted, it is necessary to have information available about the streng
th and weight of the preceding syllable. Similar, but more complex, statements must also be made for longer prosodic chunks.\par
\pard \qj\sb240 4.\tab As well as rhythmic properties, there are \lquote segmental\rquote durational effects which relate to smaller stretches of speech but which (perh
aps paradoxially) reflect higher levels of linguistic organisation. For example, Keating, Fougeron & Cho (LabPhon ref.) and Fougeron & Keating (JASA ref) have shown that the duration of various segment types is sensitive to at least three levels of structu
re in the prosodic hierarchy.
Such observations provide further evidence that the accurate modelling of durations depends on having a rich phonological structure and that phonetic interpretation should access information from that structure. In other words
, temporal phonetic interpretation is reliant on the informational richness which is encoded in the phonological structure.\par
\pard \qj\sb240 5.\tab The temporal interpretation model is based on a CART (Classification and Regression Tree) analysis of the database, taking into account the phonological features in the prosodic hierarchy.
CART analysis is succinctly described by van Santen (ref.):\par
\pard \qj\li720\sb240 6.\tab CART-based methods construct a tree by making binary splits on factors so as to minimise the variance of the durations in the two corresponding subsets. When a CART tree encounters a bundle of features not obser
ved in the database, it can still find a path in the tree that up to some point matches the new feature bundle. This means that if nothing in the database matches the the required pattern exactly a near approximation will be found.\par
\pard \qr\sb240 van Santen ref.\par
\pard \qj\sb240 7.\tab The labelled waveforms of the database and their XML-parsed description files are searched according to relevant feature information (eg. syllable weight and strength), and a CART model is use
d to generalise across this data and generate duration statistics for feature bundles at given places in the phonological structure. The resulting duration model can be used to drive MBROLA diphone synthesis, since it predicts the
durations of acoustic segments.\par
\pard \qj\sb240 8.\tab The analysis model works top-down\emdash that it, it factors out first the effects of IP, then of AG, and so on, down the tree to the features at the terminal level.
This reflects the assumption that the IP, AG, Foot and Syllable are all levels of timing,
and that details of lower-level differences (such as segment type) can be overlaid on details of higher-level differences (such as syllable weight and strength; the strength and weight of an adjacent syllable; etc.). The top-down model also has the effect
of constraining search spaces. ??? EXAMPLE. The resulting timing model is such
that each node in the hierarchy has a multiplicative compression factor associated with it. The fact that it is a multiplicative model means that the order in which the statements of temporal interpretation are applied
is irrelevant. It also makes the model compositional. \par
\pard \qj\sb240 9.\tab As an example, consider the interpretation of /p/ in {\i happy}. In order to interpret the /p/ accurately, the model refers to (at least) the following pieces of information:\par
\pard \qj\li720\sb240 \bullet \~/p/ is located in a Rhyme whose Nucleus contains a short open vowel\par
\pard \qj\li720 \bullet \~/p/ is [ambisyllabic:+] and is in the Coda of a [strong:+], [light:+] syllable and in the Onset of a weak syllable\par
\pard \qj\sb240 10.\tab Each of these facts\emdash along with other, higher-level ones\emdash affects the temporal interpretation of the /p/ in {\i happy}. Other bundles of phonological features are interpreted in the same structure-bound way.\par
\pard \qj\sb240 {\b Not sure about content and location of these next two paras. Awaiting advice from JKL and PGC!}\par
\pard \qj\sb240 11.\tab Another way to interpret timing, which is not segment-based, is based
on a non-segmental model of temporal interpretation (Local & Ogden ref., Ogden, Local & Carter ref.). According to this model, higher-level constituents in the hierarchy are compressed, and their daughter
nodes are compressed in the same way. The temporal interpretation of ambisyllabicity is the degree of overlap that exists between syllables, so an intervocalic consonant (typically ambisyllabic) has duration properties inherited from both the syllables it
is in.\par
\pard \qj\sb240 12.\tab Syllable{\i\fs20\dn4 n} can be overlaid on Syllable{\i\fs20\dn4 n-1} by setting its start point to be before that of Syllable{\i\fs20\dn4 n-1}
. By overlaying syllables to varying degrees and making reference to ambisyllabicity, it is possible to lengthen or shorten intervocalic consonants
systematically. There are morphologically bound differences which can be modelled in this way, provided that the phonological structure is sensitive to them. For instance, the Latinate prefix {\i in-}
is fully overlaid with the stem to which it attaches, giving a short nasal in {\i innocuous}, while the Germanic prefix {\i un-} is not overlaid to the same degree, giving a long nasal in {\i unknowing}. Rhythmical differences in pairs like {\i recite}
and {\i re-site} can likewise be treated as differences in phonological structure and consequent differences in the temporal interpretation of those structures.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.3\tab Intonational modelling\par
\pard\plain \qj\sb240 \f20 1.\tab
We assume, in common with most theories of intonation, that the highly variable F0 contours encountered in natural speech can be analysed into component parts and classified according to a finite set of possible pitch melodies, which need to be defined pho
nologically. There is, then, a dimension of paradigmatic choice in modelling intonation: the overall pitch pattern selected for an IP is not itself predictable from structure but is determined by discourse fact
ors. Once that discourse-based selection has been made, then a pitch accent specification can be assigned to each of the AGs within the IP. The pattern for an IP is thus composed of the pitch accents assigned to AGs, and of boundary tones associated with t
he edges of the IP domain. For example, IP attributes will tell us (i) about position in discourse (initial, medial, final), (ii) about speech act function (declarative, interrogative, imperative), and (iii) about linguistic focus. The information in (i)
is relevant to pitch range and will be interpreted in terms of F0 scaling and boundary tone. Information in (ii) is used in determining the choice of pitch accents for the component AGs, whereas (iii) determines nuclear accent placement, and hence the AG s
tructure itself, since the nucleus must be located on the final AG of an IP (IPs being right-headed). By default, AGs are co-terminous with headed, heavy Feet (those beginning with stressed syllables), so that the intonation nucleus falls on the final suc
h Foot; in context the focus may shift to an earlier Foot position, thus creating an AG constituent containing more than one Foot.
In this case, since AGs are left-headed, the first Foot within the AG is the head of that AG and the domain for the nuclear pitch contour. {\b (Examples available if required.) YES PLEASE! -RAO}\par
\pard \qj\sb240 2.\tab A discourse-final declarative IP, then, consisting of two well-formed (non-degenerate) AGs, would typically be assigned a relatively high accent in AG1, a falling nuclear pitch movement in AG2 and a low fin
al boundary tone (equivalent to H* H*L L% in ToBI-style notation).\par
\pard \qj\sb240 3.\tab
The interpretation of the selected pitch contour in terms of F0 is, like other phonetic parameters, structure-dependent. Precise alignment of contour turning-points is constrained by the properties of units at lower levels in the hierarchy. In our model, d
escribed in more detail in (ICPhS paper 1999 ref), nuclear pitch accents are defined in terms of a template based on a sequence of contour turning-points. These templates are in turn bas
ed on a set of essential parameters derived by automatic means from the Laryngograph recording used to calculate the F0 trace, and checked using informal listening tests to ensure that there was perceptual equivalence between natural F0 contours and those
constructed by linking the target points we identified. For example, for a falling (H*L) pitch accent we identify three crucial contour turning-points: Peak ONset (PON), Peak OFfset (POF) and Level ONset (LON). In other words, we recognise that the Apeak@
associated with H* accents is often manifested as a plateau, with its own duration, rather than as a single peak: PON and POF represent the start and end of such a plateau, with POF therefore denoting the beginning of the F0 fall. LON occurs at the end of
the fall, and is the point from which the low tone spreads till the end of voicing in the AG (cf Aphrase accent@ (ref)). \par
\pard \qj\sb240 {\b ***Include suitable F0 plot as illustration, + ICPhS diagram with following procedural explanation*** \par
}\pard \qj\sb240 4.\tab Firstly, the location of the key syllable components was established using the manual annotations.
Then the peak F0 value in the accented syllable was found. The onset (PON) and the offset (POF) of the peak were then found by finding the range of times around the peak where the F0 value was within 4% (approximating to a range for perceptual equality). T
he schematic representation below illustrates the search for PON and POF.\par
\pard \qj\sb240 5.\tab The template turning-points are specified as attributes of the leftmost Foot (=head) within the AG. Our statist
ical analysis of the database suggests that the timing of all these points varies systematically with aspects of the structure of this Foot, such as its length in terms of number of component syllables, and characteristics of the onset and rhyme of the acc
ented syllable at its head. Many earlier studies of F0 alignment relate e.g. H* Apeak@ timing to this accented syllable, rather than to the Foot (various refs). Our early results suggest that we can cut down on some of the variability by treating the Foo
t as the primary domain for our template.\par
\pard \qj\sb240 6.\tab The patterns of alignment across structures which we observe for our single speaker model are consistent with those reported in the literature (see House & Wichmann 1996, Wichmann and House 1999 for summary).
We claim that successful modelling of the F0 values for this speaker, integrated with the same speaker=s timing and spectral properties, enhances the coherence of the synthesised output. Acoustic-phonetic coherence will be further enhanced by incorporating
microprosodic perturbations of the F0 contour (Silverman, {\b what\rquote s this??->}Y), clearly observable for e.g. obstruent consonants on our database.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 6.\tab Perceptual testing/experiments\par
\pard\plain \qj\sb240 \f20 {\b Urgent decisions needed on this. Include or exclude? If include (in }{\b\i any}{\b form), UCL must provide me with text immediately please. If you don\rquote
t, I think the paper looks a bit funny with only two experiments reported.}\par
[[This section will be expanded with the experimental results from respective sites. STILL LOTS OF WORK TO DO HERE ON JOINING THINGS UP BETTER. WILL WAIT TILL I GET UCL TEXT, THEN TAKE OUT COMMONALITIES AND PUT IN 6.1]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.1\tab conditions shared by all experiments\par
\pard\plain \qj\sb240 \f20 [[This section will contain information relevant to all the experiments.]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.2 f0\par
\pard\plain \qj\sb240 \f20 [[Emphasises the innovation in our testing of intonation; something about lack of good standard models for testing intonation.]]\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.3 timing\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.1.\~Hypothesis\par
\pard\plain \qj\sb240 \f20 The hypothesis we are testing in ProSynth is that having hierarchically organised, prosodically structured linguistic information should make it pos
sible to produce more natural-sounding synthetic speech which is also more robust under difficult listening conditions. As an initial test of our hypotheses about temporal structure and its relation to prosodic structure, an experiment has been conducted t
o test whether the categories set out in Section 2 make a significant difference to listeners\rquote
ability to interpret synthetic speech. If the timings predicted by ProSynth for structural positions are perceptually important, listeners should be more success
ful at interpreting synthetic speech when the timing appropriate for structure is used than in the case where the timing is inappropriate for the linguistic structures set up.\par
\pard \qj\fi357\sb240 The data consists of phrases from the database of natural English generated by MBROLA [11] synthesis using timings of two different kinds: (1)\~
the segment durations predicted by the ProSynth model taking into account all the linguistic structure outlined in Section 2 (2)\~the segment durations predicted by ProSynth for a different ling
uistic structure. If the linguistic structure makes no significant linguistic difference, then (1) and (2) should be perceived equally well (or badly). If temporal interpretation is sensitive to linguistic structure in the way that we have suggetsed, then
the results for (1) should be better than the results for (2).\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.2.\~Data\par
\pard\plain \qj\sb240 \f20 12 groups of structures to be compared on structural linguistic grounds were established (eg "light ambisyllabic short initial syllable" versus "light nonambisyllabic short initial syllable"). Each group has two members (eg {\i
robber}/{\i rob them} and {\i loving}/{\i loveless}
). For each phrase, two synthetic stimuli were generated: one with the predicted ProSynth timings for that structure, and another one with the timings for the other member of the pair. Files were produced with timing information from the natural-speech utt
erances, and an approximation to f0 of the speech in the database. The timing information for the final foot was then replaced with timing from the ProSynth model. This produced
the 'correct' timings. In order to produce the 'broken' timings, timing information for the rhyme of the strong syllable in this final foot was swapped within the group so, for example the durations for {\i ob} in {\i robber}
were replaced with the durations for {\i ob} in {\i rob them} and vice versa.\par
\pard \qj\sb240 The stimuli have segment labels ultimately from the label files from the database, f0 information from the recordings in the database, and timing information partly from natural speech and partly from the ProSynth model.\par
\pard \qj\sb240 As an example, consider the pair {\i (he\rquote s a) robber} and {\i (to) rob them}. The durations (in ms.) for {\i robber} and {\i rob them} are:\par
\pard \qj\li720\sb240 {\f12407 \'81\tab }120\tab {\f12407 \'81}\tab 110\par
\pard \qj\li720 {\f12407 b\tab }65\tab {\f12407 b}\tab 85\par
{\f12407 \'ab\tab }150\tab {\f12407 D}\tab 60\par
{\f12407 \tab }\tab {\f12407 \'ab}\tab 120\par
{\f12407 \tab }\tab {\f12407 m}\tab 135\par
\pard \qj\sb240 Stimuli with these durations are compared with stimuli with the durations swapped round:\par
\pard \qj\li720\sb240 {\f12407 \'81\tab }110\tab {\f12407 \'81\tab }120\par
\pard \qj\li720 {\f12407 b\tab }85\tab {\f12407 b\tab }65\par
{\f12407 \'ab\tab }150\tab {\f12407 D\tab }60\par
{\f12407 \tab }\tab {\f12407 \'ab\tab }120\par
{\f12407 \tab }\tab {\f12407 m}\tab 135\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.3.\~Experimental design.\par
\pard\plain \qj\sb240 \f20 22 subjects heard every phrase once at comfortable listening levels over headphones, presented by a Tucker-Davies DD1 digital analogue interface. The signal-to-noise ratio was -5dB. Th
e noise was cafeteria noise, i.e. different background noises like voices and laughter. Subjects were instructed to transcribe what they heard using normal English spelling, and were given as much time as they needed. When they were ready, they pressed a k
ey and the next stimulus was played.\par
\pard \qj\sb240 Each subject heard half of the phrases as generated with the ProSynth model, and half with the timings switched. The subjects heard six practice items before hearing the test items, but were not informed of this.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.4.\~Results\par
\pard\plain \qj\sb240 \f20
The phoneme recognition rate for the correct timings from the ProSynth model is 77.5%, and for the switched timings it is 74.2%. Although this is only a small improvement, it is nevertheless significant using a one-tailed correlated t-test (t(21) = 2.21, p
< 0.02).\par
\pard \qj\sb240 Examples of the stimuli and further details of the results of the experiments (including updates) are available on the world wide web [12].\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.5.\~Discussion\par
\pard\plain \qj\sb240 \f20 The results show a significant effect of linguistic structure on improved intel
ligibility. The results are for the whole phrase, including parts which were not switched round: excluding these may result in improved results. The MBROLA diphone synthesis models durational effects, but not the segmental effects predicted by our model an
d described in more detail in Section 3: for example, the synthesis produces aspirated plosives in words like {\i roast}[{\f12407 H}]{\i ing}
where our model predicts non-aspiration. It uses only a small diphone database. The rather low phoneme recognition rates may be due to
the quality of the synthesis was problematic, or the cognitive load imposed by high levels of background noise. Further statistical analysis will group the data according to foot-type, and future experiments will use a formant synthesiser.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.3.6.\~Future work\par
\pard\plain \qj\sb240 \f20
Future work will concentrate on refining the temporal model so that it generates durations which approximate those of our natural speech model as well as possible. The work will be checked by more perceptual experiments, including presenting the synthe
tic stimuli under listening conditions that impose a high cognitive load, such as having the subjects perform some other task while listening to synthesis.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 6.4 segmental boundaries\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.1. Material. \par
\pard\plain \qj\sb240 \f20 18 phrases from the database were copy-synthesized into HLsyn using {\scaps procsy}
[15], at 11.025 kHz SR, and hand-edited to a good standard of intelligibility, as judged by a number of listeners. In 10 phrases, the sound of interest was a voiceless fricative: at the onset of a stressed syllable\emdash {\i in a }{\i\ul f}{\i ield}
; unstressed onset\emdash {\i it\rquote s }{\i\ul s}{\i urreal}; coda of an unstressed syllable\emdash {\i to di}{\i\ul s}{\i robe}; between unstressed syllables\emdash {\i di}{\i\ul s}{\i appoint}; coda of a final stressed syllable\emdash {\i on the roo}
{\i\ul f}{\i , his ri}{\i\ul ff}{\i , a my}{\i\ul th}{\i , at a lo}{\i\ul ss}{\i , to cla}{\i\ul sh}; and both unstressed and stressed onsets\emdash {\i\ul f}{\i ul}{\i\ul f}{\i illed.}
The other 8 items had voiced stops as the focus: in the coda of a final stressed syllable\emdash {\i it\rquote s mislai}{\i\ul d}{\i , he\rquote s a ro}{\i\ul gue}{\i , he was ro}{\i\ul bb}{\i ed}; stressed onset\emdash {\i in the }{\i\ul b}{\i and}
; unstressed onset\emdash {\i the }{\i\ul d}{\i elay, to }{\i\ul b}{\i e wronged}; unstressed and final post-stress contexts\emdash {\i to }{\i\ul d}{\i eri}{\i\ul de}; and in the onset and coda of a stressed syllable\emdash {\i he }{\i\ul b}{\i e}{\i\ul
gg}{\i ed.\par
}\pard \qj\sb240 The sound of interest was synthesized with the \ldblquote right\rdblquote type of excitation pattern. From each right version, a \ldblquote wrong\rdblquote
one was made be substituting a type or duration of excitation that was inappropriate for the context. Changes were systematic; no attempt
was made to copy the exact details of the natural version of each phrase, as our aim was to test the perceptual salience of the type of change, with a view to incorporating it in a synthesis-by-rule system.\par
\pard \qj\sb240 At
FV boundaries, the right version had simple excitation (an abrupt transition between aperiodic and periodic excitation), and the wrong version had mixed periodic and aperiodic excitation. VF boundaries had the opposite pattern: wrong versions had no mixed
excitation. See Fig. 1. Right versions were expected to be more intelligible than wrong versions of fricatives.\par
\pard \qj\sb240 Each stop had one of two types of wrong voicing: longer-than-normal voicing for {\i\ul b}{\i and} and{\i }{\i\ul b}{\i e}{\i\ul gg}{\i ed} (see Fig. 2) whose onset stops normally have a
short proportion of voicing in the closure; and unnaturally short voicing in the closures of the other six words. The wrong versions of {\i\ul b}{\i and} and{\i }{\i\ul b}{\i e}{\i\ul gg}{\i ed}
were classed as hyper-speech and expected to be more intelligible than the right versions. The other 6 were expected to be less intelligible in noise if naturalness and intelligibility co-varied.\par
\pard \qj\sb240 <FIG MISSING>\par
\pard \qj\sb240 Figure 1. Spectrograms of part of /{\scaps\f12407 is}/ in {\i disappoint}. Left: natural; mid: synthetic \ldblquote right\rdblquote version; right: synthetic \ldblquote wrong\rdblquote version.\par
\pard \qj\sb240 <FIG MISSING>\par
<FIG MISSING>\par
<FIG MISSING>\par
\pard \qj\sb240 Figure 2. Waveforms showing the region around the closure of /b/ in {\i he begged}. Upper panel: natural speech; middle: \ldblquote right\rdblquote synthetic version; lower: hyper-speech synthetic version.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.2. Subjects. \par
\pard\plain \qj\sb240 \f20 The 22 subjects were Cambridge University students, all native speakers of British English with no known speech or hearing problems and less than 30 years old.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.3. Procedure. \par
\pard\plain \qj\sb240 \f20 The 18 experimental items were mixed with randomly-varying cafeteria noise at an average s/n ratio of -4
dB relative to the maximum amplitude of the phrase. They were presented to listeners over high-quality headphones, using a Tucker-Davis DD1 D-to-A system from a PC computer, and a comfortable listening level. Listeners were tested individually in a sound-
treated room. They pressed a key to hear each item, and wrote down what they heard. Each listener heard each phrase once: half the phrases in the right version, half wrong or hyper-speech. The order of items was randomized for each listener separately, and
, because the noise was variable, it too was randomized separately for each listener. Five practice items preceded each test.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.4. Results\par
\pard\plain \qj\sb240 \f20 Responses were scored for number of phonemes correct. Wrong insertions in otherwise correct responses counted as errors. There were two analyses, one on all phonemes in the phrase, the other on just three\emdash
the manipulated phoneme and the 2 adjacent to it. Table 6 shows results for 16 phrases i.e. excluding the two hyper-speech phrases. Responses were significantly bett
er (p < 0.02) for the right versions in the 3-phone analysis, and achieved a significance level of 0.063 in the whole-phrase analysis.\par
\pard \qj\sb240 \par
\trowd \trqc\trgaph107\trleft-107 \clbrdrt\brdrs \clbrdrl\brdrs \clshdng0\cellx1129\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrr\brdrs \clshdng0\cellx4531\clbrdrt\brdrs \clbrdrr\brdrs \clshdng0\cellx6232\pard \qj\keepn\intbl context\cell \pard \qj\keepn\intbl
version of phrase\cell \pard \qj\keepn\intbl t(21) p (1-tail)\cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx1127\clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx2828
\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx4529\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\keepn\intbl \cell \pard \qj\keepn\intbl \ldblquote right\rdblquote \cell \ldblquote wrong\rdblquote \cell \cell \pard \intbl \row \trowd
\trqc\trgaph107\trleft-107 \clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx1127\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx2828\clbrdrt\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx4529
\clbrdrt\brdrs \clbrdrb\brdrdot \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\sb240\keepn\intbl 3 phones\cell \pard \qj\sb240\keepn\intbl 69\cell 61\cell 2.35 0.015\cell \pard \intbl \row \trowd \trqc\trgaph107\trleft-107 \clbrdrl\brdrs \clbrdrb\brdrs
\clbrdrr\brdrs \clshdng0\cellx1127\clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx2828\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx4529\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx6230\pard \qj\sb240\intbl entire phrase\cell \pard
\qj\sb240\intbl 72\cell 68\cell 1.59 0.063\cell \pard \intbl \row \pard \qj\sb240 Table {\*\bkmkstart perc_data_\bkmkcoll32 }{\*\bkmkend perc_data_}6. Percentage correctly identified phonemes in 16 phrases.\par
\pard \qj\sb240 Responses to the hyper-speech words differed: 84% vs. 89% correct for normal vs. hyper-speech {\i begged}; 85% vs. 76% correct for normal vs. hyper-speech {\i band} (3-phone analysis). Hyper-speech {\i in the} {\i band}
was often misheard as {\i in the van}. This lexical effect is an obvious consequence of enhanced periodicity in the /b/ closure of {\i band}.\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 6.4.5. Discussion\par
\pard\plain \qj\sb240 \f20
We have shown for one speaker of Southern British English that linguistic structure influences the type of excitation at the boundaries between voiceless fricatives and vowels, as well as the duration of periodic excitation in the closures of voiced stops.
Most FV boundaries are simple, whereas most VF boundaries are mixed. Within thes
e broad patterns, syllable stress, vowel height, and final vs. non-final position within the phrase all influence the incidence and/or duration of mixed excitation. We interpret these data as indicating that the principal determinant of mixed excitation se
ems to be asynchrony in coordinating glottal and upper articulator movement. Timing relationships seem to be tighter at FV than VF boundaries, and there can be considerable latitude in the timing of VF boundaries when the fricative is a phrase-final coda.
\par
\pard \qj\sb240
Our findings for voiced stops were as expected, if one assumes that the main determinants of the duration of low-frequency periodicity in the closure interval are aerodynamic. One interesting pattern is that voicing in the closures of prestressed onset sto
ps is short both in absolute terms and relative to the total duration of the closure.\par
\pard \qj\sb240 We further showed that phoneme identification is better when the pattern of excitation at segment boundaries is appropriate for the structural context. Considering that o
nly one acoustic boundary i.e. one edge of one phone or diphone, was manipulated in most of the phrases, and that there are relatively few data points, the significance levels achieved testify to the importance of synthesizing edges that are appropriate to
the context. It is encouraging that differences were still fairly reliable in the whole-phrase analysis under these circumstances, since we would expect more response variability over the whole phrase.\par
\pard \qj\sb240 If local changes in excitation type at segment bounda
ries enhance intelligibility significantly, then systematic attention to boundary details throughout the whole of a synthetic utterance will presumably enhance its robustness in noise considerably. However, it is a truism that at times the speech style tha
t is most appropriate to the situation is not necessarily the most natural one. The two instances of hyper-speech are a case in point. By increasing the duration of closure voicing in stressed onset stops, we imitated what people do to enhance intelligibil
ity in adverse conditions such as noise or telephone bandwidths. But this manipulation risked making the /b/s sound like /v/s, effectively widening the neighborhood of {\i band} to include {\i van.} Since {\i in the van} is as likely as {\i in the band}
, contextual cues could not help, and {\i band}\rquote s intelligibility fell. {\i Begged}\rquote
s intelligibility may have risen because there were no obvious lexical competitors, and because we also enhanced the voicing in the syllable coda, thus making a more extreme hyper-speech style, and, perhap
s crucially, a more consistent one. These issues need more work.\par
\pard \qj\sb240 The perceptual data do not distinguish between whether the \ldblquote right\rdblquote
versions are more intelligible because the manipulations enhance the acoustic and perceptual coherence of the signal at the boundary, or because they provide information about linguistic st
ructure. The two possibilities are not mutually exclusive in any case. The data do suggest, however, that one reason for the appeal of diphone synthesis is not just that segment boundaries so
und more natural, but that their naturalness may make them easier to understand, at least in noise. It thus seems worth incorporating fine phonetic detail at segment boundaries into formant synthesis. It is relatively easy to produce these details with HLs
yn, on which {\scaps procsy} is based.\par
\sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 7. Conclusion\par
\pard\plain \qj\sb240 \f20 {\b Yes, this section is deadful\emdash a real construction site. Please be patient!}\par
\pard \qj\sb240 This needs work. I suggest just a couple of paras. Ideas for what to put in here gratefully received. My own thoughts:\par
\bullet \~\ldblquote informational richness\rdblquote is about (1)\~the speech signal containing systematic information that signals (2)\~complex linguistic structure.\par
\pard \qj\sb240 \bullet \~repeat that having properly structured linguistic knowledge has something essential to offer speech synthesis, at temporal, spectral and intonational levels of modelling. We\rquote re suggesting an integrated, structure-based
(i.e. prosodic) model.\par
\pard \qj\sb240 \bullet \~perhaps an indication of where we go next.{\ul \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 8. REFERENCES\par
\pard\plain \qj\sb240 \f20 {\b I need you all to compile a set of references. Below is a list of things referred to by name in the text (+ refs pasted in from the scrappy bibliography that\rquote s been building up during the writing), so we\rquote
ve got an incomplete checklist. Some have numbers in text eg. [3]. or [Ref1]. Check this please! \par
}\pard \qj\sb240 {\b SUGGESTION: we need to make sure ICSLP ref. is in there, along with all ICPhS papers relevant to ProSynth. If you can find somewhere in the text where a ref. to such work is suitable, let me know.\par
}\pard \qj \par
Abercrombie (ref.) \par
Bird (1995)\par
Bregman 199xx\par
Dutoit et al ref\par
Elman and McClelland 1986\par
Fougeron & Keating (JASA ref) \par
Gobl and NiChasaide 19xx\par
Hamon et al.\par
Hawkins & Slater 1994\par
\pard \qj\li720 {Hawkins, S. & Slater, A. \ldblquote Spread of CV and V-to-V coarticulation in British English: implications for the intelligibility of synthetic speech.\rdblquote }{\i ICSLP}{ 94, 1: 57-60, 1994.}\par
\pard \qj Hawkins and Nguyen in press {\b \par
Hawkins & Nguyen LabPhon: please double check in text.}\par
\pard \qj\li720 {Hawkins, S., & Nguyen, N. \ldblquote Effects on word recognition of syllable-onset cues to syllable-coda voicing\rdblquote , }{\i LabPhon VI}{, York, 2-4 July 1998.}\par
\pard \qj Heid & Hawkins ref., Jenolan Caves\par
\pard \qj\li720 {Heid, S. & Hawkins, S. \ldblquote Automatic parameter-estimation for high-quality formant synthesis using HLSyn.\rdblquote Presented }{\i at 3rd ESCA Workshop on Speech Synthesis}{, Jenolan Caves, Australia, 1998.}\par
\pard \qj Heid and Hawkins (under review)\par
\pard \qj Heid and Hawkins 1999\par
House & Hawkins (1995)\par
\pard \qj\li720 {House, J. & Hawkins, S., \ldblquote An integrated phonological-phonetic model for text-to-speech synthesis\rdblquote , }{\i Proc. ICPhS XIII}{, Stockholm, Vol. 2, 326-329, 1995.}\par
\pard \qj House & Wichmann 1996\par
Keating, Fougeron & Cho (LabPhon ref.)\par
Kelly and Local 1989\par
\pard \qj\li720 {Kelly, J. & Local, J. }{\i Doing Phonology.}{ Manchester: University Press, 1989.}\par
\pard \qj Kwong and Stevens 1999\par
Local & Ogden (1997)\par
\pard \qj\li720 {Local, J.K. & Ogden R. \ldblquote A model of timing for nonsegmental phonological structure.\rdblquote In Jan P.H. van Santen, R W. Sproat, J. P. Olive & J. Hirschberg (eds.) }{\i Progress in Speech Synthesis}{
. Springer, New York. 109-122, 1997.}\par
\pard \qj Local (1992a)\par
\pard \qj\li720 {Local, J.K. \ldblquote Modelling assimilation in a non-segmental rule-free phonology.\rdblquote In G J Docherty & D R Ladd (eds): }{\i Papers in Laboratory Phonology II}{. Cambridge: CUP, 190-223, 1992.}\par
\pard \qj Local (1992b)\par
Local (1995a)\par
Local (1995b)\par
Manuel (1995)\par
Marslen-Wilson and Warren 199x\par
Ogden (1992)\par
Ogden, Local & Carter ref.\par
\'85hman\par
other wmw refs (Gaskell?)\par
\pard \qj Pierrehumbert 1990\par
\pard \qj Pisoni and Duffy 19xx\par
Pisoni in van Santen book\par
Pratt (1986)\par
Remez 19xx\par
Remez and Rubin 19xx {\i Science} paper\par
Repp\par
Rosen and Howell 19xx\par
Selkirk 1984\par
\pard \qj\li720 {Selkirk, E. O., }{\i Phonology and Syntax}{, MIT Press, Cambridge MA, 1984.}\par
\pard \qj Silverman, Y\par
Simpson 1992\par
Strange\par
Tunley 1999\par
van Santen ref.\par
van Tasell, Soli et al 19xx\par
West (1999) \par
Wichmann and House 1999\par
Zsiga (1995)\par
\pard \qj \par
\pard\plain \s15\qj\fi-284\li556\sb120\sl-219\tx560 \fs18 {\f20 1.\tab Hawkins, S. \ldblquote Arguments for a nonsegmental view of speech perception.\rdblquote }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm. Vol. 3, 18-25, 1995.\par
2.\tab House, J. & Hawkins, S., \ldblquote An integrated phonological-phonetic model for text-to-speech synthesis\rdblquote , }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm, Vol. 2, 326-329, 1995.\par
3.\tab Local, J.K. & Ogden R. \ldblquote A model of timing for nonsegmental phonological structure.\rdblquote In Jan P.H. van Santen, R W. Sproat, J. P. Olive & J. Hirschberg (eds.) }{\i\f20 Progress in Speech Synthesis}{\f20
. Springer, New York. 109-122, 1997.\par
4.\tab Local, J.K. \ldblquote Modelling assimilation in a non-segmental rule-free phonology.\rdblquote In G J Docherty & D R Ladd (eds): }{\i\f20 Papers in Laboratory Phonology II}{\f20 . Cambridge: CUP, 190-223, 1992.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 5.\tab Kelly, J. & Local, J. }{\i\f20 Doing Phonology.}{\f20 Manchester: University Press, 1989.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 6.\tab Hawkins, S., & Nguyen, N. \ldblquote Effects on word recognition of syllable-onset cues to syllable-coda voicing\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
7.\tab Hawkins, S. & Slater, A. \ldblquote Spread of CV and V-to-V coarticulation in British English: implications for the intelligibility of synthetic speech.\rdblquote }{\i\f20 ICSLP}{\f20 94, 1: 57-60, 1994.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 8.\tab Tunley, A. \ldblquote Metrical influences on /r/-colouring in English\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 9.\tab Fixmer, E. and Hawkins, S. \ldblquote The influence of quality of information on the McGurk effect.\rdblquote Presented at Australian Workshop on Auditory-Visual Speech Processing, 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 10.\tab Selkirk, E. O., }{\i\f20 Phonology and Syntax}{\f20 , MIT Press, Cambridge MA, 1984.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 11.\tab Broe, M. \ldblquote A unification-based approach to Prosodic Analysis.\rdblquote }{\i\f20 Edinburgh Working Papers in Cognitive Science}{\f20 \~7, 27-44, 1991.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 12.\tab Bladon, R.A.W. & Al-Bamerni, A. \ldblquote Coarticulation resistance in English /l/.\rdblquote }{\i\f20 J. Phon}{\f20 4: 137-150, 1976.\par
13.\tab http://www.w3.org/TR/1998/REC-xml-19980210\par
14.\tab http://www.ltg.ed.ac.uk/\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 15.\tab Heid, S. & Hawkins, S. \ldblquote Automatic parameter-estimation for high-quality formant synthesis using HLSyn.\rdblquote Presented }{\i\f20 at 3rd ESCA Workshop on Speech Synthesis}{\f20
, Jenolan Caves, Australia, 1998.\par
}\pard\plain \qj\sb240 \f20 [Ref1] http://www.w3.org/XML/\par
[Ref2] http://www.phon.ucl.ac.uk/project/prosynth.htm \par
\pard \qj\sb240 [Ref3] Klatt, D., (1979) "Synthesis by rule of segmental durations in English sentences", Frontiers of Speech Communication Research, ed B.Lindblom & S.\'85hman, Academic Press.\par
}
This archive was generated by hypermail 2.0b3 on Thu Sep 02 1999 - 16:26:29 BST