Richard Ogden (rao1@york.ac.uk)
Tue, 7 Sep 1999 16:54:51 +0100 (BST)
oh dammit, I forgot to enclose csl6. Here it is. Sorry.
Richard
Richard Ogden
rao1@york.ac.uk
http://www.york.ac.uk/~rao1/
{\rtf1\mac\deff2 {\fonttbl{\f0\fswiss Chicago;}{\f2\froman New York;}{\f3\fswiss Geneva;}{\f4\fmodern Monaco;}{\f5\fscript Venice;}{\f6\fdecor London;}{\f7\fdecor Athens;}{\f12\fnil Los Angeles;}{\f13\fnil Zapf Dingbats;}{\f14\fnil Bookman;}
{\f15\fnil N Helvetica Narrow;}{\f16\fnil Palatino;}{\f18\fnil Zapf Chancery;}{\f20\froman Times;}{\f21\fswiss Helvetica;}{\f22\fmodern Courier;}{\f23\ftech Symbol;}{\f33\fnil Avant Garde;}{\f34\fnil New Century Schlbk;}{\f134\fnil Saransk;}
{\f237\fnil Petersburg;}{\f2017\fnil IPAPhon;}{\f2713\fnil IPAserif Lund1;}{\f9839\fnil Espy Serif;}{\f9840\fnil Espy Sans;}{\f9841\fnil Espy Serif Bold;}{\f9842\fnil Espy Sans Bold;}{\f10565\fnil M Times New Roman Expt;}
{\f12407\fnil SILDoulosIPA-Regular;}{\f12605\fnil SILSophiaIPA-Regular;}{\f13505\fnil SILManuscriptIPA-Regular;}}{\colortbl\red0\green0\blue0;\red0\green0\blue255;\red0\green255\blue255;\red0\green255\blue0;\red255\green0\blue255;\red255\green0\blue0;
\red255\green255\blue0;\red255\green255\blue255;}{\stylesheet{\s243\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext243 footer;}{\s244\qj\sl-240\tqc\tx4967\tqr\tx9935 \f20\fs20 \sbasedon0\snext244 header;}{\s245\qj\sb240 \f20\fs18\up6
\sbasedon0\snext0 footnote reference;}{\s246\qj\sb240 \f20\fs20 \sbasedon0\snext246 footnote text;}{\s252\qj\sb240\sa60\keepn \b\i\f20 \sbasedon0\snext0 heading 4;}{\s253\qj\sb240\sa60\keepn \b\f20 \sbasedon0\snext0 heading 3;}{\s254\qj\sb360\keepn
\b\i\f21 \sbasedon0\snext0 heading 2;}{\s255\qj\sb360\keepn \b\f21\fs28 \sbasedon0\snext0 heading 1;}{\qj\sb240 \f20 \sbasedon222\snext0 Normal;}{\s1\qj\sb120\sa120\sl360 \sbasedon222\snext1 Abstract;}{\s2\qc\sb180\sl-280 \b\f20 \sbasedon222\snext2
AbstractHeading;}{\s3\li288\ri288\sb140\sl-219 \f20\fs18 \sbasedon222\snext3 Address;}{\s4\qc\sb180\sl-219 \f20\fs22 \sbasedon222\snext4 Affiliation;}{\s5\qc\sb180\sl-219 \i\f20\fs22 \sbasedon222\snext5 Author;}{\s6\qj\sb120\sa120\sl360
\sbasedon222\snext6 Body;}{\s7\qc\sb120\sa240\sl360 \sbasedon0\snext0 caption;}{\s8\qc\sl219 \f20\fs18 \sbasedon222\snext8 CellBody;}{\s9\qc\sl219 \b\f20\fs18 \sbasedon222\snext9 CellHeading;}{\s10\qc\sb180\sl-280\keepn \b\f20 \sbasedon222\snext10 Head1;}
{\s11\fi-562\li562\sb180\sl-280\keepn\tx566 \b\f20 \sbasedon222\snext11 Head2;}{\s12\qj\fi-283\li572\ri561\sb140\sl-220\tx566 \fs18 \sbasedon222\snext12 Item;}{\s13\qj\fi-283\li572\ri561\sb140\sl-220\tx560 \fs18 \sbasedon222\snext13 NumItem;}{\s14\qc
\f20\fs8 \sbasedon4\snext14 bugfix;}{\s15\qj\fi-284\li556\sb120\sl-219\tx560 \fs18 \sbasedon222\snext15 Reference;}{\s16\qj\sl-280 \f21 \sbasedon222\snext16 RTF_Defaults;}{\s17\qj\sl219 \f20\fs18 \sbasedon222\snext17 TableTitle;}{\s18\qc\sl-340
\b\f20\fs28 \sbasedon0\snext18 Title;}{\s19\qc\sl280 \f20 \sbasedon222\snext19 CellFooting;}{\s20\qj\sb240 \sbasedon0\snext20 Document Map;}{\s21\qj\fi-720\li720 \sbasedon0\snext21 Indent;}{\s22\qj \fs20 \sbasedon0\snext22 Plain Text;}{\s23\qj\fi360
\f20\fs18 \sbasedon0\snext23 Normal Indent;}}{\info{\title INSTRUCTIONS FOR ICSLP96 AUTHORS}{\author Richard Ogden}}\paperw11880\paperh16820\margl1151\margr1151\margt1582\margb2098\widowctrl\ftnbj \sectd
\sbkodd\linemod0\linex0\headery709\footery709\cols1\colsx288 {\header \pard\plain \qj \f20 \par
}{\footer \pard\plain \qj\tqc\tx4800\tqr\tx9520 \f20 CSL paper\tab {{\field{\*\fldinst date \\@ "MMMM"}}} {{\field{\*\fldinst date \\@ "d"}}}, {{\field{\*\fldinst date \\@ "yyyy"}}}\tab {\chpgn }\par
}\pard\plain \s18\qc\sl-340 \b\f20\fs28 ProSynth: An Integrated Prosodic Approach to Device-Independent, Natural-Sounding Speech Synthesis\par
\pard\plain \s5\qc\sb180\sl-219 \i\f20\fs22 Richard Ogden{\fs14\up11 *}{\plain \f20\fs22 ,}{\fs14\up11 }Sarah Hawkins{\fs14\up11 **}, Jill House{\fs14\up11 ***}, Mark Huckvale{\fs14\up11 ***}, John Local{\fs14\up11 *}{\plain \f20\fs22 , }Paul Carter{
\fs14\up11 *}, Jana Dankovicov\'87{\fs14\up11 ***}, Sebastian Heid{\fs14\up11 **}\par
\pard\plain \s4\qc\sb180\sl-219 \f20\fs22 {\fs14\up11 *} University of York,{\fs14\up11 **} University of Cambridge, {\fs14\up11 ***} University College, London\par
\pard\plain \qj\sb240 \f20 {\fs28 \par
}\pard\plain \s14\qc \f20\fs8 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s2\qc\sb180\sl-280 \b\f20 ABSTRACT{\fs18 \par
}\pard\plain \s1\qj\sb120\sa120 {\f20
This paper outlines ProSynth, an approach to speech synthesis which takes a rich linguistic structure as central to the generation of natural-sounding speech. We start from the assumption that the speech signal is informationally rich, and that this acoust
ic richness reflects linguistic structural richness and underlies the percept of naturalness. Naturalness achieved by structural richness produces a perceptually robust signal that is intelli
gible in adverse listening conditions. ProSynth uses syntactic and phonological parses to model the fine acoustic-phonetic detail of real speech, segmentally, temporally and intonationally. Preliminary tests to evaluate the effects of modelling fine spectr
al detail timing, and intonation suggest that the approach increases inteligiblity.}{\b\f20 \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 1. Introduction\par
\pard\plain \qj\sb240 \f20 1.\tab Speech synthesis by rule (text-to-speech, TTS) has restricted uses because it sounds unnatural and is often difficult to understand. Despite recent impr
ovements in grammatical analysis and in deriving correct pronunciations for irregularly-spelled words, there remains a more fundamental problem, that of the inherent incoherence of the synthesized acoustic signal. Synthetic speech typically lacks the subtl
e systematic variation of natural speech that underlies the perceptual coherence of syllables and their constituents and the longer phrases of which they form part. Intonation is often dull and repetitive, timing and rhythm are poor, and modifications that
word boundaries undergo in connected speech are poorly modelled. Much of this incoherence arises because many modern TTS systems encode linguistic knowledge in ways which are not in tune with current understanding of human speech and language processes.
\par
\pard \qj\sb240 2.\tab
Segmental intelligibility data illustrate the scale of the problem. When heard in noise, most synthetic speech loses intelligibility much faster than natural speech: natural speech is about 15% less intelligible at 0 dB s/n ratio than in quiet, whereas f
or isolated words or syllables, Pratt (1986) reported that typical synthetic speech drops by 35%-50%. We can expect similar results today. Concatenated natural speech avoids those problems related solely to voice quality and local segment boundaries, but s
uffers just as much from poor models of timing, intonation, and systematic variation in segmental quality that is dependent on word and rhythmical structure. Even when the grammatical analysis is right, one string of words can sound good, while another wit
h the same grammatical pattern does not. \par
\pard \qj\sb240 3.\tab ProSynth is an integrated {\i prosodic}
(i.e. structure-based) approach to speech synthesis. At its core is a phonological model which allows for structurally important distinctions to be made, even when the phonetic effect of these distinctions is subtle. The phonological model in ProSynth dra
ws together insights from current phonology, and makes it easier to model phonetic and perceptual effects. Recent research in computational phonology (eg. Bird 1995) combines hi
ghly structured linguistic representations (more technically, signs) with a declarative, computationally tractable formalism. Recent research in phonetics (eg. Simpson 1992, Hawkins & Slater 1994, Manuel 1995, Zsiga 1995) shows that speech is rich in non-p
honemic information which contributes to its naturalness and robustness. Other work (Local 1992 a & b, 1995a & b, Ogden 1992, Local & Ogden 1997) has shown that it is possible to combine phonological with phonetic knowledge by means of a process known as p
honetic interpretation: the assignment of phonetic parameters to pieces of phonological structure. All these strands of work have contributed to the phonological model which ProSynth uses. By mimicking as far as possible the systematic spectral, temporal a
nd intonational detail which is observable in natural speech, we aim to improve the intelligiblity of synthetic speech. \par
\pard \qj\sb240 4.\tab This paper has the following structure. Section 2 outlines the motivation for the ProSynth model. Section 3 describes the linguistic
model used to represent the information necessary for modelling the kinds of phonetic effects described in Section 2. Section 4 sets out how the model described in Section 3 is implemented, and how segmental, temporal and intonational detail are modelled.
\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 2.\tab Motivation\par
\pard\plain \qj\sb240 \f20 1.\tab
Interdependencies between grammatical, prosodic and segmental parameters are well known to phoneticians and to everyone who has synthesized speech. When these components are developed for synthesis in separate modules, the apparent conve
nience is offset by the need to capture the interdependencies, which often leads to problems of rule ordering and rule proliferation to correct effects of earlier rules. In our view, much of the robustness of natural speech is lost by neglecting systematic
subphonemic detail, a neglect that results partly from an inappropriate emphasis on phoneme strings rather than on linguistic structure. Fine phonetic detail, also called systematic, or lawful, variation (or variability, cf. Elman and McClelland 1986), co
ntributes to making the time-varying speech signal an effective communicative medium because it reflects multidimensional properties of both vocal-tract dynamics and linguistic structure.\par
\pard \qj\sb240 2.\tab
Accordingly, ProSynth models more phonetic detail than is standard in synthetic speech. Such detail includes secondary resonance effects, timing and rhythm, and f0 alignment. The aim is to create a signal that sounds natural because it seems to come fro
m a single talker and provides rich phonetic information about the linguistic structure of the utterance. The well-known \ldblquote redundancy\rdblquote
of the speech signal, whereby a phone can be signalled by a number of more-or-less co-occurring acoustic properties, contributes some of this detail, but in our view, other less well-documen
ted properties are just as important. As implied above, they can be roughly divided into two groups: those that make the speech signal sound as if it comes from a single talker, and those that reflect linguistic structure for a given accent.\par
\pard \qj\sb240 3.\tab A speech si
gnal sounds as if it comes from a single talker when it is perceptually coherent, meaning that its properties reflect details of vocal-tract dynamics. To be heard as speech, time-varying acoustic properties must bear the right relationships to one another.
When they do, the perceptual system groups them together into an internally coherent auditory stream (Bregman 1990) or more abstract entity{\b }
(cf. Remez 19xx). A wide range of properties seems to contribute to perceptual coherence. The influence of some, li
ke patterns of formant frequencies, is widely acknowledged (cf. Remez, Pisoni & Carrell 1981). Others are known to be important but are not always well understood; examples are the amplitude envelope which governs some segmental distinctions (cf. Rosen and
Howell 1987) and also perceptions of rhythm and of \lquote integration\rquote
between stop bursts and following vowels (van Tasell, Soli et al 19xx); and correlations between the mode of glottal excitation and the behaviour of the upper articulators, especially at abrupt segment boundaries (Gobl and NiChasaide 19xx).\par
\pard \qj\sb240 4.\tab
A speech signal will not sound as if the talker is using a consistent accent and style of speech unless all the systematic phonetic details are right. This requires producing often small distinctions that reflect different combinations of linguistic pro
perties. As an example, take the words {\i mistakes} and {\i mistimes}, whose spectrograms are shown at the left hand side of Figure XX. The beginnings of these two words are phonetically different in a number of wa
ys, even though the first four phonemes are the same. The /t/ of {\i mistimes} is aspirated and has a longer closure, whereas the one in {\i mistakes} is not aspirated and has a shorter closure. The /s/ of {\i mistimes}
is shorter, and its /m/ and /I/ are longer, which is heard as a rhythmic difference: the first syllable of {\i mistimes} has a heavier beat than that of {\i mistakes}. \par
\pard \qj\sb240 5.\tab These phonetic differences arise because the morphological structure of the words differs: {\i mistimes} contains the morphemes {\i mis}+{\i time}, which each have a separate meaning; and the meaning of {\i mistimes}
is straightforwardly related to the meaning of each of the two morphemes. But the meaning of {\i mistakes}
is not obviously related to the meaning of its constituent morphemes. This morphological difference is reflected phonologically in the syllable structure, as shown on the right of Figure XX. In {\i mistimes}
, /s/ is the coda of syllable 1, and /t/ is the onset of syllable 2. Conversely, the /s/ and /t/ in {\i mistakes} belong to both syllables and form both the coda
of syllable 1 and the onset of syllable 2. In an onset /st/, the /t/ is always unaspirated (cf. {\i step}, {\i stop}, {\i start}
). The durational differences in the /m/ and the /I/ arise because the morphologically-conditioned differences in syllable structure result in {\i mist} being a rhythmically heavy syllable whereas {\i mis}
is rhythmically light, while both syllables are metrically weak (i.e. unstressed). So the morphological differences between the words are reflected in structural phonological differences; and these in t
urn have implications for the phonetic detail of the utterances, despite the segmental similarities between the words.\par
\pard \qj\sb240 INSERT FIGURE XX ABOUT HERE\par
\par
\pard \qj\li720\sb240 Legend to Figure xx. Left: spectrograms of the words {\i mistimes} (top) and {\i mistakes }(bottom) spoken by a British English woman in the sentence {\i I\rquote d be surprised if Tess ____ it} with main stress on {\i Tess}
. Right: syllabic structures of each word.\par
\pard \qj\sb240 \par
\pard \qj\sb240 6.\tab Some types of systematic fine detail may contribute both perceptual coherence and information about linguistic structure
. So-called resonance effects (Kelly and Local 1989) provide one example. Resonance effects associated with /r/, for example, manifest acoustically as lowered formant frequencies, and can spread over several syllables, but the factors that determine whethe
r and how far they will spread include syllable stress, the number of consonants in the onset of the syllable, vowel quality, and the number of syllables in the foot (Tunley 1999).\par
\pard \qj\sb240 7.\tab On the one hand, including this type of fine phonetic detail (or systematic variation) in synthetic speech makes it sound more natural in a subtle way that is hard to describe in phonetic terms but seems to make the signal
\ldblquote fit together\rdblquote better\emdash in other words, it seems to make it more coherent. On the other hand, the fact that
the temporal extent of rhotic resonance effects depends on linguistic structure means not only that cues to the identity of a single phoneme can be distributed across a number of acoustic segments (sometimes several syllables), but also that aspects of th
e linguistic structure of the affected syllable(s) can be subtly signalled.\par
\pard \qj\sb240 8.\tab Listeners can use distributed{\b }
acoustic information to identify naturally-spoken words (Marslen-Wilson and Warren 199xx; other wmw refs (Gaskell?); Hawkins and Nguyen in press), and when such information is included in synthetic speech it can increase phoneme intelligibility in noise by
10-15% or more (Hawkins and Slater 1994, Tunley 1999). Both classical and recent experiments ((xxref Repp, \'85hman, Strange, Heid and Hawkins 1999; Pisoni in van Santen book, Pisoni and Duffy 19xx,{\b [[sh check these refs]]}
Kwong and Stevens 1999) suggest that most systematically varying properties will enhance perception in at least some circumstances. Natural-sounding, systematic variation of this type may be especially influential in adverse listening conditions or when c
ognitive loads are high. \par
\pard \qj\sb240 9.\tab In summary, ProSynth is based on the philosophy that natural speech is robust because it contains many phonetic details at the spectral, temporal and into
national levels. These details vary systematically to form a perceptually coherent whole and are the product of the phonetic interpretation of a rich linguistic structure.{\b }
In ProSynth, we attempt to model declaratively the richness of both linguistic structure and of the acoustic-phonetic signal which results from its interpretation (Pierrehumbert 1990). The next sections set out how the phonological model is organised, and
how we interpret it phonetically.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 3.\tab ProSynth: a linguistic model\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 Overview\par
\pard\plain \qj\sb240 \f20 1.\tab ProSy
nth uses a phonological model which encodes phonological information in a hierarchical fashion using structures based on attribute-value pairs. Each phonological unit occurs in a complete metrical context. This context is a prosodic hierarchy with phonolog
ical contrasts available at all levels. The prosodic hierarchy is described in Section 3.1. The complex interacting levels of rules present in traditional layered systems are replaced in ProSynth by a one-step phonetic interpretation function operating on
the entire context, which makes rule-ordering unnecessary. Whereas conventional synthesis systems use a relatively poor structure and complex, interacting rules, ProSynth uses instead a rich structure and applies simple rules of phonetic interpretation whi
ch are highly structure-bound. Systematic phonetic variation is thus constrained by position in structure. The basis of phonetic interpretation is not the segment, but phonological features at places in structure. We thus extend the principles successfully
demonstrated in (Local & Ogden 1997; Local 1992XX) to a wider variety of phonological and domains and phonetic details. The details of the units of structure and their attributes are set out in Section 3.2.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.1 The Prosodic Hierarchy\par
\pard\plain \qj\sb240 \f20 1.\tab
The phonological structure is organised as a prosodic hierarchy, with phonological information distributed across the structure. The knowledge is formally represented as a Directed Acyclic Graph (DAG), a kind of tree structure. Graph-structures in the f
orm of trees are commonly used in phonological analysis, except for the important addition of ambisyllabicity. \par
\pard \qj\sb240 2.\tab
Text is parsed into a prosodic hierarchy which has units at the following levels: syllable constituents (Onset, Rhyme, Nucleus, Coda); Syllable; Foot; Accent Group (AG); Intonational Phrase (IP). The prosodic hierarchy, building on House & Hawkins (1995
) and Local & Ogden (1997) is a head\_driven (refs) and strictly layered structure.{\plain }Each unit is dominated by a unit at the next highest level (Strict Layer Hypothesi
s, Selkirk 1984). This produces a linguistically well-motivated and computationally tractable hierarchy which accords with the representational requirements of its implementation in XML (Section XX). Constituents at each level have a set of possible attrib
utes, and relationships between units at the same level are determined by the principle of headedness. Structure-sharing is explicitly recognized through ambisyllabicity. \par
\pard \qj\sb240 3.\tab Fig. XX shows a partial phonological structure for the phrase \ldblquote Come with a bloom\rdblquote
. Note that phonological information is spread around the structure. For example, the feature [voice] is treated as a property of the Rhyme as a whole, and not of just one of the terminal nodes headed by the Rhyme. Timing information is also included: in t
he Fig. XX, the [start] of the IP is the same as the [start] of the Onset of the first syllable of the utterance, and the [end] of the IP is the same as the [end] of the Coda of the last syllable, as indicated by the tags {\f13 \'c0} and {\f13 \'c1}
. The value for [ambisyllabic] is shown for two consonants: note that for the [ambisyllabic: +] consonant /{\f12407 D}/, the terminal node is re-entrant.\par
\pard\plain \s7\qc\sb120\sl360 {\f20\fs20 {\pict\macpict\picw370\pich266
064800000000010a0172001102ff0c00ffffffff000000000000000001720000010a000000000000001e0001000a00000000010a0172002c000c00150948656c76657469636100030015000d000b002e0004000000000028000a0113024950000028002e011002414700002a2404466f6f740001000a001001160022011800
22fffe011700360001000a0034011600460118000900b7087c00b7086400220022011700360001000a00580116006a0118002200460117003600a100640010474449310001ffffffff0000000000000001000a00000000010a017200030000000d0000002a240453796c6c002a24025268002a24024e750001000a007c0116
008e01180022006a011700360001000a00a0011600b201180022008e011700360001000a00000000010a017200291302436f0001000a00a0011700b201290022008e010536360001000a00000000010a0172002800be00f7014f0001000a007c00fc00b2011700200046013200e800e100a100640010474449310001ffffff
ff0100000000000001000a00000000010a01720028007600920453796c6c00002a24025268002a24024e750001000a007c0098008e009a0022006a009900360001000a00a0009800b2009a0022008e009900360001000a00000000010a017200291302436f0001000a00a0009900b200ab0022008e008736360001000a0000
0000010a0172002800be0079014f0001000a007c007e00b200990020004600b400e800630001000a00000000010a017200280076004a0453796c6c00002a24025268002a24024e750001000a007c0050008e00520022006a005100360001000a00a0005000b200520022008e005100360001000a00000000010a0172002913
02436f0001000a00a0005100b200630022008e003f36360001000a00000000010a0172002800be0031014f0001000a007c003600b2005100200046006c00e8001b0001000a00000000010a01720028007600da0453796c6c00002a24025268002a24024e750001000a007c00e0008e00e20022006a00e100360001000a00a0
00e000b200e20022008e00e100360001000a00000000010a0172002800be00c1014f0001000a007c00c600b200e10020004600fc00e800ab0001000a00000000010a01720028002e004a02414700002a2404466f6f740001000a003400500046005200220022005100360001000a00580050006a0052002200460051003600
01000a00580051006a0099002000460009007c00e10001000a00580051006a00e100200046ffc1007c01710001000a00100051002201170020fffe01dd0034ff8b0001000a00c4005000d60052002200b2005100360001000a00c4003500d60037002200b2003600360001000a00c4006200d60064002200b2006300360001
000a00c4007d00d6007f002200b2007e00360001000a00c4009800d6009a002200b2009900360001000a00c400aa00d600ac002200b200ab00360001000a00c400ab00d600c6002200b200e1af360001000a00c400e000d600e2002200b200e100360001000a00c400fc00d60105002200b200f31b360001000a00c4011600
d60118002200b2011700360001000a00c4012800d6012a002200b2012900360001000a00000000010a017200030015000d000b002800e20032016b00291b010000002912016d0000293801490000291101440000293601ab000001000a00c400f300d600fc002200b20105e5360001000a00000000010a017200030000000d
0000002912016200002914016c000029100275d9002912016d00002800e2007a017700030015000d000b0028006d012b135b737472656e6774683a207374726f6e675d20002a0c0f5b7765696768743a2068656176795d00002a180c5b636865636b6564202b5d20002a0c095b766f696365202b5d00002b0921055b656e64
3a00291501e700002907015d00002801060092115b616d626973796c6c616269633a202b5d00280106004a115b616d626973796c6c616269633a202d5d0001000a00e8006200fa0064002200d6006300360001000a00e800aa00fa00ac002200d600ab00360001000a00000000010a017200030000000d00000028000a012b
095b73746172743a202000030015000d000b00291c01cb00002908025d2000280016012b055b656e643a00291501e700002907015d00002800be0002085b73746172743a200000291a01cb00002908015d0000ff}}{\f20 \par
}\pard \s7\qc\sb120\sa240\sl360 {\f20 Fig. 1. Partial tree structure of the utterance: \ldblquote Come with a bloom\rdblquote . Selected phonological attributes are shown for some constituents. See text for details.\par
}\pard\plain \qj\sb240 \f20 4.\tab
There is no separate level of phonological word within the hierarchy. Such a unit does not sit happily in a strictly layered structure, because the boundaries of prosodic constituents like AG and Foot may well occur in the middle of
a lexical item. Conversely, word boundaries may occur in the middle of a Foot/AG. For example, in the phrase {\i maths department}
there are two feet: [maths de-], and [-partment]. The second begins in the middle of a word, and the first contains a word boundary. \par
\pard \qj\sb240 The computational representation of the prosodic structure allows us to get round this problem: word\_level and syntactic\_level information is hyper\_
linked into the prosodic hierarchy. Phonetic interpretation may be sensitive to information at any leve
l, so that it is possible to distinguish, for instance, a plosive in the onset of a weak word-final syllable from an onset plosive in a weak word-medial syllable. In this way lexical boundaries and the grammatical categories of words can be used to inform
phonetic interpretation. \par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 3.2 Units of Structure and their Attributes\par
\pard\plain \qj\sb240 \f20 1.\tab Input text is parsed into both a syntactic and a phonological structure. The phonological parse allots material to places in the prosodic hierarchy and is supplemented with links to t
he syntactic parse. The lexicon itself is in the form of a partially parsed representation. This section describes in more detail the units of structure\emdash in particular super-syllabic constituents\emdash and their attributes.\par
\pard \qj\sb240 {\b 10.\tab Phonological features:} Features are represented as <{\i attribute}, {\i value}> pairs, where the {\i value} slot can also be filled by another attribute-value{\i }pair.{\fs18\up6 \chftn {\footnote \pard\plain \s246\qj\sb240
\f20\fs20 {\fs18\up6 \chftn } Note on <{\i attribute}, {\i value}> pairs. Where the {\i value} is not boolean, such as [weight: heavy/light], we abbreviate this to eg. [light] where it is convenient to do so in the text.}}
To the set of conventional features are added the features [rhotic:\'b1], to allow us to mimic the long-domain resonance effects of /r/ [5, 8], and [ambisyllabic:\'b1] for ambisyllabic constituents (\'a4XX). Phonological <{\i attribute}, {\i value}
> pairs are distributed around the entire prosodic hierarchy rather than at just the terminal nodes (or even associated to just terminal nodes), as in many phonological theories. [voice:\'b1
], for instance, is a property of the rhyme as a whole in order to model durational and resonance effects. Attributes at any level in the hierarchy may be accessed for use in phonetic interpretation.\par
\pard \qj\sb240 {\b 2.\tab Headedness}: When a unit branches
into sub-constituents, one of these constituents is its Head. If the leftmost constituent is the head, the constituent is said to be left-headed. If the rightmost constituent is the head, the structure is right-headed. AGs and Feet are left-headed. Propert
ies of a head are shared by the nodes it dominates (Broe ref., Ogden 1998). Therefore a [heavy] syllable has a [heavy] rhyme; the syllable-level resonance features [grave:\'b1] and [round:\'b1
] can also be shared by nodes they dominate: this is how some aspects of coarticulation are modelled. \par
\pard \qj\sb240 The feature [head:\'b1
] is used to mark headedness. A constitutent with the feature [head:+] is the head of the superordinate constituent it belongs to. In Fig. XX, headedness is indicated by vertical lines, as opposed to slanting ones. Phonetic interpretatio
n proceeds head-first and is therefore determined in a structurally principled fashion without resort to extrinsic ordering. \par
\pard \qj\sb240 {\b 3.\tab Intonational Phrase (IP)}: The IP, the domain of a well-formed, coherent intonation contour, contains one or more AGs; minimally it must include a strong AG. The head of the IP is the rightmost AG\emdash
traditionally the intonational nucleus. The IP is the largest prosodic domain recognised in the current implementation of the ProSynth model. The attributes of IP are (1)\~position in discourse, (2)\~speech act function, (3)\~
focus. (1) and (2) together determine f0 scaling and boundary tones whereas (3)\~will determine intonational nucleus placement, using information from the syntax or the lexicon as a defau
lt when other discoursal info is unknown. Work on this aspect of the model is on-going.\par
{\b 4.\tab Accent Groups (AG)}
: AGs are units of intonation. They immediately dominate one or more feet. The head of the AG is the leftmost [heavy] foot, and is associated with an intonational pitch accent. AG attributes include [weight: heavy/light], number of component feet, position
within the IP and Pitch Accent specifications. Only [heavy] AGs can have Pitch Accents assigned to them. When an IP begins with one or more unaccented syllables, we maintain the strictly layered structure by analysing them as constituting a [light] or
\ldblquote degenerate\rdblquote AG, which in turn contains a [light] foot. Degenerate AGs have no head, cannot carry Pitch Accents, and can only occur as the first AG in an IP.\par
\pard \qj\sb240 {\b 6.\tab Feet}: All syllables are organised into Feet, which are units of rhythm. Types of Feet are differentiated using attributes of [weight: heavy/light], [strength: strong/weak], [head:\'b1
] and number of component syllables. Feet with the attribute [hea
d:+] are assigned Pitch Accents (see above). The attribute [weight] distinguishes between fully-formed ([heavy]) and degenerate ([light]) feet. A degenerate foot (which must be [light]) cannot act as a site for rhythmic stress\emdash
it is also [weak]. A [strong]
foot is associated with a rhythmically stressed position. The leftmost syllable within a foot acts as its head, so the syllable at the head of a [strong] foot, itself [strong], is stressed. A [weak] foot cannot carry stress. However, [strong] syllables ma
y occur inside [weak] feet; for example, the fourth syllable {\i known} in the phrase {\i in the well-known maths department} is [strong], but is dominated by a rhythmically [weak] Foot. \par
\pard \qj\sb240 {\b 7.\tab Syllables:}
The Syllable contains the constituents onset and rhyme. The rhyme branches into nucleus and coda. Nuclei, onsets and codas can all branch. Onsets and codas contain consonants, while nuclei contain vowels. Both onsets and codas contain vocalic features whi
ch are inherited from the nucleus, which is the head of the syllable. This allows for the accurate modelling of coarticulation (Coleman 1992, Local 1992, Ogden 1992).\par
\pard \qj\sb240
Syllables are right-headed, rhymes left-headed. Attributes of the syllable include [weight: heavy/light], and [strength: strong/weak]: these are necessary for the correct assignment of temporal compression (Section 5.2). Foot-initial syllables are strong.
\par
\pard \qj\sb240 8.\tab Weight is defined with regard to the subconstituents of the rhyme. A syllable is [heavy] if its nucleus attribute [length] has the value [long] (in seg
mental terms, if it contains a long vowel or a diphthong). A syllable is also [heavy] if its coda has more than one constituent, as in /rent/, /ask/, /taks/. Other syllables are [light]. In polymorphemic syllables such as {\i cat+s}
, the weight of the syllable is determined according to the stem, and the suffix is treated as a syllable appendix.\par
\pard \qj\sb240 9.\tab There is not a direct relationship between syllable strength and syllable weight. Strong syllables need not be heavy. In {\i loving}, /{\f12407 l\'c3v}/ has a [short] nucleus, and the
coda has only one constituent (corresponding to /{\f12407 v}/), yet it is the strong syllable in the foot. Similarly, weak syllables need not be light. In {\i department}
, the final syllable has a branching coda (i.e. more than one constituent) and therefore is [heavy] but [weak]. ProSynth does not use extrametricality: all phonological material must be dominated by an appropriate node in structure.\par
\pard \qj\sb240 Fig. XX illustrates the partial metrical structure for the syllable, foot, AG and IP nodes for the phrase {\i in the well-known maths department}, along with low-level syntactic tags.\par
\pard \qj\sb240 \par
\trowd \trgaph80\trleft-80 \clmgf\clbrdrt\brdrs \clbrdrl\brdrs \clshdng0\cellx4708\clmrg\clbrdrt\brdrs \clshdng0\cellx4708\clmrg\clbrdrt\brdrs \clshdng0\cellx4708\clmrg\clbrdrt\brdrs \clshdng0\cellx4708\clmgf\clbrdrt\brdrs \clshdng0\cellx9496\clmrg
\clbrdrt\brdrs \clshdng0\cellx9496\clmrg\clbrdrt\brdrs \clshdng0\cellx9496\clmrg\clbrdrt\brdrs \clbrdrr\brdrs \clshdng0\cellx9496\pard \qj\sb240\keepn\intbl \cell \pard \qj\sb240\keepn\intbl \cell \cell \cell \pard \qj\sb240\keepn\intbl IP\cell \pard
\qj\sb240\keepn\intbl \cell \pard \qj\sb240\intbl \cell \cell \pard \intbl \row \trowd \trgaph80\trleft-80 \clmgf\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx2314\clmrg\clbrdrt\brdrs \clbrdrb\brdrs \clshdng0\cellx2314\clmgf\clbrdrt\brdrs
\clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx4708\clmrg\clbrdrt\brdrs \clbrdrb\brdrs \clshdng0\cellx4708\clmgf\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx9496\clmrg\clbrdrt\brdrs \clbrdrb\brdrs \clshdng0\cellx9496\clmrg\clbrdrt\brdrs
\clbrdrb\brdrs \clshdng0\cellx9496\clmrg\clbrdrt\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx9496\pard \qj\sb240\keepn\intbl AG\par
[POS: 0]\par
\pard \qj\keepn\intbl [feet: 1]\par
[weight: light]\par
[head:-]\cell \pard \qj\sb240\keepn\intbl \cell \pard \qj\sb240\keepn\intbl AG\par
[POS: 1]\par
\pard \qj\keepn\intbl [feet: 2]\par
[weight: heavy]\par
[head:-]\cell \pard \qj\sb240\keepn\intbl \cell \pard \qj\sb240\keepn\intbl AG\par
[POS: 2]\par
\pard \qj\keepn\intbl [feet: 2]\par
[weight: heavy]\par
[head:+]\cell \pard \qj\sb240\keepn\intbl \cell \pard \qj\sb240\intbl \cell \cell \pard \intbl \row \trowd \trgaph80\trleft-80 \clmgf\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx2314\clmrg\clbrdrt\brdrs \clbrdrb\brdrs \clshdng0\cellx2314
\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx3511\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx4708\clmgf\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx7102\clmrg\clbrdrt\brdrs \clbrdrb\brdrs \clshdng0\cellx7102
\clmgf\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx9496\clmrg\clbrdrt\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx9496\pard \qj\sb240\keepn\intbl F\par
[head:-]\par
\pard \qj\keepn\intbl [weak]\par
[light]\cell \pard \qj\sb240\keepn\intbl \cell \pard \qj\sb240\keepn\intbl F\par
[head:+]\par
\pard \qj\keepn\intbl [strong]\par
[heavy]\cell \pard \qj\sb240\keepn\intbl F\par
[head:-]\par
\pard \qj\keepn\intbl [weak]\par
[heavy]\cell \pard \qj\sb240\keepn\intbl F\par
[head:+]\par
\pard \qj\keepn\intbl [strong]\par
[heavy]\cell \pard \qj\sb240\keepn\intbl \cell \pard \qj\sb240\keepn\intbl F\par
[head:-]\par
\pard \qj\keepn\intbl [strong]\par
[heavy]\cell \pard \qj\sb240\keepn\intbl \cell \pard \intbl \row \trowd \trgaph80\trleft-80 \clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx1117\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx2314\clbrdrt\brdrs \clbrdrl\brdrs
\clbrdrb\brdrs \clshdng0\cellx3511\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx4708\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx5905\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx7102\clbrdrt\brdrs \clbrdrl
\brdrs \clbrdrb\brdrs \clshdng0\cellx8299\clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx9496\pard \qj\sb240\keepn\intbl S\par
[head:-]\par
\pard \qj\keepn\intbl [weak]\par
[light]\cell \pard \qj\sb240\keepn\intbl S\par
[head:-]\par
\pard \qj\keepn\intbl [weak]\par
[light]\cell \pard \qj\sb240\keepn\intbl S\par
[head:+]\par
\pard \qj\keepn\intbl [strong]\par
[light]\cell \pard \qj\sb240\keepn\intbl S\par
[head:+]\par
\pard \qj\keepn\intbl [strong]\par
[heavy]\cell \pard \qj\sb240\keepn\intbl S\par
[head:+]\par
\pard \qj\keepn\intbl [strong]\par
[light]\cell \pard \qj\sb240\keepn\intbl S\par
[head:-]\par
\pard \qj\keepn\intbl [weak]\par
[light]\cell \pard \qj\sb240\keepn\intbl S\par
[head:+]\par
\pard \qj\keepn\intbl [strong]\par
[heavy]\cell \pard \qj\sb240\keepn\intbl S\par
[head:-]\par
\pard \qj\keepn\intbl [weak]\par
[heavy]\cell \pard \intbl \row \pard \sb240\keepn\intbl {\i in\cell }\pard \sb240\keepn\intbl {\i the\cell well-\cell known\cell maths\cell de-\cell }\pard \sb240\keepn\intbl {\i -part-\cell }\pard \sb240\keepn\intbl {\i -ment\cell }\pard \intbl \row
\trowd \trgaph80\trleft-80 \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx1117\clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx2314\clmgf\clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx4708\clmrg\clbrdrb\brdrs \clshdng0\cellx4708\clbrdrl\brdrs \clbrdrb\brdrs
\clshdng0\cellx5905\clmgf\clbrdrl\brdrdot \clbrdrb\brdrs \clshdng0\cellx9496\clmrg\clbrdrb\brdrs \clshdng0\cellx9496\clmrg\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx9496\pard \qj\sb240\keepn\intbl Prep.\cell \pard \qj\sb240\keepn\intbl Det.\cell \pard
\qj\sb240\keepn\intbl Adj.\cell \pard \qj\sb240\keepn\intbl \cell \pard \qj\sb240\keepn\intbl N(N\cell \pard \qj\sb240\keepn\intbl N)\cell \pard \qj\sb240\intbl \cell \cell \pard \intbl \row \pard \qc\sb240
Fig. XX: Partial metrical and syntactic structure of {\i in the well-known maths department}. {\b \par
}\pard \qj\sb240 {\b 11.\tab Ambisyllabicity}
: Ambisyllabicity means that a consonant can simultaneously belong to two adjacent syllables. Formally, ambisyllabicity is represented as re-entrant nodes at the terminal level: i.e. a consonant may simultaneously be ultimately dominated by two syllable no
des by being in th
e coda of one syllable and in the onset of the next. Constituents which are shared between syllables are marked [ambisyllabic:+]. Ambisyllabicity makes it easier to model coarticulation [4] and is an essential piece of knowledge in the correct temporal rel
ations between adjacent syllables. It is also used to predict spectral properties such as plosive aspiration in intervocalic clusters (\'a4XX).\par
\pard \qj\sb240 12.\tab Constituents are [ambisyllabic:+] wherever this does not result in a breach of syllable structure constraints. {\i Loving} comprises two Syllables, /{\f12407 l\'8bv}/ and /{\f12407 vIN}/, since /{\f12407 v}
/ is both a legitimate coda for the first syllable, and a legitimate onset for the second. {\i Loveless} has no ambisyllabicity, since /{\f12407 vl}/ is neither a legitimate onset nor a legitimate coda. Clusters may be entirely ambisyllabic, as in {\i
risky} (/{\f12407 rIsk}/+/{\f12407 ski}/), where /{\f12407 sk}/ is a legitimate coda and onset cluster; partially ambisyllabic (i.e. one consonant is [ambisyllabic:+], and one is [ambisyllabic:-]), as in {\i selfish} /{\f12407 sElf}/+/{\f12407 fIS}
/), or not ambisyllabic as in {\i risk them} (/{\f12407 rIsk}/+/{\f12407 D\'abm}/).{\b \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 4. Implementation\par
\pard\plain \qj\sb240 \f20
This section describes how the phonological model described in the previous section has been implemented computationally. It describes the database used for the spectral, temporal and intoational modelling, the architecture and the use of XML to represent
linguistic structure.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.1 Database\par
\pard\plain \qj\sb240\tx0 \f20 1.\tab Analysis for modelling is based on a speech database of over 600 utterances, recorded by a single male speaker of southern British English. Database speech files have been e
xhaustively labelled to identify segmental and prosodic constituent boundaries, using hand\_correction of an automated procedure. F0 contours, calculated from a simultaneously recorded Laryngograph signal, can be displayed time\_
aligned with constituent boundaries.\par
\pard \qj\sb240 2.\tab The database comprises a subset of possible linguistic structures, each with a number of exemplars which together offer a wide range of the types of systematic variation of interest. Each utterance consists of one IP, and up to {\b
three} AGs. The fo
ot-types within the AG are varied, according to the weight of the head syllable, the number and type of consonants in the onset and rhyme, the syllabic affiliation of intervocalic consonants, and vowel length. There are also phrases containing segments who
se secondary resonance is expected to spread, and some which are expected to block the spreading of such effects.\par
\pard \qj\sb240 3.\tab The database thus provides us with material for analysis of the spectral, temporal and intonational phenomena we aim to synthesise. \par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.2 Architecture\par
\pard\plain \qj\sb240 \f20 1.\tab
ProSynth uses an open computational architecture for synthesis. There is a clear separation between the computational engine and the computational representations of data and knowledge. The overall architecture is shown in Fig. XX. \par
\pard \qc\sb240\keepn {\fs20 {\pict\macpict\picw426\pich156
065000000000009c01aa001102ff0c00ffffffff000000000000000001aa0000009c00000000000000a100640010474449310001ffffffff000000000000001e0001000a00000000009c01aa00600014006e002400c00000005a0068010e005a0068005a005a006800b4005a00600032006e004200c0005a005a006800b400
5a0001000a001b006d0039006f0022fffd006e005a0001000a001b00be003900c0000900b7088800b708700022fffd00bf005a00a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa0060001400e3002401350000005a0068010e005a
0068005a005a006800b4005a0060003200e300420135005a005a006800b4005a0001000a001b00e2003900e40022fffd00e3005a0001000a001b0133003901350022fffd0134005a00a100640010474449310001ffffffff0100000000000001000a00000000009c01aa002c000c00150948656c7665746963610003001500
0d000b002e000400000000002b8632074c657869636f6e0028002c00f60c4465636c617261746976652000002a0c096b6e6f776c6564676500001aff00ff00ff00001bff00ff00ff000009ffffffffffffffff0031005c006e008100c0001a0000000000000038001aff00ff00ff000031005c00e30081013e001a00000000
0000003800280071007e0b436f6d706f736974696f6e0029780e496e746572707265746174696f6e001aff00ff00ff00000b001b001b004100020158003001aa001a0000000000000048001aff00ff00ff00004100380158006601aa001a0000000000000048001aff00ff00ff000041006e0158009c01aa001a0000000000
0000480028000f016b074d42524f4c4120002b050c08646970686f6e65200000280027016e0973796e746865736973002b051e06484c73796e200000280051015d1371756173692d6172746963756c61746f727920002b110c0973796e746865736973002b021e0950726f736f647920200028008701690c6d616e6970756c
617465642000002b0a0c065370656563680000700022004a00020080004a005c0002008000020080004a004a004a005c0002005c0002002800620016084d61726b6564202000002a0c047465787400a100640010474449310001ffffffff00000000000000710022006a005f0071006e006e006e0071005f006e005f006a00
5f006e006e006e006e0001000a006d004a006f005f0022006e00353f0000a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa00710022004d008e005c0095005c0092004d008e004d0092004d0095005c0092005c00920001000a0041
0091004d0093002200350092002400a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa00710022004d010c005c0113005c0110004d010c004d0110004d0113005c0110005c01100001000a0041010f004d0111002200350110002400
a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa00710022006a00d4007100e3006e00e3007100d4006e00d4006a00d4006e00e3006e00e30001000a006d00bf006f00d40022006e00aa3f0000a100640010474449310001ffffffff
01000000000000a100640010474449310001ffffffff0000000000000001000a00000000009c01aa00710022001d0150002b0158001d0158002b0156002a015300290150001d0158001d01580001000a002a013d006e0153002000b20127ffe6016900a100640010474449310001ffffffff01000000000000a10064001047
4449310001ffffffff0000000000000001000a00000000009c01aa007100220053014b005f015800530158005f0150005c014e005a014b00530158005301580001000a005c013d006e014e00220080012c33ca00a100640010474449310001ffffffff01000000000000a100640010474449310001ffffffff000000000000
0001000a00000000009c01aa00710022007c014b00890158008901580081014b007f014e007c015000890158008901580001000a006e013d007f014e0022005d012c333300a100640010474449310001ffffffff01000000000000ff}}\par
\pard \qc\sb240 Fig. XX: ProSynth synthesis architecture.\par
\pard \qj\sb240 2.\tab Text marked for the type and placement of accents is input to the system, and a pronunciation lexicon is used to construct the strictly layered metrical structure for each intonational phrase in turn. The overall utter
ance is then represented as a hierarchy, as described in more detail in Section 3.\par
\pard \qj\sb240 3.\tab
The interpreted structure is converted to a parametric form depending on the signal generation method. The phonetic descriptions and timing can be used to select diphones and express their durations and pitch contours foroutput with the MBROLA system (D
utoit et al ref). The phonetic details can also be used to augment copy-synthesis parameters for HLsyn quasi-articulatory formant synthesiser (Heid & Hawkins 1999, in press)
. The timings and pitch information have also been used to manipulate the prosody of natural speech using PSOLA (Hamon et al. ref).\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.3 Linguistic Representation and Modelling\par
\pard\plain \qj\sb240 \f20 1.\tab
The Extensible Markup Language (XML) is an extremely simple dialect of SGML (Standard Generalised Markup Language), the goal of which is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. X
ML is a standard proposed by the World Wide Web Consortium of industry specific mark\endash up for a number of applications, such as: vendor\endash neutral data exchange, media\endash
independent publishing, collaborative authoring, the processing of documents by intelligent agents and other metadata applications [Ref1]. \par
\pard \qj\sb240 2.\tab
XML is used as the external data representation for the phonological structures in ProSynth. The features of XML which make it ideal for this application are: storage of hierarchical information expressed in nodes with attributes; a standard text\endash
based format suitable for networking; a strict and formal syntax; facilities for the expression of linkage between parts of the structure; and readily\endash available software support. \par
3.\tab
In the ProSynth system, the input word sequence is converted to an XML representation which then passes through a number of stages representing phonetic interpretation. A declarative knowledge representation is used to encode knowledge of phonetic inter
pretation and to drive transformation of the XML data structures. Finally, special purpose code translates the XML structures into parameter tables for signal generation. \par
\pard \qj\sb240 4.\tab XML is used to encode the following in ProSynth: \par
\pard \qj\sb240 {\b 5.\tab Word Sequences}:{\i }The text input to the synthesis system needs to be marked\endash
up in a number of ways. Importantly, it is assumed that the division into prosodic phrases and the assignment of accent types to those phrases has already been performed. This information is added to the text using a simple mark\endash
up of Intonational Phrases and Accent Groups (Section XX). \par
\pard \qj\sb240 {\b 6.\tab Lexical Pronunciations}:{\b }The lexic
on maps word forms to syllable sequences. Each possible pronunciation of a word form has its own entry comprising: SYLSEQ (i.e. syllable sequence), SYL, ONSET, RHYME, NUC, ACODA, CODA, VOC and CNS nodes. Information present in the input mark\endash
up, possibly derived from syntactic analysis, selects the appropriate pronunciation for each word form. \par
{\b 7.\tab Prosodic Structure}:{\i }Each composed utterance comprising a single intonational phrase is stored in a hierarchy of: UTT, WORDSEQ, WORD, IP, AG, FOOT, SYL, ONSET, RHYME, NUC, CODA, ACODA, VOC and CNS nodes. Syllables are cross\endash
linked to the word nodes using linking attributes. This allows for phonetic interpretation rules to be sensitive to the grammatical function of a word as well as to the position of the syllable in the word. \par
\pard \qj\sb240 {\b 8.\tab Database Annotation}
: The ProSynth database has been manually annotated and a prosodic structure complete with timing information has been constructed for each phrase. This annotation is stored in XML using the same format as for synthesis.
Tools for searching this database help us in generating knowledge for interpretation. \par
\pard \qj\sb240 9.\tab
As described in Section XX, ambisyllabicity is a particular case of re-entrancy in a DAG. Since XML rigidly enforces a strict hierarchy of components it is necessary to duplicate and link nodes in order to represent ambisyllabicity in XML \par
10.\tab An extract of a prosodic structure expressed in XML is shown in Figure XX, taken from the phrase \ldblquote Come with a bloom\rdblquote (see Fig. XX for another representation of this information).
(In the XML representations, Y/N are used in place of the +/- used elsewhere in the text.)\par
\pard \qj\sb240 {\f22\fs18 <FOOT DUR="1" START="0.5561" STOP="1.0883">\par
}\pard \qj {\f22\fs18 \par
<SYL DUR="1" FPOS="1" RFPOS="1" RWPOS="1" START="0.5561" STOP="1.0883"\par
STRENGTH="STRONG" WEIGHT="HEAVY" WPOS="1" WREF="WORD4">\par
\par
}\pard \qj\li720 {\f22\fs18 <ONSET DUR="1" START="0.5561" STOP="0.7341" STRENGTH="STRONG">\par
}\pard \qj\li720 {\f22\fs18 <CNS AMBI="N" CNSCMP="N" CNSGRV="Y" CNT="N" DUR="1" NAS="N" RELEASE="0.6565" RHO="N" SON="N" START="0.5561" STOP="0.6670" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N" VOI="Y">b</CNS>\par
}\pard \qj\li720 {\f22\fs18 <CNS AMBI="N" CNSCMP="N" CNSGRV="N" CNT="Y" DUR="1" NAS="N" RHO="N" SON="Y" START="0.6670" STOP="0.7341" STR="N" VOCGRV="N" VOCHEIGHT="CLOSE" VOCRND="N"\par
}\pard \qj\li720 {\f22\fs18 VOI="Y">l</CNS>\par
</ONSET>\par
}\pard \qj {\f22\fs18 \par
}\pard \qj\li720 {\f22\fs18 <RHYME CHECKED="Y" DUR="1" START="0.7341" STOP="1.0883" STRENGTH="STRONG"\par
VOI="Y" WEIGHT="HEAVY">\par
}\pard \qj\li1440 {\f22\fs18 \par
}\pard \qj\li1440 {\f22\fs18 <NUC CHECKED="Y" DUR="1" LONG="Y" START="0.7341" STOP="0.9126" STRENGTH="STRONG" VOI="Y" WEIGHT="HEAVY">\par
<VOC DUR="1" FXGRD="-251.2" FXMID="126.7" GRV="Y" HEIGHT="CLOSE" RND="Y" START="0.7341" STOP="0.8234">u</VOC>\par
<VOC DUR="1" FXGRD="-171.1" FXMID="105.4" GRV="Y" HEIGHT="CLOSE" RND="Y" START="0.8234" STOP="0.9126">u</VOC>\par
}\pard \qj\li1440 {\f22\fs18 </NUC>\par
\par
<CODA DUR="1" START="0.9126" STOP="1.0883" VOI="Y">\par
}\pard \qj\li1440 {\f22\fs18 <CNS AMBI="N" CNSCMP="N" CNSGRV="Y" CNT="N" DUR="1" NAS="Y" RHO="N" SON="Y" START="0.9126" STOP="1.0883" STR="N" VOCGRV="Y" VOCHEIGHT="CLOSE" VOCRND="Y"\par
}\pard \qj\li1440 {\f22\fs18 VOI="Y">m</CNS>\par
</CODA>\par
}\pard \qj\li720 {\f22\fs18 </RHYME>\par
}\pard \qj {\f22\fs18 </SYL>\par
</FOOT>\par
}\pard \qc\sb240 Fig XX. Partial XML representation of utterance: \ldblquote Come with a bloom\rdblquote , as represented in Fig. XX.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 4.4 Knowledge Representation\par
\pard\plain \qj\sb240 \f20 1.\tab Knowledge for phonetic interp
retation is expressed in a declarative form that operates on the prosodic structure. This means firstly that the knowledge is expressed as unordered rules, and secondly that it operates solely by manipulating the attributes on the XML encoded phonological
structure. To encode such knowledge a representational language called ProXML was developed in which it is easy to express the hierarchical contexts which drive processing and to make the appropriate changes to attributes. The ProXML language is read by an
interpreter PRX written in C which takes XML on its input and produces XML on its output. ProXML is a very simple language modelled on both C and Cascading Style Sheets (see [Ref2] for more information). A ProXML script consists of functions which are nam
ed after each element type in the XML file (each node type) and which are triggered by the presence of a node of that type in the input. When a function is called to process a node, a context is supplied centered on that node so that reference to parent, c
hild and sibling nodes is easy to express. \par
\pard \qj\sb240 2.\tab
Figure XX shows a simple example of a ProXML script to adjust syllable durations for strong syllables in a disyllabic word whose second and final syllable is weak. If the first syllable is heavy, the rule is dependent on the length of the vowel. In this
example, the DUR attribute on SYL nodes is set as a function of the phonological attributes found on that node and on others in the hierarchy. Note that the rules modify the duration attribute (*= means scale ex
isting value) rather than set it to a specific value. In this way, the declarative aspect of the rule is maintained. The compression factors in the script are computed from regression tree data taken from the ProSynth database (see Section XX).\par
\pard \qj\li1440\sb240 {\f22\fs18 SYL \{\par
}\pard \qj\li1440 {\f22\fs18 if ((:STRENGTH=="STRONG")&&(:WPOS=="1")&&(:RWPOS=="2")\par
&&(../SYL[2]:WEIGHT=="LIGHT"))\par
if (:WEIGHT=="HEAVY")\par
if (./RHYME/NUC:LONG=="Y")\par
:DUR *= 1.0884;\par
else\par
:DUR *= 1.1420;\par
else\par
:DUR *= 0.8274;\par
\}\par
}\pard \qc\sb240 Fig. X: Example ProXML script, which modifies syllable durations dependent on the syllable level and nucleus level attributes.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 5. Modelling\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.1\tab Spectral detail\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.1.1\tab Segmental identity\par
\pard\plain \qj\sb240 \f20
Whichever type of synthesis output system is used, the immediate input comes from the XML file. For concatenative synthesis, we currently use the MBROLA system, with sound segments chosen in the standard way from the MBROLA inventory for British English. F
or formant synthesis, we use HLsyn driven by {\scaps procsy, }which is part copy-synthesizer from la
belled speech files, and part rule-driven from information in the XML file. Most formant trajectories for vowels and approximants are copy-synthesized, while obstruent consonants and some other sounds are produced by rule. {\scaps Procsy}
is described in detail by Heid and Hawkins (1999, in press).\par
\pard\plain \s253\qj\sb240\sa60\keepn \b\f20 5.1.2.\tab Fine-tuning spectral shape\par
\pard\plain \qj\sb240 \f20 1.\tab In concatenative synthesis, the task of fine-tuning spectral shape is achieved by selecting appropriate units. ProSynth as yet makes no attempt to improve upon the standard MBROLA unit
selection, but ultimately our work should have applications in unit selection inasmuch as it should increase our understanding of how factors such as long-domain resonance effects and grammatical dependencies influence spectral variation.\par
\pard \qj\sb240 2.\tab
When the parameters are set to appropriate values, HLsyn itself does much local fine-tuning of spectral shape automatically. In comparison with standard formant synthesizers, it is relatively straightforward to produce complex acoustic changes at segmen
t boundaries tha
t closely mimic those of natural speech. Most notably, HLsyn produces natural-sounding, perceptually-robust transitions between adjacent segments that differ in excitation type, such as the transition between vowels and voiced or voiceless stops or fricati
ves. This attribute of HLsyn means that some of the immediate appeal of concatentive synthesis\emdash natural-sounding, perceptually-robust transitions between adjacent segments, together with a pleasant voice quality\emdash
is also available in formant synthesis at little computational cost.\par
\pard \qj\sb240 3.\tab Although these types of acoustic fine-detail are relatively easily achievable using HLsyn, they have to be programmed to occur in only the right contexts. {\scaps Procsy }
provides the rules that do this. Some of the systematic variation is programmed by reference to the structure of the prosodic hierarchy, and some in the traditional way by reference to linear segmental context. Examples of prosodically-dependent rules incl
ude stress-dependent variations in the waveform amplitude envelope, and stress-dependent differences in excitation type in certain CVC sequences. For example, in Southern British English (SBE), the first CVC of {\i today} and {\i to disappoint }
are spectrally very different from those of {\i turtle }and {\i tiddler}, as are the {\i tit} sequences in {\i attitude }and {\i titter}
. Examples of rules that rely mainly on local segmental context include coarticulation of nasality and the amount of voicing in the closure of voiced stops. These sorts of properties, though in need of more work, are reasonably well
understood and most are relatively straightforward to implement to a satisfactory standard.\par
\pard \qj\sb240 4.\tab
More challenging, because more subtle and less well understood, is the temporal extent of perceptually salient long-domain coarticulatory processes such as the resonance effects discussed in Section 2. For example, in SBE, /r/-colouring varies with vowe
l height and the number of consonants in the syllable onset, and spreads for at least two syllables on either side of the conditioning consonant, as long as those s
yllables are unstressed and especially if they are in feet of 3 or more syllables Tunley (1999). Thus, whereas strong /r/-colouring might be expected to be found throughout a phrase like {\i The tapestry bikini}
, it would be expected to be weak and confined only to {\i bad} and {\i rap} in a phrase like {\i The bad rap artist} (in a non-rhotic accent). Work by West (1999) is broadly supportive of these observations.\par
\pard \qj\sb240 5.\tab It is not yet known, however, what limits the spread of rhotic resonance effects. Some of our current efforts
are directed towards answering this question. For example, when an /r/ occurs in a context that is susceptible to /r/-colouring, such as the last syllable of {\i tapestry}
, is the resonance effect blocked by the next stressed syllable, or can it spread through into unstressed syllables of the adjacent foot? Just as low vowels show less susceptibility than high vowels, are some consonants (for example, velar stops) more like
ly to affect the the spread of resonance effects than others? The way that resonance effe
cts are modelled in ProSynth will depend to a large extent on the answers to these questions. For example, if rhotic resonance effects are restricted to unstressed syllables in the foot or feet immediately adjacent to the conditioning /r/, then the feature
[rhotic] can be an attribute of the foot in the prosodic tree. If however these effects pass through stressed syllables into the next feet, then they might have to be modelled as an attribute of a level higher than the foot. (Preliminary evidence suggests
we should not rule out that possibility.) Finally, if some segments block the spread of resonance effects, even in unstressed syllables, then either the domain of the [rhotic] feature may be best placed below the foot, or else the acoustic realisation of
the feature must also take account of the segmental context in a relatively complicated way. \par
\pard \qj\sb240 6.\tab The temporal extent of systematic spectral variation due to coarticulatory processes is modelled using two intersecting principles. One reflects how much a giv
en allophone blocks the influence of neighbouring sounds, and is like coarticulation resistance (Bladon & Al-Bamerni 1976). The other principle reflects resonance effects, or how far coarticulatory effects spread. The extent of resonance effects depends on
a range of factors including syllabic weight, stress, accent, and position in the foot, vowel height, and featural properties of other segments in the domain of potential influence. For example, intervening bilabials let lingual resonance effects spread t
o more distant syllables, whereas other lingual consonants may block their spread; similarly, resonance effects usually spread through unstressed but not stressed syllables.{\i \par
}\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.2\tab Temporal modelling\par
\pard\plain \qj\sb240 \f20 1.\tab One of the goals of temporal modelling is to model English rhythms accurately.The ProSynth timing model is foot-based and for any given syllable takes into account (1)\~its strength (2)\~its weight (3)\~
its place in the foot (4)\~the strength and weight of adjacent syllables. Information about word boundaries is also available, allowing (eg.) word-finality to influence the temporal interpretation of any syllable.\par
\pard \qj\sb240 2.\tab Abercrombie (ref.) describes two rhythms which are important for disyllabic words in the variety of English being modelled: (1)\~short-long: {\i happy funny city}, (2)\~equal-equal: {\i hamper funding seedy}
. The words with short-long rhythm have a light first syllable, while the words with equal-equal rhythm have a heavy first syllable. The second syllable vowels in the two sets are durationally different. Taking th
e vowels in the database as a whole, and looking specifically at utterance-final disyllabic feet with short vowels in the first syllable, it is found that the duration of the vowel of both the first and the second syllable is sensitive to the weight of the
first syllable (Table X). The duration of a second syllable after a heavy first syllable is 23% greater than after a light first syllable.\par
\pard \qj\sb240 \par
\trowd \trgaph80\trleft-80 \clbrdrt\brdrs \clbrdrl\brdrs \clbrdrb\brdrdb \clshdng0\cellx2680\clbrdrt\brdrs \clbrdrb\brdrdb \clshdng0\cellx5440\clbrdrt\brdrs \clbrdrb\brdrdb \clbrdrr\brdrs \clshdng0\cellx8200\pard \qc\sb240\intbl Weight of 1st syll\cell
Duration of 1st syll\cell Duration of 2nd syll\cell \pard \intbl \row \trowd \trgaph80\trleft-80 \clbrdrt\brdrdb \clbrdrl\brdrs \clshdng0\cellx2680\clbrdrt\brdrdb \clshdng0\cellx5440\clbrdrt\brdrdb \clbrdrr\brdrs \clshdng0\cellx8200\pard \qc\sb240\intbl
heavy\cell 381.2\cell 329.6\cell \pard \intbl \row \trowd \trgaph80\trleft-80 \clbrdrl\brdrs \clbrdrb\brdrs \clshdng0\cellx2680\clbrdrb\brdrs \clshdng0\cellx5440\clbrdrb\brdrs \clbrdrr\brdrs \clshdng0\cellx8200\pard \qc\sb240\intbl light\cell 276.2\cell
268.4\cell \pard \intbl \row \pard \qj\sb240 3.\tab As well as d
urational differences, there are also qualitative differences in the second-syllable vowels. The words of type (1) have diphthongised vowels, while the words of type (2) have monophthongal vowels. The implication of these results is that when the second sy
llable of words like these is phonetically interpreted, it is necessary to have information available about the strength and weight of the preceding syllable. Similar, but more complex, statements must also be made for longer feet.\par
\pard \qj\sb240 4.\tab As well as rhythmic properties, there are \lquote segmental\rquote
durational effects which relate to smaller stretches of speech but which (perhaps paradoxially) reflect higher levels of linguistic organisation. For example, Keating, Fougeron & Cho (LabPhon ref.) and Fougeron & Keating (J
ASA ref) have shown that the duration of various segment types is sensitive to at least three levels of structure in the prosodic hierarchy. Such observations provide further evidence that the accurate modelling of durations depends on having a rich phonol
ogical structure and that phonetic interpretation should access information from that structure. In other words, temporal phonetic interpretation is reliant on the informational richness which is encoded in the phonological structure.\par
\pard \qj\sb240 5.\tab
The temporal interpretation model is based on a CART (Classification and Regression Tree) analysis of the database, taking into account the phonological features in the prosodic hierarchy. CART analysis is succinctly described by van Santen (ref.):\par
\pard \qj\li720\sb240 6.\tab CART-based methods con
struct a tree by making binary splits on factors so as to minimise the variance of the durations in the two corresponding subsets. When a CART tree encounters a bundle of features not observed in the database, it can still find a path in the tree that up t
o some point matches the new feature bundle. This means that if nothing in the database matches the the required pattern exactly a near approximation will be found.\par
\pard \qr\sb240 van Santen ref.\par
\pard \qj\sb240 7.\tab The labelled waveforms of the database and their XML-parsed description
files are searched according to relevant feature information (eg. syllable weight and strength), and a CART model is used to generalise across this data and generate duration statistics for feature bundles at given places in the phonological structure. The
resulting duration model can be used to drive MBROLA diphone synthesis, since it predicts the durations of acoustic segments.\par
\pard \qj\sb240 8.\tab The analysis model works top-down\emdash that it, it factors out first the effects of IP, then of AG, and so on, down the tree to the
features at the terminal level. This reflects the assumption that the IP, AG, Foot and Syllable are all levels of timing, and that details of lower-level differences (such as segment type) can be overlaid on details of higher-level differences (such as sy
llable weight and strength; the strength and weight of an adjacent syllable; etc.). The top-down model also has the effect of constraining search spaces. {\b EXAMPLE}
. The resulting timing model is such that each node in the hierarchy has a multiplicative compr
ession factor associated with it. An example of this has already been provided in Fig. XX. The fact that it is a multiplicative model means that the order in which the statements of temporal interpretation are applied is irrelevant. It also makes the model
compositional. \par
\pard \qj\sb240 9.\tab As an example, consider the interpretation of /p/ in {\i happy}. In order to interpret the /p/ accurately, the model refers to (at least) the following pieces of information:\par
\pard \qj\li720\sb240 \bullet \~/p/ is located in a Rhyme whose Nucleus contains a short open vowel\par
\pard \qj\li720 \bullet \~/p/ is [ambisyllabic:+] and is in the Coda of a [strong], [light] syllable and in the Onset of a weak syllable\par
\pard \qj\sb240 10.\tab Each of these facts\emdash along with other, higher-level ones\emdash affects the temporal interpretation of the /p/ in {\i happy}. Other bundles of phonological features are interpreted in the same structure-bound way.\par
\pard \qj\sb240 11.\tab This method of timing assumes that segment durations, as measured from the database, are in fact what a duration model must replicate. However, another way to look at the speech signal i
s to consider segments as an artifact of the temporal overlaying of phonetic parameters. This view of timing has been explored in earlier work, such as Coleman 1992, Local 1992, Ogden 1992 and Local & Ogden 1997. According to this model, higher-level const
ituents in the hierarchy are compressed, and their daughter nodes are compressed in the same way. The temporal interpretation of ambisyllabicity is the degree of overlap that exists between syllables, so an intervocalic consonant (typically ambisyllabic) h
as duration properties inherited from both the syllables it is in.\par
\pard \qj\sb240 12.\tab The temporal consequences of ambisyllabicity can be modelled by overlaying Syllable{\i\fs20\dn4 n} on to Syllable{\i\fs20\dn4 n-1} thus setting its start point to be before the end of Syllablen{\i\fs20\dn4 -1}
. By overlaying syllables to varying degrees and making reference to ambisyllabicity, it is possible to lengthen or shorten intervocalic consonants systematically. There are morphologically related differences which can be modelled in this way, provided th
at the phonological structure is sensitive to them; the {\i mistakes} and {\i mistimes} example discussed in Section XX is one such instance. The Latinate prefix {\i in-}
is fully overlaid with the stem to which it attaches and is [ambisyllabic:+], giving a short nasal in {\i innocuous}, while the roughly synonymous Germanic prefix {\i un-}
is not overlaid to the same degree and is [ambisyllabic:-], giving a long nasal in {\i unknowing}. Future work will focus on integrating the segment-based and the more syllable-based approaches in the model.\par
\pard\plain \s254\qj\sb360\keepn \b\i\f21 5.3\tab Intonational modelling\par
\pard\plain \qj\sb240 \f20 1.\tab
We assume, in common with most theories of intonation, that the highly variable F0 contours encountered in natural speech can be analysed into component parts and classified according to a finite set of possible pitch melodies, which need to be defined
phonologically. There is, then, a dimension of paradigmatic choice in modelling intonation: the overall pitch pattern selected for an IP is not itself predictable from structure but is determined by discourse factors. Once that discourse-
based selection has been made, then a pitch accent specification can be assigned to each of the AGs within the IP. The pattern for an IP is thus composed of the pitch accents assigned to AGs, and of boundary tones associated with the edges of the IP domain
. For example, IP attributes will tell us (i) about position in discourse (initial, medial, final), (ii) about speech act function (declarative, interrogative, imperative), and (iii) about linguistic focus. The information in (i) is relevant to pitch range
and will be interpreted in terms of F0 scaling and boundary tone. Information in (ii) is used in determining the choice of pitch accents for the component AGs, whereas (iii) determines nuclear accent placement, and hence the AG structure itself, since the
nucleus must be located on the final AG of an IP (IPs being right-headed). By default, AGs are co-terminous with headed, heavy Feet (those beginning with stressed syllables), so that the intonation nucleus falls on the final such Foot; in context the focu
s may shift to an earlier Foot position, thus creating an AG constituent containing more than one Foot. In this case, since AGs are left-headed, the first Foot within the AG is the head of that AG and the domain for the nuclear pitch contour. {\b
(Examples available if required.) YES PLEASE! -RAO}\par
\pard \qj\sb240 2.\tab A discourse-final declarative IP, then, consisting of two well-formed (non-degenerate) AGs, would typically be assigned a relatively high accent in AG1, a falling nuclear pitch movement in AG2 and a low final bounda
ry tone (equivalent to H* H*L L% in ToBI-style notation).\par
\pard \qj\sb240 3.\tab
The interpretation of the selected pitch contour in terms of F0 is, like other phonetic parameters, structure-dependent. Precise alignment of contour turning-points is constrained by the properties of units at lower levels in the hierarchy. In the ProSy
nth model, described in more detail in (ICPhS paper 1999 ref), nuclear pitch accents are defined in terms of a template based on a sequence of contour turning-points. These templates are in turn bas
ed on a set of essential parameters derived by automatic means from the Laryngograph recording used to calculate the F0 trace, and checked using informal listening tests to ensure that there was perceptual equivalence between natural F0 contours and those
constructed by linking the target points identified. For example, for a falling (H*L) pitch accent three crucial contour turning-points are identified: Peak ONset (PON), Peak OFfset (POF) and Level ONset (LON). In other words, we recognise that the {\i
peak} as
sociated with H* accents is often manifested as a plateau, with its own duration, rather than as a single peak: PON and POF represent the start and end of such a plateau, with POF therefore denoting the beginning of the F0 fall. LON occurs at the end of th
e fall, and is the point from which the low tone spreads till the end of voicing in the AG (cf {\i phrase accent} (ref)). \par
\pard \qj\sb240 {\b ***Include suitable F0 plot as illustration, + ICPhS diagram with following procedural explanation*** \par
}\pard \qj\sb240 4.\tab Firstly, the location of the key
syllable components was established using the manual annotations. Then the peak F0 value in the accented syllable was found. The onset (PON) and the offset (POF) of the peak were then found by finding the range of times around the peak where the F0 value
was within 4% (approximating to a range for perceptual equality). The schematic representation below illustrates the search for PON and POF.\par
\pard \qj\sb240 5.\tab The template turning-points are specified as attributes of the leftmost Foot (=head) within the AG. Statistical
analysis of the database suggests that the timing of all these points varies systematically with aspects of the structure of this Foot, such as its length in terms of number of component syllables, and characteristics of the onset and rhyme of the accented
syllable at its head. Many earlier studies of F0 alignment relate e.g. H* {\i peak}
timing to this accented syllable, rather than to the Foot (various refs). Early results suggest that it is possible to cut down on some of the variability by treating the Foot as the primary domain for the template.\par
\pard \qj\sb240 6.\tab
The patterns of alignment across structures which are observed for the single speaker model are consistent with those reported in the literature (see House & Wichmann 1996, Wichmann and House 1999 for summary). We claim that successful modelling of the
F0 values for this speaker, integrated with the same speaker=s timing and spectral properties, enhances the coherence of the synthesised output. Acoustic-phonetic coherence will be further enhanced by incorporating m
icroprosodic perturbations of the F0 contour (Silverman XX), clearly observable for e.g. obstruent consonants on the ProSynth database.\par
\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 6.\tab Perceptual testing/experiments\par
\pard\plain \qj\sb240 \f20 {\b Construction site. Thank you Sarah. Everyone else: the telephone number of the best florist in York is 613044.\par
}\sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 7. Conclusion\par
\pard\plain \qj\sb240 \f20 {\b I still haven\rquote t done anything on this. Depends on exptl results. No one\rquote s said anything about ideas below: are they OK?}\par
This needs work. I suggest just a couple of paras. Ideas for what to put in here gratefully received. My own thoughts:\par
\bullet \~\ldblquote informational richness\rdblquote is about (1)\~the speech signal containing systematic information that signals (2)\~complex linguistic structure.\par
\pard \qj\sb240 \bullet \~repeat that having properly structured linguistic knowledge has something essential to offer speech synthesis, at temporal, spectral and intonational levels of modelling. We\rquote
re suggesting an integrated, structure-based (i.e. prosodic) model.\par
\pard \qj\sb240 \bullet \~perhaps an indication of where we go next.{\ul \par
}\pard\plain \s255\qj\sb360\keepn \b\f21\fs28 \sect \sectd \sbknone\linemod0\linex0\headery709\footery709\cols1\colsx289 \pard\plain \s255\qj\sb360\keepn \b\f21\fs28 8. REFERENCES\par
\pard\plain \qj\sb240 \f20 {\b Still waiting for you all to do this.\par
}\pard \qj \par
Abercrombie (ref.) \par
Bird (1995)\par
Bregman 199xx\par
Dutoit et al ref\par
Elman and McClelland 1986\par
Fougeron & Keating (JASA ref) \par
Gobl and NiChasaide 19xx\par
Hamon et al.\par
Hawkins & Slater 1994\par
\pard \qj\li720 Hawkins, S. & Slater, A. \ldblquote Spread of CV and V-to-V coarticulation in British English: implications for the intelligibility of synthetic speech.\rdblquote {\i ICSLP} 94, 1: 57-60, 1994.\par
\pard \qj Hawkins and Nguyen in press {\b \par
Hawkins & Nguyen LabPhon: please double check in text.}\par
\pard \qj\li720 Hawkins, S., & Nguyen, N. \ldblquote Effects on word recognition of syllable-onset cues to syllable-coda voicing\rdblquote , {\i LabPhon VI}, York, 2-4 July 1998.\par
\pard \qj Heid & Hawkins ref., Jenolan Caves\par
\pard \qj\li720 Heid, S. & Hawkins, S. \ldblquote Automatic parameter-estimation for high-quality formant synthesis using HLSyn.\rdblquote Presented {\i at 3rd ESCA Workshop on Speech Synthesis}, Jenolan Caves, Australia, 1998.\par
\pard \qj Heid and Hawkins (under review)\par
Heid and Hawkins 1999\par
House & Hawkins (1995)\par
\pard \qj\li720 House, J. & Hawkins, S., \ldblquote An integrated phonological-phonetic model for text-to-speech synthesis\rdblquote , {\i Proc. ICPhS XIII}, Stockholm, Vol. 2, 326-329, 1995.\par
\pard \qj House & Wichmann 1996\par
Keating, Fougeron & Cho (LabPhon ref.)\par
Kelly and Local 1989\par
\pard \qj\li720 Kelly, J. & Local, J. {\i Doing Phonology.} Manchester: University Press, 1989.\par
\pard \qj Kwong and Stevens 1999\par
Local & Ogden (1997)\par
\pard \qj\li720 Local, J.K. & Ogden R. \ldblquote A model of timing for nonsegmental phonological structure.\rdblquote In Jan P.H. van Santen, R W. Sproat, J. P. Olive & J. Hirschberg (eds.) {\i Progress in Speech Synthesis}
. Springer, New York. 109-122, 1997.\par
\pard \qj Local (1992a)\par
\pard \qj\li720 Local, J.K. \ldblquote Modelling assimilation in a non-segmental rule-free phonology.\rdblquote In G J Docherty & D R Ladd (eds): {\i Papers in Laboratory Phonology II}. Cambridge: CUP, 190-223, 1992.\par
\pard \qj Local (1992b)\par
Local (1995a)\par
Local (1995b)\par
Manuel (1995)\par
Marslen-Wilson and Warren 199x\par
Ogden (1992)\par
Ogden, Local & Carter ref.\par
\'85hman\par
other wmw refs (Gaskell?)\par
Pierrehumbert 1990\par
Pisoni and Duffy 19xx\par
Pisoni in van Santen book\par
Pratt (1986)\par
Remez 19xx\par
Remez and Rubin 19xx {\i Science} paper\par
Repp\par
Rosen and Howell 19xx\par
Selkirk 1984\par
\pard \qj\li720 Selkirk, E. O., {\i Phonology and Syntax}, MIT Press, Cambridge MA, 1984.\par
\pard \qj Silverman, Y\par
Simpson 1992\par
Strange\par
Tunley 1999\par
van Santen ref.\par
van Tasell, Soli et al 19xx\par
West (1999) \par
Wichmann and House 1999\par
Zsiga (1995)\par
\par
\pard\plain \s15\qj\fi-284\li556\sb120\sl-219\tx560 \fs18 {\f20 1.\tab Hawkins, S. \ldblquote Arguments for a nonsegmental view of speech perception.\rdblquote }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm. Vol. 3, 18-25, 1995.\par
2.\tab House, J. & Hawkins, S., \ldblquote An integrated phonological-phonetic model for text-to-speech synthesis\rdblquote , }{\i\f20 Proc. ICPhS XIII}{\f20 , Stockholm, Vol. 2, 326-329, 1995.\par
3.\tab Local, J.K. & Ogden R. \ldblquote A model of timing for nonsegmental phonological structure.\rdblquote In Jan P.H. van Santen, R W. Sproat, J. P. Olive & J. Hirschberg (eds.) }{\i\f20 Progress in Speech Synthesis}{\f20
. Springer, New York. 109-122, 1997.\par
4.\tab Local, J.K. \ldblquote Modelling assimilation in a non-segmental rule-free phonology.\rdblquote In G J Docherty & D R Ladd (eds): }{\i\f20 Papers in Laboratory Phonology II}{\f20 . Cambridge: CUP, 190-223, 1992.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 5.\tab Kelly, J. & Local, J. }{\i\f20 Doing Phonology.}{\f20 Manchester: University Press, 1989.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 6.\tab Hawkins, S., & Nguyen, N. \ldblquote Effects on word recognition of syllable-onset cues to syllable-coda voicing\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
7.\tab Hawkins, S. & Slater, A. \ldblquote Spread of CV and V-to-V coarticulation in British English: implications for the intelligibility of synthetic speech.\rdblquote }{\i\f20 ICSLP}{\f20 94, 1: 57-60, 1994.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 8.\tab Tunley, A. \ldblquote Metrical influences on /r/-colouring in English\rdblquote , }{\i\f20 LabPhon VI}{\f20 , York, 2-4 July 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 9.\tab Fixmer, E. and Hawkins, S. \ldblquote The influence of quality of information on the McGurk effect.\rdblquote Presented at Australian Workshop on Auditory-Visual Speech Processing, 1998.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 10.\tab Selkirk, E. O., }{\i\f20 Phonology and Syntax}{\f20 , MIT Press, Cambridge MA, 1984.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 11.\tab Broe, M. \ldblquote A unification-based approach to Prosodic Analysis.\rdblquote }{\i\f20 Edinburgh Working Papers in Cognitive Science}{\f20 \~7, 27-44, 1991.\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 12.\tab Bladon, R.A.W. & Al-Bamerni, A. \ldblquote Coarticulation resistance in English /l/.\rdblquote }{\i\f20 J. Phon}{\f20 4: 137-150, 1976.\par
13.\tab http://www.w3.org/TR/1998/REC-xml-19980210\par
14.\tab http://www.ltg.ed.ac.uk/\par
}\pard \s15\qj\fi-284\li556\sb120\sl-219\tx560 {\f20 15.\tab Heid, S. & Hawkins, S. \ldblquote Automatic parameter-estimation for high-quality formant synthesis using HLSyn.\rdblquote Presented }{\i\f20 at 3rd ESCA Workshop on Speech Synthesis}{\f20
, Jenolan Caves, Australia, 1998.\par
}\pard\plain \qj\sb240 \f20 [Ref1] http://www.w3.org/XML/\par
[Ref2] http://www.phon.ucl.ac.uk/project/prosynth.htm \par
\pard \qj\sb240 [Ref3] Klatt, D., (1979) "Synthesis by rule of segmental durations in English sentences", Frontiers of Speech Communication Research, ed B.Lindblom & S.\'85hman, Academic Press.\par
}
This archive was generated by hypermail 2.0b3 on Tue Sep 07 1999 - 16:56:18 BST