INVESTIGATING PEAK TIMING IN NATURALLY-OCCURRING SPEECH.

Department of Phonetics and Linguistics

INVESTIGATING PEAK TIMING IN NATURALLY-OCCURRING SPEECH: FROM SEGMENTAL CONSTRAINTS TO DISCOURSE STRUCTURE

Jill HOUSE and Anne WICHMANN¹

¹Anne Wichmann is currently at University of Central Lancashire, Preston

Abstract
This paper examines the alignment of fundamental frequency (f0) contours in speech with the segmental string. We first give an overview of past studies in this area, and then test the insights gained from these studies against a sample of naturally-occurring monologue. Our analysis of f0 peaks in sentence-initial position reveals effects of segmental and prosodic context consistent with earlier studies. In addition we observe further constraints on the timing of these accents, apparently related to topic structure. Topic-initiality appears to delay the f0 peak, sometimes to a position beyond the boundaries of the syllable itself. We discuss the theoretical implications for characterising the domain of tonal association.

The timing of f0 peaks in relation to accented syllables: what do we know?

1. Background
Since the early '80s, in parallel with the explosion of research into intonational phonology, there have been a number of studies investigating the alignment of fundamental frequency (f0) contours to the segmental string. Often these studies have been motivated by speech synthesis, where a better understanding of alignment should lead to algorithms which produce more natural-sounding synthetic speech. Other studies have had a more phonological intention -- for example, to establish a category distinction between tonal configurations. Languages for which detailed studies have been reported include American English, British English, German, Swedish, Dutch and Mexican Spanish². Insofar as there is comparability between such studies, it appears that many aspects of the phonetic realisation of f0 patterns may be considered to be language-independent.

²For example, American English: Steele (1986), Pierrehumbert & Steele (1989), Silverman & Pierrehumbert (1990), van santen & Hirschberg (1994); British English: Silverman (1987), House !1989); German: Kohler (1983, 1987, 1990); Swedish: Bruce (1983, 1986, 1990); Dutch: Caspers & van Heuven (1993), Rietveld & Gussenhoven (1995); Mexican Spanish: Prieto et al (1995).

A number of factors have been identified which contribute systematically to a) the relative height of f0 points (typically f0 peaks within pitch accents) and b) the position of these points in time, in relation to some tone-bearing unit (typically identified as the accented syllable), in defined contexts. An underlying assumption has to be that we can reliably identify instances of a particular phonological f0 pattern, such as a specified pitch accent, and then examine its realisation in context. Some of the factors below bear directly on both the frequency and time domains; this paper is concerned primarily with the timing domain.

2. Factors known to influence the timing of tonal peaks
These were usefully summarised in Bruce (1990:107) as follows:

Tonal composition (phonological analysis of pitch accents -- whether analysed as mono- or bitonal, linked or unlinked tones, targets or gestures -- can influence the results of a phonetic analysis)
Prosodic context
a. Boundaries (word, phrase, utterance, etc)
b. Rhythmical organisation (rhythmical grouping, e.g. stress clash)
c. Focus (prefocal, focal, postfocal position)
d. Tonal environment (tonal interaction within and between successive pitch accents, e.g. tonal crowding)
e. Pitch range (local or global, e.g. differences in degree of overall emphasis due to degree of involvement)
f. Global intonation (e.g. absence/presence of downdrift due to interrogative/declarative structure)
Segmental context (e.g. differences in intrinsic vowel length)
Speaking rate (fast, normal, slow tempo).

To this list we would suggest adding the influence of higher-level prosodic organisation, generally referred to as "discourse intonation"; implicitly this falls under 2(a, f) above, as part of the wider prosodic context.

We can relate this list to the three levels of prosodic organisation identified by Silverman (1987) as jointly determining the shape of f0 contours:

(i) the lowest tier: segmental perturbations, or "microprosody";

(ii) the intermediate tier: "sentence intonation" -- the choice of tones which in sequence constitute the intonational phrase;

(iii) the highest level: some level of paragraph or topic organisation.

In (iii), Silverman recognises and begins to address the "discourse" issue.

3. Experimental evidence

3.1 Objectives and techniques
A number of experimental techniques have been employed to investigate matters of alignment, involving both production and perception studies. First of all, if we wish to investigate how particular f0 patterns, defined phonologically, are realised over different segmental contexts, we have to be confident that we have defined and identified our phonological categories correctly. We have to distinguish between shifts in alignment which cross category boundaries, and those which represent some kind of "allotonic" variation. It has often been observed that on the one hand, configurations which are formally very similar may be perceived as functionally quite different; or on the other hand, that patterns which look very different from each other are classed together on grounds of function by listeners. Section 4 summarises some studies which focus on the first of these problem areas, the issues of categorical distinctions; section 5 discusses studies concerned with the second issue, allotonic variation within categories, in some cases exploring the interaction between contextual factors and category boundaries.

Production studies in this area typically involve one or more speaker subjects generating data by reading material specified by the researchers, controlled so as to elicit the particular prosodic patterns under scrutiny, under defined conditions. Measurements are then taken from the data, and analysed for systematic effects. Perception studies usually involve listener subjects categorising stimuli in which specified parameters have been controlled by the investigators, using resynthesised natural or synthetic speech. The usual disadvantages of unnatural laboratory speech must be outweighed at least in part by the need to eliminate unwanted variables. Much work has now been done to identify the key variables, but the picture remains incomplete.

Large speech corpora are becoming increasingly available. We propose that it may be valuable to use corpora to undertake complementary studies of what actually happens in natural speech that has not been controlled for its prosodic content. When we measure the timing and alignment of f0 contours, can we account for our findings in terms of the factors we have identified in laboratory studies? And if not, why not? What additional factors must be taken into consideration?

3.2 Measurement
The issue of measurement turns out to be far from trivial. Key questions include:

(i) What are the appropriate reference points in the segmental string? Should one be relating the alignment of f0 points to (say):

- the onset of the whole syllable?

- the onset of the vowel?

- the onset of voicing?

(ii) In identifying peaks (H) and troughs (L) in the f0 contour itself, in many cases the physical evidence is not clearcut. For example, the high or low value may be sustained over a period of time, and analysts must make a principled decision: should they choose a point at one or other edge of this sustained frequency, or select a mid-point?

Clearly, researchers will want to make their choices about measurement so as to maximise the predictive power of their data. Studies are generally internally consistent, but differences between them, though small, may be significant enough to obscure comparability.

4. Phonological category distinctions
It is no easy task to establish category boundaries between tonal configurations which are clearly differentiated in function at the extremes of a continuum. An example is when a constant f0 gesture is varied in the time domain -- an f0 peak or trough is shifted in time in relation to the accented syllable, giving rise to a different semantic interpretation at each end of the continuum. The question is whether this change in interpretation arises gradually or suddenly: is there a sharp category boundary at some point in the continuum?

Classic categorical perception techniques may be employed (labelling, discrimination), using resynthesised natural speech and time-shifting the f0 contour over an invariant segmental string. In proposing a labelling task, researchers must be confident that they can put some kind of semantic gloss on the proposed categories. This has been done notably for German by Kohler (1987), who found a strong category boundary between falling nuclear accents where the f0 peak preceded the stressed syllable, giving an early fall (="established"), and those where the peak occurred at least 60ms into the vowel of the stressed syllable, a late fall (= "new").

Changes in the timing of f0 gestures will normally be accompanied by other changes, such as to the durations of the underlying segments. Changes to the shape of f0 contours, such as the steepness of the falling or rising gestures, may be crucial in signalling different tonal categories, even where the alignment of an f0 peak may be constant. This issue is addressed by Kohler in his later perceptual work (Kohler 1990). In his experiments on "macro" intonation, stimulus utterances which are segmentally identical are differently interpreted depending on whether listeners hear stress on the prefix or stem of the verb form: "umlagern". Kohler finds that f0, segment duration and contour shape act as competing cues; in interpreting the results he differentiates between f0 as a cue to word-stress and its role in signalling a particular intonation pattern.

The target f0 patterns in Kohler's experiments were always realised in nuclear rather than pre-nuclear position, with syllables therefore subject to phrase-final lengthening. In his comments on this study, Silverman (1990) challenges the view that f0 has a direct stress-signalling function, preferring simply to assign early, mid and late peak contours to different pitch accents, while proposing that there are multiple cues to stress, of which f0 is only one. In a separate production experiment Silverman demonstrates that there are consistent durational differences between stressed and unstressed syllables in both pre-nuclear and nuclear position -- but discrepancies between speakers about how these durational differences are manifested (whether stress makes syllables longer or shorter).

Pierrehumbert & Steele (1989) investigate a different f0 configuration for American English: the "rise-fall-rise"; they propose a different phonological specification for the pattern with f0 peak aligned in the middle of the stressed vowel (L+H* L H%), and with the alignment late in the syllable (L*+H L H%). Again, the issue is whether this is a gradient or categorical distinction. Rather than adopting a classic categorical perception procedure, the authors set up a perception/production task, whereby subjects mimic renditions of an all-sonorant sentence in which the f0 peak has been systematically shifted in equal steps between early in the stressed syllable and late in the following, unstressed syllable. If subjects were able to mimic these stimuli fairly accurately, this would support a gradient interpretation; but if their productions clustered in a bimodal fashion, this would support the perception of two contrasting categories. In the event, four of the five subjects showed this bimodal effect, though one showed a centering tendency in his responses, suggesting that he only had a single category in his system.

In the studies reported above, the reference point for f0 alignment was taken as the start of the vowel in the stressed syllable, though the choice of anchor point is relatively unimportant when the segmental content is kept constant. The location of f0 peaks in the resynthesised stimuli was controlled by the researchers, but equivalent peaks had to be identified by Pierrehumbert & Steele in analysing their production data. The sharp peak characteristic of the pattern under investigation simplified the task of locating the peak, though they found considerable problems in locating the preceding L tone consistently. A reliable crossover point between tonal categories may be found for a particular segmental string (Kohler 1987), but we cannot assume that the absolute value of this timing point will be the same in other segmental and prosodic contexts. The contribution of these other factors is what largely motivates the studies described below.

5. Tones in context
For a given inventory of accent types, however defined and characterised, it is important to investigate their phonetic realisation over different segmental and prosodic contexts, and to identify and quantify the factors contributing systematically to variability. An adequate model of intonation for synthesis, for example, will have to incorporate the most salient conditioning factors.

Nuclear and non-nuclear accents
Different phonological models make different predictions about the status and behaviour of nuclear and non-nuclear accents: the British school of intonation traditionally treats the two types of accent as separate categories, while in the Dutch (IPO³) school (e.g. 't Hart et al, 1990), accent configurations are described in terms of a phrase-length pattern. An autosegmental model (e.g. Pierrehumbert 1980), on the other hand, claims that the same tonal inventory is used on both nuclear and pre-nuclear accents; the contour shape of those in nuclear position is simply modified by phrase accents and boundary tones. Some of the studies discussed below look at accents in both nuclear and pre-nuclear position, others at one or the other.

³Institute for Perception Research, Eindhoven, The Netherlands

Segmental context
Systematically varying the segmental context will involve systematically varying the composition of syllable onsets and rhymes in the target domain; there are well-known intrinsic and derived differences in the duration of segments from different classes, when incorporated in different syllabic constituents, but their interaction with the timing of the f0 contour is less well documented.

Prosodic context
Characterising the prosodic context is a complex business. We have to look both at rhythmic organisation and at tonal specifications in the vicinity of the target. Assuming that the target for investigation is an accented syllable, this can appear in a variety of positions within a prominence hierarchy, which coincides with intonational structure at the level of the intonational phrase (IP). In Germanic languages, at least, we can assume that the target syllable will be rhythmically stressed, and initiate a stress group, or (stress) foot. This foot may be monosyllabic or polysyllabic (the stressed syllable + some number of unstressed syllables). If monosyllabic and nuclear, it will be by definition phrase-final, with a phrase boundary immediately to its right, and will be subject to phrase-final lengthening. If monosyllabic and pre-nuclear, it will have another stressed syllable in the immediate right context; if this following stressed syllable is also accented, i.e. associated with a pitch accent, then this will mean that two pitch accents are immediately adjacent to each other -- a potential "clash". In a polysyllabic foot, the accented syllable will be separated from the next tone-bearing element (accented syllable or boundary tone) by whatever interval in time is occupied by any intervening unstressed syllables. The target accented syllable will be considerably shorter in a polysyllabic foot than if it were a monosyllable. A further complication may arise from the presence or absence of word boundaries within the foot, and from whether the unstressed syllables are strong or weak -- both factors which will affect rhythmic organisation. All the above factors are related to the right context, but the left context may also be important. Pre-nuclear accents themselves may be the first accent in a phrase (= intonational onsets) or phrase-medial. If they are in phrase-onset position, in a model which recognises boundary tones, they will be considered to have a boundary tone either immediately to the left or with a few unstressed syllables intervening.

The position of an accent within topic structure is arguably part of its prosodic context, though in a more loosely defined way than with regard to its position in the IP. Alternatively, one could consider the discourse structure to be a separate, higher level of organisation. There are well-known effects in the frequency domain from topic-initial (extra high frequency) and topic-final (extra low frequency) positions, but the contribution of the discourse structure to f0 timing has been little studied (but see Swerts et al, 1994).

Finally, one must recognise that adjustments will be needed when speaker tempo is changed; if segment/syllable durations are extended in slow speech or compressed in fast, then this must have an effect on the timing of the f0 contour.

Some of the contexts described above have been studied systematically in the studies reported below. But the picture is lamentably incomplete, and predictive models are only partially successful.

5.1 Segmental context
The effects of "microprosody" -- vowel-intrinsic f0 and consonantal perturbations -- appear to be salient in the frequency domain, and act as important cues to segmental identity (Kohler 1990, Silverman 1987, 1990). Silverman (1987) demonstrated that listeners were skilled at factoring out these effects when interpreting intonation. Their contribution to time-alignment is harder to assess; in a visual analysis of an f0 signal, consonantal perturbations may obscure the location of nearby f0 peaks (House 1989). This type of segment-based microprosody is not further discussed in detail here.

Segment classes are important to f0 timing in other ways. Vowels are associated with different intrinsic durations, which will affect the amount of voicing available to an f0 gesture. Sonorant consonants behave differently from voiced obstruents in their availability to carry the f0 contour. The duration and therefore the slope of the contour will thus depend on the segmental composition of the syllable. Syllable-final voiceless obstruents (in English, at least) markedly shorten the durations of preceding sonorants in the syllable rhyme, and thus may have an indirect effect on contour alignment.

In a pilot production experiment, House (1989) looked at realisations by a single speaker, trained in the British intonation tradition, of three nuclear tones (high fall, fall-rise, low rise) over monosyllables with systematically varied segmental composition. All contained either a short vowel or (long) diphthong , with varied onset and coda types. The starting points for the different f0 gestures (beginning of fall in the fall and fall-rise; beginning of rise in the rise) were typically located in different parts of the syllable rhyme (earliest for the fall, latest for the rise) when these words were spoken in isolation; these differences appeared to be phonological. These different starting points also meant that the f0 movements regularly had different durations from each other. However, the mapping of the f0 gestures to the syllables varied systematically according to the composition of the syllable rhyme: movements were steeper, more compressed, over intrinsically short vowels and those "clipped" by following voiceless consonants, while the shallowest slopes were found over long vowels followed by sonorant consonants, especially nasals. At the same time, the variability in the duration of the f0 movement was much less than that of the segments themselves, so the two duration measures could not be considered to be in direct proportion. The domain identified in this study as being the relevant one for contour realisation was the voiced rhyme, starting from the beginning of voicing in the vowel and continuing to the end of periodicity in the syllable.

Van Santen & Hirschberg (1994) include segment classes as a variable in their study of nuclear accents in American English, using a much larger, single-speaker database, with target syllables produced phrase-finally in carrier sentences. Distortions in the temporal domain are described as "non-linear rightward stretching of the contour as the durations of onset, vowel nucleus, and coda increase". The relationship between peak timing and syllable composition is not a simple one, whether syllable- or vowel-onset is used as a reference point. The authors claim that the timing of anchor-points such as the peak can be "predicted from segment duration in conjunction with coda class; no information is needed about onset class or vowel height". These findings are consistent with House's observations. In calculating peak position they use a weighted combination of onset duration and "s-rhyme" (sonorant rhyme) duration, plus a constant. The "s-rhyme" is defined as starting at the beginning of the last sonorant consonant in an onset cluster, and continuing through any sonorant segments in the rhyme. Onset duration has the largest effects on the timing of the early part of the contour, s-rhyme duration on the later parts. An apparent justification for incorporating onsets into the equation is their discovery of an important anchor-point which occurs 70ms before target syllable onset, and whose value is 20% of peak value. The other stable anchor-points were the peak itself, and the end-point of the s-rhyme, where there was very little variation in f0 value.

5.2 Prosodic context and tempo
Steele (1986; reported in more detail in Silverman & Pierrehumbert 1990) investigated nuclear H* accent contours in American English, in systematically varied prosodic positions and at normal and fast speaking rates. Her results suggested that f0 peaks occurred at a proportional rather than a fixed interval into the vowel. The influence of vowel length was confirmed by House's 1989 observations. The f0 peak delay (relative to the vowel) was shortened by intrinsically shorter vowels and by a faster speaking rate; however, when vowels were shortened by the presence of unstressed syllables in the foot (forming a nuclear "tail"), there was no corresponding shortening of peak delay: on the contrary, the peak tended to occur later in the stressed syllable when this was not phrase-final. Phrase-final position was thus associated both with longer duration and an earlier f0 peak. House's (1989) study of nuclear tone realisations included a set of sentences in which a subset of the target nuclear monosyllables were embedded in sentences where the nucleus was followed by one or more unstressed syllables in a tail. Analysis of these suggested a consistent shift rightwards of the f0 anchor-points to a position late in the nuclear syllable, though there were too few data points for statistical verification of the results. Steele controlled for any segmentally-induced variability by using an all-sonorant string.

In a follow-up to Steele's study, Silverman & Pierrehumbert (1990) report a substantial investigation of pre-nuclear H* accents in American English. As well as wanting to establish what factors determine f0 alignment in these syllables, they were interested to see if there was any evidence for the phonological homogeneity of pre-nuclear and nuclear accents, or whether they should be treated differently, as in the British tradition (Silverman 1987 had observed that nuclear accents in British English tended to have a consistently earlier f0 alignment than pre-nuclear ones). The results have been interpreted to support the theory of phonological homogeneity.

Their two speaker subjects produced a series of repetitions of all-sonorant two-accent phrases, using H* on the prenuclear accent and H+L* on the second (followed by L L%). There were three speaking rates (slow, normal, fast), and the two accents were separated by 0-4 unstressed syllables. The position of the word boundary between the accents was also varied (e.g. "MAma LEMM" vs "MA le MANN"), so that the effects of word boundaries on rhythmic organisation within feet containing the same number of syllables could be studied. The target (prenuclear) syllables were also phrase-initial, though this aspect of the context is not discussed. The H*-associated syllables could thus be analysed as acquiring certain binary contextual properties: plus or minus "stress clash" (adjacent to another accented syllable); plus or minus word-final (adjacent to word boundary). A word-boundary effect had been found in German by Kohler (1983) on both the degree of syllable compression and the way f0 was realised: a falling contour would typically be spread over the whole of the remainder of the word, making the fall steeper over words with fewer syllables.

In relative terms, Silverman & Pierrehumbert found that f0 timing varied across the three tempi and in relation to the right prosodic context in ways which closely echoed the nuclear syllables in Steele's study; in absolute terms, however, f0 peaks were earlier in nuclear than non-nuclear syllables, confirming Silverman's earlier observations. A simple change in tempo meant earlier peak alignment in fast speech, where syllables were shorter, and later alignment in slow speech where they were longer. However, a "lengthening" prosodic context to the right shifted f0 alignment to an earlier point in the syllable. These contexts were either a word boundary on its own (MA/MOM le MANN), or a word boundary in combination with stress clash (MA/MOM LEMM), which increased the leftward shift. The absolute difference reported for Steele's nuclear syllables, where peaks were even earlier, could be attributed to the even stronger lengthening effect of the phrase boundary, and/or the upcoming phrase accent. The authors favour an interpretation of their results based on the "sonority profile" of the syllable, recognising that the closing gesture of such a profile is extended by prosodic lengthening. In calculating an equation which gives the best fit to the data, they find that peak placement in proportion to syllable rhyme length is most regular in patterning. They found no evidence for invariance in the rise-time leading up to the peak, and no need for a modification to the underlying phonological representation to account for alignment differences. There was, on the other hand, evidence of some "tonal repulsion" or "gestural overlap" factor (in the stress clash condition).

A few unwanted variables crept into the study, e.g. from the difficulties of locating the boundaries of lateral consonants, or the best point to choose as f0 peak; the tonal sequence chosen (H* H+L*) cannot have helped, since it tends to generate a plateau at the H level. In commenting on the results, Bruce (1990) queries whether the f0 peak is necessarily the best anchor-point to choose, and suggests analysing the alignment of low (L) anchor-points preceding or following. Analysis of a complete f0 gesture might suggest a certain amount of gesture invariance. He cites evidence from Swedish which shows that the starred tone within word accents is the point most reliably anchored in the stressed syllable; other f0 points, such as the focal H accent, may not be critically timed to specific syllables but will be adjusted in alignment according to the segmental material available.

Another extensive production-based study, using two speaker subjects, is reported for H* accents in Mexican Spanish in Prieto et al (1995). Earlier observations had suggested that in phrase-medial position, the f0 peak was regularly displaced to the right, often into the syllable following the accented one. A possible characterisation of these accents as L*+H rather than H* was apparently rejected because the former was not part of the Spanish inventory (Hirschberg, p.c.) Prosodic contexts are systematically varied: the H* accents appear in initial, medial or final position within all-sonorant words (e.g. NUmero vs nuMEro vs numeRO); the words may in turn precede a major (IP) or minor (ip) prosodic boundary, or be phrase-medial; the inter-stress interval (in syllables) is varied. The study specifically aims to look at the complete accent gesture rather than simply f0 peaks.

In this study, peak delay was measured from the beginning of the whole syllable, so in calculating peak position, syllable onset duration was inevitably a significant and fairly constant factor (indeed the authors achieved a higher prediction rate by using raw rather than relative peak delay). This measure was perhaps justified by the discovery that the L point from which the precursive rising gesture began was always anchored in the syllable onset. However, it means that results cannot be compared directly with those of Silverman & Pierrehumbert, who measured peak delay in relation to the sonorant rhyme. The data provided evidence that longer syllables were associated with later peaks, except that there were certain prosodic contexts to the right which could shift the peak leftwards. Upcoming IP and ip boundaries had some effect, but there were large inter-speaker differences. There was a clear effect from the position of the accented syllable within the word: the peak retracted as accented syllables approached the end of the word. In checking whether word-boundary effects may have contributed to this, further inconsistencies were found between the speakers. There was apparently no increased peak retraction before a stress clash -- but again, the speakers showed different clash-resolving strategies. Both speakers had a weak but consistent tendency to increase peak delay in line with an increase in the number of unstressed syllables between accents. Since the preceding L was stably anchored in the onset, while peak position varied, there was no invariant rise time.

In their study of the "effects of time pressure" on Dutch rising and falling accent gestures, Caspers & van Heuven (1993) set up three different kinds of time pressure, and used themselves as speaker-subjects. Six different accent patterns were specified, over syllables containing either a long or short , with varied onset/coda consonants, and produced at both a normal and a fast rate. The first type of time pressure was fast speech. In this condition, the duration of both rising and falling gestures was compressed by both speakers. The second type of pressure was using a short rather than a long vowel; here, the only consistent effects (both speakers) were an increase in the size of the f0 excursion for both rises and falls, and a steeper slope for the rise. The third time-pressure condition was in cases of a clash of f0 gestures -- where a rise was followed immediately by a fall on the same syllable (the "pointed hat" pattern). For both speakers, the duration of the rise was shorter and its slope steeper, while the fall was relatively unaffected, suggesting it was not under pressure from anything in the right context. As far as alignment was concerned, the start-point of the rise was consistently anchored in the syllable onset. Its end-point was very variable when there was a fall in the right context, providing evidence against 't Hart's (1990) claim that the rising gesture was completed 50ms into the vowel. The picture was much less clear for the fall, where shape was the most constant feature. The best anchor-point was in relation to vowel (rather than syllable or voicing) onset, but there was no fixed point of synchrony in segmental structure, and there were large effects when under pressure from a preceding rise.

5.3 Interacting effects of phonological category and syllable structure on alignment
A perception experiment using synthetic speech is described in Rietveld & Gussenhoven (1995). In the light of evidence from studies such as those described above, showing systematic variation in alignment depending on segmental and prosodic factors, the authors carried out a study using a synthesis system which aligned target values for accents at a fixed proportion into the vowel. The study was designed to test whether the perception of category boundaries was distorted by such a relatively crude algorithm. The boundary to be tested was that distinguishing two versions of the "flat hat": with and without "downstep", alternatively characterised as H* H*L L% ("late fall") or H* !H*L L% ("early fall"), a contrast not unlike that attested for German in Kohler (1987). At the same time, the researchers wished to test whether alignment should be calculated with reference to syllable constituency or a perceptual event -- the P-centre.

A carrier sentence placed the target syllable -- carrying the final accent -- in penultimate position in the phrase; the vowel was always , but consonants in onset and coda varied in number and voicing. P-centres were calculated (they are predicted to be later for voiced codas, earlier for long onsets). The intonation over the early part of the sentence was constant, but the position of the final fall was shifted in regular steps in relation to the target syllable. Listener subjects identified the contours as either (I) "quiet, low-pitched" (downstepped) or (II) "more emphatic, high-pitched" (non-downstepped). The crossover point in perception did indeed vary according to the syllable structure; for instance, it usually occurred earlier when there was a voiceless coda, unless this was counter-balanced by a long onset. Calculating alignment in terms of P-centres was found to be a valid strategy, but had less predictive power than a combination of onset duration, coda voicing and voiced onset duration. The timing of the H* target had to be related to the availability of sonorant segments in the syllable (including consonants in onsets and codas). Since the experiment used synthetic speech, results need to be interpreted with appropriate caution if they are to be applied to natural speech.

6. Implications for phonological structure
A striking and consistent finding which emerges from the experimental literature, and which seems to be fairly robust across the languages studied, is that f0 timing in "laboratory speech" correlates with segmental duration in two quite different ways:

(i) Intrinsic differences between segments, such as long vs short vowels, will lead to straightforward adjustments to the f0 alignment: the longer the vowel, the later the peak. This adjustment also appears to apply when segmental durations are affected by syllable structure (e.g. pre-fortis clipping), or when speakers change the length of segments by increasing or decreasing the overall tempo of their speech.

(ii) Lengthening effects due to the prosodic context, such as upcoming word or phrase boundaries, will conspire to push peaks leftwards: so effectively, the longer the vowel when lengthened under these conditions, the earlier the peak. Stress clash (or more accurately, accent clash) compounds this leftward push, presumably to increase the separation between the accent gestures.

An alternative way to look at the effects of "prosodic lengthening" is to take the fully long (e.g. phrase-final) realisations as the unmarked case, and to treat durational variants as examples of shortening. A shortening effect arises from rhythmic organisation, at least in some languages, whereby stressed syllables at the head of polysyllabic feet are substantially shorter than those without trailing syllables. When syllables are rhythmically clipped in this way, accents tend to be aligned late, sometimes with their peaks outside the stressed syllable altogether.

When this occurs, it is a problem for phonological theories which insist on a strict association between a pitch accent and a stressed syllable, whether the domain for phonetic realisation -- the tone-bearing unit (TBU) -- is defined, for example, as the whole syllable, the vowel, or the sonorant rhyme. British nuclear tones and American starred tones are both supposed to show such a close association. We would like to propose that this association may be more properly made with a larger structural domain, namely the foot. In phonetic realisation, there will be a strong chance of f0 peaks landing on the stressed syllable, but even if they do not, it will be the stressed syllable, with all its other rhythmic cues, which is perceived as prominent. A stressed syllable next to a phrase boundary is of course a monosyllabic foot in its own right. The suggestion about association with the foot is not new -- Pierrehumbert & Beckman (1988) describe accent as "a foot-level property that is attracted to the head syllable" (see also Pierrehumbert 1980) -- but the assumption has nonetheless remained that it is the stressed syllable which is the TBU.

Further work is needed to test the foot hypothesis, to establish how much tonal material has to be assigned to the foot, and to determine exactly what the best definition of a foot should be in this context. F0 behaviour may provide evidence to support a particular definition of the foot -- whether a foot is allowed to include word boundaries, in the Abercrombian tradition; how strong syllables which are not perceived as being rhythmically stressed are organised; whether a foot should be seen as embedded within the structure of the (phonological) word, allowing certain weak syllables to remain unorganised at foot level. The evidence reported above shows that there are differences in f0 alignment between polysyllabic stress groups containing word boundaries and those which are word-internal, but whether this simply reflects the durational differences is a matter for empirical investigation.

Current work by Möbius (1994, 1995) on modelling f0 for synthesis, using an adaptation of Fujisaki's (1979) model, assumes that the curve associated with an accent is a property of the whole "accent group", defined as the accented syllable + any trailing unaccented ones. In the examples he gives, it appears that the accent group is coterminous with the stress foot, since all stressed syllables receive an accent. The accent command generates a contour which, while blind to the segmental content of the accent group, gives acceptable results in synthesis. A version of his model, incorporating the findings of van Santen & Hirschberg -- so that alignment depends on the durational/segmental makeup of the whole foot, not just the stressed syllable -- has been implemented in text-to-speech synthesis; good results are reported in terms of naturalness for American English, German and Spanish (van Santen, p.c.).

There are limitations to Möbius's approach -- there would seem to be only one kind of accent; sentence mode determines phrase final and declination conditions; and, as Ladd (1995) points out, the system is not able to model certain phonetic shapes such as the flat hat pattern without treating what are clearly two accents as a single accent. Nonetheless, Möbius's use of a unit similar to a foot as the domain for realisation of an accent gesture fits in with our interpretation of experimental results reported above.

7. The discourse issue
So far this paper has mainly been concerned with the constraints of low-level structures (segments, words, feet, tone groups) on intonation. It is however well known that intonation is also affected by a higher level "discourse" structure. The information contained in a text is not simply expressed by a sequence of sentences, but by sentences grouped together around a topic or sub-topic to make up a meaningful unit, often referred to as a discourse unit. In written texts the boundaries of such topical units are often highlighted by typographic means such as paragraph divisions, headings, sub-headings etc.

In speech, both the internal coherence of a discourse unit and the demarcation of its boundaries can be indicated prosodically. The prosodic features most commonly associated with the transitions between discourse units are low boundary tones, long pause and high pitch reset. These observations were made for example in conversational data by Brown and Yule (1983) who use the term "paratone" to indicate a prosodic domain comparable to a paragraph in writing.

7.1 F0 peak timing and "finality"
Swerts (1994) (cf also Swerts et al 1994), in a study of the prosodic correlates of finality, observes that in synthesised utterances the timing of the f0 peak (H*) on the nuclear accent affects the degree of perceived finality. A single utterance in Dutch (een gele driehoek: "a yellow triangle") was synthesised with two different contours from the Dutch (IPO) system, a "flat hat" and a "pointed hat", and with two different timings: an early fall beginning 20ms before vowel onset, and a late fall 80ms after vowel onset. The results of the perception experiment showed that the early falls constituted stronger finality cues than the late falls. This is consistent, as Swerts points out, with the autosegmental model of Dutch intonation proposed by Gussenhoven (1988) which has two accent-lending falls, (the subject of the study reported in section 5.3 above): downstepped (early) and non-downstepped (late), as opposed to the IPO system which only has one ('t Hart et al 1990). The meaning ascribed by Swerts to the two falls, (degrees of "finality"), supports the intuitions of Rietveld and Gussenhoven in their perception of the two falls: "The downstepped contour sounds more as if it were meant as a definitive contribution to the discourse, and does not seem intended to draw the listener into a further discussion or evaluation of that contribution" (1995:377) This perception is what one would expect in an interactive context, although it is an inferred meaning, rather than a "literal" meaning. The effect of different degrees of finality in a monologue, or reading aloud, would probably not be perceived as interaction management but as reflecting the structure of the text.

Swerts's study was based on one isolated utterance, and claims that the observed effects are necessarily related to discourse structure are therefore too strong. However, the perception of different degrees of finality in the performance of a single utterance suggests that there is at least a potential for this to be exploited for discoursal effect. A similar observation has been made in relation to the height of nuclear falls (Wichmann 1991b), namely that the starting point of a fall affects the degree of perceived finality. There is clearly a relationship between late and early falls and high and low falls, but this has yet to be explored fully. Categorical or gradient distinctions have been claimed for both sets of patterns. This is still a matter for debate.

As we have seen, most experimental studies of peak timing have focussed on the shape of accents in nuclear position, though a few more recently have explicitly compared these with pre-nuclear, or phrase-medial, accents. A consistent observation, confirmed by these studies, is that the peaks in pre-nuclear accents tend to occur later in the accented syllable (if not outside it altogether) compared with those in nuclear accents. The conclusion drawn is not that nuclear and pre-nuclear accents differ intrinsically in their properties, but that the difference in alignment can be explained by the strong effects of prosodic lengthening reported in nuclear position, together with pressure from domain-edge tones which need to be realised.

Inevitably, in collecting controlled data for laboratory experiments, discourse structure is somewhat limited. Implicit in the above studies is an assumption that pre-nuclear accents must be a homogeneous class -- the argument was whether they differed from nuclei, not whether they differed among themselves. But in natural speech, pre-nuclear accents may be phrase-initial or phrase-medial, and there is ample evidence from the many studies of accent scaling that position within the phrase has an important effect in the frequency domain. However the intonational phrasing is phonologised (not at issue here), the first accent in a phrase, described as the intonational "onset" in the British tradition, but abbreviated henceforth to IO, to avoid confusion with syllable onsets, is typically observed to have a higher f0 than subsequent ones in a defined domain. When text material is organised into spoken "paragraphs", according to discourse topic, this effect is enhanced: the start of a new topic appears to be signalled by extra high frequency on the first IO. Are there also consistent generalisations to be made about the timing of f0 within non-final accents according to the position of the accent in the phrase and in the discourse?

7.2 F0 peak timing and "initiality"
Wichmann (1991a) investigated the intonational correlates of topic shift in a complete 5-minute broadcast news summary, part of the Spoken English Corpus (SEC: Knowles et al forthcoming). The results showed that, as expected, each news item was marked prosodically by a high pitch reset. Further observations also showed a clear correlation in the data between topic-initial IOs and a very late alignment of the extra-high f0 peak, which occurred at the end or even beyond the stressed syllable. This pattern, a clear rise in f0 to a late peak, was rare in the text except in this topic-initial position.

There were ten such topic-initial IOs in the news broadcast. The prosodically transcribed SEC text, auditorily based, assigns a high level tone to seven of them; of the remainder, one is marked as high falling, and the remaining two, in nuclear as well as IO position in their respective phrases, as fall-rises. The same late f0 peak is nonetheless found on all of them, suggesting an interaction between topic structure and peak timing, regardless of contour.

Naturally-occurring speech is of course not controlled for segmental and prosodic environment. Claims for a discoursal effect on peak timing cannot be made without taking into account other timing constraints. A reanalysis of the IO accents in this text is therefore required, to see first of all whether segmental and low-level prosodic constraints on peak timing, as observed in experimental work to date, are reflected in natural data; secondly to see whether there is an additional, discoursal, effect contributing to the late f0 alignment.

8. Looking for evidence in "real" data for reported constraints on peak timing

8.1 Procedure
The text analysed was that described above: a news summary from the SEC. The IO was taken to be the first accented syllable of each major tone-group as marked in the prosodic transcription of the corpus text. For each IO the timing of the f0 peak was calculated as a percentage of the total duration of the accented syllable. In some cases the peak associated with the accented syllable⁴ actually occurred later in the foot, giving values greater than 100%.

⁴ Syllables were measured consistently as far as possible (work-medial intervocalic consonants were excluded), but given the unconstrained nature of the data some segmental sequences, such as the geminate nasals in "Prime Minister", made the precise location of boundaries difficult.

Each accented syllable was examined for its segmental makeup and prosodic environment. All factors, other than tempo, identified by past research as operating a leftward pull or rightward push on the timing of the peak were taken into account. Those exerting a leftward pull were taken to be:

short vowel

number of consonants in the syllable onset

presence of a voiced segment in the syllable onset

stress clash

upcoming word boundary

upcoming intonation phrase boundary;

those exerting a rightward push were taken to be:

long vowel or diphthong

sonorant syllable coda

polysyllabic foot.

For the purposes of this study the speaking speed was assumed to be constant.

The subsequent weighting of each syllable followed the practice of Rietveld and Gussenhoven (1995) who implemented their experimental results in their synthesis system by assigning equal value to each of the segmental effects they identified (the peak was moved 15ms to the left for each consonant in the syllable onset, an additional 15 ms for a voiced segment, and moved 15ms to the right in the presence of a sonorant coda). This principle of equal weighting was extended here to the prosodic context. Thus for each leftward pulling factor a unitary negative value was assigned, and for each rightward pushing factor a positive value was assigned. The final weighting was the sum of these values (e.g. three leftward pulling factors and one rightward gives a final weighting of -2 (-3 + 1 = -2). The resulting value is referred to below as the "accumulated timing constraint" (ATC).

The prosodic transcription of the SEC uses a version of the British system, assigning to each accented syllable a tonetic stress mark indicating the tone (rise, fall etc) associated with it. No formal distinction is made between nuclear and non-nuclear accents. The IOs concerned here were for the most part marked with a high level tone, but also included fall, fall-rise and rise. Those marked with a rise were excluded from the present analysis for two reasons: firstly, a rising tone has a late f0 peak by definition, and secondly, the other tones (level, fall, fall-rise) are all associated with an H* accent in other (autosegmental) models of intonation and are therefore assumed to be more closely related phonologically.

8.2 Evidence for segmental and prosodic constraints
The IOs (a total of 42) were divided into three categories: those with a weighting of -2 and lower, those with a weighting of -1 or 0, and those with a weighting of +1; (there were none higher than +1). For each of these categories the mean peak position was calculated.

Table 1

ATC (accumulated timing constraint) Mean peak position (expressed as % into IO syllable)

-2 or lower 57.8

-1/0 95.9

+1 94.1

Table 1 shows the mean peak position for each category expressed in terms of percentage into the total duration of the accented syllable. It appears from this that the timing of the f0 peak occurred much earlier in those IO syllables with an ATC of -2 or lower, than in those with an ATC of -1, 0 or +1. The difference between the other two categories is insignificant, reflecting perhaps the absence of extreme positive weightings. (The lowest ATC was -5, while the highest was +1) Alternatively it could suggest that a rightward push is generally weaker than a leftward pull.

8.3 Evidence for discoursal constraints
A further categorisation of IO syllables was undertaken in order to test the hypothesis that discourse structure might also exert an effect on the timing of the f0 peak, namely a further rightward push. The IOs were divided according to whether they occurred at the beginning of a news item or not (categorised here as "topic-initial" and "topic-medial"). See Table 2.

Table 2

ATC mean peak position (% into IO syllable)

topic initial topic medial

-2 or lower 80.7 50.2

-1/0 103.4 91.8

+1 106 90.2

It is clear from these results that while the effects of segmental and low-level prosodic context remain, i.e. that the timing of the f0 peak is much earlier in a syllable with an ATC of -2 or lower than in others, those IOs which are topic-initial tend to have a consistently later f0 peak than those which are topic-medial. The effect is again most marked in those IOs which have a strong negative weighting, but we note that in the absence of negative weighting the f0 peak is still delayed, to the extent of occurring outside the accented syllable.

9. Discussion

9.1 Segmental effects and prosodic context
This analysis has shown first of all clear confirmation in unconstrained natural data of those segmental and prosodic influences on peak timing observed in experimental studies, particularly those constraints assumed to exert a leftward pull. Further studies might usefully investigate the relative strength of effect of segmental and prosodic constraints, and indeed whether the rather crude unitary measure used here to assess the overall weighting should be refined.

9.2 Discourse effects
In addition the results suggest that the topic structure of a text has an effect on peak timing, exerting a strong rightward push even to the extent of causing the peak to occur beyond the stressed syllable itself. Since a tendency has already been established for topic-initial IOs to be markedly higher in pitch than those which are topic-medial, it could be thought that the delayed peak may simply be a function of increased height, on the assumption that the greater the step up in pitch, the longer it may take for a speaker to reach the target. The text itself contains evidence that this is not necessarily the case. The very first utterance in the broadcast (omitted from the analysis since it is meta-textual -- "Now it's one o'clock.." -- and not part of the news) begins with a very high IO in which the f0 peak nonetheless occurs at the beginning of the syllable. This speaker at least is therefore physically able to combine a high onset with an early peak.

9.3 Phonological implications
The results described above also have implications for the categorising of tonal contours. The f0 contour which results from a delayed peak on an onset perceived as "level" is physically very similar to that on an onset marked as rising. It is interesting to speculate on the factors influencing the transcribers' choice of phonological categories here. It may be that they intuitively factor out the discoursal effect on peak timing in what is phonologically classed as a level tone (or high head), but it is not clear what additional acoustic cues distinguish this perceptually from an onset which they categorise as a rising tone.

Finally we would like to suggest that the observations reported earlier on the differences between nuclear and non-nuclear accent timing lend themselves to re-interpretation. The greater perceived "finality" of an early peak in a nucleus appears to have its counterpart in the discoursally greater "initiality" of a late peak in an intonational onset. Although we have no corresponding perceptual evidence, we would suggest that timing differences which have been ascribed in the past to the distinction between nuclear and non-nuclear contours, might more appropriately be ascribed to the effects of initiality and finality.

References
Brown, G. & G. Yule (1983) Discourse Analysis, Cambridge: CUP

Bruce, G. (1983) "Accentuation and timing in Swedish", Folia Linguistica 17 (1-2), 221-238

Bruce, G. (1986) "How floating is focal accent?". Paper presented at Nordic Prosody IV, Middelfart, Denmark, June 1986

Bruce, G. (1990) "Alignment and composition of tonal accents: comments on Silverman & Pierrehumbert's paper", Kingston & Beckman (eds), Papers in Laboratory Phonology I, Cambridge: CUP, 107-114

Caspers, J. & V. van Heuven (1993) "Effects of time pressure on the phonetic realization of the Dutch accent-lending pitch rise and fall", Phonetica 50, 161-171

Fujisaki, H., K. Hirose, K. Ohta, (1979) "Acoustic features of the fundamental frequency contours of declarative sentences in Japanese", Annual Bulletin of the Research Institute for Logopedics and Phoniatrics (Tokyo), vol 13, 163-172

Hart, J. 't, R.Collier, A. Cohen (1990) A Perceptual Study of Intonation, Cambridge: CUP

House, J. (1989) "Syllable structure constraints on f0 timing", poster presentation, LabPhon II, Edinburgh

Kohler, K. (1983) "Prosodic boundary signals in German", Phonetica 40, 89-134

Kohler, K. (1987) "Categorical pitch perception", Proc. XIth ICPhS, Tallinn, vol 5, 331-333

Kohler, K. (1990) "Macro & micro f0 in the synthesis of intonation", Kingston & Beckman (eds), Papers in Laboratory Phonology I, Cambridge: CUP, 115-138

Knowles, G., L.Taylor, B.J.Williams (forthcoming) The Spoken English Corpus, London: Longman

Ladd, D.R. (1995) "'Linear' and 'overlay' descriptions: an autosegmental-metrical middle way", Proc. XIIIth ICPhS vol.2, Stockholm, 116-123

Möbius, B. (1994) "A quantitative model of German intonation and its application to speech synthesis", Proc. 2nd ESCA/IEEE Workshop on Speech Synthesis, New Paltz, 139-142

Möbius, B. (1995) "Components of a quantitative model of German intonation", Proc. XIIIth ICPhS vol.2, Stockholm, 108-115

Pierrehumbert, J. (1980) "The phonology and phonetics of English intonation", PhD dissertation, MIT, IULC (1987)

Pierrehumbert, J. & M. Beckman (1988) Japanese Tone Structure, Linguistic Inquiry monograph 15, Cambridge, Mass: MIT Press

Pierrehumbert, J. & S. Steele (1989) "Categories of tonal alignment in English", Phonetica 46, 181-196

Prieto, P., J. van Santen & J. Hirschberg (1995) "Tonal alignment patterns in Spanish", J Phon 23, 429-451

Rietveld, T. & C. Gussenhoven (1995) "Aligning pitch targets in speech synthesis: effects of syllable structure", J Phon 23, 375-385

Santen, J. van & J. Hirschberg (1994) "Segmental effects on timing and height of pitch contours", Proc. ICSLP 94, Yokohama, 719-722

Silverman, K. (1987) "The structure and processing of fundamental frequency contours", PhD dissertation, University of Cambridge

Silverman, K. (1990) "The separation of prosodies: comments on Kohler's paper", Kingston & Beckman (eds), Papers in Laboratory Phonology I, Cambridge: CUP, 139-151

Silverman, K. & J. Pierrehumbert (1990) "The timing of prenuclear high accents in English", Kingston & Beckman (eds), Papers in Laboratory Phonology I, Cambridge: CUP, 72-106

Steele, S. (1986) "Nuclear accent f0 peak location: effects of rate, vowel, and number of following syllables", JASA Supplement 1, 80, s51

Swerts, M. (1994) "Prosodic features of discourse units", PhD dissertation, Technical University Eindhoven

Swerts, M., D.G. Bouwhuis, R. Collier, (1994) "Melodic cues to perceived 'finality' of utterances" Journal of the Acoustical Society of America 96(4), 2064-2075

Wichmann, A. (1991a) "Beginnings, middles and ends: intonation in text and discourse", PhD dissertation, Lancaster University

Wichmann, A. (1991b) "Falls: variability and perceptual effects" Proc.XIIth ICPhS vol 5, Aix- en-Provence, 194-197

Wichmann, A. (1992) "F0 peak position as a cue to text structure", oral presentation, 13th ICAME, Nijmegen

Back to SHL 9 Contents

Back to Publications

Back to Phonetics and Linguistics Home Page

ATC (accumulated timing constraint)	Mean peak position (expressed as % into IO syllable)
-2 or lower	57.8
-1/0	95.9
+1	94.1

	topic initial	topic medial
-2 or lower	80.7	50.2
-1/0	103.4	91.8
+1	106	90.2

Department of Phonetics and Linguistics