I thought you might be interested in this document about prosody
from Alex Monaghan. The idea is to help set up the work plan for
COST 258. If you have any comments you would like me to carry to=20
the meeting on May 18th, then let me know.
Mark
-------------------------------------------------------------
COST 258 - Improvements in Naturalness of Synthetic Speech
State-of-the-Art Summary of European Synthetic Prosody
Alex Monaghan
May 1998
This document summarises contributions from eighteen research groups
across Europe. The motivations, methods and manpower of these groups
vary greatly, and it is thus difficult to represent all their work
satisfactorily in a concise summary. I have therefore concentrated on
points of consensus, and I have attempted to include the major
exceptions to any consensus. An extensive list of references is
provided at the end, sorted by research group, for those requiring more
detail on individual research groups.
The summary starts with a brief overview, followed by a section
devoted to three different aspects of prosody (pitch, timing and
intensity). Next comes a summary of the various methodologies for
synthetic prosody research in Europe, followed by an indication of the
applications (existing or envisaged) of this research. Finally, I
outline some key issues for future work, many of which should be
addressed by COST 258.
I am indebted to the seventeen colleagues who provided summaries of
their own work, and I have deliberately stuck very closely to the text
of those summaries in many cases. There is important European work on
synthetic prosody which is still missing from this summary, and we
should attempt to remedy that as a matter of urgency.
OVERVIEW
In contrast to US or Japanese work on synthetic prosody, European
research has no standard approach or theory. In fact, there are
generally more European schools of thought on modelling prosody than
there are European languages whose prosody has been modelled. We have
representatives of the linguistic, psycho-acoustic and stochastic
approaches, and within each of these approaches we have phoneticians,
phonologists, syntacticians, pragmaticists, mathematicians and
engineers. Nevertheless, certain trends and commonalities emerge.
Firstly, the modelling of fundamental frequency is still the goal of the
majority of prosody research. Duration is gaining recognition as a major
problem for synthetic speech, but amplitude continues to attract very
little attention in synthesis research. Most workers acknowledge the
importance of interactions between these three aspects of prosody, but
as yet very few have devoted significant effort to investigating such
interactions.
Secondly, synthesis methodologies show a strong tendency towards
stochastic approaches. Many countries which have not been at the
forefront of international speech synthesis research have recently
produced speech databases and are attempting to develop synthesis
systems from these. Methodological details vary from neural nets trained
on automatically aligned data to rule-based classifiers based on
hand-labelled corpora. In addition, these stochastic approaches tend to
concentrate on the acoustic phonetic level of prosodic description,
examining phenomena such as average duration and F0 by phoneme or
syllable type, lengths of pause between different lexical classes,
classes of pause between sentences of different lengths, and constancy
of prosodic characteristics within and across speakers. These are all
phenomena which can be measured without any labelling other than
phonemic transcription and part-of-speech tagging.
Ironically, there is also widespread acknowledgement that structural and
functional categories are the major determinants of prosody, and that
therefore synthetic prosody requires knowledge of syntax, semantics,
pragmatics, and even emotional factors. None of these are easily
included in spoken corpora, and therefore tend to be ignored in practice
by stochastic research. Compared with US research, European work seems
generally to avoid the more abstract levels of prosody, although there
are of course exceptions.
The applications of European work on synthetic prosody range from R&D
tools (classifiers, phoneme-to-speech systems, mark-up languages),
through simple TTS systems and limited-domain CSS applications, to
fully-fledged unrestricted text input and multimedia output systems, IR
front ends, and talking document browsers. For some European languages,
even simple applications have not yet been fully developed: for others,
the challenge is to improve or extend existing technology to include new
modalities, more complex input, and more intelligent or natural-sounding
output.
The major obstacles to progress in most cases seem to me to be twofold:
- what is the information which synthetic prosody should convey?
- what are the phonetic correlates which will convey it?
For the less ambitious applications, such as tools and restricted text
input systems, it is important to ascertain which levels of analysis
should be performed and what prosodic labels can reliably be generated.
The objective is often to avoid assigning the wrong label, rather than
to try and assign the right one: if in doubt, make sure the prosody is
neutral and allow the user to decide on an interpretation. For the more
advanced applications, such as "intelligent" interfaces and rich-text
processors, the problem is often to decide which aspects of the
available information should be conveyed by prosodic means, and how the
phonetic correlates chosen to convey those aspects are related to the
phonetic correlates of the document or discourse as a whole: when faced
with a text which contains italics, bolding, underlining, capitalisation,
and various levels of sectioning, what are the hierarchic relations
between these different formattings and can they all be encoded in the
prosody of a spoken version?
Interestingly, many of the perceived problems at the segmental level,
which are generally regarded as signal processing or coding issues, may
be resolved if these twin obstacles to naturalistic prosody can be
surmounted.
PITCH, TIMING & INTENSITY
-------------------------
As stated above, the majority of European work on prosody has
concentrated on pitch, with timing a close second and intensity a poor
third. Other aspects of prosody, such as voice quality and spectral
tilt, have been almost completely ignored for synthesis purposes.
With only one exception (University of Helsinki), all the institutions
who expressed an interest in prosody have an interest in the synthesis
of pitch contours. Three institutions (University of Joensuu, IKP Bonn,
and IPO Eindhoven) have concentrated entirely on pitch. All others
report results or work in progress on pitch and timing. Three
institutions (University of Helsinki, Czech Academy of Sciences, and the
Institute of Phonetics in Prague) make significant reference to
intensity.
Pitch
-----
Research on pitch (fundamental frequency or abstract intonation
contours) is mainly at a very concrete level. The "J. Stefan" Institute
in Slovenia is a typical case, concentrating on "the microprosody
parameters for synthesis purposes, especially ... modelling of the intra
word F0 contour." The two Czech laboratories take a similar stochastic
corpus-based approach, as does GAPS (Madrid). The next level of
abstraction is to split the pitch contour into local and global
components: here, the Fujisaki model is the commonest approach (J.
Stefan, LAIP, IKP), although there is a home-grown alternative developed
at Aix-en-Provence.
IKP has recently moved from the Fujisaki model to a "Maximum Based
Description" model. This model uses temporal alignment of pitch maxima
and scaling of those maxima within a speaker-specific pitch range,
together with sinusoidal modelling of accompanying rises and falls, to
produce a smooth contour whose minima are not directly specified. The
approach is similar to the Edinburgh model developed by Ladd, Monaghan
and Taylor for the phonetic description of synthetic pitch contours.
The next level of abstraction, from actual pitch contours to typical or
stylised patterns, has also been investigated at Aix-en-Provence. The
best-known work on stylisation is of course that of IPO for Dutch and
English, and IPO has also taken stylisation one step further by
standardising slopes, excursion sizes, durations and so forth.
Only one institution seems to have experimented with all of these
approaches. The J. Stefan Institute in Slovenia lists Fujisaki, INTSINT,
IPO and stochastic modelling among its techniques. It is also unique in
mentioning the work of Pierrehumbert, which brings me to phonological
modelling.
Workers at KTH, IPO and Dublin have all developed phonological approaches to
intonation synthesis which model the pitch contour as a sequence of pitch
accents and boundaries. These approaches have been applied mainly to
Germanic languages, and have had considerable success in both laboratory
and commercial synthesis systems. The phonological frameworks employed
are based on the work of Bruce, 't Hart and colleagues, and Ladd
respectively. A fourth approach, that of Pierrehumbert and colleagues,
has been employed by various European institutions. The assumptions
underlying all these approaches are that the pitch contour realises a
small number of phonological events, aligned with key elements at the
segmental level, and that these phonological events are themselves the
(partial) realisation of a linguistic structure which encodes syntactic
and semantic relations between words and phrases at both the utterance
level and the discourse level.
Important outputs of this work include:
- classifications of pitch accents and boundaries (major, minor;
declarative, interrogative; etc.)
- rules for assigning pitch accents and boundaries to text or
other inputs
- mappings from accents and boundaries to acoustic correlates,
particularly fundamental frequency
One problem with phonological work related to synthesis is that it has
generally aimed at specifying a "neutral" prosodic realisation of each
utterance. The rules were mainly intended for implementation in TTS
systems, and therefore had to handle a wide range of input with a small
amount of linguistic information to go on: it was thus safer in most
cases to produce a bland, rather monotonous prosody than to attempt to
assign more expressive prosody and risk introducing major errors. This
has led to the situation identified by LAIP and ETH, where their systems
(LAIPTTS and SVOX) can produce acceptable pitch contours for some
sentence types (declaratives, yes/no questions) but not for others, and
where the prosody for isolated utterances is much more acceptable than
that for longer texts and dialogues.
The problem of specifying pitch contours in larger contexts has been
addressed by projects at KTH, IPO, Dublin and elsewhere, but in most cases
the results are still quite inconclusive. The mappings from text to
prosody in larger units are dependent on many unpredictable factors
(speaking style, speaker's attitude, hearer's knowledge, and the
relation between speaker and hearer, to name but a few). In dialogue
systems, where the message to be uttered is generated automatically and
much more information is consequently available, the level of linguistic
complexity is currently very limited and does not give much scope for
prosodic variation. This issue will be returned to in the discussion of
applications below.
Timing
------
Work on this aspect of prosody includes the specification of segmental
duration, duration of larger units, pause length and speech rate.
Approaches to segmental duration are exclusively stochastic. They
include neural net models (University of Helsinki, Czech Academy of
Sciences, ICP Grenoble), inductive learning (J. Stefan Institute), and
statistical modelling (LAIP, Aix). The Aix approach is interesting, in
that it uses simple DTW techniques to align a natural signal with a
sequence of units from a diphone database: the best alignment is
aassumed to be the one where the diphone midpoints match the phone
boundaries in the original.
Some researchers (LAIP, Prague Institute of Phonetics, Aix, ICP)
incorporate rules at the syllable level, based particularly on Campbell's
work. The University of Helsinki is unusual in referring to the word
level rather than syllables or feet. The Prague Institute of Phonetics
refers to three levels of rhythmic unit, and is the only group to
mention such an extensive hierarchy although workers in Helsinki intend
to investigate phrase-level and utterance-level timing phenomena.
Several workers have investigated the length of pauses between units.
Most others express their intention to investigate pause duration during
COST 258. For Slovene, it is reported that "pause duration is almost
independent of the duration of the intonation unit before the pause",
and seems to depend on speech rate and on whether the speaker breathes
during the pause: there is no mention of what determines the speaker's
choice of when to breathe. KTH investigated pausing and other phrasing
markers in Swedish, based on analyses of the linguistic and information
structure of spontaneous dialogues: the findings included a set of
phrasing markers corresponding to a range of phonetic realisations
including pausing and pre-boundary lengthening. Colleagues in Prague
note that segmental duration in Czech seems to be related to boundary
type in a similar way, and workers in Aix suggest a four-way
classification of segmental duration to allow for boundary and other
effects: again, this is similar to suggestions by Campbell and
colleagues.
Speech rate is mentioned by several groups as an important factor and an
area of future research, but only the Prague Institute of Phonetics
claims to have developed rules for different rates and styles of
synthesis.
Intensity
---------
The importance of intensity, particularly its interactions with pitch
and timing, is widely acknowledged. Little work has been devoted to it
so far, with the exception of the two Czech institutions who have both
incorporated control of intensity into their TTS rules.
Languages
---------
Some of the different approaches and results above may be due to the
languages studied. These include Czech, Dutch, English, Finnish, French,
German, Slovene, Spanish and Swedish. In Finnish, for example, it is
claimed that pitch does not play a significant linguistic role. In
French and Spanish, the syllable is generally considered to be a much
more important timing unit than in English or Dutch.
There are, however, several important methodological differences which
are independent of the language under consideration. The next section
looks at some of the methodologies and the assumptions on which they are
based.
METHODOLOGIES
The commonest methodologies in European prosody research are the purely
stochastic corpus-based and the linguistic knowledge-based approaches.
The former is typified by the work of ICP or Helsinki, and the latter by
IPO or KTH. These methodologies differ essentially on whether the goal
of the research is simply to model certain acoustic events which occur
in speech (the stochastic approach) or to discover the contributions to
prosody of various non-acoustic variables such as linguistic structure,
information content and speaker characteristics (the knowledge-based
approach). This is nothing new, nor is it unique to Europe. There are,
however, some new and unique approaches both within and outside these
established camps which deserve a mention here.
Research at ICP, for example, differes from the standard stochastic
approach in that prosody is seen as "a direct encoding of meaning via
prototypical prosodic patterns". This assumes that no linguistic
representations mediate between the cognitive/semantic and acoustic
levels. The ICP approach is based on a corpus with transcription of
P-Centres, and has been applied to short sentences with varying
syntactic structures. Based on syntactic class (presumably a cognitive
factor) and attitude (e.g. assertion, exclamation, suspicious irony), a
neural net model is trained to produce prototypical durations and pitch
contours for each syllable.
Research at both ETH and Joensuu derives more from the knowledge-based
approach but is essentially concerned with assessing and combining the
various proposals from that approach rather than with applying or
developing a particular set of proposals. Research at Joensuu was noted
above as being unusually eclectic, and concentrates on assessing the
performance of different theoretical frameworks in predicting prosody.
ETH has similar concerns, namely to determine a set of symbolic markers
which are sufficient to control the prosody generator of a TTS system
and allow default prosody to be generated in the absence of such
markers. Both the evaluation of competing prosodic theories and the
compilation of a complete and coherent set of prosodic markers have
important implications for the development of speech synthesis mark-up
languages, which is discussed in the section on applications below.
LAIP and IKP both have a perceptual or psychoacoustic flavour to their
work. In the case of LAIP, this is because they have not found
linguistic factors to be adequate predictors of prosodic control:
speed and memory are important considerations for LAIPTTS, and complex
linguistic analysis is perhaps not worth the computational overheads.
For a neutral reading style, LAIP has found that perceptual and
performance-related prosody is an adequate substitute for linguistic
knowledge: evenly-spaced pauses, rhythmic alternations in stress and
speech rate, and an assumption of uniform salience of information leads
to an acceptable level of coherence and "fluency". However, these
measures are inadequate for predicting prosodic realisations in "the
semantically punctuated reading of a greater variety of linguistic
structures and dialogues", where the assumption of uniform salience does
not hold true.
Research at IKP has concentrated on the notion of "prominence", a
psycholinguistic measure of the degree of perceived salience of a
syllable and consequently of the word or larger unit in which that
syllable is the most prominent. IKP proposes a model where each syllable
is an ordered pair of segmental content and prominence value. In the
case of boundaries, the ordered pair is of boundary type (e.g. rise,
fall) and prominence value. These prominence values are presumably
assigned on the basis of linguistic and information structure, and
encode hierarchic and salience relations, allowing listeners to
reconstruct a prominence hierarchy and thus decode those relations.
The IKP theory assumes that listeners judge the prosody of speech not as
a set of independent perceptions of pitch, timing, intensity and so
forth, but as a single perception of prominence for each syllable:
synthetic speech should therefore attempt to model prominence as an
explicit synthesis parameter. "When a synthetic utterance is judged
according to the perceived prominence of its syllables, these judgements
should reflect the prominence values [assigned by the system]. It is the
task of the phonetic prosody control, namely duration, F0, intensity and
reductions, to allow the appropriate perception of the system parameter."=20
So far, there are no concrete proposals regarding how a prominence value
of, say, 20 is to be realised in the synthetic waveform.
APPLICATIONS
By far the commonest application of European synthetic prosody research
is in TTS systems, mainly laboratory systems but with one or two
commercial systems. Work oriented towards TTS includes KTH, IPO, LAIP,
ETH, Czech Academy of Sciences, Prague Institute of Phonetics, and
British Telecom.
Other applications include announcement systems (Dublin), dialogue
systems (KTH, IPO, BT, Dublin), and document browsers (Dublin). Some
institutions have concentrated on producing toools for prosody research
(Joensuu, Aix, UCL) or on developing and testing theories of prosody
using synthesis as an experimental or assessment methodology.
Current TTS applications typically handle unrestricted text in a robust
but dull fashion. As mentioned above, they produce acceptable prosody
for most isolated sentences and "neutral" text, but other genres (email,
stories, specialist texts, ...) rapidly reveal the shallowness of the
systems' processing. There are currently two approaches to this problem:
the development of dialogue systems which exhibit a deeper understanding
of such texts, and the treatment of rich-text input from which prosodic
information is more easily extracted.
Dialogue systems predict appropriate prosody in their synthesised output
by analysing the preceding discourse and deducing the contribution which
each synthesised utterance should make to the dialogue: is it commenting
on the current topic, introducing a new topic, contradicting or
confirming some proposition, or closing the current dialogue? Lexical,
syntactic and prosodic choices can be made accordingly. There are two
levels of prosodic analysis involved in such systems: the extraction of
the prosodically-relevant information from the context, and the mapping
from that information to phonetic or phonological specifications.
Extracting the relevant syntactic, semantic, pragmatic and other
information from free text is not currently possible. Limited domain
systems have been developed in Edinburgh and Dublin, but these systems
generally only synthesise a very limited range of prosodic phenomena
since that is all that is required by their input. The relation between
a speaker's intended contribution to a dialogue and the linguistic
choices which the speaker makes to realise that contribution is only
poorly understood: the incorporation of more varied and expressive
prosody into dialogue systems will require progress in the fields of NLP
and HCI amongst others.
More work has been done on the relation between linguistic information
and dialogue prosody. IPO has recently embarked on research into "pitch
range phenomena, and the interaction between the thematic structure of
the discourse and turn-taking." Research at Dublin is refining the
mappings from discourse factors to accent placement which were first
developed at Edinburgh in the BRIDGE spoken dialogue generation system.
Work at KTH has produced "a system whereby markers inserted in the text
can generate prosodic patterns based on those we observe in our analyses
of dialogues", but as yet these markers cannot be automatically deduced.
The practice of annotating the input to speech synthesis systems has led
to the development of speech synthesis mark-up languages at Edinburgh
and elsewhere. The type of mark-up ranges from control sequences which
directly alter the phonetic characteristics of the output, through more
generic markers such as <emphasis> or <question>, to document formatting
commands such as section headings. With such an unconstrained set of
possible markers, there is a danger that mark-up will not be coherent or
that only trained personnel will be able to use the markers effectively.
One option is to make use of a set of markers which is already used for
document preparation.Researchers in Dublin are working on prosodic
rules to translate common document formats (LaTeX, HTML, RTF, etc.) into
spoken output for a document browser, with interfaces to a number of
commercial synthesisers. BT are developing a multi-modal approach,
whereby speech can be synthesised from a range of different inputs and
combined with static or moving images: this seems relatively
unproblematic, given appropriate input.
FUTURE WORK
In this section I have noted the intentions of affiliated institutions
to continue or re-direct their own research efforts, as well as two
suggestions for collaborative research within COST 258. The bulk of this
final section is devoted to a proposal-cum-overview submitted by UCL,
which relates to most of the issues mentioned above.
Several institutions have noted their intention to extend their prosodic
rules, either to new aspects of prosody (e.g. timing and intensity) or
to new classes of output (interrogatives, emotional speech, dialogue,
and so forth).
ETH has suggested that the standardisation of prosodic mark-up and the
development of a suitable prosody generator could be an important
objective for COST 258. A similar suggestion from Joensuu is to "work
on an interchange (or intermediate or super-set) coding standard for
intonation models which facilitates contrastive evaluation and
conversion", building on their assessment of various models to date.
IPO proposes to investigate segmental reduction as a function of
emphasis or prominence, and to improve the modelling of the voice source
waveform to produce different voice qualities. In addition, they propose
an investigation of word-level effects: "What we mean by this is
that in natural speech phonemes seem to pattern into words almost
automatically, while this is not true or anyway much less so of
synthetic speech. As a result, on-line word recognition in connected
speech seems to remain far below what would be expected on the basis of
intelligibility performance for isolated words." This raises the issue
of the relation between prosodic factors and intelligibility, which I
would like to pursue, but first I will insert an outline proposal from
UCL regarding future work for COST 258.
UCL Proposal on Prosody
-----------------------
1. Rhythm
Problem: synthetic speech has a too rigid and regular rhythm which=20
makes it difficult to follow, uninteresting and monotonous.
Solution: create stronger rhythmic structure for utterances. There=20
are three parts to this: (i) better phrasing (see below), (ii) creating=20
co-ordination of rhythm across constituents of a phrase, and (iii)=20
linguistically-motivated adjustments to tempo across phrases. In=20
addition, rhythmic changes give rise to consequent segmental=20
realisation, and these need to be investigated and modelled.
2. Phrasing
Problem: synthetic speech is often inappropriately phrased. The=20
incorrect use of phrasing destroys the meaningfulness of utterances=20
and disturbs the listener.
Solution: recognise that utterances constitute units of inter-related=20
meaning. The object of the phrasing component should be to bind=20
together meaning units. This will largely follow phrase level=20
analysis, but be sensitive to the semantic roles they play. Limited=20
improvements may still be possible from better syntactic parsing, but=20
since prosodic and syntactic phrasing do not necessarily coincide, it=20
may be that such an analysis is impossible from text at our current=20
understanding of computational linguistics. In this case we should=20
look to a semantic/pragmatic level of text mark-up.
3. Focus & Deaccenting
Problem: text-to-speech synthesis typically produces a neutral=20
declarative reading of text which is confusing in that all information=20
is given equal accentuation. Better results may be achieved in=20
applications (e.g. dialogue) where a discourse model is=20
implemented.
Solution: without prior discourse-based mark-up it may be difficult=20
to decide which components of a phrase need to be accented. It=20
may be easier to decide which components should be deaccented on=20
the basis of an overlap between the current phrase and previous=20
ones. A better identification of noun compounds can lead to some=20
improvements in English synthesis.
4. Pitch patterns
Problem: (i) salient pitch events (e.g. accents) may be wrongly=20
located in the text, usually as a result of errors in 2 and 3 above; (ii)=20
systems use an impoverished inventory of possible pitch patterns (to=20
play safe), making the speech monotonous and difficult to listen to;=20
(iii) phonetic realisation of fundamental frequency patterns may be=20
unnatural in their alignment to the text and to syllable constituents.
Solution: (i) relies on discourse knowledge; (ii) modelling allowable=20
intonational variability in unmarked discourse contexts could reduce=20
mechanical repetition effect, (iii) need to relate f0 contour alignment=20
systenatically to higher level units of structure, and to gain a better=20
understanding of the low-level interaction between f0 and segments.
Discussion 1
------------
Many of the problems and solutions referred to by UCL have been known
for some time (e.g. Monaghan 1991), but may not have been discussed in
relation to all the languages involved in COST 258. The emphasis seems
to be on deeper linguistic analysis (semantics, discourse structure) and
on investigation of the correlation between those levels of analysis and
acoustic realisations. There are also several simpler measures proposed
(constrained random variation of pitch contours, imposition of rhythmic
structure) which have been proposed before for English and French
respectively. The UCL proposals for prosody seem to be a very good
starting point for discussion of work within COST 258.
UCL also submitted a similar proposal on segmental quality, which I
include here because it seems to me that most of the segmental problems
are related to prosody issues and might be solved by prosodic
improvements alone.
UCL Proposal on Segmental Quality
---------------------------------
1. Segmental timing
Problem: the timing of syllable units in synthetic speech often seems=20
to be disjointed. Inappropriate durations disturb the listener and=20
tend to destroy a fluent rhythm to the utterance.
Solution: relate segmental timing to the larger scale prosody of the=20
utterance and the syllable. This could be computed within a=20
conventional segment-based solution to timing prediction according=20
to the phonetic context. However the more sophisticated the=20
linguistic description, the less frequent such contexts are observed in=20
training data (so-called lop-sided sparsity). A better solution is to=20
use a rule-based system within an appropriate metrical structure. In=20
this way regularities can be modelled at different levels in the=20
hierachy, making best use of limited data.
2. Coarticulation on different scales
Problem: diphone and polyphone concatenation systems are=20
inevitably limited in the extent to which neighbouring syllables can=20
affect one another. However the incorrect coarticulation of units=20
can lead to disturbing changes in perceived quality. Concatenation=20
cannot be extended to larger units or more sophisticated tagging=20
because of the sparseness of training data.
Solution: use a metrical framework to express the coarticulatory=20
interactions over larger domains. Identify the phonetic properties=20
which can be attributed to units of structure rather than individual=20
segments. Recognise existence of dominance hierarchy between=20
segment classes.
3. Quality changes due to prosody
Problem: current synthetic speech systems use prosodic=20
manipulation of concatenated units without any consequent=20
modifications to segmental quality. However as rate is changed=20
vowel quality is affected, and the domain of influence of shared=20
coarticulatory features changes. Also as pitch is changed voice=20
quality is affected.
Solution: research models of how segmental quality and voice=20
quality change as a function of rate and pitch.
4. Parametric description of speaker character and speaking style
Problem: applications of synthetic speech require a variety of=20
speakers and speaking styles. Current systems are extremely limited=20
in their range of speakers and styles.
Solution: research the construction of parametric models of speaker=20
characteristics or of speaking styles.
Discussion 2
------------
The segmental problems mentioned by UCL depend mainly on prosodic
structure or suprasegmental settings (VQ, rate, etc.). Existing work on
prosody, and proposals such as those by IPO above, may reolve many of
these problems. Certainly there is a need for segmental and prosodic
research to be integrated.
REFERENCES
IKP, Bonn, Germany
------------------
M=F6bius, B. (1993): Ein quantitatives Modell der deutschen Intonation -
Analyse und Synthese von Grundfrequenzverldufen.T|bingen: Niemeyer
Ladd, D.R.; Verhoeven, J.; Jacobs, K. (1994) "Influence of adjacent
pitch accents on each others perceived prominence: two contradictory
effects." Journal of Phonetics 22
Terken, J. (1991) "Fundamental frequency and perceived prominence of
accented syllables." J. Acoust. Soc. Am. 87, 1768-1776
University of Helsinki, Finland
-------------------------------
[1] Karjalainen M. & Altosaar T. An object-oriented database for speech
processing. In Proceedings of the European Conference on Speech
Technology, 1993.
[2] Vainio M. & Altosaar T. Pitch, Loudness, and Segmental Duration
Correlates: Towards a Model for the Phonetic Aspects of
Finnish Prosody. In Proceedings of ICSLP '96 (H.T. Bunnell & W.
Idsardi, eds.), Vol. 3, 2052-2055, Philadelphia, 1996
[3] Vainio, M., Altosaar, T., Karjalainen, M. & Aulanko, R. Modeling Finnish
microprosody for speech synthesis. To appear in the ESCA Workshop on
Intonation in Athens, Greece, September 1997.
[4] Vainio, M., & Altosaar, T. Pitch, Loudness, and Segmental Duration
Correlates in Finnish Prosody. To appear in the proceedings of the
Nordic Prosody XIII, August 1996, Joensuu, Finland
LAIP, Lausanne, Switzerland
---------------------------
Bachenko J., Fitzpatrick E. (1990). A computationnal grammar of discourse-
neutral prosodic phrasing in English, Computationnal Linguistics, 16. 155
-170.
Bailly, G. (1983). Contribution =E0 la d=E9termination automatique de la
prosodie du fran=E7ais parl=E9 =E0 partir d'une analyse syntaxique.=
Etablissement
d'un mod=E8le de g=E9n=E9ration. Th=E8se d'ing=E9nieur, Institut National
Polytechnique de Grenoble.
Barbosa, P. A. (1994). Caract=E9risation et g=E9n=E9ration automatique de la
structuration rythmique du fran=E7ais. Th=E8se de Doctorat. U.R.A. CNRS=
n=B0368 -
INPG/ENSERG, Universit=E9 Stendhal, Grenoble.
Beaugendre, F. (1994). Une =E9tude perceptive de l'intonation du fran=E7ais.
Th=E8se de Doctorat en Sciences de l'Universit=E9 Paris XI. LIMSI n=B094 -=
25
Campbell, W.N. (1992). Syllable-based segmental duration. In G. Bailly , et
al (Eds.),Talking Machines. Theories, Models, and Designs (pp. 211-224).
Elsevier Science Publishers.
Delais, E. (1994). Pr=E9diction de la variabilit=E9 dans la distribution des
accents et les d=E9coupages prosodiques en fran=E7ais. 20=E8mes Journ=E9es=
d'Etude
sur la Parole (pp. 379-384). Tr=E9gastel.
Delais-Roussarie, E. (1995). Pour une approche parall=E8le de la structure
prosodique: =E9tude de l'organisation prosodique et rythmique de la phrase
fran=E7aise. Th=E8se de Doctorat, Universit=E9 de Toulouse-Le Mirail.
Di Cristo, A. & Hirst, D. (1994). Rythme syllabique, rythme m=E9lodique et
repr=E9sentation hi=E9rarchique de la prosodie du fran=E7ais. Travaux de
l'Institut de Phon=E9tique d'Aix, 15. 13-24.
Jun, S-A. & Fougeron, C. (1995). The accentual phrase and the prosodic
structure of French. XIII=E8me Congr=E8s International des Sciences
Phon=E9tiques, 2 (pp. 722-725). Stockholm.
Keller, E., & Zellner, B. (1996). A timing model for fast French. York
Papers in Linguistics, 17, University of York. 53-75
Keller, E., Zellner, B., & Werner, S. (1997). Improvements in Prosodic
Processing for Speech Synthesis. Proceedings of Speech Technology in the
Public Telephone Network: Where are we today? September 1997 Rhodes
(Greece).
Keller, E., Zellner, B., Werner, S., and Blanchoud, N. (1993). The
prediction of prosodic timing: Rules for final syllable lenthening in
French. Proceedings ESCA Workshop on Prosody, September 27-29. Lund,
Sweden. 212-215.
Liberman, M., & Prince A. (1977). On stress and linguistic rhythm.
Linguistic Inquiry, 8, 249-336.
Martin, Ph. (1987). Structure rythmique de la phrase fran=E7aise. Statut
th=E9orique et donn=E9es exp=E9rimentales. Proceedings des 16e Journ=E9es=
d'Etude
sur la Parole (pp. 255-257). Hammamet.
Mertens, P. (1993). Intonational grouping, boudaries and syntactic
structure in French. Proceedings ESCA Workshop on Prosody, September 27-29.
Lund, Sweden. 156-159.
Mertens, Piet. (1987). L'intonation du fran=E7ais. De la description
linguistique =E0 la reconnaissance automatique. Th=E8se doctorale,=
Katholieke
Universiteit Leuven.
Pasdeloup, V. (1992). Dur=E9e intersyllabique dans le groupe accentuel en
Fran=E7ais. Actes des 19=E9mes Journ=E9es d'Etudes sur la Parole. (pp.=
531-536).
Bruxelles.
Selkirk, E.O. (1984). Phonology and syntax: The relation between sound and
structure. MIT Press, Cambridge, MA.
van Santen, J.P.H (1993). Timing in text-to-speech systems. Proceedings of
the 3rd European conference on speech communication and technology (pp.
1397-1404). Berlin.
Wang, M. Q. & Hirchberg, J. (1992). Automatic Classification of
Intonational Phrase Boundaries. Computer Speech and Language,6. 175-196.
Werner, S. (1996). Intonation Modelling for Speech Synthesis in French.
Communication au colloque Choix de technologies et structure generale d'un
syst=E8me de synth=E8se de la parole. S=E9minaire DEA "Synth=E8se de la=
parole",
Universit=E9s Paris VII et Paris III, Paris, f=E9vrier 1996.
Zellner, B. (1996). Structures temporelles et structures prosodiques en
fran=E7ais lu. Revue Fran=E7aise de Linguistique Appliqu=E9e: La=
communication
parl=E9e. 1. (pp.7-23).Paris.
Zellner, B. (1997). Improving Speech Fluency in French through
Psycholinguistic Principles. 14th CALICO Annual Symposium, ISBN
1-890127-01-9. New-York.
Zellner, B. (to appear). Fluidit=E9 en synth=E8se de la parole. Revue=
d'Etudes
de lettres, Universit=E9 de Lausanne.
Institute of Phonetics, Prague, Czech Republic
----------------------------------------------
Dohalsk=E1-Zichov=E1, M. - Dub=ECda, T.: R=F4le des changements de la dur=E9=
e et
de l'intensit=E9 dans la synth se du tch que, in: Proceedings XXIes
Journ=E9es d'=E9tude sur la parole, Avignon, 1996, pp. 375 - 378
Dohalsk=E1-Zichov=E1, M. - Mejvaldov=E1, J.: O sont les limites
phonostylistiques du tch que synth=E9tique, in Proceedings XVIe Congr s
International des Linguistes, Paris, 1997 (in print)
Dohalsk=E1-Zichov=E1, M. - Hed=E1nek, J.: Comparing Some Parameters of
Continuously Accelerated Natural Speech Signal, in: Speech Processing,
H.-W. Wodarz (ed.), Forum Phoneticum 63, Frankfurt a. Main, 1997, pp. 13
- 22
Dohalsk=E1-Zichov=E1, M. - Mejvaldov=E1, J.: Generation of Duration,=
Intensity
and F0 Contours in Short Czech Synthetic Sentences, 7th Czech-German
Workshop - Speech Processing, IRE AS CR, Prague, 1997, R. V=EDch (ed.) (in
print)
Jan=EDkov=E1, J.: The melodic contrast of the yes-no-question and the
unfinished utterance in three-syllable stress groups, diploma work,
Institute of Phonetics, Prague, 1997
Palkov=E1, Z. - Pt=E1=E8ek, M.: Modelling prosody in TTS diphone synthesis=
in
Czech, in: Speech processing, H.-W. Wodarz (ed.), Forum Phoneticum 63,
pp. 59 - 77, Frankfurt a. Main, 1997
Palkov=E1, Z. - Pt=E1=E8ek, M.: Prosody Modifications in Text, in: Speech
Processing, 6th Czech-German Workshop, Prague, 1996, (Abstract) R. V=EDch
(ed.) pp. 32 - 34
Palkov=E1, Z.: Modelling Intonation in Czech: Neutral vs. marked TTS F0
patterns, in: Intonation: Theory, Models and Applications, proceedings
of an ESCA Workshop, A. Botinis, G. Kouroupedroglou, G. Carayannis
(eds.), pp. 267 - 270, Athens, 1997 (in print)
Palkov=E1, Z.: Modelling emphatic prominence in TTS, abstract, 7th
Czech-German Workshop - Speech Processing, IRE AS CR, Prague, 1997, R.
V=EDch (ed.) (in print)
Pt=E1=E8ek, M. - Hor=E1k, P.: Czech Diphone Synthesis with the New Diphone
Inventory, abstract, 7th Czech-German Workshop - Speech Processing, IRE
AS CR, Prague, 1997, R. V=EDch (ed.) (in print)
V=EDch, R. - P=F8ibyl, J. - Pt=E1=E8ek, M.: Cepstrales Sprachsynthesesystem=
f=FCr
die Tschechische Sprache, in: Studientexte zur Sprachkommunikation, Heft
14 - Elektronische Sprachsignalverarbeitung, K. Fellbaum (ed.),
Brandenburgische Universit=E4t Cottbus, 1997
KTH, Stockholm, Sweden
----------------------
Ayers G, Bruce G, Granstr=F6m B, Gustafson K, Horne M, House D & Touati P
(1995). "Modelling intonation in dialogue." In: Elenius K & Branderud P,
eds, Proc of XIII Intl Congress of Phonetic Sciences (ICPhS 95), Aug 1995,
Stockholm, 2, pp. 278-281.
Bruce, G., Granstr=F6m, B. (1990). "Modelling Swedish prosody in
text-to-speech: phrasing". In K. Wiik & I. Raimo (eds.) Nordic Prosody V,
pp. 26-35. Phonetics Department, Turku University.
Bruce, G., B. Granstr=F6m, K. Gustafson & D. House. (1991), "Prosodic
phrasing in Swedish", Working Papers 38, pp. 5-17. Department of
Linguistics and Phonetics, Lund University.=20
Bruce, G., B. Granstr=F6m, K. Gustafson & D. House. (1993). "Interaction of
F0 and duration in the perception of prosodic phrasing in Swedish." Proc.
Nordic Prosody VI, ed. B. Granstr=F6m and L. Nord, Stockholm: Almqvist &
Wiksell International, pp. 7-22.
Bruce, G., B. Granstr=F6m, K. Gustafson & D. House. (1993), "Phrasing
strategies in prosodic parsing and speech synthesis." Proc. EUROSPEECH '93,
Berlin, 21-23 September, 1993.=20
Bruce G, Granstr=F6m B, Gustafson K, House D and Touati P (1994). "Modelling
Swedish prosody in a dialogue framework." Proceedings ICSLP 94, pp.
1099-1102, Yokohama.
Bruce, G., Granstr=F6m, B., Gustafson, K., House, D. & Touati, P. (1994).
"Preliminary report from the project, 'Prosodic Segmentation and
Structuring of Dialogue'." In FONETIK =B494, Working papers from the 8th
Swedish Phonetics Conference, May 24-26, Lund, Sweden, pp. 34-37.
Bruce G, Granstr=F6m B, Filipsson M, Gustafson K, Horne M, House D, Lastow B
and Touati P (1995). "Speech synthesis in spoken dialogue research." In:
Proceedings EUROSPEECH 95, pp. 1169-1172 (Madrid).=20
Bruce G, Granstr=F6m B, Gustafson K, Horne M, House D & Touati P (1995).
"Towards an enhanced prosodic model adapted to dialogue applications." In:
Dalsgaard P et al., eds, Proc of ESCA Workshop on Spoken Dialogue Systems,
May-June 1995, Vigs=F8, Denmark, pp. 201-204.
Bruce G, Filipsson M, Frid J, Granstr=F6m B, Gustafson K, Horne M, House D,
Lastow B and Touati P. (1996). "Developing the modelling of Swedish prosody
in spontaneous dialogue." Proc. ICSLP 96 pp. 370-373 (Philadelphia).
Bruce G & Granstr=F6m B (1996). "Prosodic modelling in Swedish speech
synthesis." In: Fant G, Hirose K & Kiritani S, eds. Analysis, Perception
and Processing of Spoken Language. Festschrift for Hiroya Fujisaki.
Amsterdam, The Netherlands: Elsevier Science B.V., pp. 62-73. =20
Bruce G, Frid J, Granstr=F6m B, Gustafson K, Horne M & House D. "Prosodic
segmentation and structuring of dialogue." (To be publ in Proc Nordisk
Prosodi VII, Aug 1996, Joensuu, Finland).
Bruce G, Frid J, Granstr=F6m B, Gustafson K, Horne M & House D (1996).
"Prosodic segmentation and structuring of dialogue." TMH-QPSR, KTH, 3/1996,
pp. 1-6.
Bruce G, Granstr=F6m B, Gustafson K, Horne M, House D & Frid J (1996). "The
Swedish intonation model in interactive perspective." In: Proc of Fonetik
96, Swedish Phonetics Conference, N=E4sslingen, May 1996. TMH-QPSR, KTH,
2/1996, pp. 19-24.
Bruce G, Filipsson M, Frid J, Granstr=F6m B, Gustafson K, Horne M & House D
(1997). "Global features in the modelling of intonation in spontaneous
Swedish." In: Botinis A, Kouroupetroglou G & Carayannis G, eds. Proc of
ESCA workshop on Intonation: Theory, Models and Applications, Athens, Sept
1997; pp. 59-62. =20
Bruce G, Filipsson M, Frid J, Granstr=F6m B, Gustafson K, Horne M & House D
(1997). "Modelling intonation in spontaneous speech." In: Bannert R,
Heldner M, Sullivan K & Wretling P, eds., Proc of Fonetik -97, Dept of
Phonetics, Ume=E5 Univ., Phonum 4, pp.173-174.=20
Bruce G, Granstr=F6m B, Gustafson K, Horne M, House D & Touati P (1997). "On
the analysis of prosody in interaction." In: Sagisaka Y, Campbell N &
Higuchi N, eds. Computing Prosody. Computational Models for Processing
Spontaneous Speech. New York: Springer, 1997, pp. 43-59.=20
Bruce G, Filipsson M, Frid J, Granstr=F6m B, Gustafson K, Horne M & House D
(1997). "Text-to-intonation in spontaneous Swedish." In: Kokkinakis G,
Fakotakis N & Dermatas E, eds., Proc of Eurospeech '97, 5th European
Conference on Speech Communication and Technology, Rhodes, Greece. 1, pp.
215-218. =20
Institute of Phonetics, Aix-en-Provence, France
-----------------------------------------------
Ast=E9sano, C.; Espesser, R.; Hirst, D.J. & Llisterri, J. 1997.
Stylisation automatique de la fr=E9quence fondamentale : une =E9valuation
multilingue. 4e Congr=E8s Fran=E7ais d'Acoustique, 14-18 avril 1997,
Marseille 441-443.
Campbell, W.N. 1992. Multi-level Timing in Speech, PhD Thesis,
University of Sussex.
Campione, E., Flachaire, E., Hirst, D.J. & V=E9ronis, J. 1997. Stylisation
and symbolic coding of F0, a quantitative approach. ESCA Tutorial and
Research Workshop on Intonation. 18.-20.September. Athens.
Chan, D., Fourcin, A., Gibbon, D., Granstr=F6m, B., Huckvale, M.,
Kokkinas, G., Kvale, L., Lamel, L., Lindberg, L., Moreno, A.,
Mouropoulos, J., Senia, F., Trancoso, I., Veld, C., Zeiliger, J. 1995.
EUROM: a spoken language resource for the EU. Proceedings of the 4th
European Conference on Speech Communication and Speech Tecnology,
Eurospeech '95, Madrid vol. 1, 867-880.
Courtois, F., Di Cristo, Ph., Lagrue, B., V=E9ronis, J. 1997. Un mod=E8le
stochastique des contours intonatifs en fran=E7ais pour la synth=E8se =E0
partir des textes. 4=E8me Congr=E8s Fran=E7ais d'Acoustique. Marseille,=
avril
1997, 373-376.
Dalsgaard, P. Andersen, O. & Barry, W. 1991. "Multi-lingual alignment
using acoustic-phonetic features derived by neural-network technique."
ICASSP-91, 197-200.
Di Cristo, A.; Di Cristo P.; V=E9ronis, J. 1997. A metrical model of
rhythm and intonation for French text-to-speech. ESCA Workshop on
Intonation : Theory, Models and Applications. Athens, September 1997.
Di Cristo, Ph. & Hirst, D.J. 1997. Un proc=E9d=E9 d'alignement automatique
de transcriptions phon=E9tiques sans apprentissage pr=E9alable. . 4e Congr=
=E8s
Fran=E7ais d'Acoustique, 14-18 April, Marseille, 425-428
Dutoit, T. 1997. An introduction to Text-to-Speech synthesis. Kluwer
Academic Press, Dordrecht.
Malfr=E8re, F. & Dutoit, T. 1997. High quality speech synthesis for
phonetic speech segmentation, EuroSpeech 97, Rhodes.
Hirst, Daniel & Di Cristo, Albert (eds) 1997. Intonation Systems : a
Survey of Twenty Languages. Cambridge University Press, Cambridge [in
press].
Hirst, D.J.; A. Di Cristo, M. Le Besnerais, Z. Najim, P. Nicolas, P.
Rom=E9as (1993) Multi-lingual modelling of intonation patterns.
Proceedings ESCA Workshop on Prosody. Lund, Septembre 1993, 204-207.
Hirst, D.J., Di Cristo, A. & Espesser, R. 1997. Levels of representation
and levels of analysis for the description of intonation systems. in M.
Horne (ed) Prosody : Theory and Experiment. Kluwer Academic Publishers,
Dordrecht. [sous presse].
Mora, E., Hirst, D. and Di Cristo, A. Intonation features as a form of
dialectal distinction in Venezuelan Spanish. ESCA Workshop on Intonation
: Theory, Models and Applications. Athens,.septembre 1997.
Talkin & C. Wightman (1994) The aligner. Proceedings ICASSP 1994.
V=E9ronis, J., Hirst, D.J., Espesser, R., Ide, N. 1994. NL and speech in
the MULTEXT project. AAAI '94 Workshop on Integration of Natural
Language and Speech., 72-78.
Vorsterman, A. Martens, J.P. & Van Coile, B. (1996) Automatic
segmentation and labelling of multi-lingual speech data.Speech
Comunication 19, 271-293.
Dublin
------
W. N. Campbell, S. D. Isard, A. Monaghan & J. Verhoeven: `Duration, Pitch=
and
Diphones in the CSTR TTS System.' In the proceedings of ICSLP 1990, Kobe,
Japan, November 1990, pp. 825-828.
M. Delaney & A. Monaghan: `SAMBA: Prosody in an Airline Announcement
Generation System.' In the proceedings of ESCA Workshop on Intonation,
Athens, September 1997 (ISBN 960-695-000-X).
A. Monaghan: `Intonation Accent Placement in a Concept-to-Dialogue System.'
Proceedings of AAAI/ESCA/IEEE Conference on Speech Synthesis, New York,
September 1994, pp. 171-174.
A. Monaghan: `What Determines Accentuation?' Journal of Pragmatics 19, pp.
559-584 (1993)
A. Monaghan: `Heuristic Strategies for Higher-Level Analysis of
Unrestricted Text.' In G. Bailly & C. Benoit (eds) 1992, Talking Machines,
pp. 143-161. Amsterdam: Elsevier.
A. Monaghan: Intonation in a Text-to-Speech Conversion System. PhD thesis,
University of Edinburgh, 1991.
A. Monaghan: `Rhythm & Stress Shift in Speech Synthesis.' Computer Speech
and Language 4 (1) 1990, pp. 71-78.
A. Monaghan: `Phonological Domains for Intonation in Speech Synthesis.'
In the proceedings of Eurospeech 1989, vol. 1 pp. 502-505.
A. Monaghan: `Generating Intonation in the Absence of Essential=
Information.'
In Ainsworth & Holmes (eds), Speech 88: Proceedings of the 7th FASE=
Symposium,
pp. 1249-1256, 1988.
A. Monaghan & D. R. Ladd: `Manipulating Synthetic Intonation for Speaker
Characterisation.' ICASSP 1991, vol. 1 pp. 453-456.