Please find attached a txt-document containing a draft proposal about
the treatment of speaking styles and the preparations needed for
the COST258 spring meeting in Lausanne. To those who feel that there is
much to say and improve about the proposals, I'd like to say that
I fully agree and that this is precisely the aim of the memo: to elicit
comments and suggestions as to how to go about. So, please, if you
have any comments, most notably to the range of speaking styles to
be covered, the purpose of the corpus and the amount of speech to
be recorded, please do so at your earliest convenience.
Also, I would appreciate if you could circulate it to colleagues
at your site not receiving this message but who have been involved
in the previous COST meeting, as well as others who might have
relevant comments.
Best regards,
Jacques Terken
====================================================================
COST258: SPEAKING STYLES FOR SPEECH SYNTHESIS
ID: /terken/cost258/s-styles.txt
DE: version 19.01.1999
1. BACKGROUND
At the COST258 meeting at Vigo, November 1998, it was decided to set up
an activity with the aim of creating a small common database of speaking
styles, to be exploited for testing automatic segmentation techniques
and for exploring differences in speaking styles. The aim of this
document is to initiate the preparations for setting up this database.
Preparations concern the definition of the speaking styles to be
included in the database, and of the recording techniques, including
the selection of speakers. Ideally, the remarks in this document will
inspire some discussion, on the basis of which more sophisticated
proposals can be made and agreement can be achieved.
Issues to be considered in making a decision are among other things
what we want to achieve by investigating stylistic variation, and how
this aim links to the goals of COST258.
In general, one might say that most laboratories have been working on
one or only a few speaking styles, and that this becomes clear immediately
if a new kind of material or setting is tested requiring another speaking
style: in such cases the available speaking style turns out to be
inappropriate for the new kind of material or setting, and it turns
out that no principled way is available of generating the appropriate
formal properties. Ideally, a theory of communication would relate
the factors relevant to variation in speaking style and the formal
properties embodying particular speaking styles, specify to what
extent the formal characteristics serve a functional, a conventional
and an esthetic purpose, and make it possible to generate a particular
speaking style given the parameter values for the relevant communicative
dimensions as input.
2. SPEAKING STYLES
Although only three papers in the Proceedings of ICSLP98 mention the
notion of speaking style, many authors refer to the differences in
speaking styles in their papers. For instance, S. Furui mentioned the
effect of speaking styles on automatic speech recognition performance
during a panel session at ICSLP98. However, authors usually do not
define the notion of speaking style except by exemplars, and they give
different examples:
- Furui (1998) takes a taxonomy derived from F. Juang, AT&T, and
mentions the following styles: isolated words; connected speech; read
speech; fluent speech; sponaneous speech.
- Bladon et al. link the notion of speaking style to the casual-formal
dimension.
- The Handbook of Standards and Resources for Spoken Language Systems
(1998) mentions the notion of "speaking style" several times. On p18
and p99 speaking styles range from read speech to several kinds of
spontaneous speech. On p85 and p191 speaking styles are linked directly
to observable properties such as speaking rate and voice height
- Other authors link the notion of speaking styles to yet other
dimensions.
Given this unclarity, we need first to address the question of
what we mean by speaking style.
2.1 Elements of a definition of "speaking styles"
Since I found it difficult to determine what the notion of "speaking
styles" means precisely, I thought it might help to look at the notion
of "style" in another domain and define the notion of "speaking style"
by analogy.
In the domain of furniture, a style, e.g. the Victorian style, consists
of a set of formal, i.e., observable characteristics by which experts
may identify a particular piece of furniture as belonging to a particular
period and distinct from pieces belonging to different periods (i.e.,
"formal" is used here in the sense of "concerning the observable form").
The explanation for the style is in the ideas of the designer: the style
embodies a set of ideas of the designer about the way things should look like.
Generalizing this observation, we might say that a style contains a
descriptive aspect ("what are the formal characteristics") and a normative
aspect (the ideas underlying the choice of these formal characteristics).
When we apply these considerations to the notion of style in speech,
we may say that a speaking style consists of a set of formal
characteristics of speech in a particular communicative situation.
The descriptive aspect concerns the observable properties that make
the different samples of speech to be perceived as being distinct.
The normative aspect concerns the appropriateness of the manner of
speaking to the communicative situation (i.e., a particular speaking style
may be appropriate in one situation but completely inappropriate in
another one).
Summarizing, a speaking style consists of a set of formal characteristics
by which we may identify particular speaking behaviour as tuned to a
particular communicative situation. The communicative situation to which
the speaker tunes his speech and by virtue of which these formal
characteristics will differ, then, may be characterized in terms of at
least three dimensions: the task, the speaker and the surrounding
situation.
2.2 Relevant dimensions
2.2.1 The task
There are obvious differences between different manners of speaking
induced by the kind of materials that have to be spoken. A non-exhaustive
inventory will contain among other things the following items:
- materials to be spelled out
- isolated words
- numerals
- tabular information
- lists/enumerations
- sentences
- texts
- communication type: monologue/dialogue
Texts do not constitute a homogeneous group, but can be subdivided in
many ways (prose, poetry; e-mail messages, help information, instructions,
results of database queries, ...; ...)
Furthermore, the manner of speaking is dependent on the way the content
is determined:
- spontaneous
- rehearsed
- reading aloud
- verbatim recall
- ...
and on the rhetorical effect:
- convince/persuade
- inform
- enchant
- hypnotize
- enacted
- ...
2.2.2 The speaker
The manner of speaking is also influenced by characteristics of the
speaker, most notably by emotion and attitude:
- expressive/committed/involved
- dull
- friendly
- polite/formal
- casual
- sad
- angry
- solemn
- uncertain
- ...
by the authority of the speaker:
- does the speaker speak for himself, on his own authority,
or is he quoting someone else.
and by his habits vis-a-vis particular communicative situations.
2.2.3 The situation
Situational variables concern the presence of loud noise, the need for
confidentiality etc, the size of the audience and the room, the relation
between the speaker and the audience (expert, superior). These lead to
different speaking modes:
- normal speech
- whispering
- lombard speech
2.3 Summarizing
We have identified three clusters of factors which may induce particular
speaking styles: task variables, speaker-related variables and situational
factors. The task variables subdivide into materials, planning mode and
effect. The speaker variables have to do with psychological factors
(emotion, attitude) and social factors (power relations between speaker
and audience). The situational factors concern the effect of the situation
on the manner of speaking.
3. How to choose: Relevance
As is evident from section 2, speaking style is a multi-dimensional
concept, and particular speaking styles constitute as many points
in this multi-dimensional space. Obviously, large parts of the space
remain empty as there are dependencies between the different dimensions
(e.g., a list of numerals will very improbably be realized in an
enthousiastic or solemn manner). This is precisely why we are able to decide
that a particular set of formal characteristics is not appropriate in a
particular communicative situation.
The question is which points in the space should be recorded and
analysed given the limited resources available within COST258. It
was agreed that the aim of recording a corpus of speaking styles
would be to use it as test material for testing automatic segmentation
techniques, and for exploring differences in speaking styles. I
interpret this in such a way that we want to test and improve the
performance of automatic segmentation techniques precisely for those
speaking styles we want to study. Thus, we first choose the speaking
styles to be studied, and then look at the performance of automatic
segmentation techniques for those materials.
Given the vast amount of possible choices, a reasonable proposal
would appear to take into consideration the needs and interests of
COST, i.e., of the telecommunication domain. That is, both from the
materials, speaker and situation dimension we might choose those
values which are relevant to the telecommunication domain. I guess
this would rule out studying poetry, for instance.
With respect to the exploration of differences in speaking styles,
since telecom services will concern providing all sorts of information
to the user (directory assistance, e-mail reading, weather forecast,
financial information), and these information services will often
involve some kind of dialog, it seems reasonable to take this as
inspiration for the choice of materials.
Thus, we might for instance select the following items from the
different dimensions:
Task: Materials:
- spelling, isolated words, numerals, enumerations, read text,
spontaneous monologue, dialogue (this would also cover the "content
production" aspect, since spontaneous speech (both monologue and
dialogue) and reading aloud (text) are both included.
Speaker: attitude/emotion:
- polite/formal/neutral, casual, friendly
Situation:
- normal, lombard
Notes:
- The three classification factors Task, Speaker and Situation are not
supposed to be orthogonal. Thus, materials items such as spelling,
isolated words, numerals, enumerations, read text might be read only
in a neutral speaking style, and spontaneous monologue would most
likely lend itself best to a somewhat casual speaking style (one
might for instance ask to speaker to talk about his most recent
vacation or the like).
- This approach would not involve explicit manipulation of speaking
rate and voice height.
Alternatively, we might find the link to the telecom domain too
constraining, and rather aim for a broader coverage, so as to take
a first step towards a general characterization of stylistic variation
in speech. However, this would seem rather ambitious, given the
resources available for COST258.
4. Further issues
4.1 Choice of speaker
Many laboratories exploit professional speakers. Budgetary considerations
will rule out the option of bringing professional speakers to Lausanne.
However, realizing the appropriate speaking styles under conditions of
reading aloud in a studio will be hard, and that's precisely the reason
why professional speakers are often preferred. So, we should decide
either to use non-professional but trained speakers, or to make recordings
in advance and transfer the recordings to Lausanne for inclusion on
the cd-rom.
4.2 Recording procedures: requirements
- microphone
- EGG recordings?
4.3 Texts
[ to be specified ]
4.4 Miscellaneous
The original idea was to have 5 minutes of speech of a single speaker
recorded
for each language. We should consider whether this is enough given the
variety of speaking styles that might be of interest. After all, 5 minutes
of speech covering a variety of speaking styles will quite likely
not be very helpful in giving quantitative answers to questions of interest.
5. Time schedule and practical issues
5.1 Proposed time schedule
Febr 7 Agreed definition of concept of "speaking styles"
as a multidimensional concept and of the dimensions
concerned
Febr 7 Decision concerning speaker background
Febr 14 Specification of technical requirements
Febr 14 Choice of speaker
March 1 Choice of speaking styles
March 21 Specification of text materials
April 8-10 Workshop Lausanne
5.2 Practical issues
To begin with, I propose that COST258 participants react to the proposals
contained in this memo, especially concerning the choice of speaking styles
and the choice of speaker. The latter is also important with respect to
practical arrangements. With respect to the choice of speaking styles,
COST258 participants are requested in particular to indicate whether the
proposal to link the choice of materials types to the telecommunication
domain is accepted and to add to, modify and make more concrete the list
of materials types.