PALS1004 Introduction to Speech Science

9. Style and Emotion

Key Concepts

Paralinguistics is the study of how speech is spoken
Speaking style research studies variation due to demands of intelligibility, familiarity and formality
Emotional speech varies at least in terms of valence and activation
Research in emotional speech is particularly difficult since it is hard to manipulate genuine emotions
Physiological changes in an individual can also affect speech in predictable ways

Learning Objectives

At the end of this topic the student should be able to:

give an account of the notion of speaking styles
provide examples of different speaking styles and suggest ways in which speech production is adapted to different styles
give examples of the ways in which the emotional state of the speaker affects how an utterance is spoken
give an account of the difficulties in studying the effect of emotions on speech
provide examples of the ways in which health and physiological state affect speech

Topics

Paralinguistics

We have seen how changes in the segmental and suprasegmental aspects of speech can change the meaning of an utterance or change the speaker's intentions behind an utterance. Such changes are called linguistic variation, since they are about how language is used to convey information. But the same utterance (with the same meaning) can be also spoken in different ways depending on the communicative situation or on the emotional state of the speaker. These changes convey to the listener information about the speaker's feelings or attitude rather than information about the meaning of the utterance. These changes are called paralinguistic variation. It is interesting to note that listeners are quite good at extracting the speaker's attitude and emotional state from the speech signal, thus it must be the case that paralinguistic information is encoded in a systematic manner.

There is also information in the speech signal that tells you about the person who is speaking: information about their identity, physical and physiological state. This is sometimes called extralinguistic variation.

Speaking Styles

It is a common experience that we can tell from a speech recording some information about the context or environment in which it was spoken. Disregarding the actual words used, we may be able to distinguish, for example:

Speech read from text
Speech directed at a child
Speech directed at a foreigner or a person with poor hearing
Speech spoken in a noisy place
Conversational speech between friends
A job interview
A politician's electoral speech
A sports commentary
A lecture

The study of how the communicative context affects the paralinguistic character of speech is called speaking style research.

Research into the paralinguistics of speaking styles is much less well developed than research into the linguistic aspects of speech. It is still not clear, for example, whether these identifiably different speaking styles listed above are just points in some speaking style space, or are just arbitrary idiosyncratic styles having no relationship with one another. It has been suggested that each style can be positioned in a three dimensional style space having dimensions of intelligibility, familiarity and social stratum (Eskenazi, 1993):

Recent research has looked at how speech varies in "clarity" according to the communicative context. The idea being that speech itself can be put on an intelligibility dimension: sometimes speech is poorly articulated and hard to understand (hypo-speech) and sometimes it is well articulated and easy to understand (hyper-speech). Hypo-speech is sometimes called "talker-oriented" in that it arises in a situation when the speaker is only concerned with minimising the effort he puts into speaking; while hyper-speech is called "listener-oriented" since it arises in a situation where the speaker is concerned with whether the listener is getting the message.

Where speech is placed along this hypo-hyper dimension can be seen to be influenced by features of the communicative context, as shown in the diagram below:

Speech is towards the hypo end of the dimension when you are having a conversation with a friend, the message content is predictable, the acoustic environment is good or the information communicated is not very important. Alternatively, speech is towards the hyper end of the dimension when you are trying to be clear to a stranger, when the message content is novel, the acoustic environment is poor or the information important.

How does speech itself change along the hypo-hyper dimension? To increase the clarity of speech, speakers tend use strategies such as:

Speaking more slowly
Articulating segments more carefully
Raising vocal intensity (by increased lung pressure or by using modal phonation)
Raising pitch (often an inevitable by-product of increased lung pressure)
Increasing pitch variation

Which strategies are used to clarify speech can also vary according to further details of the communicative context. In a noisy place one might speak more loudly, but to a child one might speak more slowly.

Emotion

Another paralinguistic aspect of speech is the effect of emotion: listeners can often recognise the emotional state of the speaker from the way in which an utterance is produced. The main effects of emotional state seem to be on prosody: pitch level, pitch range, pitch dynamics, intensity, speaking rate, voice quality. However, similar changes can occur for different emotions, so it is not always easy to identify the emotion from the sound.

Speech emotion analysis refers to the use of various methods to analyze vocal behavior as a marker of affect (e.g., emotions, moods, and stress), focusing on the nonverbal aspects of speech. The basic assumption is that there is a set of objectively measurable voice parameters that reflect the affective state a person is currently experiencing (or expressing for strategic purposes in social interaction). This assumption appears reasonable given that most affective states involve physiological reactions (e.g., changes in the autonomic and somatic nervous systems), which in turn modify different aspects of the voice production process. For example, the sympathetic arousal associated with an anger state often produce changes in respiration and an increase in muscle tension, which influence the vibration of the vocal folds and vocal tract shape, affecting the acoustic characteristics of the speech, which in turn can be used by the listener to infer the respective state (Scherer, 1986). Speech emotion analysis is complicated by the fact that vocal expression is an evolutionarily old nonverbal affect signaling system coded in an iconic and continuous fashion, which carries emotion and meshes with verbal messages that are coded in an arbitrary and categorical fashion. Voice researchers still debate the extent to which verbal and nonverbal aspects can be neatly separated. However, that there is some degree of independence is illustrated by the fact that people can perceive mixed messages in speech utterances – that is, that the words convey one thing, but that the nonverbal cues convey something quite different. [Juslin & Scherer, Scholarpedia]

As with speaking style, it is hard to know whether we should categorise emotions into basic types (the “big six”: anger, fear, sadness, joy, surprise and disgust) or whether to position emotions in some n-dimensional space. In so far as speech is concerned, we find that the same kinds of speech changes occur in a number of different emotions (increased energy or increased pitch for example), so that the space of emotional states that influence speech is probably limited. Cowie (2001, 2003) proposes just two dimensions, valence and activation. Valence corresponds to the positive or negative aspect of the emotion, while activation relates to “the strength of the person’s disposition to take some action rather than none”. There is some experimental evidence that two-dimensions are necessary, even if they are not sufficient.

The study of emotional speech is fraught with methodological and ethical problems: (i) it is difficult to get recordings of genuine emotions since they occur in natural settings which are unlikely to be recorded (and even then we can’t be sure of the actual emotion felt by the speaker), (ii) it is considered unethical to actually make speakers "afraid" or "panicking" just so we can investigate their speech, (iii) acted speech may caricature rather than express genuine emotions (and actors vary in how well they express emotions), (iv) speakers vary in how they express the same emotion in speech, and (v) speakers expressing an emotional “type” will also differ in the degree of emotional "arousal".

Acoustic properties of emotional speech

Most research into emotional speech has focussed on how emotional expressions affect the prosody and voice source characteristics of speech. For example this table from Scherer, 2003:

Note that similar patterns arise for different emotions - mean F0 rises for both Anger and Joy. Also note that changes are correlated across acoustic features - vocal intensity and vocal pitch rise or fall together. These are the kind of facts which support the idea that emotional speech should be described within a simpler "space" of behaviours, such as the valence-activation space suggested by Cowie.

Physiology

The speech produced by a person will be affected by changes in their physical state or their health. For example we notice if a speaker is out of breath, or has a cold. We can also often get a sense of the age of a person from their voice.

We'll look at four areas where researchers have studied the effect of physiological changes on the voice: Stress, Fatigue, Intoxication and Age.

Stress

Although there is no clear definition of "stress", it has been defined as the physiological response of an individual to external stressors subject to psychological evaluation. Physiological response here refers to the invocation of the 'fight or flight' mechanisms of our nervous system, by which adrenalin levels are increased, cardiovascular output is increased, senses are sharpened, pupils are dilated and so on. Stressors include direct physical effects on the body (e.g. acceleration, heat), physiological effects (e.g. drugs, dehydration, fatigue, disease), perceptual effects (e.g. noise, poor communication channel) or cognitive effects (e.g. perceptual load, cognitive load, emotion). Psychological evaluation allow for the fact that an individual's reaction to these stressors may vary according to the individual's evaluation of their importance.

This focus on the physiological response to stressors makes sense with regard to the assessment of stress through characteristics of the voice. It is to be expected that speech, as a neuromuscular performance, will be affected by the physiological state of the individual. For example increased respiration might increase sub-glottal pressure and hence affect the voice fundamental frequency and spectral slope of the voice spectrum. Increased muscle tension might affect vocal fold vibration or the supra-laryngeal articulation of vowels and consonants. Increased cognitive activity might affect speaking rate, pauses or speaking errors.

Fatigue

Various speech parameters have been observed to vary systematically with increasing fatigue. Changes in pitch height, pitch variation, speaking rate, pause frequency and length, and spectral slope have been reported. Vogel et al (2010) reported an increase in the total time taken to read a passage, and an increase in pause duration after their subjects had been kept awake for more than 22hours.

The figure below (from Baykaner et al, 2015) shows the results of an experiment to predict time awake from changes in the speaking voice. In this experiment, the speakers were kept awake for 60 hours (three days) and changes to the voice could be used to identify quite accurately whether the speaker had slept in the previous 24 hours. Each point in the graph is a recording of the subject reading from a novel.

Intoxication

Alcohol and other intoxicants have been seen to affect speech. Hollien et al (2001) report increases in fundamental frequency, increases in time to complete the task and increases in disfluencies with increasing alcohol intoxication. The graph below shows mean changes in fundamental frequency for men and women in the study as a function of breath alcohol concentration.

Interestingly, Hollien et al report much variability in how individuals respond to the same level of alcohol intoxication. A significant minority of speakers showed no measurable effects of alcohol intoxication on their speech.

Over the long term, repeated intoxication can have permanent effects on the voice. Alcohol, in particular, causes dehydration of the vocal folds and makes them more susceptible to organic damage.

Age

A speaker's voice changes as they get older to the extent to which we can estimate fairly well the age of a speaker from their voice. The figure below (from Huckvale & Webb, 2015) shows the predicted ages of 52 speakers made by 36 listeners. The mean absolute error of age prediction was about 10years, that is we can often estimate a speaker's age within a decade just by hearing their voice.

Readings

Essential

R. Cowie, "Describing the emotional states expressed in speech", ISCA ITRW Speech and Emotion, Newcastle, Northern Ireland, 2001. On Moodle.

Background

K. Scherer, "Vocal communication of emotion: A review of research paradigms", Speech Communication 40 (2003) 227-256.
M. Eskenazi, "Trends in speaking styles research", Proc. EuroSpeech 1993, Berlin, 501-509.
H. Hollien, G. DeJong, C. Martin, R. Schwartz, K.Liljegren, "Effects of ethanol intoxication on speech suprasegmentals", J. Acoustic. Soc. Am. 110 (2001) 3198-3206.

Reflections

You can improve your learning by reflecting on your understanding. Come to the tutorial prepared to discuss the items below.

Suggest two utterances which vary in paralinguistic terms but not in linguistic terms.
What are the defining characteristics of "child-directed speech"?
How is "read speech" different from "spontaneous speech"?
Give an everyday example of extreme hypo-speech.
Give an everyday example of extreme hyper-speech.
Why are the dimensions of valence and activation more useful to phonetic research than just a list of emotional categories?
What are your experiences of the effects of fatigue or of alcohol intoxication on speech?
What problems are there in making authentic recordings of emotional speech or of intoxicated speech?

Word count: . Last modified: 10:43 09-Mar-2018.