2. Principles of Phonetics
Learning Objectives
At the end of this topic the student should be able to:
- explain how speech is overlaid on biological systems used for other purposes.
- contrast a parametric account of speech articulation with a segmental account.
- explain the selection procedure behind the symbols found on the IPA alphabet.
- describe at least one way in which segments have been described in terms of more elementary components.
- contrast segmental with suprasegmental features of speech
- contrast linguistic with paralinguistic features of speech
Topics
- Biological basis of speech
We tend to use the term vocal tract as a name for the anatomical elements of the human body that produce speech. But strictly speaking there is no such thing, since there is no part of the apparatus used to produce speech which is only used for that purpose.
Physiologically, speech is an overlaid function, or to be more precise, a group of overlaid functions. It gets what service it can out of organs and functions, nervous and muscular, that have come into being and are maintained for very different ends than its own. [Sapir, 1921].
Sub-system Biological function Linguistic function Diaphragm, lungs, rib-cage Respiration Source of air-flow used to generate sounds Larynx Acting as a valve, coughing Phonation Velum Breathing through the nose, acting as a valve Production of nasal & nasalised sounds Tongue Eating, swallowing Changing acoustics of oral cavity, creating constrictions that generate turbulence, blocking and releasing air-flow causing bursts Teeth Biting, chewing Provide sharp edges for creating turbulence Lips Feeding, acting as valve, facial expression Changing acoustics of oral cavity, creating constriction that generates turbulence, blocking and releasing air-flow causing bursts Do humans have a biological specialisation for speech?
An interesting question is whether humans (compared to nonhuman primates) have any anatomical or physiological specialisation for speech. It turns out that there is still no convincing evidence of this. A past theory claimed that the larynx in humans is placed lower in the throat compared to other related animals so as to make a more flexible acoustic cavity between larynx and lips. However modern studies have shown that other animals, including chimpanzees, have similar larynx development to human infants.
If humans have any biological specialisation for speech then it is more likely to be found at the neuro-physiological level. The human variant of the FOXP2 gene seems to be necessary for the neurophysiological control of the coordinated movements required for speech (among other things). And while there are many differences between the structure of human and nonhuman primate brains, these differences have yet to be uniquely associated with language.
- Parametric and segmental accounts of speech
Folk wisdom has it that we use 72 different muscles in speaking. The number may be in doubt, but there are certainly a lot, see this list of muscles used in speech production.
When we seek to describe the planning and execution of speech we tend to focus on the movements of the articulators rather than the muscle movements behind them. This immediately reduces the problem to one of describing the position and movement of the vocal folds, the soft-palate, the back, front, sides and tip of the tongue, the lips and the jaw. That is, to the movement of 10-12 objects rather than 72.
Speaking requires co-ordinated movement of the articulators to achieve a desired acoustic result. The articulators move asynchronously but in a choreographed way. Their motion is rapid, precise and fluid, as can be seen on x-ray films of speaking:
If we track the movement of any single articulator we see that it takes a smoothly-changing continuum of positions. If we attempt to track the position of the articulators for a given utterance then we obtain a parametric description of the speech. Consider this simplified diagram of some of the articulator movement over time in the production of a single word [image source: Laver 1994]:
Even this schematised account hints at the complexity of parametric analysis. While parametric accounts of speaking may provide a more authentic description of articulation (see reading by Tench (1978)), the possible variations in articulator position, movement and timing make these formulations rather difficult to contruct, analyse and exploit (imagine a diagram like this for every word in a dictionary). We need to think about how we can best simplify a parametric description into a more convenient formalism.
There are a number of possible simplification steps:
- Discretize the position of each articulator into a small number of levels. So that the soft palate for example, is only described as either open or closed. Or the lip shape is only described as either spread or rounded.
- Discretize the timing of each articulator movement, so that instead of a motion that continuously varies with time, we get instead a piecewise stationary account.
- Synchronize the timing of changes in each articulator across all articulators, so that the same time intervals describe piecewise stationary sections of all articulators.
- Categorise the intervals using a small inventory of possible combinations of articulator positions.
- Symbolise the categories using a set of characters and diacritics.
The input to the process is a parametric account, the output is a symbolic, segmental account that we call a transcription.
An important question at this point is to ask what information is lost in the process of transcription? Or to put it another way what fidelity is required in our transcription such that no useful articulator movement is lost? Clearly we could use very fine levels of position, or many brief instants of time, or a large inventory of categories and symbols to achieve high fidelity, but this would itself add much complexity.
Different criteria could be used to decide on how accurately we need to specify the articulator positions and articulatory time intervals.
- Articulatory criterion: might say choose enough levels and intervals such that each articulator position can be specified to within a given tolerance, say 1mm.
- Acoustic criterion: might say choose enough levels and intervals such that the sound generated by the articulation can be specified to within a given tolerance, say one decibel (1dB).
- Auditory criterion: might say choose enough levels and intervals such that the sound perceived by an average listener can be specified within a given auditory tolerance, say within the just-noticeable-difference (JND) for loudness, pitch or timbre.
- Phonological criterion: might say choose enough levels and intervals such that all articulations that act contrastively (i.e. give rise to different meanings in any world language) are assigned different categories and symbols.
You should not be surprised to hear that it is the last of these that has been adopted in practice. This key idea, that the fidelity of transcription should be just good enough to describe articulatory changes that lead to changes in meaning (in some language) is one of the principles underlying the alphabet of the International Phonetics Association.
The IPA is intended to be a set of symbols for representing all the possible sounds of the world's languages. The representation of these sounds uses a set or phonetic categories which describe how each sound is made. These categories define a number of natural classes of sounds that operate in phonological rules and historical sound changes. The symbols of the IPA are shorthand ways of indicating certain intersections of these categories. Thus [p] is a shorthand way of designating the intersection of the categories voiceless, bilabial, and plosive; [m] is the intersection of the categories voiced, bilabial, and nasal; and so on. The sounds that are represented by the symbols are primarily those that serve to distinguish one word from another in a language. [ Handbook of the IPA, 1999 ]
Relaxing the rules
It is interesting to speculate about what might happen if we had used different criteria for generating a segmental transcription from a parametric analysis. If we allow time intervals to operate asynchronously across the articulators (to remove the requirement that all articulators make a step change together) then we get a polysystemic account of phonetics, one in which planning is concerned with each articulator independently, and where phenomena can be accounted for on different timescales. The most famous practitioner of such ideas was J.R. Firth (who worked at UCL and SOAS) with his prosodic phonology. More recently there have been attempts to derive an acoustic phonology, where speech is chunked into automatically-derived sound signal categories on the basis of acoustic similarity - you might like to think about what problems might arise with this idea.
- Linear segmental units
In current terminology the units of phonetic transcription (i.e. the objects that are represented by symbols and diacritics in the IPA alphabet) are called segments or phones. The units of segmental phonological transcription are called phonemes. We might say that a phone is a particular type of articulated speech sound, which in turn might be found as a realisation of an abstract (underlying) phoneme. The set of phones that are used to realise a given phoneme are called the allophones of the phoneme.
In the IPA chart, consonant phones are described on a system based on voice, place and manner (VPM), whereas vowel phones are described using the vowel quadrilateral. While the chart contains many possible phones, there are of course far fewer in any given language. This leads to the idea that a more economical description of phonemes is possible by focussing on only the specific phonetic contrasts that function in any given language. A number of such sub-segmental feature representations have been developed over the years.
Perhaps the most famous sub-segmental feature set is that described by Jakobson, Fant and Halle (1952), who described the contrasts among the English phonemes using only 9 binary features: (1) vocalic/non-vocalic, (2) consonantal/non-consonantal, (3) compact/diffuse, (4) grave/acute, (5) flat/plain, (6) nasal/oral, (7) tense/lax, (8) continuant/interrupted, and (9) strident/mellow. See figure below. Interestingly the authors attempted to justify the choice of features on acoustic as well as articulatory grounds. A related set of features was presented in Chomsky and Halle's Sound Pattern of English (1968) but with more of an emphasis on the articulation of the segments.
In terms of units larger than segments, we say that segments are grouped into syllables (typically a vowel surrounded by zero or more consonants) and that words are made up from one or more syllables. A sequence of words makes up an utterance, and a sequence of utterances make up a dialogue turn.
- Suprasegmental units
Although the phoneme sequence is (by definition) sufficient to identify words, the speaker also has freedom in how the segments in the sequence are timed, how well they are articulated, and what character they have in voice quality, voice pitch or loudness. Since these aspects operate across multiple segments (over domains such as syllables, feet, intonation phrases, turns or topics), these aspects are called suprasegmental.
Suprasegmental units fall under three headings (see Laver 1994, Ch.4):
- Settings: these are preferences the speaker shows for particular articulatory state. For example, a speaker might choose a closed jaw position, or a partially-lowered velum, or choose a breathy voice quality. One way of thinking of a setting is as a kind of default or average position of the articulators to which a speaker returns after executing the articulator movements required for the segmental string.
- Stress patterns: these refer to the relative salience of regions of an utterance to listeners. Speakers can make some regions more salient by speaking them more loudly, more carefully, more slowly or with a change in pitch; alternatively speakers can make regions less salient by speaking them more softly, more casually, more quickly or with little change in pitch. We notice in English, for example, that words have patterns of strong and weak syllables, leading to a 'galloping' rhythm, something that is less obvious in, say, Italian.
- Tone and intonation: these refer to changes in pitch which are associated with syllables, words or utterances. Languages with lexical tone use pitch movements to differentiate words, while intonational languages use pitch movements to signal utterance-level features such as sentence function and focus.
- Linguistics and Paralinguistics
There is more to speech than a message encoded in phonological and grammatical units. The segmental and suprasegmental phonetic elements that are used to encode language in spoken form are called the linguistic aspects of Phonetics. But aspects such as 'tone of voice' or 'speaking style' are both phonetic and communicative, but not linguistic. We call these paralinguistic aspects of Phonetics. Paralinguistic phonetic phenomena are often implemented using suprasegmental elements such as speaking rate, articulatory quality, voice quality, average pitch height and so on and are used to communicate information about affect, attitude or emotional state. These features also play a role in the co-ordination of conversation among multiple communicating speakers.
Readings
Essential
- The Principles of the International Phonetic Association, in Handbook of the International Phonetic Association, Cambridge University Press, 1999. [PDF on Moodle].
- Paul Tench (1978). On introducing parametric phonetics. Journal of the International Phonetic Association, 8, pp 34-46.
Background
- John Laver, Chapter 4 The phonetic analysis of speech, in Principles of Phonetics, Cambridge University Press, 1994 [available in library]. A more detailed account of the principles behind phonetic transcription than presented here. Recommended.
Laboratory Activities
The lab session involves two activities that explore the relationship between parametric and segmental accounts of speech production.
- Generating a parametric analysis from a segmental transcription
You will be provided with a slow-motion x-ray film of a speaker saying the utterance
She has put blood on her two clean yellow shoes
On the form provided, sketch the change in position of the following articulators through the utterance, aligned to the given segmental transcription: (i) jaw height, (ii) velum open/close, (iii) height of back of tongue, (iv) height of tip of tongue, (v) lip opening.
- Generating a segmental transcription from a parametric analysis
You will be provided with an audio recording of a sentence which has been made with a simultaneous tracking of the articulators (captured using electromagneticarticulography (EMA)).
On the printout provided, align a broad phonetic transcription against the signal. You can use the computers to play back regions of the signal to help with your transcription and alignment. How does your transcription of the utterance fit with the recorded movement of the articulators?
Application of the week
This week's example application of phonetics is the personal speech dictation system. These systems allow you to dictate documents into your computer. Typically they are designed to operate in quiet office environments with a good quality microphone. Usually you need to be personally enrolled into the system by reading a short text; the system uses that to determine the particular characteristics of your voice. Dictation systems also scan the documents that you have on your computer; they use these to establish your typical vocabulary and your frequent word combinations. Dictation systems are particularly useful for people who work with their eyes or hands busy on other tasks (such as map-makers or dentists) or for people suffering from repetitive strain injury to their wrists.
Personal dictation systems are available for all major computing platforms in the most common world languages. The Microsoft speech recognition system is available free on versions of Windows Vista, 7 and 8.
Language of the Week
This week's language is Japanese, as spoken by an educated speaker from Tokyo. [ Source material ].
Reflections
You can improve your learning by reflecting on your understanding. Come to the tutorial prepared to discuss the items below.
- Learn how to introduce yourself in Japanese.
- Listen carefully to the Japanese audio. Identify the speech sounds not present in SSBE. Identify the speech sounds which exist in SSBE but seem to have a different character in Japanese.
- Japanese is said to have voiceless vowels; can you find some?
- What harm is there in calling the human speaking apparatus the vocal tract?
- How could we tell whether some aspect of human anatomy or physiology is specialised for language?
- What differences are there between a parametric description of an utterance and a segmental description? What are the disadvantages of parametric accounts?
- Put into words the process by which the symbols on the IPA chart are chosen. Why are diacritical marks useful?
- Contrast the terms "phoneme", "allophone" and "phone".
- Why might describing phonemes in terms of more elementary features be considered more 'economical'?
- Identify utterances that differ in meaning only because of suprasegmental features.
- Should Phonetics be concerned with paralinguistic phenomena? Should Phonology?
Hajime mashite.
Konnichi wa, watashi no namae wa <your name> desu.
Yoroshiku onegaishimasu.
Last modified: 17:50 04-Oct-2013.