Huckvale: Speech and Handwriting Recognition

From: Computational Linguistics for Speech and Handwriting Recognition, AISB Workshop, Leeds University, April 1994.

Purpose: the Missing Link

in Speech and Handwriting Recognition

Mark Huckvale

Department of Phonetics and Linguistics
University College London
Gower Street
London WC1E 6BT
M.Huckvale@ucl.ac.uk

1. Introduction

Since speaking and writing are both forms of linguistic communication, and since both share the idiosyncrasies of human production, fertile analogies may be drawn between the two in the fields of generation, perception and the encoding of messages. Both forms of communication exhibit enormous variability, contextual effects, and speaker/writer dependence. Thus it is not surprising that methods for automatic speech recognition and automatic writing recognition have many common aspects, both in terms of their architectures and their performance. This paper discusses the similarities of contemporary speech and writing recognition systems in terms of their organisation and weaknesses.

The paper first argues for a much deeper analogy between speaking and writing than is normally considered - specifically to common aspects of phonological realisation and this discussion leads synergistically to ways in which speech recognition might benefit from writing recognition ideas and vice-versa.

The paper then discusses three areas of recognition: (i) the use of delayed decision making in utterance recognition, (ii) the use of linear segment-in-context models for analysing variability, and (iii) the domain of contextual influences.

The paper concludes by tying these themes together in a plea to study how the speech and writing material we attempt to recognise actually functions as communication.

2. Analogies

We note some points of contact between speaking and writing:

On-line and off-line recognition: Both speech recognisers and writing recognisers operate on the basis of an acquired signal: for speaking we can not normally acquire the articulation and so must rely on the sound pressure signal; while for writing we have the choice of recording the pen-tip movement (on-line) or the visual consequences (off-line). Interestingly, the use of the pen-tip signal rather than the image makes an enormous difference to recognition accuracy in the writer-dependent (c.f. speaker dependent) case (Nouboud & Plamondon 1990, Taxt & Olafsdottir 1990). If the analogy holds, speech recognition from articulatograph input should give good speaker-dependent recognition since speakers might be expected to use a single system of articulation for phonological events. But, just as in handwriting, on-line recognition would probably not perform so well in the speaker-independent case.

Normalisation of Signal: For further processing, the sound signal or the pen-tip signal or the visual image need to be normalised, for speech this is done by considering the short-time amplitude spectrum instead of the waveform; in writing by normalising for size, rotation and line size. The choice of normalised representations can be associated with the processes of self-monitoring in the production. The short time amplitude spectrum is more useful than the waveform, because the ear is insensitive to waveform shape, and thus the production goal is to get the spectral content correct rather than the waveform. Similarly, writing production aims to get the shape correct rather than size or rotation or line-width.

Realisation of segmented underlying form: In both speech and writing, co-ordinated and exquisitely-timed muscular control is effected to produce an articulation. A common 'phonological' view is that speech and writing are the realisations of a linear segmented underlying representation of the message: phonemes or letters. The realisation of such abstract units by articulation alters them from having discrete features, values and positions into a continuous physical signal. The result is that information about the underlying form is smeared through the signal, so that a single measurement of the signal has been influenced by a number of underlying units. Conversely information about the identity of a given underlying unit is obtained from a range of times.

Recognition of underlying units: The physical processes of realisation in both speech and writing have consequences for the generated form of the abstract unit according to context, according to speaker/writer, according to different environments and on different occasions. A recogniser must model the nature of the variability of these realisations to be able to determine the likelihood that a given signal could have a given underlying transcription. The use of phone-in-context and letter-in-context models for recognition shows how current systems make a direct link between physical shape and a very limited phonological context.

Contextual variants: As well as associating letters with phonemes: abstract entities used to differentiate lexical items, we might also associate contextual variants as allographs/allophones, or realisations as graphs/phones. We might push this one step further by considering the pattern elements in both a letter and a spectrographic representation of a phone. If speech perception results hold analogously, the perception of a letter comes from a combination of perceptions of these pattern elements, and trading relations exist by which the influence of one cue is balanced against another.

Theories of perception: The equivalent to the motor theory of speech perception - which maintains that we overcome the inherent variability in the sound signal by a mental reference to the means of production - would be a motor theory of writing perception, in which readers inferred an underlying pen manipulation which was more invariant than the visual form. Perhaps in this form the motor theory is seriously undermined.

Supra-segmental similarities: There are also interesting analogies above the level of the segment: volume with size, syllables with letter groups, tone groups with lines, rhythm with spacing. Note, however that for both domains these are just the properties of the signal which are normalised out from the input at an early stage in recognition. The normal justification for this is that supra-segmental properties do not predict segmental shape; however see section 5.

Modelling techniques: Contemporary recognition systems in both areas are using neural-net and HMM models of linear phonological units in context trained on large quantities of reference material. Both systems use language models to provide sequential constraints in recognition, although writing systems tend to use letter-sequence models while speech systems only use word-sequence models. The only justification for this is that writing recognition requires unlimited vocabulary size.

From these points of contact we see that it is quite justifiable to relate the nature of the recognition problem and the architecture for recognition across speech and writing.

3. Delayed decision making

The single most important advance in automatic speech recognition is often said to be the introduction of dynamic programming (DP) - a particularly efficient method for solving some tricky graph-search problems. The introduction of DP not only allowed non-linear methods of time-alignment of signals and fast methods for HMM recognition, but heralded a conceptual change in recognition architecture from bottom-up to top-down. No longer were speech recognition systems constructed as a series of transformational tasks in which sounds were converted to phones, phones grouped into words, words into sentences. Instead syntactic, lexical and phonological information to be 'compiled' into a complete production model for every allowable sentence. Recognition then proceeded by finding the single best input to this model that matched an unknown signal; and this was only feasible with DP.

Given this approach in speech, it is therefore quite surprising that writing recognition systems still put so much weight on the bottom-up recognition of letters. Take the example in Figure 1.; here the word 'course' can not possibly be recognised on a letter-by-letter basis. It may be recognised only when placed in a much larger context of about 2 words on either side. While it is clear how this particular realisation of the word 'course' arose, how the 'ou' letters have been formed, how the 'rs' letters have been coalesced, this is post-hoc rationalisation once the word identity has been established. Similarly in continuous speech recognition systems: the recognised phone sequence is established after the best fitting sentence is found - not as a preparatory step.

Figure 1. Difficult example for linear segment model

This concept of 'delayed decision making' has affected our views of the speech recognition problem profoundly. Early decision making has been shown to have serious consequences on recognition performance since given signal events always have more than one interpretation. To pass up alternatives for resolution at a higher level invites a problem of combinatorial explosion: too many competing hypotheses without information adequate for disambiguation. The delaying of decisions, combined with a statistical framework which allows probabilities arising from observations to combine with probabilities arising from sequential constraints to be combined, produces a much more effective result - but at the expense of losing the exact association between data and interpretation. A writing recognition system on these principles could not indicate which part of the writing is which recognised letter. This may or may not be appropriate for writing recognition applications.

4. Use of segments to model physical realisation

In both speaking and writing we are concerned with linguistic communication - an encoding of a message using the richness of human language. An important element of that communication is the separation of the meaning of the message from the patterns used to encode it. Neither phonemes nor letters have meaning of themselves, and hence neither do the sounds or graphical shapes.

The phonological regularities apparent in both pronunciation and spelling demonstrate that one aspect of the arbitrary mapping between meaning and production is between lexical entries and phonemes; that the mapping arises in the mental lexicon. That this seems obvious demonstrates the pervasiveness of the linear segmented view of pronunciation and spelling. For it needn't be so; the arbitrary mapping could be directly from meaning to sound or shape. The word 'bed' may map to the sound "bed" and only by phoneticians to the transcription /bed/. The phonological regularities observed in production could arise from some limited perceptual capability rather than from the structure of the lexicon (maybe we can only see letters and hear phonemes). However the view that production is controlled by linear segments leads to our current recognition systems that are predicated on phone/letter models influenced by phone/letter context.

That such a model is inadequate is widely known, although very few efforts have been made to extend the recognition architecture to richer models of the meaning-to-sound mapping. There are two aspects to the improvements required: firstly to recognise the influence of the production system on the nature of variability; secondly to recognise the importance of supra-segmental information.

The influence of the production system is that the sound/writing stream is not a sequence of discrete events localised in time/space. In phonetics we are happy to talk about events such as 'bilabial closure' or 'lateral release' as if they can be isolated in the signal. In writing we talk of graphs being made up from a number of 'strokes' - pen-tip gestures. In neither case is there good evidence that the signal can be decomposed into such events. In the data that we collect, these elemental constructs - if they ever existed at all - have merged into a stream that is continuous in time and value. What we appear to be doing is interpreting the signal with respect to a model of production which imposes a discrete view: 'if this is the data, then it must have come from this individual event'. While this is a perfectly reasonable approach to form a descriptive model of the data, it shouldn't be pushed into service as a model for the production mechanism itself without further evidence. Thus a phonemic analysis of pronunciation doesn't imply that phonemes are used in the control of the articulators. Similarly, a letter-by-letter analysis of handwriting may not be a good model of manipulator movement (cf. Figure 1).

The deficiencies of a linear view in recognition are that we miss regularities in variabilities that are shared by a number of phones/letters while at the same time limiting contexts narrowly to immediate neighbours. Huckvale (1993a,1993b) describes how a tiered phonological representation goes some way to address these issues.

The second aspect: the importance of supra-segmental information, is that the speaker/writer uses prosody to direct the listener/reader how to set about decoding the utterance: how to break the input up into manageable units, how to separate old information from new, how (probably) to disambiguate between different syllabifications and word boundary possibilities. What is important for contemporary recognition systems is how the supra-segmental aspects affect the segmental realisation.

5. Information loading

The relationship between 'communicative load' and quality of production is well known. Lieberman (1963) showed that words are given longer and more intelligible pronunciations when they occur in contexts which do not predict them ('the word which you are about to hear is nine').

Thus one reason we have difficulty identifying phones/letters is that the production mechanism appears to have an ambivalent attitude towards presenting clear realisations: letters/phones are clear where they need to be according to the decoding scheme assumed of the listener by the speaker. Otherwise the production is merely sufficient to do the communicative job reasonably reliably. The consequences are that the transcription model of the signal fails to bridge the gap between signal and lexicon. Sufficient information for discrimination of the lexical entries is present in the signal given the context, but a phone sequence mis-represents its phonetic content. By forcing the input to be a sequence of phones/letters we force the representation to be in error; the consequences of error are then spread throughout the recognition system at all levels: more word candidates, more word classes, more syntactic constituents, more interpretations.

The word 'course' can be identified in Figure 1 because we don't first process signals into transcription and then perform lexical access. Instead lexical possibilities constrain the information that needs to be extracted. Writing shows us that we might consider a recognition system for speech in which phonology is used to describe the organisation of the lexicon, and hence how choices between words at different junctures in a sentence are the actual arbiters of phonetic measurement (Huckvale, 1990).

6. Discussion

In this paper I have tried firstly to show that it is legitimate to discuss together the problems of speech recognition and writing recognition; in section 2, I have outlined some of the many points of contact between the production and the recognition of spoken and written utterances.

Secondly I have introduced three observations common to speech recognition and writing recognition: (i) that decisions must not be made too early, (ii) that linear segmentation is a rather crude implementation of phonology for recognition, and (iii) that the realisation of segments depends on their communicative load not just on neighbouring segments.

I believe that there is an important single lesson that can be drawn from these observations. The delaying of early decisions is useful because it does not make explicit a low-level feature/segment representation - whatever scheme is chosen (HMMs of phones, for example) the modelling is weak enough to allow the influence of higher level knowledge. The signal is not recognised as a segment sequence merely that knowledge of production variation is modelled with segments. What is recognised is always the whole utterance. But if segments are being used simply as models of pronunciation variation, then they are doing quite a bad job. The variation of a phone/letter is dependent on a wider context than simply its neighbours and in the characteristics of the production system. If we seek a model of variability then we need to consider prosody and production. But prosody is not some arbitrary interference on the segment string - it is a demonstration of the underlying organisation of the message; it provides essential cues to the recovery of meaning.

The key to all this is to observe that utterances have been produced for a reason. The difficulty we have in recognising segment sequences arises because we ignore this. Somehow the conception has arisen that variation in realisation due to the production system or to prosody is some kind of 'noise' that hides the true segmental string - whereas these aspects are just the opposite: vital clues as to how the utterance should be interpreted. Our modelling of productions should take into account the communicative function at that stage in the interaction with the machine, not just how a segment depends on its neighbours. To model communicative function means that we need to study why as well as how utterances were produced. This knowledge can be incorporated into recognition just as word syntax is used currently.

The missing link in existing speech and writing recognition systems is just the concept that the speaker/writer wants to help the listener/reader. Producers of utterances know they must obey pragmatic rules of quality and relevance for communication to be possible. These rules, far from perverting the pure segmental model of production, enhance an inherently variable system with a larger scale structure directly linked to the utterance intent.

7. Acknowledgements

The author is grateful to Wendy Holmes for constructive criticisms of an earlier draft of this paper.

8. References

M.A.Huckvale, (1990), The exploitation of speech knowledge in neural nets for recognition, Speech Communication, p1.

M.A.Huckvale, (1992), Illustrating Speech: Analogies between speaking and writing, in Speech, Hearing and Language - Work in Progress 6, Phonetics and Linguistics, University College London.

M.A.Huckvale, (1993a), Tiered segmentation of speech, in Speech, Hearing and Language - Work in Progress 7, Phonetics and Linguistics, University College London.

M.A.Huckvale, (1993b), The benefits of tiered segmentation of speech for the recognition of phonetic properties, Eurospeech-93, Berlin.

P. Lieberman, (1963), Some effects of semantic and grammatical context on the production and perception of speech", Language and Speech 6, p172-5.

F. Nouboud, R. Plamondon (1990), On-line recognition of handprinted characters: survey and beta tests, Pattern Recognition, 23 pp1031-1044.

T. Taxt, J.B. Olafsdottir, (1990), Recognition of handwritten symbols, Pattern Recognition, 23, pp1155-1166.