THE CONTRIBUTION OF TEMPORALLY-CODED ACOUSTIC SPEECH PATTERNS TO AUDIO-VISUAL SPEECH PERCEPTION IN NORMALLY HEARING AND PROFOUNDLY HEARING-IMPAIRED LISTENERS

Andrew Faulkner and Stuart Rosen.

Department of Phonetics and Linguistics, University College London

Wolfson House, 4 Stephenson Way, London NW1 2HE. UK. E-mail a.faulkner@ucl.ac.uk

In: "Proceedings of the Workshop on the The Auditory Basis of Speech Perception". ESCA,, June 1996, pp 261-264

ABSTRACT

Studies simulating hearing impairment in normally hearing subjects have investigated auditory and audio-visual consonant identification with acoustic signals representing isolated and combined temporal speech pattern elements. These elements comprise the temporal patterning of both periodic laryngeal excitation and aperiodic voiceless excitation, voice fundamental frequency, and the speech amplitude envelope.

In consonant identification, the principal auditory contributions to audio-visual speech perception came from the on­and­off patterning of silence, periodic and aperiodic excitation. Variations in amplitude envelope and fundamental frequency provided little further information. In audio-visual sentence recognition, however, speech amplitude information did provide significant information beyond that from the temporal patterning and fundamental frequency of laryngeal excitation.

A speech analysing hearing aid, the SiVo-II aid, has been employed to implement these codings. It employs an artificial neural-net classifier trained to extract laryngeal excitation information from speech in noise, and also extracts speech amplitude envelope. In a group of profoundly hearing-impaired listeners who derive little lipreading support from amplified speech, encoded speech pattern elements show a significant advantage, especially in noise.

INTRODUCTION

The perceptual role of temporal structure in speech has until recently [1,2] been rather neglected in comparison to spectral structure. Temporal speech information is likely to have especial significance for hearing impaired listeners, whose frequency resolution is impaired or, where the hearing loss is profound, may be completely absent [3].

Temporal information can be considered to have several component elements. We have adopted a classification of temporal components based on the presence or absence of acoustic excitation, the periodicity or aperiodicity of excitation, the variation in frequency of periodic (laryngeal) excitation, and the amplitude variation of periodically and aperiodically excited speech. A further temporal aspect that is not investigated here is the instantaneous frequency of the speech pressure waveform which contains information related at least in principle to vocal tract resonances.

The contributions of different temporal information elements to segmental consonant identification are of inherent theoretical interest. One application of such an understanding is in the design of hearing aids intended to make optimal use of limited auditory abilities [e.g. 4].

Spectral speech information is broadly correlated with visible information from lipreading. However, the patterning in time of laryngeal excitation, silence and aperiodic excitation are invisible, as is the frequency of larynx vibration. The amplitude variation of voiced speech components is correlated with both visible (degree of constriction) and invisible factors (e.g. opening of the nasal tract) .

CONSONANT IDENTIFICATION

The studies of segmental perception presented here address several specific issues.

Does the variation of amplitude envelope contribute to segmental information beyond the gross timing of periodic and aperiodic excitation? This relates to the important current practical issue of whether extreme compression in hearing aids removes important segmental information

While fundamental frequency microstructure is known to be related to manner of articulation through the effects on vocal fold vibration frequency of changes in the acoustic impedance of the vocal tract, can this information be used in consonant identification?

Experiment 1

This first experiment concerned the contributions to consonant identification of the gross timing, amplitude variation and fundamental frequency variation of voiced components of speech.

Methods

9 conditions were tested: lipreading alone (L), plus 4 sound conditions with (L+) and without lipreading:

V ­ A fixed­frequency, fixed­amplitude signal indicating vocal fold vibration.

V(A) ­ as V but with added amplitude envelope, derived from the original speech.

Fx ­ A fixed­amplitude signal whose periodicity followed the speaker's fundamental frequency.

Fx(A) ­ as for Fx, but with an amplitude envelope added.

Four normal­hearing native speakers of British English took part. Speech materials comprised each of the 24 English consonants between the vowel /a/. Five video­recorded lists from a female speaker were used, each list consisting of two tokens of each consonant.

Fundamental frequency and the duration of laryngeal excitation were derived from an electro-laryngograph signal. The on-off pattern of voicing was represented by a gated pulse signal of constant frequency. Where fundamental frequency was represented, the pulse rate was controlled cycle-by-cycle. The pulse signal was low­pass filtered at 400 Hz. Amplitude envelope information was derived by full­wave rectifying the 3­kHz low­pass filtered speech, and smoothing with a 30 Hz low­pass filter. The envelope was multiplied against the appropriate pulse train. All signals were recorded for testing purposes, and presented free-field using a loudspeaker.

Analysis

Each session was analyzed separately by constructing a confusion matrix from which overall proportion correct scores were derived, together with unconditional information transfer measures for:

voicing: voiced vs. voiceless

place: bilabial vs. labiodental vs. dental vs. alveolar vs. palatal vs. velar vs. pharyngeal

manner: plosive vs. affricate vs. fricative vs. nasal vs. glide

voice/manner: a voicing/manner feature, closely related to so-called envelope features [2] with classes; voiced plosive, voiceless plosive, voiceless fricative, voiced fricative, and sonorant (nasals + glides).

To allow for learning, only the last 6 sessions for each condition of the 10 run were analyzed. Statistical claims are made on the basis of an ANOVA including an observer x condition interaction (which was often significant), and Tukey's Studentized Range Test (p 0.05).

Results

Table 1 shows mean performance as a function of condition. Values with a common symbol in the same column (*, #, @) are indistinguishable statistically. Although more information tends to lead to better performance, neither fundamental frequency nor envelope increase performance very much compared to on­off voicing. Fx information does result in a significant increase in manner information compared to the simple duration of voicing, but this does not affect overall performance. That Fx variations aid consonant identification little has already been shown [5], but the small effects of amplitude envelope variation come as a surprise.

TABLE 1

condition

feature

correct

voice/manner

voicing

manner

V # 13 # @ 45 @ 68 # 28
V(A) # 14 # @ 48 @ 69 # 30
Fx # 18 @ 52 @ 75 # 35
Fx(A) # 17 # @ 50 @ 72 # 33
L 54 # 43 15 60
L + V * 79 * 71 * 92 @ 67
L + V(A) * 83 * 76 * 93 @ * 73
L + Fx * 83 * 77 * 94 * 75
L + Fx(A) * 85 * 78 * 95 * 77

Experiment 2

Experiment 2 was primarily concerned with the role of voiceless frication and the amplitude envelope of voiceless speech. There were 3 different acoustic signals, presented with and without lipreading, a total of 6 conditions. Apart from Fx(A) used in Experiment 1, the other sound signals were:

Fx(A)+Nz ­ as for Fx(A) above, with a band of fixed­level noise present during periods of voiceless excitation.

Fx(A)+Nz(A) ­ as above, but with an amplitude envelope on the noise as well.

Methods

Five new observers took part, following the same procedures as in Experiment 1. Voiced speech was processed as in Experiment 1. Voiceless excitation was detected by a spectral balance circuit comparing the amount of energy above and below 3 kHz in the speech signal. However, voiceless excitation was only represented in the absence of voicing. Voiceless excitation was represented by white noise that was then mixed with the voicing pulses, and low­pass filtered at 400 Hz. Amplitude envelope (for both voiced and voiceless speech) was derived by full­wave rectifying the broad­band speech signal and smoothing the result using a 30 Hz low­pass filter.

Results

Again, more information tended to lead to better performance. The addition of voiceless information almost always lead to significantly improved performance (except for the place feature). As in Experiment 1, the addition of amplitude envelope never caused significant increments for the features analysed.

TABLE 2

condition

feature

correct

voice/manner

voicing

manner

Fx(A) # 19 # 47 62 # 32
Fx(A) + Nz @ 24 @ 60 # 72 @ 45
Fx(A) + Nz(A) @ 27 @ 61 @ # 79 @ 45
L + Fx(A) 68 68 @ 82 61
L + Fx(A) +Nz * 76 * 82 * 92 * 73
L + Fx(A) +Nz(A) * 75 * 79 @ * 88 * 72

Experiment 3

Experiment 3 focused primarily on the overall role of envelope. It included one condition that had not been used previously, Fx+Nz, in which the gross timing of voiced and voiceless excitation were both represented without amplitude envelope information. Three new observers took part.

Results

The results (Table 3), lead to the same conclusions as the previous two experiments. Variations in envelope beyond a simple binary indication of amplitude never led to statistically significant increments in performance. But the addition of voiceless information often did, especially for voicing and for other features in conjunction with lipreading.

TABLE 3

condition

feature

correct

voice/manner

voicing

manner

place

Fx # 9 # 33 # 33 # 26 # 21
Fx + Nz # 14 # 39 @ 50 # 28 # 22
Fx(A) + Nz(A) # 14 # 39 @ 48 # 28 # 22
L @ 41 # 33 5 @ 48 @ 77
L + Fx 62 58 @ 60 56 @ * 79
L + Fx +Nz * 72 * 69 * 73 * 67 @ * 81
L + Fx(A) + Nz(A) * 74 * 73 * 75 * 70 * 82

EXPERIMENT 4 - SENTENCE PERCEPTION

It is clear that the bulk of temporal segmental information is contained in the on­and­off patterning of silence, periodicity and aperiodicity. In connected speech, other factors are of course involved. Fundamental frequency variation has been shown to contribute to audio-visual speech perception both in connected discourse tracking [6] and sentences [7]. Amplitude envelope variation is also known to contribute to the lipreading of connected discourse [8] and sentences [9]. In view of our above findings, we sought here to examine the additional information from amplitude variation in sentence perception using the same speech processing methods. We also wanted to compare the contribution of voiceless excitation information between consonant and sentence level speech perception.

Methods

Six normally hearing subjects took part. The auditory supplements employed were Fx, Fx(A) and Fx(A) + Nz(A). Four conditions were used, audio-visual presentation with each of these supplements, and also unaided lipreading (L).

Speech materials were taken from an audio-visual recording of the BKB sentences for which normative data have been established [10]. Each of the 21 sentence lists comprises 16 sentences, each with four or five "key" content words that are scored for correctness. Speech processing was as in experiments 2 and 3 above.

Each subject first received an unscored practice sentence list in each of the four conditions. They subsequently received five test sessions comprising one test list in each of the four conditions.

Results

The group results are shown in Figure 1. Because some subjects approached the upper bounds of the test scores in some audio-visual conditions, an arcsine transformation was applied to the data prior to an analysis of variance. Comparisons between conditions were again made using Tukey Studentized range tests.

Figure 1. Mean number of key words identified by condition in experiment 4. The error bars are simple 95% confidence intervals for each mean.

.

As expected, all three of the audio-visual conditions showed significantly higher scores than the visual only condition. Scores in conditions L+Fx(A) and L+Fx(A)+Nz(A) did not differ significantly, while both showed a significantly higher score than condition L+Fx. There was a significant practice effect; F(4,20) = 10.55, p<0.001, but no significant interaction of condition and practice; F(12,60) = 1.82.

Discussion

In showing a significant contribution of amplitude envelope variation, the results of experiment 4 are consistent with other results in the literature. Since we have found no significant contribution of amplitude variation at the level of segmental (consonant) perception, we attribute the effect of amplitude variation here to supra-segmental factors.

A contribution of voiceless excitation information is not apparent here, despite it being consistently significant in consonant identification. Presumably the contribution it can make in consonantal manner perception and the enhancement of voicing contrasts is made less significant here by the availability of syntactic and lexical context.

RESULTS FROM PROFOUNDLY HEARING IMPAIRED LISTENERS

For hearing impaired listeners, the use of simple encodings of temporal speech elements has several potential advantages. First, simple patterns of temporal information can readily be encoded using acoustic signals that match the analytic capacity and frequency/intensity ranges of the profoundly impaired ear [5]. Second, we can expect speech analysis methods to become available that extract speech information in levels of noise that cause extreme difficulty for profoundly hearing impaired people. In the case of voice fundamental frequency information, this has already been demonstrated by a multi-layer perceptron classifier [11]. We have developed a wearable hearing aid whose digital signal processing carries out this analysis, and have made evaluations of its benefit for profoundly impaired listeners.

The SiVo-II aid [12] used in these studies provided voice fundamental frequency information, encoded as a frequency- and amplitude-modulated sinusoid. It was compared to well-fitted conventional hearing aids in tests of audio-visual consonant and sentence lipreading in quiet and noise. The profoundly hearing impaired subjects were selected on the criterion of showing limited benefit in aided lipreading from conventional hearing aids.

Even though amplitude envelope information may not be expected to enhance consonant identification, in noise, where the fundamental frequency extractor may on occasion respond to noise rather than speech, it is expected that it is helpful to preserve the intensity relations between responses to voiced speech and to the (generally lower-level) background noise.

Consonant identification

Eleven subjects took part in a comparison of the SiVo-II and conventional aids with speech in quiet and in a background of speech-spectrum shaped noise at signal-to-noise ratios of +10 and +5 dB (measured from the peak levels of speech and noise). Results are shown on figure 2.

Figure 2. Percentage correct scores in audio-visual consonant identification. The box and whisker plot shows the median, interquartile range and range excluding outliers. The white boxes represent scores with a conventional hearing aid, and black boxes scores using the SiVo-II.

An ANOVA showed a significant main effect of hearing aid; F(1,10)=9.84, p=0.01. The main component of this advantage was from a higher transmission of voicing information; F(1,10)=12.66, p= 0.005. At the higher noise levels, performance with the conventional aid was comparable to the same subjects' unaided lipreading performance.

Sentence lip-reading

A similar comparison using the BKB audio-visual sentence materials in signal-to-noise ratios of 0 and 5 dB in seven profoundly hearing impaired subjects has shown a non-significant trend towards higher scores with the SiVo-II aid; F(1,6) =4.35 p= 0.08.

CONCLUSIONS

The role of amplitude envelope in consonant identification has been clarified, and it appears that the gross timing of both periodic and aperiodic speech excitation is of much greater significance than are variations in speech amplitude or voice fundamental frequency at this segmental level. Both amplitude and fundamental frequency variation are clearly of significance in speech perception at the sentence level.

In profoundly hearing impaired subjects who cannot make effective use of amplifying hearing aids, the use of noise-resistant speech pattern analysis can lead to significantly improved perception. Although the real-time analysis methods used in these studies are imperfect, the combination of a simplified signal and the rejection of noise offers more speech information to some listeners than they can extract from the full speech signal.

ACKNOWLEDGEMENTS

The assistance of Kirsti Reeve, Deborah Vickers, Kerensa Smith, and Athena Euthymiades in carrying out these experiments is gratefully acknowledged. Supported by the Medical Research Council (UK), a Wellcome Trust Vacation Scholarship, CEC TIDE projects 206 (STRIDE) and 1217 (OSCAR), and Northwestern University.

REFERENCES

[1] S. Rosen. "Temporal information in speech: acoustic, auditory and linguistic aspects". Phil. Trans. Royal Soc. London B, vol. 336, 1992, pp. 367­373.

[2] D. J. Van Tasell et al. "Temporal cues for consonant recognition: Training, talker generalization, and use in the evaluation of cochlear implants". J. Acoust. Soc. Am., vol. 92, 1992, pp. 1247­1257.

[3] A. Faulkner, S. Rosen and B. C. J. Moore "Residual frequency selectivity in the profoundly hearing impaired listener". Br. J. Audiol., 1990, vol. 24, pp. 381-392.

[4] A. Faulkner et al. "Speech pattern hearing aids for the profoundly hearing­impaired: Speech perception and auditory abilities", J. Acoust. Soc. Am., vol. 91, 1992, pp. 2136­2155.

[5] S. Rosen et al. "Lipreading with fundamental frequency information". Proc. Inst. Acoust. Autumn Conf., 1979, pp. 5­8.

[6] S. Rosen et al. "Lipreading connected discourse with fundamental frequency information". Brit. Soc. Audiol. Newsletter (Summer), 1980, pp. 42­43.

[7] R. S. Waldstein and A. Boothroyd "Speechreading enhancement using a sinusoidal substitute for voice fundamental frequency". Speech Communication, vol. 14, 1994, pp. 303-312

[8] K. W. Grant et al. "The contribution of fundamental frequency, amplitude envelope, and voicing duration cues to speechreading in normal­hearing subjects". J. Acoust. Soc. Am., vol. 77, 1985, pp. 671­677.

[9] M. Breeuwer and R. Plomp "Speechreading supplemented by auditorily presented speech parameters". J. Acoust. Soc. Am., 1985, vol. 79, pp. 481-499.

[10] J. R. Foster et al. "Lip-reading the BKB sentence lists: corrections for list and practice effects". Br. J. Audiol., vol. 27, 1993, pp. 233-246.

[11] J. R. Walliker and I. S. Howard. "The implementation of a real time speech fundamental period algorithm using multi-layer perceptrons". Speech Communication, 1990, vol. 9, pp. 63-71.

[12] J. R. Walliker et al. "Speech analytic hearing aids for the profoundly deaf". In B. Granström, S. Hunnicut and K-E. Spens (Eds.) Speech and Language Technology for Disabled Persons. ESCA/ETRW, Stockholm, 1993, pp. 35-38.