Department of Phonetics and Linguistics

PERFORMANCE OF A REAL-TIME MULTI-LAYER PERCEPTRON ALGORITHM FOR THE EXTRACTION OF VOICING AND VOICE FUNDAMENTAL FREQUENCY INFORMATION

Andrew FAULKNER and Jianing WEI

Abstract
The SiVo-3 hearing aid provides voicing, fundamental frequency, voiceless excitation, and amplitude envelope information as a supplement to lip-reading for profoundly hearing impaired people. The extraction of voicing and voice fundamental frequency is performed by a real-time multi-layer perceptron that is trained to locate in noisy speech acoustic events associate with the instant of laryngeal closure. Training makes use of reference data from a laryngographic electrode recording. The performance of a 'C' language off-line prototype algorithm and a real-time version for the TI TMS320C50 processor has been compared to standard Cepstral and SIFT algorithms. In comparisons with quiet and noisy speech, the off-line 'C' language MLP showed superior detection of voicing in speech to the reference methods. The accuracy of the off-line MLP in fundamental frequency estimation was similar to that of SIFT. For speech in quiet, the Cepstral method was substantially more accurate than the other methods in fundamental frequency estimation. The Cepstral method's fundamental frequency estimation accuracy was more affected than the MLP and SIFT algorithms by the presence of noise so that at a +10 dB signal-to-noise ratio, the Cepstral method performed similarly to the others.

The real-time MLP was compared to the same two reference methods for speech in quiet, with similar results to that from the off-line algorithm. It is concluded that the MLP method provides acceptable performance given the requirements of this hearing aid application, where voicing detection is of primary importance and errors of fundamental frequency estimation are relatively innocuous given the limited frequency discrimination shown by profoundly impaired listeners.

1. Introduction
The SiVo-3 hearing aid, a major component of the TIDE project OSCAR, features a speech-analytic mode designed to provide lipreading support for profoundly hearing impaired listeners. This speech-analytic mode provides simple acoustic signals that signal acoustic speech elements that are known to be valuable supplements to lipreading for this population (e.g. Faulkner et al., 1993). The selected elements are:

Periodic speech is represented by a sinusoid within the comfortable hearing range that follows the extracted fundamental frequency. Aperiodic speech is represented by a noise with a spectrum shaped to match the comfortable listening level between 125 and 2000 Hz. Speech amplitude envelope, after 2:1 logarithmic compression, is used to amplitude modulate both sinusoid and noise signals, with maximum and minimum levels limited by threshold and maximal comfortable levels.

The present report describes evaluations of both an off-line 'C' prototype and a real-time version of the algorithm employed by the SiVo-3 aid for the extraction of the voicing pattern and voice fundamental frequency. The real-time SiVo-3 algorithm is also compared with a previous algorithm of somewhat different structure that was used in the preceding SiVo-2 aid. One reason for the choice of a multi-layer perceptron (MLP) for the SiVo-3 is that the computations used by an MLP can be performed very efficiently by a fixed-point digital signal-processor chip. A second reason comes from the initial demonstration by Howard and Huckvale (1989) that an MLP can be rather effective for the extraction of voice fundamental frequency from noisy speech. The use of an MLP was expected to be a significant advance, particularly in respect of noise resistance, over previous real-time peak-picking methods (Howard, 1989) that had been used both in the original SiVo aid and also in the Nucleus cochlear implant speech processor.

1.1 Performance targets
In the audio-visual perception of consonants, the gross timing of the voicing pattern has been shown to be the principal source of auditory information (Faulkner, Rosen and Reeve). It is, therefore expected the accuracy of the voicing pattern cues from the SiVo-3 will be of major importance. An accurate representation of fundamental frequency (Fx) is highly desirable, since supra-segmental intonation information also contributes significantly to audio-visual perception of connected speech. Another important aspect of performance in this applications is that the extracted fundamental frequency information can represent the irregularity of vocal fold vibration in order to give the user feedback on their own voice control (Ball et al., 1990). For this purpose, a time-based analysis of fundamental frequency that operates cycle-by-cycle is highly desirable.

2. Algorithm structure
The approach presented here employs a multi-layer perceptron (MLP) neural network pattern classifier (Rumelhart & McClelland, 1988) which carries out a non-linear mapping of the input data set to the output unit of the network. The MLP classifiers discussed here depend on the use of supervised training by which the neural net is provided with labelled reference data that specifies the correct classification of the network output for each time-frame of the training data., whereby one frame per period (at the moment of larynx closure) is designated as voiced and all other frames as unvoiced. This allows the algorithm to give cycle-by-cycle estimates of Fx.

The algorithm consists of three parts: 1) the pre-processor, 2) the main extractor, and 3) a post-processor. The evaluations here cover two related MLP networks that differ in both the pre-processing and the network structure.

The input data to the MLP classifier is a number of waveform samples from the pre-processor. The analysis window needs to be long enough to allow the classifier to "see" one or more periods of speech. The two MLPs used here have analysis windows of 30.5 ms (MLPa) and 20.5ms (MLPb).

The MLP is trained with the objective of making its final output unit produce a high output at the time of vocal fold closure in the speech input, and a low output at other times. The algorithm is, therefore, estimating the time at which each vocal fold closure occurs, from which each successive fundamental period is then estimated.

2.1 MLPa Pre-processor
MLPa is the designation of the MLP used in the SiVo-3 MLP. The pre-processing stage principally performs data reduction of the speech signal. The speech input is low-pass filtered at 900 Hz, using a 4th order IIR low-pass filter. The signal is then down-sampled to a 2 kHz sample rate to match the cycle rate of the MLP. Each cycle of the MLP has as its input data the speech waveform samples from 61 successive 0.5 ms frames.

2.2 MLPb Pre-processor
The second MLP considered here, MLPb, is a predecessor to MLPa. It is similar to the algorithm described by Walliker and Howard (1990) and to that used in the earlier SiVo-2 device as evaluated in the TIDE project TP133/206 STRIDE. The input data for the MLPb algorithm is sampled both in time and frequency from the speech signal. The speech is first processed through 6 band-pass filters, and then the output of each filter is down-sampled to a 2 kHz sample rate. Forty-one successive frames from each of the six filters provide the input data to the MLP. MLPb differs from the MLP used in the SiVo-2 aid in the use of a post-processing buffer.

2.3 MLP structures
The main fundamental frequency extractor is a MLP pattern classifier containing an input layer, two hidden layers, and an output layer. Theoretically, a one-hidden-layer MLP network is capable of forming an arbitrarily close approximation to any non-linear decision boundary for any classification task (Makhoul et al., 1989). In practice a small two-hidden-layer network can be used where a one-hidden-layer network would require an unrealistically large number of nodes (Chester, 1990). The input layer's size is determined by the number of inputs. MLPa has 61 input units compared to the 246 input units of MLPb, and hence it is possible, with a similar level of computational effort, to use more units in the intermediate layers of MLPa. The MLPa algorithm has a 61-20-6-1 structure, while MLPb has a structure of 246-6-6-1 units. Both structures are fully connected.

2.4 Training reference data
The reference target data is derived from the laryngograph (Lx) signal (Fourcin & Abberton, 1971) which directly represents the events in the closed phase of vocal fold vibration. Other work has been reported using the raw Lx signal as the reference data to train an MLP classifier for larynx excitation (Denzler et al., 1993). Some disadvantages of using the unprocessed Lx signal are that 1) the DC offset of the Lx signal can be significant, and 2) the amplitude range of the Lx signal varies widely from speaker to speaker. In this study, the Lx signal was processed to make it more suitable for MLP training.

2.4.1 Lx processing for MLPb
The Lx-based training data for the earlier MLPb algorithm was a discontinuous unit-pulse train, with the value of one at each estimated time of vocal fold closure, and value zero elsewhere. The discontinuous nature of this training reference is likely to be a disadvantage, since there is no information in the reference signal to indicate frames that are close to, but not exactly synchronous with, a vocal fold closure. Hence, an MLP output pulse in the frame adjacent to the target frame cannot be treated differently from an output pulse many frames away from the target frame in respect of its contribution to the error during training. The reference data was generated by automatic labelling of the Lx signal to identify instants of larynx closure. As no perfect automatic labelling method was available, subsequent hand-correction was required.

2.4.2 Lx processing for MLPa
For MLPa, the reference signal was a continuous one. The Lx signal was filtered using a zero-phase 8th-order 20 Hz IIR high-pass filter to remove DC. It was then half-wave rectified and normalised cycle-by-cycle to the range between 0.0 - 1.0. By applying this processed Lx signal as the reference for the MLP classifier, we avoided hand-labelling which is time-consuming and can be inaccurate (Barnard, 1991). The continuously varying reference signal contains information that may be relevant to the training process at every sample value of each vocal fold cycle.

2.4.3 Recording of training data
The training data were recorded in a non-acoustically treated room at the Department of Phonetics and Linguistics, University College London, with a 50 cm speaker to microphone spacing. The data consisted of two signals, the speech pressure waveform and the laryngograph signal. Eight speakers (four male and four female) each read two passages "Arthur the Rat" and "The Rainbow Passage". A speech spectrum shaped noise was added to part of the speech waveform training data, at speech-to-noise ratios (S/N) of 20, 15 and 10 dB. S/N ratio was determined from the maximum speech rms level measured in a 500 ms window and the noise rms level. In order to provide training data that represented a range of speech input levels, parts of the noisy and quiet training data were attenuated by 0, 6 or 12 dB.

2.5 Training method
The network training used a standard back propagation method (Rumelhart et al., 1986). We know from the Generalised Delta Rule that the current weight change in the MLP network DWij(n) is proportional to the gradient descent in the error E of the input/output pattern,

DWij(n) µ - h ,

where h is a positive constant known as the "learning rate". It is known that training can be speeded up by increasing h. However, oscillations eventually arise. The gradient descent method can be enhanced by introducing a "momentum term" from the previous weight change as:

DWij(n) µ - h + DWij(n-1) ,

where the damping value acts to optimise the learning behaviour. It not only offsets oscillations but also increases the effective learning rate. However, if is too large, it will dominate the weight update and the updating direction can deviate far from the steepest gradient. Here an adaptive method for choosing h and was employed, in which the weights of the MLP are adapted by changing the learning parameter as a function of the direction of adaptation in the parameter space (Chan & Fallside, 1987). This method attempts to eradicate two specific weakness of the fixed coefficient learning algorithm: oscillations across the walls of the ravine of the error surface and the long search caused by the shallow gradient of the error surface in the weight space.

The standard gradient descent adaptation method modifies the MLP weights after all the training data samples have been presented to the network (batch learning). However, our experience has been that it is faster to update the weights after the presentation of a subset of training data (on-line learning).

The process of computing the gradient and adjusting the weights is repeated until a minimum is found. In practice, it is difficult to automate location of the error minimum. Here, the method of cross-validation was used to monitor generalisation performance during learning. The method of cross-validation works by splitting the data into two sets: a training set that is used to train the network, and a test set that is used to measure the generalisation performance of the classifier. During training, the performance of the classifier on the training data will improve, but its performance on the test data will only improve to a point, beyond which its performance on other data will start to deteriorate. This is the point where the network starts to over-learn, and it is here that the learning process should be terminated.

A number of training runs were made with different random initial coefficients for the MLP as a precaution against finding local minima in the error space. The run giving the lowest overall error from the training data was selected to determine the MLP coefficients that were ultimately used.

2.6 Post-Processor
The post-processing stage performs error correction and smoothing of the raw estimates from the main extractor. The post-processor outputs a fundamental period marker for each estimated fundamental period. The history of several previous output periods needs to be stored in order to correct the errors and smooth the output from the main extractor. A 20.5 ms buffer was introduced which could hold 40 samples of previous MLP output and the present output. This is shown in Figure 1.

Figure 1:Buffer Post-Processor.

Each incoming MLP output is stored at the beginning of the buffer and all the other values in the buffer are shifted along by one frame. When the value at the end of buffer is both a local peak and exceeds a pre-determined primary threshold, the processor will look along the buffer for a second peak. If there is a second peak within the buffer, the time difference between the first and second peaks is calculated. If this fundamental period Tx is the first period after a silent or a voiceless speech segment, it is immediately accepted as a valid output. If it occurs within 30 ms of a previously detected period of voiced excitation, this Tx value is compared with the estimated previous fundamental period Te, where Te is the average of the previous two fundamental periods: (Tx-1+Tx-2)/2). If Tx is within 20% of Te, then Tx is accepted as a valid output. If Tx is not within 20% of Te, then the processor will look for a second peak that is within 20% of Te. If a peak has been found within this range that exceeds a lower secondary threshold, then it is used to calculate the present Tx. Otherwise, the previously detected second peak is used to calculate Tx.

Each Tx estimate is used to initiate the generation of a single cycle of a cosine signal. The sequence of cosine cycles forms a signal that is matched to the auditory abilities of profoundly hearing impaired listeners and represents both the fundamental frequency pattern and the on-off voicing pattern of the speech input. The MLP fundamental frequency extractor and the buffer processor introduce a delay of 35 ms (MLPa) or 30 ms (MLPb) between the input and output.

In the operation of the SiVo-3, each period estimate is used to control the generation of a single period of a cosine wave. This introduces a further delay of Tx(in the range 0.5 to 30 ms). Studies of aided lip-reading in normal listeners have shown that if the time delay of a fundamental frequency signal compared to the original speech is within 40 ms, there is no affect on performance, and that delays up to 80ms produce only slight and non-significant decreases in performance (McGrath & Summerfield, 1985).

3. Evaluation 1. UNIX 'C' implementation of MLPa: performance in quiet and noise
This first study concerns the performance of an off-line 'C' language implementation of MLPa that was developed prior to the real-time version in TMS320C50 assembly language.

3.1 Comparison methods
In this section, the MLPa algorithm is compared to two standard methods. In each case, the implementations were those of the ILS signal processing software library (ILS, 1986).

The chosen methods were the "cepstrum" technique (Noll, 1967), and the "Simple Inverse Filter Tracking " or SIFT algorithm (Markel, 1972). While there are more recent algorithms in existence that may give superior performance (e.g. Secrest and Doddington, 1983) these standard Cepstral and SIFT methods have the merit be being widely used and documented. The ILS Cepstral analysis was applied with the default settings for a 10 kHz sample rate; the analysis window was 32 ms in duration. The default parameters for the ILS SIFT analysis were also used. Here the effective analysis window was 50 ms, with smoothing of the Fx estimates applied over four successive windows.

3.2 Test data
The test data, like the training data, contain both speech and Laryngograph waveforms. The laryngograph signal was used to derive the reference fundamental frequency for comparison with the estimates from each algorithm. Four speakers were recorded, two male and two female. They were different speakers to those recorded for the training data. Each read a standard speech passage, "The North Wind and the Sun".

The test data included noisy speech produced in the same way as the noisy training speech data, with S/N of 15 dB and 10 dB. The test speech data had a sampling rate of 10 kHz.

The fundamental frequency and fundamental period estimates from the algorithms were all expressed in terms of fundamental frequency. The Fx sample rate was 200 Hz for the MLPa output, and 100 Hz for the other methods. The Fx data from the MLP algorithm were resampled to 100 Hz before comparisons were made.

3.3 Voicing classification performance
One basic parameter of fundamental frequency extraction algorithms, especially in this SiVo-3 application, is the voiced-unvoiced decision. This decision has two error components, where voiced frames are wrongly classified as unvoiced, and where unvoiced frames are incorrectly classified as voiced. These two errors for the three algorithms are shown in Figure 2 through Figure 4.

Figure 2Voicing classification errors in quiet Figure 3 Voicing errors at 15 dB S/N

Figure 4: Voicing errors at 10 dB S/N

The voicing errors in each case show a tendency for more false alarms as the signal-to-noise ratio deteriorates; this is probably an appropriate bias in a hearing aid to supplement lipreading, where the user may be able to ignore false alarms that occur during (visible) pauses in speech

An overall measure that combines both sources of error in voicing classification is the signal detectability index d' (Peterson et al., 1954). Voicing detection performance measured by d' is shown in Figure 5. This measure makes clear that the MLPa algorithm is the most effective for voicing classification, especially in quiet. The Sift algorithm is poorer as a result of the higher number of unvoiced frames that it classifies as voiced, while the Cepstral analysis is poorer than the MLP in missing a higher number of voiced frames.

Figure 5. Voicing detection accuracy measured by d'

3.4 Accuracy of fundamental frequency extraction
For all samples of the reference Fx pattern that were classified as voiced according to the laryngograph reference, Fx errors were computed for each algorithm. Errors of greater than one octave are designated gross errors. The proportion of frames where gross errors occurred is shown in Figure 6 through to Figure 8.

Figure 6: Proportion of frames of voiced speech showing Fx errors for quiet speech

Figure 7: Proportion of frames showing gross Fx errors: S/N = 15 dB

Figure 8: Proportion of frames showing gross Fx errors: S/N = 10 dB

All of the methods show relatively low rates of gross Fx error. The rate is markedly higher for MLPa than for either the Cepstral or the Sift method, but never exceeds 5%.

The distributions of Fx errors of one octave and less are shown in Figure 9 through Figure 11.

Figure 9: Distribution of Fx errors in quiet. Solid line - Cepstrum; dashed line - MLPa; dotted line - SIFT.

Figure 10. Distribution of Fx errors: 15 dB S/N

Figure 11: Distribution of Fx errors: 10 dB S/N

The distributions of errors of one octave and less show that the ILS Cepstral method is always more accurate than either the SIFT or the MLPa algorithms in the precision of Fx estimation. However, the error distribution from the Cepstral method is much more affected by the presence of noise than are the other methods, and at the poorest signal-to-noise ratio, the distribution is much more similar to that from the SIFT and MLPa algorithms. The SIFT and MLPa algorithms show very similar distributions of Fx error, except that the SIFT Fx estimates tend to be between one to two semitones too low. Such a constant error would not be important for an auditory supplement to lipreading.

3.5 Conclusions
The off-line prototype MLP has been shown to give superior performance in voicing classification than either reference method, both in quiet and in noise. The Fx estimation accuracy of the OSCAR MLP is less accurate than that of the Cepstral analysis both in terms of gross and less extreme Fx errors, but this difference is relatively small at poorer signal-to-noise ratios. The OSCAR MLP is more accurate than the Sift method in respect of less extreme Fx errors. The gross Fx error rate of the OSCAR MLP is larger than that of the two reference algorithms, but is still relatively low.

Since in this application it is the accuracy of voicing detection that is of primary significance over accuracy of Fx estimation, the results confirm that the MLP is likely to be appropriate for the SiVo application.

4. Evaluation 2. Real-time MLP running on SiVo-3 speech processor: performance in quiet
To evaluate the MLP algorithms running in real-time on the SiVo-3 device, simultaneous recordings were made (Alesis A-DAT digital tape recorder) of the following input and output and reference signals.

  1. Speech - from the microphone of a SiVo-3 unit
  2. The waveform from a Laryngograph ®
  3. the output signal from the MLPa algorithm running on a SiVo-3 processor
  4. The output signal from the MLPb algorithm running on a second SiVo-3

Recordings were made in a quiet room with a microphone distance of about 1 m. The microphone was that built into the case of the SiVo-3 unit. Recordings were taken from four female and four male talkers who had not been used for the MLP training data.

Evaluations are based on the voicing and fundamental frequency patterns sampled in 5 ms frames, compared to the voicing and fundamental frequency pattern derived from the Laryngograph waveform. Comparison data were obtained from the same speech samples using the Cepstral and SIFT fundamental frequency analysis algorithms provided in the ILS signal processing package. These were the same as the ILS reference methods used in the earlier evaluation of the Unix 'C' implementation of MLPa.

For each sample recording, the speech and laryngograph waveforms, and the outputs of the MLPa and MLPb algorithms after post-processing, were digitally acquired using a four-channel PCLx card and software from Laryngograph Ltd. The data were then copied to a Sun Unix computer system and processed using the Speech Filing System (SFS) software (Huckvale, 1996) to extract voicing and fundamental frequency patterns. Unvoiced frames were assigned a fundamental frequency of zero Hz.

These data were finally analysed using SPSS for Windows to make the following comparisons for each sample recording

  1. Cross-correlation of voicing and fundamental frequency patterns (Fx) from each algorithm compared to those extracted from the laryngograph signal to verify that the algorithm outputs had been optimally time-aligned with the reference data.
  2. Computation of the numbers of 0.5 ms frames of speech classified as voiced or unvoiced compared to classifications based on the laryngograph waveform
  3. Scatter-plots and correlations of the fundamental frequency values estimated from each algorithm compared to estimates from the laryngograph signal
  4. For each frame that was classified as voiced from the laryngograph reference, the ratio of the fundamental frequency estimated by each algorithm to that estimated from the laryngograph reference.

4.1 Cross-correlations of estimated Fx with reference
Cross-correlation indicated that the time-alignment of the MLP output signals was correct - the delays involved in the MLP and post-processing buffer were 30 ms for MLPa and 35 Ms for MLPb. The ILS algorithms produced correctly time-aligned output without the need for correction.

4.2 Voiced/unvoiced classification
Voicing classification performance has been measured in two ways. Firstly, by the cross-correlation of the voiced/unvoiced classification frame-by-frame of the algorithm output with the reference from the Laryngograph. Secondly, by the signal detection statistic d'. Correlations and d' were compared between algorithms by pairwise t tests using a significance criterion of p=0.05.

The correlation comparisons are shown in Figure 12. Across the eight talkers, the correlation with the reference voicing pattern was not significantly different between MLPa and MLPb. The correlations from MLPa were not significantly different to those from the Cepstral analysis, but MLPb was significantly less accurate than the Cepstral analysis. The correlations from the Sift analysis were significantly lower than those from all of the other three algorithms.

Figure 12: Correlation of frame-by-frame voicing classification with Lx reference. The box and whisker plot shows the median value as a bar, the 75% range as the box, and the extreme values as the whiskers.

Figure 13: Voicing detection performance measured by d'

The d' measure (see Figure 13) showed no substantial differences between the MLP algorithms. Voicing detection by the Cepstral analysis was significantly higher than for the MLP methods, and voicing detection by Sift was significantly worse than any of the other methods.

4.3 Scatter-plots of estimated and reference Fx


Figure 14: Scatter-plots of reference Fx and estimated Fx from the Cepstral and Sift algorithms. The panels show data from the male and female speakers from each of the four algorithms. The solid line is a linear regression fit for the data. The squared correlation for each fit is shown to the right of each panel.


Figure 15: Scatter-plots of reference Fx and estimated Fx from MLPa and MLPb. The panels show data from the male and female speakers from each of the four algorithms. The solid line is a linear regression fit for the data. The squared correlation for each fit is shown to the right of each panel.

Scatter-plots of reference compared to estimated Fx are shown in Figure 14 and Figure 15. A perfect performance would result in a scatter plot where all points lay on a straight line of unit slope. None of the methods approach this level of accuracy. The Cepstral analysis provides the most accurate result, and the Sift method the poorest. MLPa is rather better than MLPb. Both MLP's show a tendency to overestimate higher fundamental frequencies for male speakers. The squared correlations shown in the figure represent a measure of goodness of Fx estimation that includes the effect of voicing error (with Fx arbitrarily valued at zero for unvoiced frames). This measure is higher generally for the female speakers than for male speakers. The algorithms rank as Cepstrum > MLPa > MLPb > Sift for this measure.

4.4 Distribution of Fx errors
Fx errors are shown here by histograms displaying the ratio between the estimated Fx value from each algorithm and Fx from the laryngograph reference. Frames of data where the Lx reference is unvoiced are not included, but frames where the algorithms fail to detect voicing are.


Figure 16: Distribution of Fx estimates. The four panels show data for male and female speakers from the ILS Cepstral and Sift algorithms. The data are for all frames where the Lx reference was voiced, and include frames that the algorithms classified as unvoiced for which the ratio is zero. Each bar represents a range of 6.25%. The algorithm is indicated on the abscissa, and the speaker sex above each panel. The mean and standard deviation of the displayed distribution is shown to the right of each panel, together with the total number of frames analysed.


Figure 17: Distribution of Fx estimates. The four panels show data for male and female speakers from the MLPa and MLPb algorithms. The data are for all frames where the Lx reference was voiced, and include frames that the algorithms classified as unvoiced for which the ratio is zero. Each bar represents a range of 6.25%. The algorithm is indicated on the abscissa, and the speaker sex above each panel. The mean and standard deviation of the displayed distribution is shown to the right of each panel, together with the total number of frames analysed.

The histograms in Figure 16 and Figure 17 reflect both the spread of estimated Fx about its true value, and the proportion of voiced frames classified as unvoiced. In precision of Fx estimation, the Cepstral method is clearly superior. The precision of Fx estimates from MLPa is reasonable, with some 70% of frames estimated with an error of 6.25% or less for male speech and around 50% with this degree of error for female speech. The broader Fx distribution from MLPa with female speech reflects the inherent limit of 0.5 ms to the precision of its Tx estimates. Female Fx for these speakers is typically between 200 and 300 Hz, but extends well beyond 400 Hz on occasion. A Tx precision of 5 ms translates to an Fx precision of 10% at 200 Hz and 20% at 400 Hz.

For male speech, MLPa is clearly superior to MLPb in regard to this measure.

4.5 Conclusions of Evaluation 2.
The MLPa algorithm is, on the evidence of the correlation of the estimated voicing pattern with the Lx reference, and accuracy of Fx estimation, superior to MLPb. This is most evident for male speech. In the accuracy of voicing detection, MLPa approaches the performance of the Cepstral reference, and both MLPs are superior to the Sift method.

4.6 Comparisons between Evaluations 1 and 2
Comparable measures are only available for voicing classification, since fundamental frequency estimation errors were processed in different ways in the two evaluations. It must be noted that the speech materials were also different between the two studies. The d' measure of voicing classification from MLPa is rather similar for speech in quiet in both evaluations; the overall d' in evaluation 1 was around 2.5, while for the real-time algorithm, d' was n the range 2.0 to 2.5. For the Cepstral method, d' for voicing classification was higher for the recordings of evaluation 2 (2.5 to 3.0) than from the recordings used in evaluation 1, where in quiet, d' was 1.8. This may reflect that the recordings used in evaluation 1 were inherently less quiet than those used in evaluation 2. It seems, therefore, that the different ranking of the Cepstral method and MLPa for voicing classification is likely to be due more to differences in the recorded materials and does not indicate a significant difference in performance between the prototype off-line and real-time versions of MLPa.

5. Overall conclusions
Both these evaluations, covering a UNIX and a real-time TMS320C50 implementation of the OSCAR MLP fundamental frequency extractor, indicate that its performance is likely to be appropriate for the SiVo-3 application. Voicing detection by the MLPa algorithm is similar to that of the Cepstral analysis, and much superior to that of the Sift algorithm. The results indicate that MLPa is, as intended, able to maintain much of its performance in levels of noise that are problematic for profoundly hearing impaired listeners. Evaluation 2 makes clear that MLPa is superior to the preceding MLPb on all criteria. Only the MLP algorithm is able to provide cycle-by-cycle F0 estimates, which gives it a further advantage over the other methods in this hearing aid application.

5.1 Significance of Fx and voicing errors
The significance of the errors of Fx estimation and voicing classification of these algorithms for a device that supplements lipreading can be assessed by reference to user-based assessments of the analysis method of the OSCAR field trials (van Son et al., 1997). Here tests in profoundly hearing impaired users of SiVo-3 compared an ideal extraction of the voicing pattern and Fx based on a recorded laryngograph signal to the use of MLPa to extract this information from speech in quiet. Speech perception using the analytic processing of MLPa did not differ significantly from a the condition in which the laryngograph signal was used to control the voicing and fundamental frequency information from the SiVo-3 aid. This suggests that profoundly impaired listeners are not highly sensitive to such errors. It has been suggested that high accuracy may be necessary in Fx based lipreading aids for optimal performance (Hnath-Chisolm and Boothroyd (1992), but these results were based on normally-hearing listeners who apparently are much more sensitive to such errors than are profoundly-impaired listeners.

5.2 User results for speech in noise
A second outcome of the OSCAR user trials has been the finding that speech perception in noise using the analytic processing of MLPa is little affected by the presence of noise at S/N up to 5 dB. This is consistent with the objective finding here that the voicing classification and Fx extraction performance of MLPa is little affected by noise.

Acknowledgements
Supported by the TIDE project OSCAR (TP1217) with funding from CEC DGXII (C5). We are grateful to John Walliker and Mark Huckvale for assistance in the development and training of the MLPs, and to both John Walliker and David Howells for their contribution to the real-time code of the SiVo-2 and SiVo-3. We are also grateful to Roel Smits for constructive comments on an earlier draft MS.

References
Ball, V., Faulkner, A., and Fourcin, A. J. (1990). "The effects of two different speech coding strategies on voice fundamental frequency control in deafened adults," Br. J. Audiol. 24, 393-409.

Barnard E., Cole R. A., Vea M. P., and Alleva F. A., "Pitch detection with a neural-net classifier", IEEE Trans. Acoustics, Speech, Signal Processing, Vol. 39, No. 2, pp 298-307, Feb., 1991.

Chan L. W., and Fallside F., "An adaptive training algorithm for back propagation networks", Computer Speech and Language, 205-218, 1987.

Chester D. L., "Why two hidden layers are better one", Proceedings of the International Joint Conference on Neural Networks, Vol. 1, pp 265-268, Erlbaum, 1990.

Denzler J., Kompe R., Kiebling A., Niemann H., and Noth E., "Going back to the source: inverse filtering of the speech signal with ANNs", Proceedings of Eurospeech, pp 111-114, Berlin, 1993.

Fourcin A. J., and Abberton E., "First applications of a new laryngograph", Medical and Biological Illustration, Vol. 21, pp 172-182, 1971.

Hess, W. (1983) "Pitch Determination of Speech Signals" Springer-Verlag, Berlin.

Hnath-Chisolm, T. and Boothroyd, A. (1992) "Speechreading enhancement by voice fundamental frequency: The effects of F0 contour distortions". J. Sp. Hear. Res., 35, pp 1160-1168.

Howard, D. M. (1989) "Peak-picking fundamental frequency estimation for hearing prostheses" J. Acoust. Soc. Am., 86, pp 902-910

Howard, I. S. and Huckvale, M. A. (1988) "Speech fundamental period extraction using a trainable pattern classifier" In. Proc. Speech '88: 7th FASE Symposium (Inst. Acoust. Edinburgh), pp129-136.

ILS (1986). Interactive Laboratory System, Signal Technology Inc.

Makhoul J, El-Jaroudi A, and Schwartz R, "Formation of disconnected decision regions with a single hidden layer", Proceedings of the International Joint Conference on Neural Networks, Vol. 1, pp 455-460, 1989.

Markel, J.D., "The SIFT algorithm for fundamental frequency estimation", IEEE Trans. AU-20, 367-377, 1972.

McGrath M and Summerfield Q (1985) Intermodal-timing relations and audio-visual speech recognition by normal-hearing adults", J. Acoust. Soc. Am., 77, 678-685.

Noll, A.M., "Cepstrum pitch determination" J. Acoust. Soc. Amer., 41, 293-309, 1967.

Peterson, W. W., Birdsall, T. G. and Fox, W. C. (1954). The theory of signal detectability. Inst. Radio Engineers Transactions, PGIT-4, 171-212

Rumelhart D. E., Hinton G., and Williams R., "Learning representations by back-propagating errors", Nature, Vol. 323, pp. 533-536, 1986.

Rumelhart D, McClelland J and the PDP Research Group (1988) Parallel Distributed Processing, The MIT Press, Cambridge, Massachusetts.

Secrest, B.G., and Doddington, G.R. (1983). "An integrated pitch tracking algorithm for speech systems," ICASSP 83, 1352-1355, Boston.

van Son, N., Beijk, C, and Faulkner, A. (1997), "Final report of Workpackage E: Field trials", in preparation. Ref. OSCAR-UCL-1997-WPE

Walliker, J. R. and Howard, I. S. (1990) The implementation of a real time speech fundamental period algorithm using multi-layer perceptrons. Speech Communication, 9, 63-71.

© Andrew Faulkner and Jianing Wei


SHL 10 CONTENTS
PUBLICATIONS
PHONETICS AND LINGUISTICS HOMEPAGE
NEXT PAPER

Page created by Martyn Holland
for comments