Department of Phonetics and Linguistics

A TOOL FOR THE PRODUCTION OF COPY SYNTHESES FROM NATURAL TOKENS

Andrew M.SIMPSON

Abstract
An X-Window graphical user interface for the Klatt Cascade-Parallel Formant Synthesiser [1] is described. It includes facilities for initial synthesiser parameter estimation, for editing time-varying parameter values, and for aural, spectral, and spectrographic comparison of target and copy-synthesised stimuli.

Copy Synthesis
High-quality copy synthesis allows the creation of speech-like stimuli which are sufficiently natural to ensure listeners are listening in the speech mode. The technique allows the precise specification of stimuli which can be altered to investigate the relative contribution of different acoustic cues in encoding phonetic contrasts. It has been extensively used within our department for the development of copy-synthesised stimuli for use in speech pattern audiometry.

Formant synthesis produces speech-like stimuli by passing a source waveform through a complex filter the resonances of which can be adjusted to model those of the vocal tract. The nature of the source can be varied (e.g. periodicity, voice fundamental frequency, open quotient, jitter) as can the characteristics of the filter (e.g. formant frequencies, bandwidths, and amplitudes, and the presence of anti-resonances). This flexibility allows the complex spectro-temporal variation that occurs in natural speech to be closely modelled.

However, the specification of formant synthesiser parameter values is a complex and laborious process. Even with good initial estimates of the voice fundamental frequency contour and the formant trajectories, much work still has to be undertaken to model the variation in formant amplitudes and bandwidths before a synthetic copy will sound natural.

The task of refining synthesiser parameter values would be easier if one could visualise how synthesiser parameters co-vary, edit them easily, and assess easily the effects of any manipulations both aurally and by using more objective analysis methods.

This tool provides such facilities within an integrated package which includes an implementation of the Klatt Cascade/Parallel Formant Synthesiser. The tool enables users to calculate initial parameter values from the target stimulus, to edit such values easily, to see and hear both natural target and its synthetic copy, and to perform spectral and spectrographic analysis of both to ensure closeness of match.

Initial Parameter Value Estimation
The tool includes the facility to calculate the voice fundamental contour from the target natural token using cepstral analysis. Additionally, it is possible to specify formant frequency trajectories by tracing each formant's path onto a spectrogram, as shown in Figure 1.

Parameter Editing
Time-varying parameters are edited using a 'canvas' onto which each parameter's trajectory can be drawn using a mouse-controlled cursor. Any number of parameters can be simultaneously displayed to allow users to co-ordinate the value of parameters which vary time-synchronously, for example, the formant amplitudes at the onset of voicing after plosive release, or at a plosive's release burst. Each parameter is displayed using a different colour to ensure it is distinguishable from others. Parameters can be cycled through three states: edited, displayed only, and not displayed, allowing groups of parameters to be simultaneously displayed and edited. Figure 2. illustrates this.

It is possible to specify parameter values with great accuracy as the precise parameter value at the cursor is displayed. The parameter canvas is time-aligned with both target and copy-synthesised waveforms and vertical cursors indicating the point in time being considered are displayed in all windows, allowing alignment between the two waveforms, and allowing the monitoring of the effect parameter changes have on the synthetic waveform. A parameter's trajectory can either be specified by drawing a line between start and end points, or can be drawn free-hand. To facilitate inspection and editing of parameter values over very brief regions of the stimuli a zoom facility allows users to display a region in greater detail, allowing very brief acoustic cues such as release bursts to be inspected and specified with great accuracy. Figure 3 shows how the zoom facility has been used to display the burst and first few cycles after release of an intervocalic voiced plosive.

Comparisons
Aural comparison of complete or partial target and copy-synthesised waveforms is possible simply by marking the extent of the desired region using mouse-controlled cursors and then clicking on the appropriate waveform. It is also possible to calculate the amplitude spectrum of corresponding regions of both waveforms thus providing more detailed information about their degree of similarity. This is illustrated in Figure 3 where the release burst spectra of a natural and synthetic intervocalic voiced velar plosive are compared. Although the strong burst at around 2 KHz has been modelled well, there are discrepancies around the secondary peak at 4 KHz and in the region below about 200 Hz. Such comparisons are useful for comparing short or relatively unchanging regions of the signal, or for making gross spectral comparisons.

To assess how closely the complex spectro-temporal variation seen in speech has been modelled a spectrogram calculation facility is provided which can display either natural or synthetic waveforms, or both simultaneously, aligned in time to facilitate comparison. (See Figure 4.)

The tool supports Microsoft RIFF audio format (.wav), and the SFS [3] format (.sfs). The synthesiser parameters are stored in the form of an ASCII text file with each parameter's value for each successive frame of the stimulus stored as a number.

Portability
The tool has been written in C using SUIT [2] and requires a UNIX platform with X Windows. Versions currently exist for Sun Sparc and Linux/XFree86 architectures. Contact the author for more details and for information about how to obtain the tool.

Acknowledgements
The formant synthesiser is an implementation of the Klatt Cascade-Parallel Formant Speech Synthesiser by Jon Iles (j.p.iles@cs.bham.ac.uk) and Nick Ing-Simmons (nicki@lobby.ti.com) The spectral, spectrographic, and pitch-extraction are from SFS [3].

References
Klatt, D.H. (1980), "Software for a cascade/parallel formant synthesiser", Journal of the Acoustical Society of America , vol. 67(3), pp. 971-995.

SUIT, The Simple User Interface Toolkit, University of Virginia, (suit@uvacs.cs.Virginia.EDU).

The Speech Filing System, Dept of Phonetics and Linguistics, University College London. (sfs@phon.ucl.ac.uk)

Figures

Figure 1 (above) - Specify formant trajectories by overlaying them on a spectogram. Return to text

Figure 2 (above) - Simultaneously display and edit covarying parameters. Return to text

Figure 3 (above) - Compare spectra of corresponding regions of the stimuli. Return to text

Figure 4 (above) - Compare original and synthetic stimuli using a spectogram. Return to text

© 1996 Andrew Simpson


Back to SHL 9 Contents

Back to Publications

Back to Phonetics and Linguistics Home Page


These pages were created by: Martyn Holland.
Comments to: martyn@phon.ucl.ac.uk