Research in Speech Synthesis

Activities
Staff members
Research projects
Publications
Speech data
Software tools
Facilities

If you are interested in doing graduate research in speech synthesis in the department, please contact Mark Huckvale at .

Activities

Recent research work in speech synthesis has been concentrated on the modelling of prosody and the development of synthesis software tools.

In our prosody work we have been concerned with how to represent the prosodic structure of text in a hierarchical phonological representation. This has involved studying prosodic phrasing (breaking text into intonational phrases) and the categorisation and assignment of pitch accents within the phrase. Our work on the phonetic interpretation of these structures involves modelling of fundamental frequency contours and predicting the durations of syllabic constituents as a function of the segmental content and the phrase context.

In our tools work we have made extensive use of XML to mark-up linguistic representations: particularly hierarchical phonological representations and their phonetic interpretation. The ProSynth tools convert text to a hierarchical phonological form expressed in XML. Scripts interpret this form by fleshing out durations, fundamental frequency and segmental quality in context. The scripting language ProXML is designed to make declarative formulation of such knowledge easy to state.

Future work will look at how intonation is used to express the information structure of sentences. This will be of use in concept to speech systems in which the text to be spoken is also generated by the computer system.

Staff

The Department of Phonetics and Linguistics at UCL has been involved in speech synthesis for over 30 years. Currently we have 3 staff who take an active interest in speech synthesis:

Mark Huckvale (Senior Lecturer in Speech Sciences)
Is an engineer and signal processing expert who has worked on PSOLA time-domain synthesis, as well as in the implementation of formant synthesizers. He has built a poly-lectal (multi-accent) synthesis system for English. He has also been involved in the construction of a database of read speech for the study of English prosody, with the aim of constructing statistical models of rhythm and intonation linked to a richer syntactical description of utterances. His current work on the ProSynth project (see below) relates to the tools and database infrastructure.

Jill House (Senior Lecturer in Phonetics)
Is a phonetician who has worked on models of intonation for speech synthesis systems. As consultant (1989-94) to the Swedish synthesis company Infovox she designed and implemented a revised prosodic model for their British English text-to-speech system, and participated in similar work for French. She also worked on the ESPRIT dialogue project SunDial, with the aim of appropriately adapting the intonation of system-generated dialogue responses to the information state of the dialogue manager. On-going work includes the development of an integrated phonological/phonetic model in the ProSynth project (see below) with Sarah Hawkins (University of Cambridge), and John Local (University of York).

Valerie Hazan (Reader in Speech Sciences)
Is interested in human perception of fine detail acoustic cues in natural speech signals and in individual differences in the perception of synthetic speech. This work has lead to a number of important applications for synthetic speech: in an audiometer system for assessing human speech processing performance; and in a project where cues are artificially enhanced to improve intelligibility. This last project has applications in foreign language learning and in prostheses for the hearing impaired as well as in synthetic speech.

Other members of the Department have worked in the domain in the past:

John Maidment and Michael Ashby are phoneticians that have built components of speech synthesis systems.

John Wells is the author of the Longman Dictionary of English pronunciation which is available in machine-readable form.

The Department has links with the Survey of English Usage within the English Department at UCL, where work is undertaken on corpus analysis and parsing.

Research projects

Recent funded research projects on synthesis (in reverse order):

1997-2001 COST258 Naturalness of synthetic speech
Mark Huckvale
Funded by European Union Co-operation in Science and Technology Directorate

1997-2000 & 2000-2001 Integrated Prosodic Aproach to Speech Synthesis
Sarah Hawkins (Cambridge), Jill House (UCL), Mark Huckvale (UCL), John Local (York), Richard Ogden (York)
Funded by EPSRC.

1993-96 Speech Pattern Audiometry and Training
Valerie Hazan, Ginny Wilson, Adrian Fourcin, Laryngograph Ltd.
Funded by DTI SMART award.

1993-96 Cue-enhancement in natural and synthetic speech
Valerie Hazan and Adrian Fourcin
Funded by SERC

1989-92 Individual strategies in synthetic speech perception
Valerie Hazan and Adrian Fourcin
Funded by SERC

1989-92 Modelling dynamics of vocal fold vibration for synthesis
David Howard, Jill House, Sarah Palmer
Funded by SERC

1985-89 Pitch accent model of intonation
Jill House and Mike Johnson
Funded by the Speech Research Unit, DRA Malvern

Recent student projects in speech synthesis have included:

2002 Intelligibility of a spelling-regular English accent.
Margaret Shaw and Mark Huckvale
This study is looking at whether mis-pronunciations that are logically connected to the spelling are less disruptive to intelligibility than random mispronunciations. Look at the Regular English pronunciation project web site.

2002 Development of a scale of acceptable rhythm.
Robert Gray and Mark Huckvale
This study is looking at whether it is possible to create a perceptual scale of rhythmic quality, and to see if there are objective properties of timing with correlate with listener preferences.

2000 Performance of the ITU test for speech output system assessment.
Yolanda Vazquez-Alvarez and Mark Huckvale
This study looked at the ITU P.85 assessment standard and checked its reliability and sensitivity. Our results were published in ICSLP

Publications

Recent published papers in the area (in reverse order):

Huckvale, M., (2002) "Speech Synthesis, Speech Simulation and Speech Science", Proc. International Conference on Speech and Language Processing, Denver, 2002, pp1261-1264. Download PDF.
Vazquez-Alvarez, Y., Huckvale, M., (2002) "The Reliability of the ITU-P.85 Standard for the Evaluation of Text-to-Speech Systems", Proc. International Conference on Speech and Language Processing, Denver, 2002, pp329-332. Download PDF.
Huckvale, M. (2002) "The Use and Potential of Extensible Mark-Up (XML) in Speech Generation", in Keller et al, Improvements in Synthetic Speech, Wiley, 2002.
Chung, H., Huckvale, M., (2001) "Linguistic factors affecting timing in Korean with application to speech synthesis", in Proc. EuroSpeech 2001, Aalborg, Denmark, Vol 2, pp815-818. Download PDF.
Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicova, J., Heid, S. (2000). ProSynth: an integrated prosodic approach to device-independent natural-sounding speech synthesis. Computer Speech and Language, 14, 177-210.
Hawkins, S., Heid, S., House, J., Huckvale, M. (2000), "Assessment of Naturalness in the ProSynth Speech Synthesis Project", IEE Workshop on Speech Synthesis, London, May. Download PDF.
Huckvale, M. (1999) Representation and processing of linguistic structures for an all-prosodic synthesis system using XML, Proc. EuroSpeech 99, Hungary. Download PDF.
House, J., Dankovicova, J., Huckvale, M., (1999), "Intonation modelling in ProSynth: an integrated prosodic approach to speech synthesis", Int. Congr. Phonetic Sciences.
Hawkins, S., House, J., Huckvale, M., Local, J., Ogden, R., (1998), "ProSynth: an integrated prosodic approach to device-independent natural-sounding speech synthesis", Proc. Int. Conf. Spoken Language processing, Sydney.
Huckvale, M., and Fang, A., (1996) PROSICE: a spoken language database for prosody research, in International Corpus of English, ed Sydney Greenbaum, OUP.
House, J. and Hawkins, S. (1996) An integrated phonological-phonetic model for text-to-speech synthesis, Proc. XIIIth ICPHS, 2, 326-329.
Benoit, C., Grice, M. and Hazan, V. (1996) The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences. Speech Communication, 1996. vol. 18 (4).
Simpson, A. and Hazan, V. (1995) Enhancing the perceptual salience of information-rich regions of natural intervocalic consonants. Proceedings of Eurospeech 95, Madrid.
Hazan, V. and Shi, B. (1993) Individual variability in the perception of synthetic speech. Proceedings of Eurospeech 1993, Berlin.
Palmer, S.K. and House, J. (1992) Dynamic voice source changes in natural and synthetic speech', Proc. ICSLP 92, Banff, Alberta, 129-132
House, J. and Youd, N. (1992) Evaluating the prosody of synthesized utterances within a dialogue system', Proc. ICSLP 92, Banff, Alberta, 1175-1178.
Hazan, V. and Shi, B. (1992) Listener variability in the perception of natural and synthetic speech. Speech, Hearing and Language Work in Progress UCL, 6, 75-88.
House, J and Youd, N. (1991) Synthesising intonation in a dialogue context, Speech, Hearing and Language: Work in Progress 5, UCL, 75-90.
House, J. and Johnson, M. (1987) Enlivening the intonation in text-to-speech synthesis: an "accent-unit" model, Proc. XIth ICPhS, Tallinn, Estonia, 134-137 (SRU project)

Speech Data

The following data sets are available to research workers in speech synthesis at nominal charge:

PROSICE - Extended read texts for study of English prosody
Currently 4 read texts each of 15 minutes in duration (about 2000 words each). With word-level annotations. Hear a sample.

PROSYNTH - Read sentences for study of English intonation
Over 700 short read sentences exploring systematic variation in stress pattern and segmental context for simple declarative intonation. Hear a sample.

If you are interested in these please contact Mark Huckvale ().

Software Tools

The following software tools are available to research workers in speech synthesis free of charge:

Speech Filing System
Our general purpose speech analysis toolkit. Contains tools for formant synthesis, diphone synthesis and semi-automatic annotation. Refer to the Speech Filing System Home Page.

ProSynth Synthesis Shell for Windows

This is a Windows application that can be used to build and evaluate speech synthesis systems. You can download it from the ProSynth project pages. It uses XML to encode all intermediate representations between text and sound and comes complete with a basic set of scripts for synthesis of simple sentences (requires the MBROLA diphone synthesis sub system).

Facilities

The Department has very good facilities for conducting research in speech synthesis, including:

an anechoic chamber for recording very high quality speech signals.
facilities for multi-channel recording, including laryngographic analysis.
networked Sun SPARC workstations and PCs running in-house (SFS) and commercial signal processing software (ESPS).
software for PSOLA and formant synthesis.
Bonn Open-source Synthesis System (BOSS)
software for conducting perceptual experiments.
a number of quietened rooms for subjective tests by listeners.