Overview
In my research I take the view that speech technology is a tool to help us
understand better how humans use spoken language to communicate, rather than
as an end in itself. My research work on speech synthesis has focussed on
issues such as: how to exploit recent phonological models of English in synthesis,
how to use synthetic speech to test the relative intelligibility of accents, or
how to teach an articulatory synthesizer to imitate speech. What is
important is what we can learn about how humans plan, perceive and acquire speech
through the manipulation and generation of speech signals using a computer.
You can read more about my analysis of the current state of speech synthesis research in
Huckvale (2002).
In this page, I summarise some of my previous and ongoing research work in
speech synthesis. You will find information about projects, papers and software. Elsewhere on the web site you can read about other research work in speech synthesis in the department.
Current Projects
Teaching an articulatory synthesizer to imitate speech
This work is being undertaken with Ian Howard, of the Sobell Department of the Institute of Neurology at UCL.
The idea behind the work is that the control of an articulatory synthesizer
is too difficult to be programmed from information gained from the study of human
speech production. In other words we should not expect to be able to write
a computer program that converts phonological representations to articulatory
control parameters through the use of rules obtained from studying how humans do it.
If you think about it, such a task is even more difficult than that facing an infant learner. The fact that infants can imitate speech without knowing any articulatory rules
tells us that the task should be soluble using only (i) an articulatory synthesizer, (ii) auditory analysis, and (iii) general learning principles.
So far in this work we have been investigating neural network models for the mapping between
auditory representations and articulatory representations. These models are not trained using
privileged information about articulator use but by exploring the space of output possibilities
available to the synthesizer through babbling. You can read about some of the early work in
the paper by Howard & Huckvale (2004) and you can hear some babbling and some imitated phrases on
Ian's web site.

Wideband spectrograms for sung input utterance "I'm half crazy, all for the love of you", taken from the song Daisy, for a male speaker (A), and re-synthesised outputs generated by the direct (B) and distal retrained (C) imitation system.
More information coming soon ...
Previous Projects
ProSynth
This work was done in collaboration with Jill House and others.
ProSynth was a joint project with the University of Cambridge and the University of York funded by the EPSRC. Its focus was on the exploitation of non-linear hierarchical prosodic phonological structures in speech synthesis. The UCL part of the project concerned the intonantion module and the underlying computational infrastructure. See Hawkins et al (2000) for overview.
In our intonation work we were concerned with how to represent pitch accents used in reading a text within a hierarchical phonological representation. This has involved studying prosodic phrasing (breaking text into intonational phrases) and the categorisation and assignment of pitch accents within the phrase. Our work on the phonetic interpretation of these structures involved the modelling of fundamental frequency contours and predicting the durations of syllabic constituents as a function of the segmental content and the phrase context.

The ProSynth windows application runs the all-prosodic synthesis
system on any PC
In our computational infrastructure work we made extensive use of XML to mark-up linguistic representations: particularly hierarchical phonological representations and their phonetic interpretation. The ProSynth tools converted text to a hierarchical phonological form expressed in XML. Scripts interpret this form by fleshing out durations, fundamental frequency and segmental quality in context. The scripting language ProXML was designed to make declarative formulation of such knowledge easy to state. See Huckvale (1999).
The use of XML for mark-up within ProSynth predated much further work in the design of speech
synthesis mark-up languages for text. A commentary on these is available in Huckvale (2001).
Quality evaluation
This work was done in collaboration with Yolanda Vasquez-Alvarez.
An evaluation of the reliability of the ITU-T P.85
recommended standard for the evaluation of voice output
systems was conducted using six English TTS systems. The
P.85 standard is based on mean-opinion-score judgements of a
listening panel on a number of rating scales. The study looked
at how the ranking of the six systems on the scales varied
across four different text genres and across two listening
sessions. Rankings were also compared with a much simpler
pair-comparison test across genres and listening sessions. For
the ITU test a large degree of correlation was found across
scales, implying that these were not really testing different
aspects of the systems. There were surprisingly similar results
across sessions, implying that listeners were indeed making
real judgements. In comparison, the pair comparison test gave
(almost) identical rankings for systems with far less variability,
making statistically significant comparisons between systems
possible, even across genres.


Soundjudge is a program to run the ITU P.85 assessment
method for speech evaluation
Korean timing model
This work was done in collaboration with Hyunsong Chung.
The work studied the phonetic and
phonological factors affecting the rhythm and timing of spoken
Korean. Stepwise construction of a CART model was used to
uncover the contribution and relative importance of phrasal,
syllabic, and segmental contexts. The model was trained from a
corpus of 671 read sentences, yielding 42,000 segments each
annotated with 69 linguistic features. On reserved test data,
the best model showed a correlation coefficient of 0.73 with a
RMS prediction error of 26 ms. Analysis of the classification
tree during and after construction shows that phrasal structure
had the greatest influence on segmental duration. Strong
lengthening effects were shown for the first and last syllable in
the accentual phrase. Syllable structure and the manner
features of surrounding segments had smaller effects on
segmental duration. See Chung & Huckvale (2001) for details.
Software
- Prosynth Deliverables
- From the ProSynth deliverables web page you can download various speech synthesis data sets, and also the ProSynth Windows application.
- VTDEMO
- VTDemo is an interactive Windows PC program for demonstrating how the quality of different speech sounds can be explained by changes in the shape of the vocal tract. With VTDemo you can move the articulators in a 2D simulation of the vocal tract cavity and hear in real-time the consequences on the sound produced.
- PhonWeb: Spoken Phonetic Transcription
- This is a web-based system for replaying SAMPA coded English phonemic transcription using a
a diphone synthesis method.
- Speech Filing System (SFS)
- SFS is a set of tools for speech research which also incorporates many elements
relating to speech synthesis. These include a diphone synthesis by rule program, a formant
synthesis by rule program, and a software formant synthesizer.
Recent Articles
S. Hawkins, J. House, M. Huckvale, J. Local, R. Ogden, "ProSynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis", Proc. ICSLP, Sydney, 1998. Download PDF.
M.Huckvale, "Representation and processing of linguistic structures for an all-prosodic synthesis system using XML", Proc. EuroSpeech99, Budapest, Hungary, 1999. Download PDF.
Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicova, J., Heid, S., "ProSynth: an Integrated Prosodic Approach to Device-Independent Natural-Sounding Speech Synthesis", in Computer Speech and Language, 14 (2000), 177-210. Read at Idea Library.
Hawkins, S., Heid, S., House, J., Huckvale, M., "Assessment of Naturalness in the ProSynth Speech Synthesis Project", IEE Workshop on Speech Synthesis, London, May 2000. Download PDF.
M.Huckvale, "The Use and Potential of Extensible Mark-Up (XML) in Speech Generation", in Keller et al, Improvements in Speech Synthesis, Wiley, 2001. [ISBN: 0471499854] Available at Amazon.com.
Chung, H., Huckvale, M., "Linguistic factors affecting timing in Korean with application to speech synthesis", in Proc. EuroSpeech 2001, Aalborg, Denmark, Vol 2, pp815-818. Download PDF.
Huckvale, M., "Speech Synthesis, Speech Simulation and Speech Science", Proc. International Conference on Speech and Language Processing, Denver, 2002, pp1261-1264. Download PDF.
Vazquez-Alvarez, Y., Huckvale, M., "The Reliability of the ITU-P.85 Standard for the Evaluation of Text-to-Speech Systems", Proc. International Conference on Speech and Language Processing, Denver, 2002, pp329-332. Download PDF.
Howard, I., Huckvale, M., "Learning to control an articulatory synthesizer through imitation of natural speech", Summer School on Cognitive and physical models of speech production, perception and perception-production interaction, Lubin Germany, Sept 2004. Web site.
|