Please find below a discussion paper on hierarchical annotation.
I think some parts of it are interpretable as is, but I will
present the ideas at our meeting too.
Mark
----------------------------------------------------
Proposals for a Hierachical Annotation System for Synthesis
Mark Huckvale
University College London
1 October 1997
This document discusses how a complex set of hierarchical
annotations can be constructed, manipulated and stored for the
purposes of speech signal description and synthetic speech
generation.
1 Separate annotation formalism from content
Propose to use a formalism for annotation that is
independent from the units and feature system used at each
level of description. The annotation system should be seen
as 'mark-up' of the true signal or word-string rather than
as a representation in its own right. A mark-up system
that is based on SGML is proposed - this way the syntax of
the annotations can be defined and separated from their
linguistic content.
2 Common annotation formalism for all levels as far as
practicable
Apart from perhaps signals and F0 contours, where more
conventional array-based representations may be
appropriate, it is suggested to use the annotation
formalism for all levels: pragmatic, semantic, syntactic,
prosodic and phonetic. At each level a suitable unit
inventory and feature system is defined. The benefit is
that that common tools can be created for all levels of
description, and that the tools are independent of
developments in the feature system or the choice of levels.
3 Interleaving hierachies represented using 'stand-off'
annotation
Since the levels will not form a simple hierarchy, with
units forming one-to-many and many-to-one mappings across
levels, it is necessary to store the annotation levels in
separate files with the mark-up indicating the linkage.
This allows, for example, prosodic phrasing to intersect
syntactic phrasing, or accent units to intersect word
units. The unification of the context for a particular
unit can be performed by software interrogating the
separate files. This would allow a linguistic rule to
access the context and features for a unit at the current
and all higher levels in the system.
4 Annotation system links-to acoustic parameters
The SGML format annotation files also link to the acoustic
data: signal samples, F0 contours, spectral properties and
synthesis parameters. This would be done using time as an
absolute reference. This would allow programs to
investigate the relationship between annotations and signal
properties.
5 Database annotation is the same as synthesis annotation
Essentially the same formalism should be used to describe
reference recorded material as is used in the synthesis
system. The task of each synthesis component is to
recreate the annotation at its level. This has the benefit
of allowing a standard set of tools for displaying both
database annotations and generated annotations and for
tools which allow comparison between the two.
6 Software support library and tools
Should aim to re-use SGML tools wherever possible, e.g. the
XML project and the system developed by Henry Thompson in
Edinburgh. Library facilities should be created which hide
the low-level text and file formats. Tools for creation
and editing are already under development.
7 Phonetic labelling allows access to sub-segment structure
It may well be impractical to provide any more detailed
annotation than phone-sized segmentation and transcription
alignment for most of the speech database. This may be a
problem if access to finer temporal detail is required. A
solution might be to define sub-phone segmentation in an
algorithmic way: the time of each sub component is
established by an algorithm that segments on the basis of,
for example, spectral similarity.
8 Stylisation of F0 contours
A method for F0 stylisation is probably required to link
the intonational model to the signal - since a link to
absolute F0 values on a 10ms frame basis would not be
productive. The method of quadratic splines developed by
Daniel Hirst may be worth investigating.
9 Annotation and signal generation modules
Modules are required for at least:
Syntactic parse
Prosodic parse
Temporal model
Intonational model
Segmental model
F0 generation
Resynthesis
Since we probably need to establish a strict sequence to
the application of modules, it may be necessary to sub-
divide modules to ensure cross-information is available at
the right time as synthesis proceeds.
10 Tools
We will need tools for at least:
Transcription alignment
Annotation file generation
F0 modelling
Resynthesis
Display/Edit of hierarchical structures
Import from Laureate/Festival
Export to Laureate/Festival/MBROLA/HLSYN