Speech Processing by Computer

LAB 8

SIGNAL GENERATION FOR SYNTHESIS

In this lab session we will experiment with three different signal generation methods and compare the results. We will use methods for prosody manipulation to change the pitch and timing of natural speech. We will use a diphone concatenation system to produce synthetic copies of a natural utterance. We will use a formant synthesis by rule system to produce a second set of synthetic copies. We will then compare the different versions by listening to them.

Control file format

The prosody manipulation program repros, the MBROLA diphone synthesis system mbrsynth and the formant synthesis system phosynth all use the same control file format. Control files are plain text files that can be created with Windows notepad. The format of this file is as follows:

• each line describes 1 phonetic segment

• a line has three parts separated by spaces:

(i) the name of the phone

(ii) the duration of the phone in milliseconds

(iii) the required Fx contour through the segment

• the phone names are in SAMPA format:

Consonants	p, b, t, d, k, g, tS, dZ, f, v, T, D, s, z, S, Z, h, m, n, N, l, r, w, j, 5 (=dark-l)
Vowels	i:, I, e, {, V, A:, Q, O:, U, u:, @, 3:, aI, OI, eI, @U, aU, e@, I@, U@
Silence	_

• the Fx contour is specified as a list of pairs of numbers, each pair consists of

(i) the % duration of the segment at which the Fx is to be set

(ii) the Fx value at that point in Hz.

Thus the following control file will say "Mark":

_ 100

m 125 0 125

A: 500 0 120 100 100

k 250

_ 100

Read this as: silence for 100ms, [m] lasts 125ms starting at 125Hz, [a:] lasts 500ms, starting at 120Hz, ending at 100Hz, [k] lasts 250ms, end with silence of 100ms.

1. Record a natural utterance

a. Design a short sentence (4/5 words max) with a mix of segments and an interesting prosody

b. Record Speech and Lx at 16,000 samples/sec

c. Generate Tx and Fx

d. Save to file SENT.SFS

2. Annotate natural utterance

a. add SAMPA annotations to SENT.SFS

b. remember to annotate silence as ‘_’

c. let Mark check your annotations for correctness

3. Generate .PHO control files

a. choose Tools/Annotations/Export/Export as MBROLA

b. save control file as NN.PHO

c. open NN.PHO in notepad

d. change the durations of segments to the values specified in the table overleaf.

e. save as SN.PHO

f. open NN.PHO

g. change the pitch contour so that all specifications are deleted except for “0 150” on the first segment and “100 100” on the last segment. This will cause a uniform fall in pitch from 150Hz to 100Hz over the utterance.

h. save as NS.PHO

i. open SN.PHO

j. change the pitch contour again as in step g

k. save as SS.PHO

4. Generate variants of natural utterance

a. select item 1.01 and choose Tools/Speech/Process/Prosody Change. Use the SN.PHO control file

b. select item 1.01 and choose Tools/Speech/Process/Prosody Change. Use the NS.PHO control file

c. select item 1.01 and choose Tools/Speech/Process/Prosody Change. Use the SS.PHO control file

5. Generate diphone versions

a. choose Tools/Generate/MBROLA synthesis. Use the NN.PHO control file and the “en1” database of diphones.

b. choose Tools/Generate/MBROLA synthesis. Use the SN.PHO control file and the “en1” database of diphones.

c. choose Tools/Generate/MBROLA synthesis. Use the NS.PHO control file and the “en1” database of diphones.

d. choose Tools/Generate/MBROLA synthesis. Use the SS.PHO control file and the “en1” database of diphones.

6. Generate formant versions

a. choose Tools/Generate/Synthesis by rule. Use the NN.PHO control file.

b. choose Tools/Synthesis Data/Synthesize speech to make a speech signal.

c. choose Tools/Generate/Synthesis by rule. Use the SN.PHO control file.

d. choose Tools/Synthesis Data/Synthesize speech to make a speech signal.

e. choose Tools/Generate/Synthesis by rule. Use the NS.PHO control file.

f. choose Tools/Synthesis Data/Synthesize speech to make a speech signal.

g. choose Tools/Generate/Synthesis by rule. Use the SS.PHO control file.

h. choose Tools/Synthesis Data/Synthesize speech to make a speech signal.

7. Compare versions

a. you should have 12 different speech items:

1.01	Natural Speech	Natural Durations	Natural Pitch
1.02		Synthetic Durations	Natural Pitch
1.03		Natural Durations	Synthetic Pitch
1.04		Synthetic Durations	Synthetic Pitch
1.05	Diphone Synthesis	Natural Durations	Natural Pitch
1.06		Synthetic Durations	Natural Pitch
1.07		Natural Durations	Synthetic Pitch
1.08		Synthetic Durations	Synthetic Pitch
1.09	Formant Synthesis	Natural Durations	Natural Pitch
1.10		Synthetic Durations	Natural Pitch
1.11		Natural Durations	Synthetic Pitch
1.12		Synthetic Durations	Synthetic Pitch

b. what is more important, natural durations or natural pitch? Compare 2 & 3, 6 & 7, 10 & 11.

c. does good prosody compensate for poor voice quality? Compare 5 & 4, 9 & 8.

Table of Phoneme Durations in milliseconds

&	140	3:	240	5	80	@	90
@U	230	A:	220	D	100	I	100
I@	210	N	130	O:	210	OI	240
Q	140	S	180	T	160	U	100
U@	230	V	155	Z	70	_	100
aI	220	aU	240	b	115	d	75
dZ	170	e	125	e@	270	eI	230
f	130	g	90	h	160	i:	140
j	110	k	140	l	80	l~	80
m	110	n	130	p	130	r	80
s	125	t	130	tS	210	u:	155
v	85	w	80	z	140	{	140

&	140	3:	240	5	80	@	90
@U	230	A:	220	D	100	I	100
I@	210	N	130	O:	210	OI	240
Q	140	S	180	T	160	U	100
U@	230	V	155	Z	70	_	100
aI	220	aU	240	b	115	d	75
dZ	170	e	125	e@	270	eI	230
f	130	g	90	h	160	i:	140
j	110	k	140	l	80	l~	80
m	110	n	130	p	130	r	80
s	125	t	130	tS	210	u:	155
v	85	w	80	z	140	{	140

&	140	3:	240	5	80	@	90
@U	230	A:	220	D	100	I	100
I@	210	N	130	O:	210	OI	240
Q	140	S	180	T	160	U	100
U@	230	V	155	Z	70	_	100
aI	220	aU	240	b	115	d	75
dZ	170	e	125	e@	270	eI	230
f	130	g	90	h	160	i:	140
j	110	k	140	l	80	l~	80
m	110	n	130	p	130	r	80
s	125	t	130	tS	210	u:	155
v	85	w	80	z	140	{	140

&	140	3:	240	5	80	@	90
@U	230	A:	220	D	100	I	100
I@	210	N	130	O:	210	OI	240
Q	140	S	180	T	160	U	100
U@	230	V	155	Z	70	_	100
aI	220	aU	240	b	115	d	75
dZ	170	e	125	e@	270	eI	230
f	130	g	90	h	160	i:	140
j	110	k	140	l	80	l~	80
m	110	n	130	p	130	r	80
s	125	t	130	tS	210	u:	155
v	85	w	80	z	140	{	140