UCL Department of Phonetics and Linguistics

Introduction to Computer Programming with MATLAB

Lecture 10: Speech Signal Analysis

Objectives

 

By the end of the session you should:

q       know one way to estimate the fundamental frequency of a section of speech signal from its spectrum

q       know one way to estimate the fundamental frequency of a section of speech signal from its waveform

q       know one way to estimate the formant frequencies from a section of speech signal

q       appreciate some of the problems of fundamental frequency and formant frequency estimation

Outline

 

1. Fundamental frequency estimation – frequency domain

 

The general problem of fundamental frequency estimation is to take a portion of signal and to find the dominant frequency of repetition.  Difficulties arise from (i) that not all signals are periodic, (ii) those that are periodic may be changing in fundamental frequency over the time of interest, (iii) signals may be contaminated with noise, even with periodic signals of other fundamental frequencies, (iv) signals that are periodic with interval T are also periodic with interval 2T, 3T etc, so we need to find the smallest periodic interval or the highest fundamental frequency; and (v) even signals of constant fundamental frequency may be changing in other ways over the interval of interest.

 

A reliable way of obtaining an estimate of the dominant fundamental frequency for long, clean, stationary speech signals is to use the cepstrum.  The cepstrum is a Fourier analysis of the logarithmic amplitude spectrum of the signal.  If the log amplitude spectrum contains many regularly spaced harmonics, then the Fourier analysis of the spectrum will show a peak corresponding to the spacing between the harmonics: i.e. the fundamental frequency.  Effectively we are treating the signal spectrum as another signal, then looking for periodicity in the spectrum itself.

 

The cepstrum is so-called because it turns the spectrum inside-out.  The x-axis of the cepstrum has units of quefrency, and peaks in the cepstrum (which relate to periodicities in  the spectrum) are called rahmonics.

 

To obtain an estimate of the fundamental frequency from the cepstrum we look for a peak in the quefrency region corresponding to typical speech fundamental frequencies.  Here is an example:

 

% get a section of vowel

[x,fs]=wavread('six.wav',[24120 25930]);

ms1=fs/1000;                 % maximum speech Fx at 1000Hz

ms20=fs/50;                  % minimum speech Fx at 50Hz

%

% plot waveform

t=(0:length(x)-1)/fs;        % times of sampling instants

subplot(3,1,1);

plot(t,x);

legend('Waveform');

xlabel('Time (s)');

ylabel('Amplitude');

%

% do fourier transform of windowed signal

Y=fft(x.*hamming(length(x)));

%

% plot spectrum of bottom 5000Hz

hz5000=5000*length(Y)/fs;

f=(0:hz5000)*fs/length(Y);

subplot(3,1,2);

plot(f,20*log10(abs(Y(1:length(f)))+eps));

legend('Spectrum');

xlabel('Frequency (Hz)');

ylabel('Magnitude (dB)');

%

% cepstrum is DFT of log spectrum

C=fft(log(abs(Y)+eps));

%

% plot between 1ms (=1000Hz) and 20ms (=50Hz)

q=(ms1:ms20)/fs;

subplot(3,1,3);

plot(q,abs(C(ms1:ms20)));

legend('Cepstrum');

xlabel('Quefrency (s)');

ylabel('Amplitude');

 

 

To search for the index of the peak in the cepstrum between 1 and 20ms, and then convert back to hertz, use:

 

[c,fx]=max(abs(C(ms1:ms20)));

fprintf('Fx=%gHz\n',fs/(ms1+fx-1));

 

The cepstrum works best when the fundamental frequency is not changing too rapidly, when the fundamental frequency is not too high and when the signal is relatively noise-free.  A disadvantage of the cepstrum is that it involves computationally-expensive frequency-domain processing.

 

2. Fundamental frequency estimation – time domain

 

The cepstrum looks for periodicity in the log spectrum of the signal, whereas our perception of pitch is more strongly related to periodicity in the waveform itself.  A means to estimate fundamental frequency from the waveform directly is to use autocorrelation.  The autocorrelation function for a section of signal shows how well the waveform shape correlates with itself at a range of different delays.  We expect a periodic signal to correlate well with itself at very short delays and at delays corresponding to multiples of pitch periods.

 

This code plots the autocorrelation function for a section of speech signal:

 

% get a section of vowel

[x,fs]=wavread('six.wav',[24120 25930]);

ms20=fs/50;                 % minimum speech Fx at 50Hz

%

% plot waveform

t=(0:length(x)-1)/fs;        % times of sampling instants

subplot(2,1,1);

plot(t,x);

legend('Waveform');

xlabel('Time (s)');

ylabel('Amplitude');

%

% calculate autocorrelation

r=xcorr(x,ms20,'coeff');   

%

% plot autocorrelation

d=(-ms20:ms20)/fs;          % times of delays

subplot(2,1,2);

plot(d,r);

legend('Autocorrelation');

xlabel('Delay (s)');

ylabel('Correlation coeff.');

 

You can see that the autocorrelation function peaks at zero delay and at delays corresponding to ± 1 period,  ± 2 periods, etc.  We can estimate the fundamental frequency by looking for a peak in the delay interval corresponding to the normal pitch range in speech, say 2ms(=500Hz) and 20ms (=50Hz).  For example:

 

ms2=fs/500                 % maximum speech Fx at 500Hz

ms20=fs/50                 % minimum speech Fx at 50Hz

% just look at region corresponding to positive delays

r=r(ms20+1:2*ms20+1)

[rmax,tx]=max(r(ms2:ms20))

fprintf('rmax=%g Fx=%gHz\n',rmax,fs/(ms2+tx-1));

 

 

 

The autocorrelation approach works best when the signal is of low, regular pitch and when the spectral content of the signal is not changing too rapidly.  The autocorrelation method is prone to pitch halving errors where a delay of two pitch periods is chosen by mistake.  It can also be influenced by periodicity in the signal caused by formant resonances, particularly for female voices where F1 can be lower in frequency than Fx.

 

3. Formant frequency estimation

 

Estimation of formant frequencies is generally more difficult than estimation of fundamental frequency.  The problem is that formant frequencies are properties of the vocal tract system and need to be inferred from the speech signal rather than just measured.  The spectral shape of the vocal tract excitation strongly influences the observed spectral envelope, such that we cannot guarantee that all vocal tract resonances will cause peaks in the observed spectral envelope, nor that all peaks in the spectral envelope are caused by vocal tract resonances.

 

The dominant method of formant frequency estimation is based on modelling the speech signal as if it were generated by a particular kind of source and filter:

 

 

This type of analysis is called source-filter separation, and in the case of formant frequency estimation, we are interested only in the modelled system and the frequencies of its resonances.  To find the best matching system we use a method of analysis called Linear Prediction.  Linear prediction models the signal as if it were generated by a signal of minimum energy being passed through a purely-recursive IIR filter.

 

We will demonstrate the idea by using LPC to find the best IIR filter from a section of speech signal and then plotting the filter's frequency response.

 

% get a section of vowel

[x,fs]=wavread('six.wav',[24120 25930]);

% resample to 10,000Hz (optional)

x=resample(x,10000,fs);

fs=10000;

%

% plot waveform

t=(0:length(x)-1)/fs;        % times of sampling instants

subplot(2,1,1);

plot(t,x);

legend('Waveform');

xlabel('Time (s)');

ylabel('Amplitude');

%

% get Linear prediction filter

ncoeff=2+fs/1000;           % rule of thumb for formant estimation

a=lpc(x,ncoeff);

%

% plot frequency response

[h,f]=freqz(1,a,512,fs);

subplot(2,1,2);

plot(f,20*log10(abs(h)+eps));

legend('LP Filter');

xlabel('Frequency (Hz)');

ylabel('Gain (dB)');

 

To find the formant frequencies from the filter, we need to find the locations of the resonances that make up the filter.  This involves treating the filter coefficients as a polynomial and solving for the roots of the polynomial.  Details are outside the scope of this course (see readings).  Here is the code to find estimated formant frequencies from the LP filter:

 

% find frequencies by root-solving

r=roots(a);                  % find roots of polynomial a

r=r(imag(r)>0.01);           % only look for roots >0Hz up to fs/2

ffreq=sort(atan2(imag(r),real(r))*fs/(2*pi));

                             % convert to Hz and sort

for i=1:length(ffreq)

    fprintf('Formant %d Frequency %.1f\n',i,ffreq(i));

end

 

This produces the output:

 

Formant 1 Frequency 289.2

Formant 2 Frequency 2310.6

Formant 3 Frequency 2770.9

Formant 4 Frequency 3489.9

Formant 5 Frequency 4294.8

Reading

 

Unit 8 from the Introduction to Digital Signal Processing course at http://www.phon.ucl.ac.uk/courses/spsci/dsp/ provides more detail about linear prediction.

 

A mathematical treatment can be found in: J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561-580, April 1975.

 

A good set of lecture notes on the mathematics of speech signal processing can be found at:

http://mi.eng.cam.ac.uk/~ajr/SA95/SpeechAnalysis.html

Exercises

 

For these exercises, use the editor window to enter your code, and save your answers to files under your account on the central server. When you save the files, give them the file extension of ".m".  Run your programs from the command window.

1.      Write a program (ex101.m) that asks the user for a filename (WAV file), and plots a fundamental frequency track against time for the whole signal using the cepstrum method.  Use a window size of 30ms and an output frame rate of 100 Fx samples/second.

2.      Modify the previous exercise to use the autocorrelation method (ex102.m).  In this program set the fundamental frequency value to zero when the peak of the autocorrelation function is less than 0.5.  This acts as a crude voicing decision.

3.      Write a program (ex103.m) that asks the user for a filename (WAV file), and produces a scatter plot of formant peaks against time.  Use a window size of 30ms and an output frame rate of 100 per second.

4.      (Homework) Write a GUI program which asks you to say a vowel and plots the location of the vowel on a “Vowel Quadrilateral” diagram (effectively F1 vs F2).