This document provides a tutorial introduction to the use of SFS for the orthographic and phonetic transcription of a speech recording, including tools for automatic alignment of phonetic transcription to the signal. This tutorial refers to versions 4.6 and later of SFS.
You can use the SFSWin program to record directly from the audio input signal on your computer. Only do this if you know that your audio input is of good quality, since many PCs have rather poor quality audio inputs. In particular, microphone inputs on PCs are commonly very noisy.
To acquire a signal using SFSWin, choose File|New, then Item|Record. See Figure 1.1. Choose a suitable sampling rate, at least 16000 samples/sec is recommended. It is usually not necessary to choose a rate faster than 22050 samples/sec for speech signals.
If you choose to acquire your recording into a file using some other program, or if it is already in an audio file, choose Item|Import|Speech rather than Item|Record to load the recording into SFS. If the file is recorded in plain PCM format in a WAV audio file, you can also just open the file with File|Open. In this latter case, you will be offered a choice to ""Copy contents" or "Link to file" to the WAV file. See Figure 1.2. If you choose copy, then the contents of the audio recording are copied into the SFS file. If you choose link, then the SFS file simply "points" to the WAV file so that it may be processed by SFS programs, but it is not copied (this means that if the WAV file is deleted SFS will report an error).
If the audio recording has significant amounts of background noise, you may like to try and clean the recording using Tools|Speech|Process|Signal enhancement. See Figure 1.3. The default setting is "100% spectral subtraction"; this subtracts 100% of the quietest spectral slice from every frame. This is a fairly conservative level of enhancement, and you can try values greater than 100% to get a more aggressive enhancement, but at the risk of introducing artifacts.
It is also suggested at this stage that you standardise the level of the recording. You can do this with Tools|Speech|Process|Waveform preparation, choosing the option "Automatic gain control (16-bit)".
If your audio recording is longer than a single sentence, you will almost certainly gain from first chunking the signal into regions of about one sentence in length. Chunking involves adding a set of annotations which delimit sections of the signal. The advantages of chunking include:
An easy way to chunk the signal is to automatically detect pauses using the "npoint" program. This takes a speech signal as input and creates a set of annotations which mark the beginning and end of each region where someone is speaking. It is a simple and robust procedure based on energy in the signal. To use this, select the speech item and choose Tools|Speech|Annotate|Find multiple enpoints. See Figure 1.4. If you know the number of spoken chunks in the file (it may be a recording of a list of words, for example), enter the number using the "Number of utterances to find" option, otherwise choose the "Auto count utterances" option. Put "chunk" (or similar) as the label stem for annotation.
If you view the results of the chunking you will see that each spoken region has been labelled with "chunkdd", while the pauses will be labelled with "/". See Figure 1.5.
If the chunking has not worked properly, or if you want to chunk the signal by hand, you can use the manual annotation facility in "eswin". To do this, select the signal you want to annotate and choose Item|Display to start the eswin program. Then choose eswin menu option Annotation|Create/Edit Annotations, and enter either a set name of "chunks" to create a new set of annotations, or enter "endpoints" to edit the set of annotations produced by npoint.
When eswin is ready to edit annotations you will see a new region at the bottom of the screen where your annotations will appear. To add a new annotation, position the left cursor at the time where you want the annotation to appear. Then press the [A] key on the keyboard once. You will see the string "L=>" appear in the status bar. Type your annotation and press [RETURN]. The annotation should appear at the position of the cursor. See Figure 1.6.
To move an annotation with the mouse, position the mouse cursor on the annotation line within the bottom annotation box. You will see that the mouse cursor changes shape into a double-headed arrow. Press the left mouse button and drag the annotation left or right to its new location. This is also an easy way to correct chunk endpoints found automatically by npoint.
Finally, to hear if the chunking has worked properly, you can listen to the chunked recording using the SFS program "wordplay". This program is not on the SFSWin menus, so to run it, choose Tools|Run program/script then enter "wordplay -SB" in the "Program/script name" box. See figure 1.7. This will replay each chunk in turn, separating the chunks with a small beep.
Assuming that your recording has been chunked into sentence-sized regions, the process of orthographic transcription is now just the process of replacing the "chunk" labels with the real spoken text. The result will be a new annotation item in the file, but where each annotation contains the orthographic transcription of a chunk of signal.
You can edit annotation labels using the "eswin" display program, but it is not very easy - you have to overwrite each annotation label with the transcription. A much easier way is to use the "anedit" annotation label editor program. This program allows you to edit the labels of annotations without affecting their timing. The program also allows you to listen to the annotated regions and type new or correct old labels. To run anedit, select a speech item and the annotation item containing the chunks and choose Tools|Annotations|Edit Labels. Since you are mapping one set of annotations into another, change the "output" annotation type to "orthography". See Figure 2.1.
The row of buttons in the middle of the annotation editor window control the set of annotations:
The row of buttons at the bottom of the annotation editor window control the replay of the speech signal:
The "Auto" replay feature causes the current annotated region to be replayed each time you change to a different annotation.
To use anedit for entering orthographic transcription, first check that the "Auto" replay feature is enabled and that you are positioned at the first chunk of speech. Replay this with the "Current" button, select and over-write the old label with the text that was spoken. Then press the [RETURN] key. Two things should happen: first you should move on to the next chunk in the file and second that chunk of signal should be replayed. You can now proceed through the file, entering a text transcription and pressing [RETURN] to move on to the next chunk. If you need to hear the signal again, use the buttons at the bottom of the screen. Every so often I suggest you save your transcription back to the file with the "Save" button. This ensures you will not lose a lot of work should something go wrong.
One word of warning: at present SFS is limited to annotations that are less than 250 characters long. Anedit prevents you from entering longer labels. There is no limit to the number of labels however.
It is worth thinking about some conventions about how you enter transcription. For example, should you start utterances with capital letters, or terminate them with full stops? Should you use punctuation? Should you use abbreviations and digits? Do you mark non-speech sounds like breath sounds, lip smacks or coughs?
Here is one convention that you might follow, which has the advantage that it is also maximally compatible with SFS tools.
For pause regions you can either choose to label these using a special symbol of your own (e.g. "[pause]"), or leave them annotated as "/", or label them with the SAMPA symbol for pause which is "...".
Once you have a chunked and transcribed recording you can distribute your transcription as a "clickable script" using the VoiScript program (available for free download from http://www.phon.ucl.ac.uk/resource/voiscript/). The VoiScript program will display your transcription and replay parts of it in response to mouse clicks on the transcription itself. This makes it a very convenient vehicle for others to listen to your recording and study your transcription.
VoiScript takes as input a WAV file of the audio recording and an HTML file containing the transcription coded as links to parts of the audio. Technical details can be found on the VoiScript web site. To save your recording as a WAV file, choose Tools|Speech|Export|Make WAV file, and enter a suitable folder and name for the file. The following SML script can be used to create a basic HTML file compatible with VoiScript:
/* anscript.sml - convert annotation item to a VoiScript HTML file */ /* takes as input file.sfs and outputs HTML assuming audio is in file.wav */ main { string basename var i,num i=index("\.",$filename); if (i) basename=$filename:1:i-1 else basename=$filename; print "<html><body><h1>",basename,"</h1>\n"; num=numberof("."); for (i=1;i<=num;i=i+1) if (compare(matchn(".",i),"/")!=0) { print "<a name=chunk",i:1 print " href='",basename,".wav#",timen(".",i):1:4 print "#",(timen(".",i)+lengthn(".",i)):1:4,"'>" print matchn(".",i),"</a>\n" } print "</body></html>\n" } |
Copy and paste this script into a file "anscript.sml". Then select the annotation item you want to base the output on and choose Tools|Run SML script. Enter "anscript.sml" as the SML script filename and the name of the output HTML file as the output listing filename in the same directory as the WAV audio file.
If you now open the output HTML file within the VoiScript program, you will be able to read and replay parts of the transcription on demand, see Figure 2.2.
antrans
conventions
analign
eswin (shift with cursor keys)
SML script
June 2004 Mark Huckvale University College London