Our work in this area was carried out in
incremental stages. First, as a precursor to training studies, it was necessary
to gain an overview of listeners’ sensitivity to visual cues for sound
contrasts that do not occur in their first language. This was done first for a
wide range of English consonant and vowel contrasts that differ in their
phonological status in Spanish (See Study 1). The next two studies (Studies 2
and 3) focused on a narrower range of English consonant contrasts that have a
different phonemic status n the native languages of L2 learners of English from
two different language backgrounds. The results of the various studies are
summarized below.
Study 1: Investigation of the effect of auditory and
visual cues in L2 speech perception for range of consonant contrasts [reported
in AVSP2001 + SHL papers]
36 Spanish learners of English were tested on their
identification of 16 consonants and 9 vowels of British English presented in
auditory, visual and audiovisual modalities. For consonants, both L2 learners
and controls showed significant improvements in the audiovisual condition, with
larger effects for syllable final consonants. The patterns of errors by L2
learners were strongly predictable from our knowledge of the relation between
the phoneme inventories of Spanish and English. Consonant confusions which were
language-dependent – mostly errors in voicing and manner – were not reduced by
the addition of visual cues. However, consonant confusions that were common to
both listener groups and which related to acoustic-phonetic sound
characteristics did show improvements. It was therefore concluded that visual
features have different weights when cueing phonemic and allophonic
distinctions.
Study 2: Effect of auditory and visual cues in L2
perception for contrasts that differ in phonological status between L1 and L2.
[reported in ICSLP 2002 paper)
This study targeted sound contrasts that have a
different phonemic status in the listeners' L1 and L2. To evaluate the
influence of the ‘clarity’ of visual information, two sound contrasts were
tested that differed in the degree of information that could be gleaned from
the visual channel: the highly-visible /v/-/b/ contrast and the less
visibly-salient /l/-/r/ contrast. To evaluate the effect of language background
on use of visual cues, learners with different L1 backgrounds (Spanish,
Japanese, Korean) were tested.
In Experiment 1, stimuli containing the English
sounds /b/ and /v/ were presented to 32 Spanish learners of English and to 47
Japanese learners of English in audio, visual and audiovisual modalities. This
contrast is visually highly salient for native listeners who achieve 94%
correct identification of /v/ in a lipreading alone condition.
In Experiment 2, 115 Japanese-L1 and 52 Korean-L1
learners of English were tested on their perception of the /l/-/r/ contrast in
audio, visual and audiovisual conditions. This contrast is less visually
salient for native English speakers who achieved 79% correct identification in
a lipreading alone condition.
From these two studies, we can conclude that, prior
to intensive training, listeners show little sensitivity to visual cues for
phonemic contrasts that do not occur in their native language. This was found
for L2-learners with different L1s
tested on two different contrasts differing in visual distinctiveness.
There was some evidence of a weak effect of visual salience as significantly
higher performance in AV than A condition was obtained in the perception of the
/b/-/v/ contrast but no AV benefit was found for the /l/-/r/ contrast. However, it must be noted that there is
strong evidence of individual differences in the use of visual cues. For both
contrasts, individual listeners achieving high scores on the ‘audio alone’
condition also tended to achieve high scores on ‘lipreading alone’ condition.
This is therefore evidence that, once the phonemic contrast is acquired,
listeners become sensitive to visual as well as auditory cues to the
distinction.
Study 3: Effect of auditory vs auditory visual
cues in intensive training of /b/-/v/ and /l/-/r/ contrasts.
The next step was to see whether intensive training
could help focus the listeners’ attention to both the auditory and visual cues
to the contrast. We also investigated whether listeners improved their
perception more when trained with tokens presented audiovisually than with the
same tokens presented auditorily.
Two major training studies were run in Japan: one
training the /l/-/r/ contrast, the other training the /b/-/v/ contrast. These
studies were run in collaboration with Dr Masaki Taniguchi from the University
of Kochi and Dr Midori Iba from Konan University. Additional testing was done
in the Department of Phonetics and Linguistics and in collaboration with the
Bell Language School in London.
The following format was used in these training
studies. We designed the training programme using the CSLU toolkit software of
our collaborators at CSLU, and used a wide range of audiovisual training
materials recorded and processed at UCL. The Toolkit enabled us to easily
integrate the use of a conversational agent in the training programme. We were
thus able to include some interactivity between the learner and ‘Baldi’, the
‘artificial teacher’, who gave instructions and gave feedback to the learner as
to his/her level of performance during training. This was an important component in maintaining learners’
motivation over the course of the 13 training/testing sessions.
The training programme was based on the High
Variability Phonetic Training (HVPT) approach (Logan et al) which advocates the
importance of introducing stimulus variability in the training process and of
giving immediate feedback on performance. The training consisted of 13 sessions
of approximately 40 minutes: a pretest during which learners were tested on the
perception of the key consonants in a set of nonsense words in A, V and AV
modalities, 10 training sessions at which they heard a wide range of real-word
stimuli by five different speakers and were given immediate feedback after each
presentation, a posttest (identical to pre-test), and a generalization session
with new stimuli. All the training was conducted on laptops, with stimuli
presented via headphones. Some sessions were supervised but the training
programme was designed to be run by the learners without a teacher or
experimenter present.
In Experiment 1, 62 Japanese-L1 subjects were
trained on their perception of the
/l/-/r/ contrast. Eighteen listeners were trained using audio stimuli
only, 25 listeners using the same stimuli presented audiovisually with a
natural face and 19 were trained in an audiovisual condition where the same
audio stimuli were carefully synchronized with an artificial face (the ‘Baldi’
conversational agent). Overall, the
effect of training condition was not significant with identification increasing
on average by 14.9% for both the group trained with auditory stimuli and group
trained with natural AV stimuli, and by 10% in group who were trained with
auditory stimuli synchronized to an artificial face. Most importantly, there
was a training condition by modality interaction showing that those trained
using auditory stimuli improved their auditory perception of the sounds to a
greater extent than those trained audiovisually, but that the audiovisual
trainees improved their sensitivity to visual cues (in visual condition) to a
much greater extent than those trained auditorily.
The listeners’ natural sensitivity to visual cues is
one factor that might account for the limited effect of AV training. Those
learners who are at chance on their use of visual cues may not have been able
to use the additional information provided in AV training. We therefore
estimated the factor ‘visual awareness’ on the basis of the pretest performance
in the Video alone condition (for ‘visual aware’: scores of >55.6% correct).
Within the subgroup of 17 listeners with ‘visual awareness’, the post-test
performance for those trained auditorily and those trained audiovisually were
on a similar level. This suggests that even those learners who did initially
make use of visual cues did not benefit more from training with audio-visual
stimuli.

Figure 1: Difference in correct identification of
/l/-/r/ consonant pre-post training in the three test condition for the two
learner groups who undertook training with either auditory or audiovisual
stimuli.
For the /b/-/v/ contrast, 39 Japanese Learners of
English were trained: 21 in audio condition and 18 in audiovisual (natural
face) condition. In this study, the group trained with audiovisual stimuli
improved significantly more than the group trained with auditory stimuli.
Perception of lipread stimuli improved less than perception of auditory or AV
stimuli for both groups.

Figure 2: Difference in correct identification of
/b/-/v/ consonants pre-post training in the three test condition for the two
learner groups who undertook training with either auditory or audiovisual
stimuli. Scores are converted to dprime.
In summary, we showed that our training technique
led to a significant improvement in the perception of non-native phonemic
contrasts over a relatively short training period. Audiovisual training was
only more effective than auditory training for a sound contrast that is
highly-visible for native listeners.