Data indicating the relative effectiveness of the acoustic and visual (speechreading) enhancement of speech contrasts in L2 learning

Our work in this area was carried out in incremental stages. First, as a precursor to training studies, it was necessary to gain an overview of listeners’ sensitivity to visual cues for sound contrasts that do not occur in their first language. This was done first for a wide range of English consonant and vowel contrasts that differ in their phonological status in Spanish (See Study 1). The next two studies (Studies 2 and 3) focused on a narrower range of English consonant contrasts that have a different phonemic status n the native languages of L2 learners of English from two different language backgrounds. The results of the various studies are summarized below.

 

Study 1: Investigation of the effect of auditory and visual cues in L2 speech perception for range of consonant contrasts [reported in AVSP2001 + SHL papers]

36 Spanish learners of English were tested on their identification of 16 consonants and 9 vowels of British English presented in auditory, visual and audiovisual modalities. For consonants, both L2 learners and controls showed significant improvements in the audiovisual condition, with larger effects for syllable final consonants. The patterns of errors by L2 learners were strongly predictable from our knowledge of the relation between the phoneme inventories of Spanish and English. Consonant confusions which were language-dependent – mostly errors in voicing and manner – were not reduced by the addition of visual cues. However, consonant confusions that were common to both listener groups and which related to acoustic-phonetic sound characteristics did show improvements. It was therefore concluded that visual features have different weights when cueing phonemic and allophonic distinctions.

 

Study 2: Effect of auditory and visual cues in L2 perception for contrasts that differ in phonological status between L1 and L2. [reported in ICSLP 2002 paper)

This study targeted sound contrasts that have a different phonemic status in the listeners' L1 and L2. To evaluate the influence of the ‘clarity’ of visual information, two sound contrasts were tested that differed in the degree of information that could be gleaned from the visual channel: the highly-visible /v/-/b/ contrast and the less visibly-salient /l/-/r/ contrast. To evaluate the effect of language background on use of visual cues, learners with different L1 backgrounds (Spanish, Japanese, Korean) were tested.

In Experiment 1, stimuli containing the English sounds /b/ and /v/ were presented to 32 Spanish learners of English and to 47 Japanese learners of English in audio, visual and audiovisual modalities. This contrast is visually highly salient for native listeners who achieve 94% correct identification of /v/ in a lipreading alone condition.

In Experiment 2, 115 Japanese-L1 and 52 Korean-L1 learners of English were tested on their perception of the /l/-/r/ contrast in audio, visual and audiovisual conditions. This contrast is less visually salient for native English speakers who achieved 79% correct identification in a lipreading alone condition.

From these two studies, we can conclude that, prior to intensive training, listeners show little sensitivity to visual cues for phonemic contrasts that do not occur in their native language. This was found for L2-learners with different L1s  tested on two different contrasts differing in visual distinctiveness. There was some evidence of a weak effect of visual salience as significantly higher performance in AV than A condition was obtained in the perception of the /b/-/v/ contrast but no AV benefit was found for the /l/-/r/ contrast.  However, it must be noted that there is strong evidence of individual differences in the use of visual cues. For both contrasts, individual listeners achieving high scores on the ‘audio alone’ condition also tended to achieve high scores on ‘lipreading alone’ condition. This is therefore evidence that, once the phonemic contrast is acquired, listeners become sensitive to visual as well as auditory cues to the distinction.

 

Study 3: Effect of auditory vs auditory visual cues in intensive training of /b/-/v/ and /l/-/r/ contrasts.

The next step was to see whether intensive training could help focus the listeners’ attention to both the auditory and visual cues to the contrast. We also investigated whether listeners improved their perception more when trained with tokens presented audiovisually than with the same tokens presented auditorily. 

Two major training studies were run in Japan: one training the /l/-/r/ contrast, the other training the /b/-/v/ contrast. These studies were run in collaboration with Dr Masaki Taniguchi from the University of Kochi and Dr Midori Iba from Konan University. Additional testing was done in the Department of Phonetics and Linguistics and in collaboration with the Bell Language School in London.

The following format was used in these training studies. We designed the training programme using the CSLU toolkit software of our collaborators at CSLU, and used a wide range of audiovisual training materials recorded and processed at UCL. The Toolkit enabled us to easily integrate the use of a conversational agent in the training programme. We were thus able to include some interactivity between the learner and ‘Baldi’, the ‘artificial teacher’, who gave instructions and gave feedback to the learner as to his/her level of performance during training.  This was an important component in maintaining learners’ motivation over the course of the 13 training/testing sessions.

The training programme was based on the High Variability Phonetic Training (HVPT) approach (Logan et al) which advocates the importance of introducing stimulus variability in the training process and of giving immediate feedback on performance. The training consisted of 13 sessions of approximately 40 minutes: a pretest during which learners were tested on the perception of the key consonants in a set of nonsense words in A, V and AV modalities, 10 training sessions at which they heard a wide range of real-word stimuli by five different speakers and were given immediate feedback after each presentation, a posttest (identical to pre-test), and a generalization session with new stimuli. All the training was conducted on laptops, with stimuli presented via headphones. Some sessions were supervised but the training programme was designed to be run by the learners without a teacher or experimenter present.

In Experiment 1, 62 Japanese-L1 subjects were trained on their perception of the  /l/-/r/ contrast. Eighteen listeners were trained using audio stimuli only, 25 listeners using the same stimuli presented audiovisually with a natural face and 19 were trained in an audiovisual condition where the same audio stimuli were carefully synchronized with an artificial face (the ‘Baldi’ conversational agent).  Overall, the effect of training condition was not significant with identification increasing on average by 14.9% for both the group trained with auditory stimuli and group trained with natural AV stimuli, and by 10% in group who were trained with auditory stimuli synchronized to an artificial face. Most importantly, there was a training condition by modality interaction showing that those trained using auditory stimuli improved their auditory perception of the sounds to a greater extent than those trained audiovisually, but that the audiovisual trainees improved their sensitivity to visual cues (in visual condition) to a much greater extent than those trained auditorily.

The listeners’ natural sensitivity to visual cues is one factor that might account for the limited effect of AV training. Those learners who are at chance on their use of visual cues may not have been able to use the additional information provided in AV training. We therefore estimated the factor ‘visual awareness’ on the basis of the pretest performance in the Video alone condition (for ‘visual aware’: scores of >55.6% correct). Within the subgroup of 17 listeners with ‘visual awareness’, the post-test performance for those trained auditorily and those trained audiovisually were on a similar level. This suggests that even those learners who did initially make use of visual cues did not benefit more from training with audio-visual stimuli.

 

Figure 1: Difference in correct identification of /l/-/r/ consonant pre-post training in the three test condition for the two learner groups who undertook training with either auditory or audiovisual stimuli.

 

 

For the /b/-/v/ contrast, 39 Japanese Learners of English were trained: 21 in audio condition and 18 in audiovisual (natural face) condition. In this study, the group trained with audiovisual stimuli improved significantly more than the group trained with auditory stimuli. Perception of lipread stimuli improved less than perception of auditory or AV stimuli for both groups.

 

 

Figure 2: Difference in correct identification of /b/-/v/ consonants pre-post training in the three test condition for the two learner groups who undertook training with either auditory or audiovisual stimuli. Scores are converted to dprime.

 

In summary, we showed that our training technique led to a significant improvement in the perception of non-native phonemic contrasts over a relatively short training period. Audiovisual training was only more effective than auditory training for a sound contrast that is highly-visible for native listeners.

 

EPSRC Project page