SCRIBE - Spoken Corpus of British English

The SCRIBE project was a one-year pilot project that investigated the construction of a corpus of spoken British English. The project ran in 1989/90 and was funded by the UK Department of Trade and Industry and the UK Science and Engineering Research Council (DTI/IED Project 1643). Partners in the project were University College London, Cambridge University, Edinburgh University, the Speech Research Unit, and the National Physical Laboratory. Although the project only ran for one year, and further funding was not provided to build a substantial corpus, a prototype corpus was collected and partially annotated. This page describes the current status of the outputs of the project and allows a sample of annotated recordings to be downloaded.

Overview of pilot corpus

The material consists of a mixture of read speech and spontaneous speech. The read speech material consists of sentences selected from a set of 200 'phonetically rich' sentences (SET-A) and 460 'phonetically compact' sentences (SET-B) and a two-minute continuous passage. The 'phonetically rich' sentences were designed at CSTR to be phonetically balanced. The 'phonetically compact' sentences were based on a British version of the MIT compact sentences (as in TIMIT) which were expanded to include relevant RP contrasts (the set contains at least one example of every possible triphone in British English). The passage was designed at UCL to contain accent sensitive material. The spontaneous speech material was collected from a constrained 'free speech' situation where a talker gave a verbal description of a picture.

The recordings were divided between a 'many talker' set and a 'few talker' set. In the 'many talker' set, each speaker recorded ten sentences from the 'phonetically rich' sentences and ten sentences from the 'phonetically compact' sentences. Each speaker in the 'many talker' set also recorded the accent diagnostic passage. In the 'few talker' set, each speaker recorded 100 sentences from the 'phonetically rich' set and 100 from the 'phonetically compact' set.

Speakers were recruited from four 'dialect areas': South East (DR1), Glasgow (DR2), Leeds (DR3) and Birmingham (DR4). The aim was to recruit 5 male and 5 female speakers from each dialect area for the few-talker sub corpus, with 20 male and 20 female speakers from each dialect area for the many-talker corpus. In fact this number of speakers was not fully achieved.

The original aim of the project was to release the corpus as a collection of audio recordings with just orthographic transcription, but with a small percentage to be phonetically annotated in the style of the TIMIT corpus.

Status of the corpus

The available audio recordings and annotations were released on eleven CD-ROMs (labelled SCRIBE_0 to SCRIBE_11) in April 1992. These were originally distributed by the Speech Group at the National Physical Laboratory, but after this was closed down the disks were passed to the MOD Speech Research Unit at Malvern which passed the disks on to a private contractor (who kept them in his garage). The Speech Research Unit itself became part of 2020Speech Ltd in 2000. The current availability of the CD-ROMs is unknown. At UCL we have one complete set which is labelled "Copyright © University of Cambridge, University of Edinburgh and University College London".

The main documentation on the CD-ROMs has been collated into a SCRIBE Manual which can be viewed online - see below.

Investigation of the annotated components of the corpus has revealed a number of file labelling and annotation alignment errors. Mark Huckvale has put a lot of effort into correcting these and the corrected annotated sub-component of the corpus is now available for download. This is only a small sub-set of the entire corpus and is made available on the understanding that ownership remains with the original producers, and that this material may not be sold or used in commercial products or services.

Downloads

  • The SCRIBE manual (400kB): is an HTML formatted collection of the SCRIBE documentation found on the CD-ROMs.
  • Sample of many talker recordings (43MB): is an archive of the audio and label files for all the many-talker recordings that were manually annotated. Contains 6 male and 1 female speakers with annotated passage, Set-A and Set-B sentences. View list of files.
  • Sample of few talker recordings (27MB): is an archive of the audio and label files for a single speaker that were manually annotated. Contains 200 annotated sentences. View list of files.