Enhanced Language Modelling

Enhanced Language Modelling through improved lexical,
grammatical and syntactical labelling

A research project funded under the EPSRC Human Factors programme

Administrative Details

Grant Period:	April 1998 - March 2001
Grant Award:	£172,000
Grant Reference:	GR/L81406
Investigators:	Mark Huckvale Alex Fang

Overview

Language Modelling

A statistical language model describes probabilistically the constraints on word order found in language: typical word sequences are assigned high probabilities, while atypical ones are assigned low probabilities. Statistical models of language may be evaluated by measuring the predicted probability of unseen test utterances: models that generate high average word probability (equivalent to low novelty or low entropy or low perplexity) are considered superior. The perplexity measure is commonly used as a measure of ‘goodness’ of such a model.

The most widely used statistical model of language is the trigram model, in which an estimate of the likelihood of a word is made solely on the identity of the preceding two words in the utterance. The strengths of the trigram model come from its success at capturing local constraints, the ease by which it may be constructed from text corpora, and from its computational efficiency in use. Such a model has been at the heart of speech recognition systems since the pioneering work at IBM in the 1980s. Trigram models have also been applied in optical character recognition applications, and in machine translation.

Criticism of current language modelling

Empirical evidence shows that the predicted likelihoods of trigrams that occur 5 or more times in the training data are well estimated (Church & Gale, 1991), but that the likelihoods of rare or unseen trigrams are not. This problem has been mostly addressed through the development of a number of mathematical methods for estimating the likelihoods of unseen events: via interpolation between trigram, bigram, unigram and uniform distributions (e.g. Bahl et al, 1983), or via discounting probabilities arising from rare events (e.g. Katz, 1987). While such smoothing leads to improvements in the average, it does so by also destroying structure in the particular: for example, the blending through interpolation of n-gram combinations of different sizes makes more likely both unseen data (good) and principled gaps (bad). However our greatest criticism is the treatment of words as empty symbols, devoid of linguistic function. Words form classes of meaning and function which also affect their collocational probabilities. Indeed this observation has lead to a number of studies of language models based on parts of speech classifications, or the use of parts of speech models for smoothing word models (e.g. Neisler & Woodland, 1996). However a model based on a few dozen equivalence classes goes to the other extreme of simplicity: the tag sets used were not designed for language modelling purposes.

A second criticism of the trigram model is that it fails to capture dependencies greater than the two-word history. The use of equivalence classes can help here, by choosing longer n-grams when data permits (Bahl et al, 1989), but again the real issue is how the local context functions linguistically within the utterance. Recent approaches to this problem have included the use of an NLP parser ‘state’ to act as conditioning information (Goddeau and Zue 1992), and through the exploitation of so-called ‘link’ grammars (Lafferty et al 1992). We believe the exploitation of utterance roles is a key development area in language modelling, and is an important part of our proposal. This view is supported by Salim Roukos of IBM who states in a review of the state of the art of language modelling (1996): "A concerted research effort to explore structure-based language model[s] may be the key for a significant progress in language modeling".

Our project

We acknowledge a continuing need for statistical language modelling of the n-gram form. We appreciate that mathematical methods for smoothing observed counts will always be required. On the other hand, expertise in descriptive linguistics, tagging and parsing could transform the effectiveness of this all-too-simple basic idea.

We give three areas where the application of linguistic knowledge of the kind employed in our current and previous NLP activities could lead to improvements on the basic trigram approach: (i) The incorporation of a robust morphological analysis whereby different word forms can be recognised as belonging to the same lemma; and the extension of this to idioms and common collocations. (ii) The use of wordclass analysis that groups words into equivalence classes appropriate for statistical language modelling; for example on verb transitivity or on noun semantic class. (iii) The capture of longer distance dependencies of word association within the trigram approach through the use of syntactic analysis at the phrase level as supplied from information used in the Survey parser.

Objectives

design, construction and evaluation of a robust morphological analyser to reduce sparseness of data, and lead to improved perplexity measures of trigram models on test data.
design and construction of an improved word-class tagging system and evaluation in word-class language models.
design, construction and evaluation of phrase level descriptors for language modelling
empirical investigation of the effectiveness of a number of interpolations schemes.

Progress

April 1998	Project starts
July 1998	Computer System installed. British National Corpus and other corpora installed. Using CMU language modelling toolkit. Using Abbot large-vocabulary recogniser to generate lattices. Investigation into methods for selecting lexica of various sizes.
October 1998	Reference recordings for word accuracy evaluations complete: 100 random sentences from LOB corpus. Dispersion-based approach to reduce out-of-vocabulary (OOV) rates. Morphological analyser prototype.
January 1999	UCL language modelling tools version 1: lattice decoder using ARPA format language models. Morphological analysis of BNC complete. Split BNC into training and testing data. Language models generated from BNC for different lexica. Phonetically-sensitive morphological analysis work started: morphological decomposition performed in a way that is compatible with conventional acoustic modelling of word pronunciations.
Recent Progress	Feb 99: Second set of reference recordings: 100 random sentences from BNC corpus. Baseline word lattice and recognition statistics obtained with Abbot from BNC test data March 99: (1) Models generated from morphologically analysed BNC training data, (2) Pronunciation dictionaries of various sizes produced for morphological word models, (3) Word lattice and recognition statistics obtained from morphological word models with Abbot

Related Issues

If you would like to know more about the project, or have ideas for original contributions that you could make to the project, please contact Mark Huckvale.

^{Author: Mark Huckvale. Last Changed: 1 March 1999}

Enhanced Language Modelling through improved lexical, grammatical and syntactical labelling