download .doc file

About 37% of word-tokens are nouns

Richard Hudson

last changed 27 September 2007

Bibliographical information

Language 70, pp 331-339, 1994

Introductory paragraph

The title of this note is a generalization that may strike readers as a joke. However, it turns out to be true (with some systematic variations) of any rea sonably large body of written English, and can be matched by other general izations about other word-classes, genres, and languages. 1 am as reluctant as any linguist could be to believe it; after all, the choice of word-classes in a text depends on a myriad of variable influences, from the message conveyed to the style of the author. Moreover, linguists have talked about ‘nominal’ and ‘verbal’ styles for some time (since Wells 1960), implying that the relative balance between nouns and verbs is a major source of variation among texts. More recent work on large corpora would appear to support the expectation of major differences, as shown by the counts for major word-classes in the Brown and LOB’ corpora (e.g. Johansson & Hofland 1989:16). And yet the facts turn out to be otherwise and (in my opinion) far more interesting because they cry out for an explanation.

Table I shows some basic figures for the word-classes in the Brown and LOB corpora, based on the reported figures for grammatical ‘word-tags’ in Francis & Kuëet-a 1982 and Johansson & Hofland 1989.2 These overall figures are remarkably similar, though the differences are still significantly different from a statistician’s point of view. Even small percentage differences have to be taken very seriously when one is dealing with tens or hundreds of thousands of cases (chi-square = I ,53 1, a difference which is virtually impossible simply by chance). But even if we can’t ignore the differences, the similarities are sufficiently striking to suggest an underlying constancy. So far as I know, this particular constancy has not been noted before in published discussions of Corpus statistics.