PRuler - Pronunciation Rule Development Tool

The PRuler tool is designed to aid the development of letter-to-sound rules for English. The program allows sets of pronunciation rules to be compiled and tested against a database of common English words.

PRuler versions
Program:Vs1.2
Rule file:Vs1.01
Word file:Vs1.1

Contents:
¤Overview
¤Operation
¤Rule Format
¤SAMPA
¤REP Project
¤Feedback
¤Copyright

Internet links:
¤Regular English
¤UCL Phonetics
& Linguistics


Overview

The PRuler program aids the development of sets of "Letter-to-Sound" rules for English. These are rules for the conversion of English spelling to English phonological pronunciation. For example, using such rules we can establish that the pronunciation of the word spelled "cat" will be /k{t/ (all pronunciation is specified in SAMPA format). Each rule maps a sequence of letters into a sequence of phonological symbols. PRuler manages sets of such rules and allows the user to evaluate how the rules work on a standard body of words and to evaluate the performance of the rule set against current English pronunciation.


Operation

Menu Operation

File
New
Starts a new empty set of rules.
Open
Opens an existing file containing a set of rules saved by the program.
Save
Saves the current set of rules back into its original file, overwriting the old contents.
Save As
Saves the current set of rules into a new file.
Save As HTML
Saves a report of the current set of rules into a file suitable for displaying on a web page (in the Hypertext Markup Language, HTML).
Properties
Edits the title, version, author and comments associated with a set of rules. These properties are saved with the rules and reported in the HTML format.
Print
Prints a report of the current set of rules.
Print Preview
Displays a preview of the printed report on the screen.
Print Setup
Allows you to set options on the printer.
Exit
Exits the program. If any rules have been changed you will be asked if you want to save them.
Rule
New
Displays the rule editing dialog box for a new rule to be typed in. The rule will be inserted in the rule list below the current cursor position (first selected rule) in the list.
Edit
Displays the rule editing dialog box for the first currently selected rule in the rule list. Changes to the rule will overwrite the current entry in the list.
Delete
Deletes the first currently selected rule in the rule list.
Apply All
Applies all rules to all words, updating all statistics and records. This is normally performed automatically after any rule change in any case.
Transcribe Word
Displays the transcribe word dialog box so that rules can be tested against words typed in by the user. If the word is also in the standard list, then each firing rule is also marked against whether it delivers standard pronunciation.
Rule Statistics
Displays statistics on phoneme use in the standard and the generated dictionary.
Redundancy Check
Tests whether each rule is actually required, by temporarily deleting it and seeing if the final dictionary changes as a consequence. Can take a few minutes to run.
View
Display All
Displays all words from the standard list in the word list.
Display Standard
Displays only words from the standard list that are assigned standard pronunciation by the current set of rules.
Display Nonstandard
Displays only words from the standard list that are not assigned standard pronunciation by the current set of rules.
Sort by Word
Sorts the word list alphabetically by word spelling.
Sort by Text Probability
Sorts the word list by decreasing word frequency in written text material.
Sort by Speech Probability
Sorts the word list by decreasing word frequency in transcribed spoken text material.

Rule List

The rule list window shows a summary of each rule in the set. Look at the Rule Format below for details of the format and how the rules work. The first entry in the list is always "[All Rules]". You can select one or more rules by clicking on them so that they become highlighted. To select more than one rule, press the [Ctrl] key when clicking with the mouse. You can edit an existing rule by double-clicking on it. You can delete a rule by selecting it and choosing the Rule/Delete menu option. You can insert a new rule below the currently selected rule by choosing the Rule/New menu option.

Along with the rule itself in the list is a percentange performance figure. This is calculated from the standard word list and table of text frequencies. The percentage expresses the (frequency-weighted) fraction of text words covered by the rule which have standard pronunciation.

When one or more rules are selected, they act as a "filter" on the word list. Only words which cause that rule to fire are displayed. You can choose menu options to control whether you see all words that fire the rule, or only words for which the rule generates a standard pronunciation, or only words for whihc the rule generates a non-standard pronunciation.

Word List

The word list window shows words drawn from the standard table of English words. If no selections are made in the rule list window, or if "[All Rules]" is selected in the rule list window, then all words in the standard table may be displayed. Menu options control whether all words are displayed or whether only those with standard or with non-standard pronunciations.

The word list actually shows the morphologically decomposed word, then the standard pronunciation, then (if non-standard) the rule pronunciation.

Single words in the word list may be selected by clicking with the mouse. When a word is selected, then the rules that are fired by this word are automatically highlighted in the rule list window. This in turn causes the list of words to be reduced to all those that share the same list of rules.


Rule Format and Application

Rule Machine

To understand the pronunciation rule format, it is useful to have a mental picture of the machine which applies rules to spelling to generate pronunciation. The figure below should be helpful in visualising this process:

The input to the process is the input letter sequence of the word. In fact this is commonly the morphologically analysed spelling as described below. Each rule in the rule list is taken in turn and in order. The centre part of the rule specifies which letters are to be mapped to a sequence of phonetic units. The left and right parts of the rules constrain the application of the rule to particular contexts. For the rule to fire, the left and right contexts must match exactly. The left and right contexts can be specified in terms of adjacent letters, adjacent phonetic units, or in terms of special meta-characters standing for 'consonant', 'vowel', or 'boundary'. For example the rule:

    A / B \ C => d
    

should be read as: the letter B should be changed to the phoneme d whenever it is found between the letters A and C. While the rule

    f \ S \ ) => s
    

should be read as: the letter S should be changed to the phoneme s when it is found immediately after the phoneme f and immediately before a word boundary.

A rule can convert multiple input letters into multiple output phonemes. For example, the rule

    # \ TION \ # => S@n
    

should be read as: the letter string TION occurring between morphological boundary markers should be changed to the phoneme string S@n.

The dialog box for editing rules should now make sense:

Rule Format Definition

The grammar of a rule is as follows:

    RULE ::= LEFT-CONTEXT MATCH RIGHT-CONTEXT "=" OUTPUT
    MATCH ::= LETTER { LETTER }
    OUTPUT ::= { PHONEME }
    LEFT-CONTEXT ::= LEFT-ALPHA-CONTEXT "/" | LEFT-PHONE-CONTEXT "\"
    RIGHT-CONTEXT ::= "\" RIGHT-ALPHA-CONTEXT | "/" RIGHT-PHONE-CONTEXT
    LEFT-ALPHA-CONTEXT ::= { LETTER | META-ALPHA }
    RIGHT-ALPHA-CONTEXT ::= { LETTER | META-ALPHA }
    LEFT-PHONE-CONTEXT ::= { PHONEME | META-PHONE }
    RIGHT-PHONE-CONTEXT ::= { PHONEME | META-PHONE }
    LETTER ::= "A"|"B"|"C"|"D"|"E"|"F"|"G"|"H"|"I"|"J"|"K"|"L"|"M"
               "N"|"O"|"P"|"Q"|"R"|"S"|"T"|"U"|"V"|"W"|"X"|"Y"|"Z"
               "-"|"+"|"("|")","'"
    PHONEME ::= "p"|"t"|"k"|"b"|"d"|"g"|"f"|"v"|"T"|"D"|"s"|"z"|"S"|"Z"|"h"
                "tS"|"dZ"|"l"|"r"|"w"|"j"|"m"|"n"|"N"|
                "i"|"I"|"e"|"{"|"V"|"A"|"Q"|"O"|"U"|"u"|"3"|"@"|
                "eI"|"aI"|"OI"|"@U"|"aU"|"e@"|"I@"|"U@"|
                "R"|"-s"|"-d"
    META-ALPHA ::= "^"|"#"|"."
    META-PHONE ::= "^"|"."
    

The meaning of the special symbols are as follows:

+Morphological prefix boundary
-Morphological suffix boundary
(Word start boundary
)Word end boundary
^Match consonant
.Match vowel
#Match any boundary symbol

The special phoneme symbol /R/ is equivalent to an /r/ only when it is followed by a vowel. It can be used to simplify rules for /r/ and to indicate the possibility for linking-/r/.

The special phoneme symbol /-s/ stands for the plural marker. It is mapped to /s/, /z/ or to /Iz/ as appropriate.

The special phoneme symbol /-d/ stands for the past tense marker. It is mapped to /t/, /d/ or to /Id/ as appropriate.

Morphological Analysis

Many pronunciation rules are easier to state if it is assumed that the spelling is first analysed into morphological components. A very simple system of morphological analysis has been designed for the PRuler program and standard word database. In this system, a limited number of prefixes are stripped from words and indicated with a "+" symbol; also a limited number of suffixes are stripped from words and indicated with the "-" symbol. There are very small amounts of respelling of root forms of words where the addition of the suffix would have removed a final "E" or "Y".

Here are some examples:

    REDO -> RE+DO
    LETTERS -> LETTER-S
    MERELY -> MERE-LY
    FLIES -> FLY-S
    HAVING -> HAVE-ING
    BECOMING -> BE+COME-ING
    BEHAVING -> BEHAVE-ING
    HOLY -> HOLY
    WHOLLY -> WHOLE-LY
    

A definitive list of affixes will be produced in due course. Morphological markers are also useful to indicate compound words, e.g. FIRE-MAN.

The challenge is to decide on a morphological system that is simple to learn and yet relevant to the design of pronunciation rules. A particular problem are Latinate affixes which are not so readily separable. At the moment we have decided not to mark these; they are both difficult to understand and have complex pronunciation rules. For example, it does not seem useful to divide RELATION into RELATE-TION.

Stress Assignment

The assignment of a stress pattern to lexical items in English is known to be very complicated. So far we have not addressed in the project how this should be achieved. In particular we have not yet accommodated changes to vowel quality that occur when syllables become unstressed. This means that currently we are underestimating the accuracy of the rule set. Thus the program will mark a pronunciation as incorrect even when the only error is that a full vowel is indicated for an unstressed syllable (for example "about" as /eIbaUt/). This problem will be addressed later.


SAMPA - Phonetic alphabet for English

The SAMPA alphabet is a machine-readable phonetic alphabet developed by John Wells and others. More information can be found at the SAMPA web site.

Consonants

The standard English consonant system is traditionally considered to comprise 17 obstruents (6 plosives, 2 affricates and 9 fricatives) and 7 sonorants (3 nasals, 2 liquids and 2 semivowel glides).

With the exception of the fricative /h/, the obstruents are usually classified in pairs as "voiceless and "voiced", although the presence or absence of periodicity in the signal resulting from laryngeal vibration is not a reliable feature distinguishing the two classes. They are better considered "fortis" (strong) and "lenis" (weak), with duration of constriction and intensity of the noise component signalling the distinction.

The six plosives are p b t d k g:

    Symbol   Word           Transcription
    p        pin            pIn
    b        bin            bIn
    t        tin            tIn
    d        din            dIn
    k        kin            kIn
    g        give           gIv

The "lenis" stops are most reliably voiced intervocalically; aspiration duration following the release in the fortis stops varies considerably with context, being practically absent following /s/, and varying with degree of stress syllable-initially.

The two phonemic affricates are tS and dZ:

    tS       chin           tSIn
    dZ       gin            dZIn

As with the lenis stop consonants, /dZ/ is most reliably voiced between vowels.

There are nine fricatives, f v T D s z S Z h:

    f        fin            fIn
    v        vim            vIm
    T        thin           TIn
    D        this           DIs
    s        sin            sIn
    z        zing           zIN
    S        shin           SIn
    Z        measure        "meZ@
    h        hit            hIt

Intervocalically the lenis fricatives are usually fully voiced, and they are often weakened to approximants (fricationless continuants) in unstressed position.

The sonorants are three nasals m n N, two liquids r l, and two sonorant glides w j:

    m        mock           mQk
    n        knock          nQk
    N        thing          TIN
    r        wrong          rQN
    l        long           lQN
    w        wasp           wQsp
    j        yacht          jQt

Vowels

The English vowels fall into two classes, traditionally known as "short" and "long" but, owing to the contextual effect on duration of following "fortis" and "lenis" consonants (traditional "long" vowels preceding fortis consonants can be shorter than "short" vowels preceding lenis consonants), they are better described as "checked" (not occurring in a stressed syllable without a following consonant) and "free".

The checked vowels are I e { Q V U:

    I        pit            pIt
    e        pet            pet
    {        pat            p{t
    Q        pot            pQt
    V        cut            kVt
    U        put            pUt
There is a short central vowel, normally unstressed:
    @        another        @"nVD@

The free vowels comprise monophthongs and diphthongs, although no hard and fast line can be drawn between these categories. They can be placed in three groups according to their final quality: i eI aI OI, u @U aU, 3 A O I@ e@ U@. They are exemplified as follows:

    i        ease           iz
    eI       raise          reIz
    aI       rise           raIz
    OI       noise          nOIz

    u        lose           luz
    @U       nose           n@Uz
    aU       rouse          raUz

    3        furs           f3z
    A        stars          stAz
    O        cause          kOz
    I@       fears          fI@z
    e@       stairs         ste@z
    U@       cures          kjU@z

Origins of SAMPA

SAMPA (Speech Assessment Methods Phonetic Alphabet) is a machine-readable phonetic alphabet. It was originally developed under the ESPRIT project 1541, SAM (Speech Assessment Methods) in 1987-89 by an international group of phoneticians, and was applied in the first instance to the European Communities languages Danish, Dutch, English, French, German, and Italian (by 1989); later to Norwegian and Swedish (by 1992); and subsequently to Greek, Portuguese, and Spanish (1993). Under the BABEL project, it has now been extended to Bulgarian, Estonian, Hungarian, Polish, and Romanian (1996). Under the aegis of COCOSDA it is hoped to extend it to cover many other languages (and in principle all languages). Recent additions: Croatian, Russian, Slovenian.


Regular English Pronunciation Project

The Regular English Pronunciation Project aims to develop a new accent for English which is more logically connected to English spelling. The strategy is to develop a set of rules that will make the pronunciation of English more regular and which will make English easier for second-language learners. See the REP Project web site for more information.


Feedback

Please send suggestions for improvements and reports of program faults to M.Huckvale@ucl.ac.uk.


Copyright

Prior to release, all data associated with the Regular English project remains the intellectual property of Mark Huckvale (© 2002 Mark Huckvale University College London).


© 2002 Mark Huckvale University College London April 2002