![]() |
PRuler - Pronunciation Rule Development ToolThe PRuler tool is designed to aid the development of letter-to-sound rules for English. The program allows sets of pronunciation rules to be compiled and tested against a database of common English words. |
||||||||||||||||||||||
Contents: Internet links: |
OverviewThe PRuler program aids the development of sets of "Letter-to-Sound" rules for English. These are rules for the conversion of English spelling to English phonological pronunciation. For example, using such rules we can establish that the pronunciation of the word spelled "cat" will be /k{t/ (all pronunciation is specified in SAMPA format). Each rule maps a sequence of letters into a sequence of phonological symbols. PRuler manages sets of such rules and allows the user to evaluate how the rules work on a standard body of words and to evaluate the performance of the rule set against current English pronunciation.
OperationMenu Operation
Rule ListThe rule list window shows a summary of each rule in the set. Look at the Rule Format below for details of the format and how the rules work. The first entry in the list is always "[All Rules]". You can select one or more rules by clicking on them so that they become highlighted. To select more than one rule, press the [Ctrl] key when clicking with the mouse. You can edit an existing rule by double-clicking on it. You can delete a rule by selecting it and choosing the Rule/Delete menu option. You can insert a new rule below the currently selected rule by choosing the Rule/New menu option. Along with the rule itself in the list is a percentange performance figure. This is calculated from the standard word list and table of text frequencies. The percentage expresses the (frequency-weighted) fraction of text words covered by the rule which have standard pronunciation. When one or more rules are selected, they act as a "filter" on the word list. Only words which cause that rule to fire are displayed. You can choose menu options to control whether you see all words that fire the rule, or only words for which the rule generates a standard pronunciation, or only words for whihc the rule generates a non-standard pronunciation. Word ListThe word list window shows words drawn from the standard table of English words. If no selections are made in the rule list window, or if "[All Rules]" is selected in the rule list window, then all words in the standard table may be displayed. Menu options control whether all words are displayed or whether only those with standard or with non-standard pronunciations. The word list actually shows the morphologically decomposed word, then the standard pronunciation, then (if non-standard) the rule pronunciation. Single words in the word list may be selected by clicking with the mouse. When a word is selected, then the rules that are fired by this word are automatically highlighted in the rule list window. This in turn causes the list of words to be reduced to all those that share the same list of rules. Rule Format and ApplicationRule MachineTo understand the pronunciation rule format, it is useful to have a mental picture of the machine which applies rules to spelling to generate pronunciation. The figure below should be helpful in visualising this process: ![]() The input to the process is the input letter sequence of the word. In fact this is commonly the morphologically analysed spelling as described below. Each rule in the rule list is taken in turn and in order. The centre part of the rule specifies which letters are to be mapped to a sequence of phonetic units. The left and right parts of the rules constrain the application of the rule to particular contexts. For the rule to fire, the left and right contexts must match exactly. The left and right contexts can be specified in terms of adjacent letters, adjacent phonetic units, or in terms of special meta-characters standing for 'consonant', 'vowel', or 'boundary'. For example the rule: A / B \ C => d should be read as: the letter B should be changed to the phoneme d whenever it is found between the letters A and C. While the rule f \ S \ ) => s should be read as: the letter S should be changed to the phoneme s when it is found immediately after the phoneme f and immediately before a word boundary. A rule can convert multiple input letters into multiple output phonemes. For example, the rule # \ TION \ # => S@n should be read as: the letter string TION occurring between morphological boundary markers should be changed to the phoneme string S@n. The dialog box for editing rules should now make sense: ![]() Rule Format DefinitionThe grammar of a rule is as follows:
RULE ::= LEFT-CONTEXT MATCH RIGHT-CONTEXT "=" OUTPUT
MATCH ::= LETTER { LETTER }
OUTPUT ::= { PHONEME }
LEFT-CONTEXT ::= LEFT-ALPHA-CONTEXT "/" | LEFT-PHONE-CONTEXT "\"
RIGHT-CONTEXT ::= "\" RIGHT-ALPHA-CONTEXT | "/" RIGHT-PHONE-CONTEXT
LEFT-ALPHA-CONTEXT ::= { LETTER | META-ALPHA }
RIGHT-ALPHA-CONTEXT ::= { LETTER | META-ALPHA }
LEFT-PHONE-CONTEXT ::= { PHONEME | META-PHONE }
RIGHT-PHONE-CONTEXT ::= { PHONEME | META-PHONE }
LETTER ::= "A"|"B"|"C"|"D"|"E"|"F"|"G"|"H"|"I"|"J"|"K"|"L"|"M"
"N"|"O"|"P"|"Q"|"R"|"S"|"T"|"U"|"V"|"W"|"X"|"Y"|"Z"
"-"|"+"|"("|")","'"
PHONEME ::= "p"|"t"|"k"|"b"|"d"|"g"|"f"|"v"|"T"|"D"|"s"|"z"|"S"|"Z"|"h"
"tS"|"dZ"|"l"|"r"|"w"|"j"|"m"|"n"|"N"|
"i"|"I"|"e"|"{"|"V"|"A"|"Q"|"O"|"U"|"u"|"3"|"@"|
"eI"|"aI"|"OI"|"@U"|"aU"|"e@"|"I@"|"U@"|
"R"|"-s"|"-d"
META-ALPHA ::= "^"|"#"|"."
META-PHONE ::= "^"|"."
The meaning of the special symbols are as follows:
The special phoneme symbol /R/ is equivalent to an /r/ only when it is followed by a vowel. It can be used to simplify rules for /r/ and to indicate the possibility for linking-/r/. The special phoneme symbol /-s/ stands for the plural marker. It is mapped to /s/, /z/ or to /Iz/ as appropriate. The special phoneme symbol /-d/ stands for the past tense marker. It is mapped to /t/, /d/ or to /Id/ as appropriate. Morphological AnalysisMany pronunciation rules are easier to state if it is assumed that the spelling is first analysed into morphological components. A very simple system of morphological analysis has been designed for the PRuler program and standard word database. In this system, a limited number of prefixes are stripped from words and indicated with a "+" symbol; also a limited number of suffixes are stripped from words and indicated with the "-" symbol. There are very small amounts of respelling of root forms of words where the addition of the suffix would have removed a final "E" or "Y". Here are some examples: REDO -> RE+DO LETTERS -> LETTER-S MERELY -> MERE-LY FLIES -> FLY-S HAVING -> HAVE-ING BECOMING -> BE+COME-ING BEHAVING -> BEHAVE-ING HOLY -> HOLY WHOLLY -> WHOLE-LY A definitive list of affixes will be produced in due course. Morphological markers are also useful to indicate compound words, e.g. FIRE-MAN. The challenge is to decide on a morphological system that is simple to learn and yet relevant to the design of pronunciation rules. A particular problem are Latinate affixes which are not so readily separable. At the moment we have decided not to mark these; they are both difficult to understand and have complex pronunciation rules. For example, it does not seem useful to divide RELATION into RELATE-TION. Stress AssignmentThe assignment of a stress pattern to lexical items in English is known to be very complicated. So far we have not addressed in the project how this should be achieved. In particular we have not yet accommodated changes to vowel quality that occur when syllables become unstressed. This means that currently we are underestimating the accuracy of the rule set. Thus the program will mark a pronunciation as incorrect even when the only error is that a full vowel is indicated for an unstressed syllable (for example "about" as /eIbaUt/). This problem will be addressed later. SAMPA - Phonetic alphabet for EnglishThe SAMPA alphabet is a machine-readable phonetic alphabet developed by John Wells and others. More information can be found at the SAMPA web site. ConsonantsThe standard English consonant system is traditionally considered to comprise 17 obstruents (6 plosives, 2 affricates and 9 fricatives) and 7 sonorants (3 nasals, 2 liquids and 2 semivowel glides).
With the exception of the fricative /
The six plosives are
Symbol Word Transcription
p pin pIn
b bin bIn
t tin tIn
d din dIn
k kin kIn
g give gIv
The "lenis" stops are most reliably voiced intervocalically; aspiration duration following the release in the fortis stops varies considerably with context, being practically absent following /
The two phonemic affricates are
tS chin tSIn
dZ gin dZIn
As with the lenis stop consonants, /
There are nine fricatives,
f fin fIn
v vim vIm
T thin TIn
D this DIs
s sin sIn
z zing zIN
S shin SIn
Z measure "meZ@
h hit hIt
Intervocalically the lenis fricatives are usually fully voiced, and they are often weakened to approximants (fricationless continuants) in unstressed position.
The sonorants are three nasals
m mock mQk
n knock nQk
N thing TIN
r wrong rQN
l long lQN
w wasp wQsp
j yacht jQt
VowelsThe English vowels fall into two classes, traditionally known as "short" and "long" but, owing to the contextual effect on duration of following "fortis" and "lenis" consonants (traditional "long" vowels preceding fortis consonants can be shorter than "short" vowels preceding lenis consonants), they are better described as "checked" (not occurring in a stressed syllable without a following consonant) and "free".
The checked vowels are
I pit pIt
e pet pet
{ pat p{t
Q pot pQt
V cut kVt
U put pUt
There is a short central vowel, normally unstressed:
@ another @"nVD@
The free vowels comprise monophthongs and diphthongs, although no hard and fast line can be drawn between these categories. They can be placed in three groups according to their final quality:
i ease iz
eI raise reIz
aI rise raIz
OI noise nOIz
u lose luz
@U nose n@Uz
aU rouse raUz
3 furs f3z
A stars stAz
O cause kOz
I@ fears fI@z
e@ stairs ste@z
U@ cures kjU@z
Origins of SAMPASAMPA (Speech Assessment Methods Phonetic Alphabet) is a machine-readable phonetic alphabet. It was originally developed under the ESPRIT project 1541, SAM (Speech Assessment Methods) in 1987-89 by an international group of phoneticians, and was applied in the first instance to the European Communities languages Danish, Dutch, English, French, German, and Italian (by 1989); later to Norwegian and Swedish (by 1992); and subsequently to Greek, Portuguese, and Spanish (1993). Under the BABEL project, it has now been extended to Bulgarian, Estonian, Hungarian, Polish, and Romanian (1996). Under the aegis of COCOSDA it is hoped to extend it to cover many other languages (and in principle all languages). Recent additions: Croatian, Russian, Slovenian. Regular English Pronunciation ProjectThe Regular English Pronunciation Project aims to develop a new accent for English which is more logically connected to English spelling. The strategy is to develop a set of rules that will make the pronunciation of English more regular and which will make English easier for second-language learners. See the REP Project web site for more information. FeedbackPlease send suggestions for improvements and reports of program faults to M.Huckvale@ucl.ac.uk. CopyrightPrior to release, all data associated with the Regular English project remains the intellectual property of Mark Huckvale (© 2002 Mark Huckvale University College London). |
||||||||||||||||||||||
| © 2002 Mark Huckvale University College London | April 2002 |