A Tutorial Introduction to SALT

6. Speech Synthesis Mark-Up Language

Control of how a SALT <PROMPT> object converts text to speech can be performed using the Speech Synthesis Mark-up Language (SSML). SSML is an XML-based markup language that can be used to control voice, pitch, rate, volume, pronunciation, and other characteristics of the output speech. We give a basic introduction to SSML below, but full details can be found in the W3C Speech Synthesis Markup Language Specification Version 1.0 or in the SASDK documentation.

6.1 Test application

You can use the following application to test SSML.

Try this out on your computer: Normal version, Debug version.

    <html xmlns:salt="http://www.saltforum.org/2002/SALT">
    <object id="speech-add-in" CLASSID="clsid:33cbfc53-a7de-491a-90f3-0e782a7e347a">
    </object>
    <?import namespace="salt" implementation="#speech-add-in"/>
    <salt:prompt id="prompter" xmlns="http://www.w3.org/2001/10/synthesis">
    </salt:prompt>
    <body>
      <h2>SALT: Test SSML</h2>
      <textarea id="iptext" rows=5 cols=50>
      Type some <emphasis>marked-up</emphasis> text here.
      </textarea><br>
      <input type="button" name="speak" value="Speak" onClick="dospeak()">
    </body>
    <script>
    function dospeak()
    {
      var pfield=document.getElementById("iptext");
      var pprompt=document.getElementById("prompter");
      pprompt.Start("<speak>"+pfield.value+"</speak>");
    }
    </script>
    </html>
    

6.2 SSML Elements

Element Description
<speak>..</speak> Encloses all text marked up with SSML.
<paragraph>..</paragraph> Marks paragraph structure of document.
<sentence>..</sentence> Marks sentence structure of document.
<say-as type=type>..</say-as> Gives finer control about how text is to be spoken (see below).
<phoneme ph=string>..</phoneme> Provides pronunciation for text (see below).
<sub alias=text>..</sub> Provides substitute form for text.
<voice args>..</sub> Sets preferred speaker type for voice.
<emphasis>..</emphasis> Speaks enclosed text with more emphasis (greater stress).
<break size=size /> Causes a short break or pause at this point. Sizes are: none, small, medium, large.
<prosody args>..</prosody> Controls pitch, rate and volume of the speech (see below).
<audio src=URL /> Inserts audio from file at this point.

<say-as> element

The <say-as> element can be used to mark up text that needs to be produced in a particular way. The attribute "type=" can be set to any of these values:

    type=acronym
    Speak as if text was an acronym.
    type=spell-out
    Spell out individual letters.
    type=number
    Speak as a number.
    type=date
    Speak as a date.
    type=time
    Speak as a time.
    type=currency
    Speak as a currency amount
    type=telephone
    Speak as a telephone number.
    type=net
    Speak as an e-mail or web address.
    type=name
    Speak as if someone's name.

Here are some examples (which you can cut and paste into the test application):

    The year is <say-as type="date">1930</say-as>.
    The total is <say-as type="currency">$20.45</say-as>.
    

<phoneme> element

The <phoneme> element can be used to provide a pronunciation for a piece of text. For the Microsoft Speech add-in, the phonetic transcription has to be in the Microsoft SAPI Phonetic Alphabet format. To indicate this you must include the "alphabet=sapi" attribute. The text enclosed in the tags is not pronounced but may be used to comment what was said.

Here is an example:

    The British pronunciation of "grass" is
    <phoneme alphabet="sapi" ph="g r aa s">grass</phoneme>.
    

<prosody> element

Attribute Description
pitch=setting Sets preferred pitch for marked-up content. Pitch values may be specified as:
  • Enumerated: low, medium, high, default
  • Absolute (Hz): 65, 110, 261, 294
  • Relative (Hz): +4, +10.6, -2.0, -6.75
  • Percentage (Hz): +2%, +5.5%, -7.0%, -4.25%
  • Semitones: +1st, +2.5st, -0.5st, -1.5st
rate=setting Sets preferred rate for marked-up content. Rate values may be specified as:
  • Enumerated: slow, medium, fast, default
  • Absolute: 0, 1, 3, 7 (scale=0..10)
  • Relative: +5, +1.3, -3.0, -7.22
  • Percentage: +15%, +7.8%, -12.0%, -6.5%
volume=setting Sets preferred volume for marked-up content. Volume values may be specified as:
  • Enumerated: silent, soft, medium, loud, default
  • Absolute: 16, 47, 84, 100 (scale=0..100)
  • Relative: +15, +45.3, -30.0, -13.25
  • Percentage: +21%, +6.5%, -50.0%, -25.5%

Here is an example:

    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xml:lang="en-US">
    <sentence>
    Your order for <prosody pitch="+0.5st" rate="-10%">
    <say-as type="number"> 8 </say-as> books </prosody>
    will be shipped tomorrow.
    </sentence>
    </speak>
    

Next: The <LISTEN> element in detail..

A Tutorial Introduction to SALT © 2005 Mark Huckvale, Phonetics and Linguistics, University College London

University College London - Gower Street - London - WC1E 6BT - Telephone: +44 (0)20 7679 2000 - Copyright © 1999-2013 UCL