VoiceXML and SALT: Approaches for Developing Telephony Applications
- VoiceXML Enables Telephony Applications
- SALT Enables Telephony and Multimodal Applications
- Two Approaches: Which Is Best?
Ever since the VoiceXML Forum published VoiceXML Version 0.9, the telephony world has embraced VoiceXML as a high-level declarative language for developing speech applications. Recently, the SALT Forum has published Speech Application Language Tags (SALT) for developing both telephony and multimodal applications by embedding tags into scripting languages such as XHTML, HTML, and so on. This article addresses the strengths and weaknesses of these two approaches. For a description of some of the technical differences, see "VoiceXML and SALT: How Are They Different, and Why?" (Speech Technology Magazine, May/June, 2002).
VoiceXML Enables Telephony Applications.
VoiceXML has had a revolutionary impact on how voice applications have been written. The VoiceXML Forum (founded by AT&T, IBM, Lucent, and Motorola) specified Version 1.0 of VoiceXML. The W3C Voice Browser Working Group took over the job of evolving and standardizing the language, resulting in Version 2.0 and additional languages for developing speech applications:
Speech Synthesis Markup Language (SSML) describes how a speech synthesis engine pronounces words and phrases.
Speech Recognition Grammar Specification (SRGS) describes what words and phrases a speech recognition engine can recognize at each point in the dialog.
Semantic Interpretation Language (a version of ECMS Script) extracts and translates recognized words.
No longer must a programmer deal with all of the details for invoking the speech synthesis engine, speech recognition engine, audio subsystem, and other subsystems; as well as coordinating the data exchanged among them. Now, developers only need to specify verbal menus and forms. The following VoiceXML code illustrates a verbal form with two fields that solicit the names of origin and destination cities for a travel application. The green text represents prompts that are presented to the user via a speech synthesis engine. The red text identifies grammars used by a speech recognition engine to listen for the caller's response. In the example below, the actual grammars are stored in files external to the application. The blue text represents event handlers that take over when the caller fails to respond appropriately. The purple text illustrates how the solicited data is transferred to a backend database.
<?xml version="1.0"?> <vxml version="2.0"> <form id="TravelForm"> <field name="OriginCity" > <grammar src="city.xml" /> <prompt>Where would you like to leave from?</prompt> <nomatch>I didn't understand </nomatch> </field> <field name="DestCity" > <grammar src="city.xml"/> <prompt>Where would you like to go to?</prompt> <nomatch>I didn't understand </nomatch> </field> <filled> <submit ... /></filled> </form> </vxml>
VoiceXML owes its declarative nature to its Forms Interpretation Algorithm. This algorithm sequences through the fields of a form, causing a field's prompt to be presented to the user, listening for the caller's response, and invoking the appropriate event handler if the caller fails to respond appropriately. In the previous example, the Forms Interpretation Algorithm performs the following actions for the "OriginCity" field:
Invokes the speech synthesis engine, and sends it the text Where would you like to leave from?
Invokes the speech recognition engine, and sends it the grammar "city.xml" so it can listen for the names of cities.
If the user fails to say one of the city names listed in the grammar, then an event handler is invoked that encourages the user to try again.
Next, the Forms Interpretation Algorithm performs similar tasks for the "DestCity" field.
The Forms Interpretation Algorithm performs many of the control and coordination activities; this leaves the programmer to specify just the prompts, grammars, and event handlers using a declarative style of programming.
More than 50 vendors have announced products based on VoiceXML. Developers have written hundreds of VoiceXML applications that handle thousands of telephone calls everyday.