Creating .NET Applications That Talk
- The Microsoft Speech Application SDK (SASDK)
- Business Benefits of Speech
- How the Speech Engine Works
- Installing the SASDK
- Creating a Speech Application
- Debugging and Tuning a Speech Application
- Setting Up a Telephony Server
- Summary
In 1939, Bell Labs demonstrated a talking machine named "Voder" at the New York World's Fair. The machine was not well received because the voice was robotic and unnatural sounding. Since then, many advances have been made in the area of speech synthesisspecifically in the last five years. Also known as text-to-speech, speech synthesis is one of two key technologies in the area of speech applications.
The second technology is speech recognition. For decades, science fiction movies have featured talking computers capable of accepting oral directions from their users. What once existed only in the thoughts of writers and filmmakers may soon become part of everyday life. In the last few years many advances have been made in the area of speech recognition by researchers such as the Speech Technology Group at Microsoft Research.
Speech-based applications have been slowly entering the marketplace. Many banks allow customers to access their account data through automated telephone systems, also known as Interactive Voice Response (IVR) systems. Yahoo and AOL have set up systems that read e-mail to their users. The National Weather Service (NOAA) has an application that reads the weather.
Speech processing is an important technology for enhanced computing because it provides a natural and intuitive interface for the user. People communicate with one another through conversation, so it is comfortable and efficient to use the same method for communication with computers.
Recently Microsoft released Speech Server as part of an effort to make speech more mainstream. Microsoft Speech Server (MSS) has three main components:
-
Speech Application SDK (SASDK)
-
Speech Engine Services (SES)
-
Telephony Application Services (TAS)
All three components are bundled into both the Standard Edition and the Enterprise Edition. The primary difference between the two depends on how many concurrent users your application must support.
Speech Engine Services (SES) and Telephony Application Services (TAS) are components that run on the Speech Server. The Speech Server is responsible for interfacing with the Web server and the telephony hardware. Web-based applications can be accessed from traditional Web browsers, telephones, mobile phones, pocket PC's, and smart phones.
This chapter will focus primarily on specific components of the SASDK, since this is the component most applicable to developers. The installation files for the SASDK are available as a free download from the Microsoft Speech Web site at http://www.microsoft.com/speech/. Chapters 3 and 4 will expand on the use of the SASDK and will introduce two fictional companies and the speech-based applications they developed.
The Microsoft Speech Application SDK (SASDK)
The Microsoft Speech Application SDK (SASDK), version 1.0 enables developers to create two basic types of applications: telephony (voice-only) and multimodal (text, voice, and visual). This is not the first speech-based SDK Microsoft has developed. However, it is fundamentally different from the earlier ones because it is the first to comply with an emerging standard known as Speech Application Language Tags, or SALT (refer to the "SALT Forum" profile box). Run from within the Visual Studio.NET environment, the SASDK is used to create Web-based applications only.
Speech-based applications offer more than just touch-tone access to account information or call center telephone routing. Speech-based applications offer the user a natural interface to a vast amount of information. Interactions with the user involve both the recognition of speech and the reciting of static and dynamic text. Current applications can be enhanced by offering the user a choice to utilize either traditional input methods or a speech-based one.
Development time is significantly reduced with the use of a familiar interface inside Visual Studio.NET. Streamlined wizards allow developers to quickly build grammars and prompts. In addition, applications developed for telephony access can utilize the same code base as those accessed with a Web browser.
The SASDK makes it easy for developers to utilize speech technology. Graphical interfaces and drag-and-drop capabilities mask all the complexities behind the curtain. All the .NET developer needs to know about speech recognition is how to interpret the resulting confidence score.
Telephony Applications
The Microsoft Speech Application SDK enables developers to create telephony applications, in which data can be accessed over a phone. Prior to the Speech Application SDK, one option for creating voice-only applications was the Telephony API (TAPI), version 3.0, that shipped with Windows 2000. This COM-based API allowed developers to build interactive voice systems. The TAPI allowed developers to create telephony applications that communicated over a Public Switched Telephone Network (PSTN) or over existing networks and the Internet. It was responsible for handling the communication between telephone and computer.
Telephony application development would further incorporate the use of the SAPI (Speech Application Programming Interface), version 5.1, to provide speech recognition and speech synthesis services. This API is COM based and designed primarily for desktop applications. Like TAPI, it does not offer the same tools and controls available with the new .NET version. Most important, the SAPI is not SALT compliant and therefore does not utilize a common platform.
Telephony applications built with the SASDK are accessed by clients using telephones, mobile phones, or smartphones. They require a third-party Telephony Interface Manager (TIM) to interpret signals sent from the telephone to the telephony card. The TIM then communicates with Telephony Application Services (TAS), the Speech Server component responsible for handling incoming telephony calls (see Figure 2.1). Depending on which version of Speech Server is used, TAS can handle up to ninety-six telephony ports per node, with the ability to add an unlimited number of additional nodes.
Figure 2.1 The main components involved when telephony applications are received. The user’s telephone communicates directly with the server’s telephony card across the public telephone network. The Third-party Telephony Interface Manager (TIM) then communicates with Telephony Application Services (TAS), a key component of Speech Server 2004.
Telephony applications can be either voice-only, DTMF (Dual Tone Multi-frequency) only, or a mixture of the two. DTMF applications involve the user pressing keys on the telephone keypad. This is useful when the user is required to enter sensitive numerical sequences such as passwords or account numbers. In some cases, speaking these types of numerical sequences might entail a security violation, because someone might overhear the user.
Call centers typically use telephony applications to route calls to appropriate areas or to automate some basic function. For instance, a telephony application can be used to reset passwords or request certain information. By automating tasks handled by telephone support employees, telephony applications can offer significant cost savings.
Telephony applications can also be useful when the user needs to iterate through a large list of information. The user hears a shortened version of the item text and can navigate through the list by speaking certain commands. For example, if the telephony application is used to recite e-mail, the user can listen as the e-mail subjects of all unread e-mails are recited. A user who wants to hear the text of a specific e-mail can speak a command such as "Read e-mail." The user can then navigate through the list by speaking commands such as "Next" or "Previous."
Multimodal Applications
Multimodal applications allow the user to choose the appropriate input method, whether speech or traditional Web controls. The application can be used by a larger customer base because it allows the user to choose. Since not all customers will have access to microphones, the multimodal application is the perfect way to offer speech functionality without forcing the user into a corner.
Multimodal applications are accessed via Microsoft Internet Explorer (IE) on the user’s PC or with IE for the Pocket PC (see Figure 2.2). Both versions of IE require the installation of a speech add-in. Users indicate that they wish to utilize speech by triggering an event, such as clicking an icon or button.
Figure 2.2 The high-level process by which multimodal applications communicate with Speech Server. The ASP.NET application is accessed either by a computer running Internet Explorer (IE) with the speech add-in or by Pocket IE with the speech add-in.
The speech add-in for IE, necessary for interpreting SALT, is provided with the SASDK. It should be installed on any computer or Pocket PC device accessing the speech application. In addition to providing SALT recognition, the add-in displays an audio meter that visually indicates the volume level of the audio input.