Summary
In the spring of 2004, Microsoft released Speech Server as part of an initiative to make speech applications mainstream. The Speech Application SDK (SASDK), version 1.0, is the component that allows Visual Studio.NET developers to create two types of applications—telephony and multimodal. The SASDK complies with an emerging standard, Speech Application Language Tags or SALT. SALT is a lightweight extension of other markup languages such as HTML and XML that standardizes the way devices use speech.
Telephony or voice-only applications are accessed by a standard telephone, mobile phone, or smartphone. They can accept input in the form of spoken text or numerical digits pressed on the keypad. Telephony applications have typically been used to make call centers more efficient. The built-in controls offered in the SASDK give them the potential to offer much more.
Multimodal applications are accessed by either a desktop PC or a pocket PC device. They allow users to select the input mechanism they prefer, whether traditional Web controls or spoken text. A speech add-in installed with Internet Explorer (IE) allows the client to access speech applications.
There are many opportunities for voice-only applications in today’s society. These applications will lower the barriers between human and computer interaction. They will also provide cost-efficient alternatives to traditional information-retrieval methods. The problems that once plagued these applications are being removed, and development has been eased by tools like the Microsoft Speech SDK.
The speech engine recognizes what the user says by applying predefined grammar rules. Although this requires more effort by the developer, it results in more accurate recognition.
Once the SASDK is installed, a new speech application is created using the template type for a Speech Web application. Depending on the application type selected (multimodal or telephony), certain files are included by default.
The prompt editor is a tool for managing an applications prompt database. This database contains prerecorded phrases used throughout the application to prompt the user for input. If phrases are not recorded in the prompt database, the text-to-speech (TTS) engine will speak the phrase. Unfortunately the TTS engine is not very natural sounding, so it is usually best to prerecord prompts.
As prompts represent what is spoken to the user, grammars represent what the user says. An application consists of several grammar files that each specify a range of possible responses. This increases the efficiency and accuracy of the speech engine, since it knows what to expect.
A series of speech controls represent the content of a single page. Since the voice-only application has no visible interface, it relies on these controls to direct the user flow. The question/answer (QA) control is the main control used; it represents a single interaction with the user.
The Telephony Application Simulator (TASim) and the Speech Debugging Console are both used when developing and debugging a voice-only application. The TASim is used to simulate the client experience for telephony applications, and the Speech Debugging Console allows developers to view specifics of the dialog flow.
Call Viewer and Speech Application Reports are two tools provided with Microsoft Speech Server that allow you to analyze and report on call event data. The developer does not need to install Speech Server to access these tools. They can be accessed from the Redistributable Installers directory that is part of the SASDK installation files.
Once a voice-only application is developed and tested, it is deployed on a telephony server. The Microsoft Speech Server is the piece that allows phones to access voice-only applications. The telephony server is integrated with Windows and works to interpret the SALT tags for the application.