Creating a Speech Application
The SASDK provides a template for creating new speech applications with Visual Studio .NET. It also provides visual editors for building the prompts (words spoken to the user) and grammars (words spoken by the user). This section will examine the basics of creating a speech application with the SASDK.
To utilize the template provided with the SASDK, open Visual Studio.NET and execute the following steps:
Click File, New, and Project. From the New Project dialog box, select the desired Project Type and click the Speech Web Template icon in the Templates window. This template was created when you installed the SASDK. Change the value in the location dropdown box to the desired project name and click OK.
You can either accept the setting defaults and click Finish, or select Application Settings and Application Resources to specify custom settings.
The default application mode is voice-only, so if you want to create a multimodal application, you can change the mode from the Application Settings tab.
The Application Resources tab allows you to specify that a default grammar library file will be created and the name it will be called. From here you can also indicate that a new prompt project will be created and specify what the name for it will be.
Click Finish at any time to build the new project.
If you choose to build a voice-only application, the project will include a Web page named Default.aspx. This page contains two speech controls, AnswerCall and SemanticMap. These are basic controls used in every voice-only application. Their specific functions will be covered in the section titled "Using Speech Controls." The default project will also include a folder named Grammars that contains two grammar files, Library.grxml and SpeechWebApplication1.grxml. For voice-only applications the prompt project and Grammars folder are included by default.
If you choose to build a multimodal application, the Default.aspx page is included, but it will contain no controls. There will be a Grammars folder, but no prompt project will be created.
By default, the Manifest.xml file is included for both project types. It is an XML-based file that contains references to the resources used by the project. References include grammar files and prompt projects. Speech Server will preload and cache these resources to help improve performance.
The Prompt Editor
Microsoft recommends that you prerecord static prompts because voice recordings are more natural than the result of the text to speech engine. The prompt editor (see Figure 2.4) is a tool that allows you to specify potential prompts and record wave files associated with each prompt.
Figure 2.4 Screenshot of the prompt editor in the prompt database project. The prompt editor is used to record the wave files associated with each prompt. The screenshot includes four different prompts.
The utterance "Welcome to my speech application" represents a single prompt. For voice-only applications, you need to make sure you include a wide range of prompts. Since the user relies on these prompts to understand how the application works, they need to be clear and meaningful.
The Prompt Database
An application built with the Speech SDK wizard adds a prompt database project by default. If you choose to add another prompt database, it can be done by using the File menu and selecting Add Project and New Project (see Figure 2.5). The new project will be based on the Prompt Project template. Once the project is added, a new Prompt Database can be added by right-clicking the prompt project and then selecting Add and Add New Item. The Prompt Database item opens up a data grid style screen that allows you to specify all the potential prompts.
Figure 2.5 Screenshot of the dialog used to add a new prompt project to your speech application. This dialog is accessed by clicking Add project from the File menu and then clicking New Project.
The prompt database contains all the prerecorded utterances used to communicate with the user. An application can reference more than one prompt database. One reason for doing this is ease of maintenance. Prompts that change often can be placed in a separate prompt database. By restricting the size of the prompt database, the amount of time needed to recompile is minimized.
If you followed the instructions in the last section to create a new speech project, you can now open the default prompt database by double-clicking the prompts file from Solution Explorer.
Transcriptions and Extractions
Figure 2.6 is a screenshot of the recording pane in the prompt project database. There are two grids in a prompt project. The top one contains transcriptions, and the bottom one extractions. Transcriptions are the individual pieces of speech that relate to a single utterance. Extractions combine transcription elements to form phrases. Extractions are formed when you place square brackets around the transcription elements.
Figure 2.6 Contents of the Recording pane in the prompt database project. Transcriptions are the individual pieces of speech that can be prerecorded. No utterances have been recorded for prompts with a red X in the Has Wave column.
Sometimes a prompt can involve one or more transcription elements, such as "I heard you say Sara Rea." In this case, the two elements are "I heard you say" and "Sara Rea." In some cases employee names may also be prerecorded in the prompt database. This adds an additional burden, because every time a new employee is added to the database, someone needs to record the employee’s name. However, by doing this, we prevent the speech engine from utilizing text-to-speech (TTS) to render the prompt. This is preferred because using recordings results in a more natural-sounding prompt.
Prompts are controlled from prompt functions. These functions programmatically indicate what phrases are spoken to the user. When the speech engine is passed a phrase from the function, it first searches the prompt database to see if any prerecorded utterances are present. It searches the entire database for matches and will string together as many transcription elements as necessary to retrieve the entire phrase.
Because the speech engine parses transcription elements together to form phrases, you can break phrases up to prevent redundancy. For instance, the phrase "Sorry, I am having trouble hearing you. If you need help, say help" may be spoken when an application encounters silence. The phrase "Sorry, I am having trouble understanding you. If you need help, say help" is used whenever the speech engine does not recognize the user’s response. Therefore, the subphrase "If you need help, say help" can be recorded as a separate phrase in the prompt database. This means that the subphrase will only have to be recorded once. In addition, the size of the prompt database is minimized.
The Recording Tool
The Recording Tool can be accessed by clicking the red circle icon above the Transcription pane or by clicking Prompt and then Record All. The text from the transcription item selected is displayed in the Display Text textbox (see Figure 2.7). After clicking Record, the person making the recording should speak clearly into the microphone. Click Stop as soon as the entire phrase is spoken. Try to select a recording location where background noise is minimized.
Figure 2.7 The Recording tool allows you to directly record each prompt associated with a transcription. Prompts can also be recorded by professional voice talent in a studio, made into wave files, and imported.
In some cases, you may want to utilize professional voice talent to make recordings. There are third-party vendors, such as ScanSoft (see the "ScanSoft" profile box), that can provide professional voice talent and assistance with recordings. Wave files created in a recording studio can be associated with a specific transcription element by clicking Import and browsing to the file’s location.
If the speech engine is unable to find a match in any of the prompt databases, it utilizes TTS. The result is a machine-like voice that may go against the natural interface you are trying to create. Speech Server comes bundled with ScanSoft’s Speechify TTS engine (see the "ScanSoft" profile box), but at present the results from a text-to-speech engine are not as natural-sounding as a recorded human voice. On the other side, it will not always be possible or manageable to prerecord all utterances. You will have to weigh these options when designing your speech application.
ScanSoft and the ScanSoft logo are registered trademarks of ScanSoft, Inc.
The recording of prompts is a major consideration when designing a speech-enabled application. If professional talent is used, you will want to try to minimize the need for multiple recording sessions. If the application requires the utilization of text-to-speech for most prompts, you may want to consider purchasing a third-party TTS add-in.
The Grammar Editor
Grammar, the reverse of prompts, represents what the user says to the application. This is a key element of voice-only applications because they rely completely on accurate understanding of the user’s commands. The grammar editor builds Extensible Markup Language (XML) files that are used by the speech-recognition engine to understand the user’s speech. What is nice about the grammar editor is that you drag-and-drop controls to build the XML instead of having to type it in directly. This helps to reduce the time spent building grammars.
A grammar is stored in the form of an XML file with a grxml extension. Each of its Question/Answer (QA) controls, representing an interaction with the user, is associated with one or more grammars. A single grammar file will contain one or more rules that the application uses to interpret the user’s response.
Clicking Add New Item from the Project menu accomplishes adding a grammar file. From there, select the category Grammar File and name the file accordingly. Existing grammars can be viewed by expanding the Grammar folder within Solution Explorer. By default, two grammar files are added when you create a voice-only or multimodal application. The first file, named library.grxml, contains common grammar rules you may need to utilize. For instance, it includes a rule for collecting yes/no responses (see Figure 2.8). It also includes rules for handling numbers, dates, and even credit card information. Rules embedded within the library grammar file can be referenced in other grammar files through the RuleRef control.
The second grammar file is named the same as the project file by default. This is where you will place the grammar rules associated with your application. Although you could store all the rules in a single file, you may want to consider adding subfolders within the main Grammars folder. You can then create multiple grammar files to group similar types of grammar rules. This helps to organize code and makes referencing grammar rules easier.
Grammar rules are built by dragging elements onto the page. Controls are available in the Grammar tab of the toolbox. Figure 2.9 is a screenshot of these grammar controls. Most rules will consist of one or all of the following:
Phrase—represents the actual phrase spoken by the user.
List—contains multiple phrase elements that all relate to the same thing. For instance, a yes response could be spoken as "yeah," "ok," or "yes please." A list control allows you to indicate that all these responses are the same as yes.
RuleRef—used to reference other rules through the URI property. This is useful when you have multiple grammar files and want to reuse the logic in existing rules.
Group—used to group related elements. It can contain any element, such as a List, Phrase, or RuleRef.
Wildcard—used to specify which words in a phrase can be ignored.
Halt—used to stop the recognition path.
Skip—used to indicate that a recognition path is optional.
Script Tag—used to get semantic information from the grammar.
The grammar editor (see Figure 2.8) contains a textbox called Recognition String. When dealing with complex rules, it can be used to test the rule without actually running the application. This is very useful when you are building the initial grammar set. To use this feature, just enter text that you would expect the user to say and click Check. The output window will display the Semantic Markup Language (SML), which is the XML generated by the speech engine and sent to the application. If the text was recognized, you will see "Check Path test successfully complete" at the bottom of the output window.
Figure 2.8 Screenshot displaying the yes/no rule inside the grammar editor. This is one of several rules included by default with the Library.grxml file.
Figure 2.9 Screenshot of the Grammar tab, available in the toolbox when creating a new grammar. The elements you will use most often are the List, Phrase, RuleRef, and Script Tag elements.
The Script tag element is used to value a semantic item with the user’s response. The properties for a script tag include an ellipsis that brings you to the Semantic Script Editor. This editor helps you to create an assignment so that the correct SML result is returned. You can also switch to the Script tab and edit the script directly. Figure 2.10 is a screenshot of the Semantic Script Editor.
Figure 2.10 Screenshot of the Semantic Script Editor that is available when you use a Script Tag element. The Script Tag is used whenever you need to value a semantic item with the user’s response.
When building grammars you will probably not anticipate all the responses on an initial pass. Therefore, grammars require fine-tuning to make the application as efficient and accurate as possible. This process is eased since grammar files are not compiled and instead are available as XML reference files. For this reason, you would not want to compile grammar files until after the application has been thoroughly tested and is ready to deploy.
Using Speech Controls
A voice-only application has no visible interface. It runs on IIS as a Web page and is accessed with a telephone. When developing and debugging the application, it is executed within the Web browser, and the Speech Debugging Console is used to provide the developer with information about the application dialog. The user will never see the page created, so it is not important what is placed on it visually. Therefore, the only elements on the page will be speech controls, and they will be seen only by the developer.
The Speech Application SDK includes several speech controls that are visible from the Speech tab in the Toolbox. These controls will be dragged onto the startup form as the application is built. Figure 2.11 is a screenshot of the speech controls available in the speech tab of the toolbox. Speech controls are the basic units for computer-to-human interaction, and the SASDK contains two varieties of controls: dialog and application speech controls.
Figure 2.11 Screenshot of all the speech controls available in the speech tab of the toolbox. The QA control is the most basic unit and is utilized in every interaction with the user. SmexMessage, AnswerCall, TransferCall, MakeCall, RecordSound, and DisconnectCall are only applicable for telephony applications.
Dialog Speech Controls
Table 2.1 is a listing of the dialog speech controls used for controlling the conversational flow with the user. A QA control, the most commonly used control, represents a single interaction with the user in the form of a prompt and a response.
Table 2.1 Dialog Speech Controls are used for controlling the conversational flow with the user.
Control Name |
Description |
Semantic Map |
Collection of SemanticItem controls where a SemanticItem control represents a single piece of information collected from the user, such as a last name. |
QA |
Question/Answer control. This represents one interaction with the user in the form of a question and then a response. |
Command |
Often used to navigate the application with unprompted commands such as Help or Main Menu. |
SpeechControlSettings |
Specify common settings for a group of controls. |
SmexMessage |
Sends and receives messages from a computer-supported telephony application (CSTA) that complies with European Computer Manufacturers Association (ECMA) standards. |
AnswerCall |
Answer calls from a telephony device. Used for inbound telephony applications. |
TransferCall |
Transfers a call. |
MakeCall |
Initiates a new call. Used for outbound telephony applications. |
DisconnectCall |
Ends a call |
CompareValidator |
Compares what the user says with some value |
CustomValidator |
Validates data with client-side script |
RecordSound |
Records what the user says and copies it to the Web server so it can be played back later. |
Listen |
Represents the listen element from the SALT specification. Considered a basic speech control. |
Prompt |
Represents the prompt element from the SALT specification. Considered a basic speech control. |
Speech Application Controls
Speech Application Controls are extensions of the basic speech controls. They are used to anticipate common user interaction scenarios. Refer to Table 2.2 for a listing of the application controls included with the SASDK. For instance, the Date control is a speech application control that expands on the basic QA control. It is used to retrieve a date and allows for a wide range of input possibilities. Application controls can reduce development time because much of the user interaction is built directly into them.
Table 2.2 Speech Application Controls available in the Speech tab of the toolbox. These controls can reduce development time by building in typical user interactions.
Control Name |
Description |
ListSelector |
Databound control that presents the user with a list of items and asks user to select one. |
DataTableNavigator |
Databound control that the user navigates with commands such as Next, Previous, and Read. |
AlphaDigit |
Collects an alphanumeric string. |
CreditCardDate |
Collects a credit card expiration date (month and year); does not ensure that it is a future date. |
CreditCardNumber |
Collects a credit card number and type. Although it does not validate the number, it ensures that the number matches the format for the particular type of credit card. |
Currency |
Collects an amount in U.S. dollars that falls within a specified range. |
Date |
Used to collect either a complete date or one broken out into month, day, and year. |
NaturalNumber |
Collects a natural number that falls within a specified range. |
Phone |
Collects a U.S. phone number where area code is three numeric digits, number is seven numeric digits, and extension is zero to five numeric digits. |
SocialSecurityNumber |
Collects a U.S. Social Security number. |
YesNo |
Collects a yes or no answer. |
ZipCode |
Collects a U.S. zip code where the zip code is five numeric digits and the extension is four numeric digits. |
Creating Custom Controls
If no control does everything you need, you have the option of creating a custom control. Custom controls allow you to expand on the functionality already available with the built-in speech controls. Utilizing the concept of inheritance, custom controls are created using the ApplicationControl class and the IDtmf interface. The developer will create a project file that is compiled into a separate DLL for each custom control.
The Samples solution file, installed with the SASDK, includes a project titled ColorChooserControl. The ColorChooserControl project by itself is installed by default in the C:\Program Files\Microsoft Speech Application SDK 1.0\Applications\Samples\ColorChooserControl directory. This project can serve as a template for any custom control you wish to create. The Color Chooser control is a complex control that consists of child QA controls used to prompt the user for a color and then confirm their selection. The grammar and prompts associated with the control are built directly in. This particular control supports voice-only mode.
The ColorChooserControl is a custom control used to control the dialog flow with the user. It demonstrates what considerations must be made when building these types of controls. It is an excellent starting point for anyone wanting to create custom controls.