Where Will Multimodal Applications Be Used?
Users will buy two general types of devices for accessing multimodal Web applications: point and speak devices and multimodal-enabled PCs. Additional devices are possible, but currently these appear the most promising.
Point and Speak Devices
Cell phones and PDAs are converging into a single device. Like a cell phone, PDAs can be wirelessly connected to a backend server. Like PDAs, cell phones now have small displays and the computational power to support a browser.
The combined PDA/cell phone device will likely support a new style of user interface called point and speak. Users will manipulate a stylus to point to a field on an electronic form displayed on the screen, and speak to enter a value into the field. The "pointing" will turn on the speech recognizer, which uses the grammar specific for that field to listen to the user's utterance, and places its result into the field. These "point and speak" forms can be used for data entry (fill in the values of a patient history form), query (specify the query parameters and constraints), and transactions (specify the amount to be paid, the items being ordered, and the required credit card information and user-verification criteria).
Speech Application Language Tags (SALT) was designed by the SALT Forum to add speech into languages that support GUIs. SALT tags enable the user to speak and listen to applications on point and speak devices.
Multimodal PCs
The PC is inexpensive, and has plenty of computing power for multimodal user interfaces. Peripherals such as pens, handwriting pads, microphones, and cameras are readily availableand can be added to most PCs. Microsoft Internet Explorer will soon be SALT-enabled to support speech input and output. Applications that benefit from speech input on the PC include the following:
Data entry. Speak the value for each field rather than input the value with a keyboard. By using a portable microphone, the user moves out of the "office position" (sitting in front of the PC, with the mouse in one hand and the other on the keyboard), and moves around the office.
Information query. Speak a query, and review the results. Refine the query as necessary to home into the right information.
Web browsing. Speak the names of the links rather than move a mouse to the hyperlink and click. If the hyperlink is a graphic, insert a label next to the graphic that is pronounced easily. Conversay's Web browser supports these features very nicely.
Window management. Manage the many windows on a PC screen by name, rather than hunting and clicking the desired window. Warnings and error messages can be presented by voice instead of cluttering the screen with message boxes. Windows managed by voice leave your hands free to enter data and requests into the various windows. In a sense, your voice becomes a "third hand."
Eyes busy, hands busy. Users can speak and listen to an application when their eyes are busy and/or when their hands are busy (for example, a recipe application in which cooks listen to a recipe as they use their hands to prepare the dishes). Customers can assemble a toy while they listen to instructions spoken from a PC. In both cases, the user's hands and eyes are busy, yet the user can hear and speak with the computer.
Not all PC applications will benefit from speech. Speech increases noise pollution in the workplace, and some PC users do not want to be heard speaking to their computer. However, where speech adds value to an application, developers embed SALT tags into the PC application code. As with all applications involving speech, usability testing is necessary to refine the verbal prompts to the user, the grammar used by the speech recognition to hear what the user says, and the event handler specification that encourages the user to overcome speech recognition problems.