Text-to-Speech Resource
Text-to-speech technology (TTS) is the process of parsing a text stream into pronounceable phonemes to synthesize human speech. The phonetic alphabet is derived from studio recordings of the sounds; the TTS engine uses specialized algorithms to combine these sounds into human-discernible words and phrases. Some vendors choose to implement the TTS engine through firmware on a speech card. Others choose to use a host-based solution so that high throughput can be achieved and different engines can be more seamlessly integrated into the platform.
A TTS playback request is received at one of the TTS channels, and the text is streamed into a buffer on this channel. The buffering allows a continuous stream to be played in the case of long prompts. As the stream is processed into its phonetic representation, it is then presented to the end user through the telephony resource. Bargein, the capability for the user to interrupt the playing of a prompt through either dual-tone multifrequency (DTMF) input or spoken input, plays an important role at this point. The goal of an efficient bargein scheme is to minimize the time between when the end user requests that the prompt cease playing and the system actually stops the prompt. Bargein times on the order of a few hundred milliseconds are not uncommon.
Because most voice portal architectures support DTMF through the telephony resource, passing the TTS output through this interface easily lends itself to the support of an efficient bargein mechanism. However, supporting spoken bargein in most cases will be less efficient than the DTMF bargein. Spoken bargein requires an active ASR resource to be bound to the same port as the TTS resource. Binding to the same port allows utterances to be "immediately" recognized as the prompt is being played. When the utterances have been validated as valid utterances, the speech processor can be instructed to halt the prompt.
Just as the capability to bargein is an important quality-of-service determinant for voice portal services, the quality of the TTS prompts a user hears contributes a large factor in quantifying their user experience with a voice portal application. The prompts can be long or short, but if they sound unpleasant, the service provider runs the risk of losing that customer.
Design Considerations
When you're working with TTS, keep the following design considerations in mind:
TTS playback quality varies widely across vendors; hence, you should request from the voice portal host TTS output samples of all languages for which your application will be deployed. As discussed earlier, the quality that end users hear affects their overall experience with the VoiceXML application.
The quality of the TTS can be fine-tuned through vendor proprietary speech and/or text controls. However, due to the abstraction afforded by VoiceXML, there is no one-to-one mapping between all the controls that would make the TTS output most acceptable to end users. Therefore, the voice portal vendor should have the capability to add proprietary extensions to their implementation of the VoiceXML interpreter.
Certain applications, such as email readers, will require a text preprocessor to properly format text prior to sending it to the TTS resource. When you design such an application, you should investigate whether this capability can be added to the voice portal under consideration.
There should be no application "limit" on the length of the string passed to the TTS resource. Testing may be needed to determine whether the VoiceXML engine vendor can handle unusually large prompts such as those found in newsreader applications.
Bargein support, whether DTMF or spoken, should be given high consideration when choosing a platform. Users expect bargein response times of a few hundred milliseconds or less.