- Introduction
- Text-to-Speech Resource
- Automatic Speech Recognition Resource
- VoiceXML Browser and Interpreter
- Audio Resource
- TCP/IP Resource
- Telephony Resource
Automatic Speech Recognition Resource
Automatic Speech Recognition (ASR) is the process by which a computer system receives human speech input and returns words, phrases, or digits corresponding to that spoken input (see Listing 1). ASR allows a computer to recognize words spoken from an interdependent set of random speakers, where "interdependence" lies in the fact that the speech models for a particular language are often based on a cross-section of native speakers of that language.
Listing 1Code Snippet for the Speech Recognition Scenario
<?xml version="1.0"?> <vxml version="1.0"> <--! This is a simple example of multiple active grammars that will demonstrate the Client-Server ASR Architecture. > <form> <prompt bargein="false"> Hello, and welcome to the Amazon.com online book and music store. What would you like to buy, books, music?</prompt> <help>Say books or music.</help> <field name="item-choice"> <grammar type="application/x-jsgf"> books | music </grammar> <filled> <prompt> I believe I heard you say <value expr="item-choice"/> . </prompt> </filled> </field> </form> <! At this point in the code, we will have activated the "help" grammar and the inline grammar corresponding to "'books' or 'music'"> . . . </vxml>
The following steps provide a simplified speech-recognition processing scenario. The scenario can be used with Listing 1:
The prompt, "Hello, and welcome to the Amazon.com online book and music store. What would you like to buy, books, music?" is rendered as text-to-speech by the TTS resource.
The JSpeech Grammar Format (JSGF) inline grammar, "books | music" causes a recognition client to create a grammar object representing that grammar. This grammar represents the vocabulary for which this recognition instance can return valid results. If a word or phrase is spoken that is not "books" or "music," the recognizer returns a "nomatch" or a recognition failure.
The reference to the grammar object is passed to the recognition server for compilation and optimization.
After the caller speaks (or noise above a certain threshold is received), the conditioned audio signal is passed to the recognition server. The conditioning might take the form of noise filtering or echo cancellation.
The recognition server receives the audio input, and because it has a reference to the current recognition request, it tries to match the words "books" or "music."
Assuming that the spoken word was "books" and the recognition server successfully made a match, the word "books" is returned to the interpreter thread instance.
Telephony-grade engines are increasingly in demand as the need for high-throughput, multithreaded processing capabilities grows. Speech vendors, such as Nuance, have implemented a client/server ASR architecture that allows the multithreaded distribution of recognition requests across one or more recognition servers.