9.9 Multimodal Techniques
While discussing the various system control techniques, we already mentioned that some techniques could be combined with others. In this section, we go deeper into the underlying principles and effects of combining techniques using different input modalities—multimodal techniques. Such techniques connect multiple input streams: users switch between different techniques while interacting with the system (LaViola et al. 2014). In certain situations, the use of multimodal system control techniques can significantly increase the effectiveness of system control tasks. However, it may also have adverse effects when basic principles are not considered. Here, we will shed light on different aspects of multimodal techniques that will help the developer make appropriate design choices for multimodal 3D UIs.
9.9.1 Potential Advantages
Researchers have identified several advantages of using multimodal system control techniques (mostly in the domain of 2D GUIs) that can also apply to 3D UIs:
Decoupling: Using an input channel that differs from the main input channel used for interaction with the environment can decrease user cognitive load. If users do not have to switch between manipulation and system control actions, they can keep their attention focused on their main activity.
Error reduction and correction: The use of multiple input channels can be very effective when the input is ambiguous or noisy, especially with recognition-based input like speech or gestures. The combination of input from several channels can significantly increase recognition rates (Oviatt 1999; Oviatt and Cohen 2000) and disambiguation in 3D UIs (Kaiser et al. 2003).
Flexibility and complementary behavior: Control is more flexible when users can use multiple input channels to perform the same task. In addition, different modalities can be used in a complementary way based on the perceptual structure of the task (Grasso et al. 1998; Jacob and Sibert 1992).
Control of mental resources: Multimodal interaction can be used to reduce cognitive load (Rosenfeld et al. 2001); on the other hand, it may also lead to less effective interaction because multiple mental resources need to be accessed simultaneously. For example, as Shneiderman (2000) observes, the part of the human brain used for speaking and listening is also the part used for problem solving—speaking consumes precious cognitive resources.
Probably the best-known multimodal technique is the famous “put-that-there” (Bolt 1980). Using this technique, users can perform manipulation actions by combining pointing with speech. Many others have used the same combination of gesture and speech (e.g., Figure 9.21), where speech is used to specify the command and gestures are used to specify spatial parameters of the command, all in one fluid action. In some cases, speech can be used to disambiguate a gesture, and vice versa.
Figure 9.21 A car wheel is selected, rotated, and moved to its correct position using voice and gestures. (Photographs courtesy of Marc Eric Latoschik, AI & VR Lab, University of Bielefeld; Latoschik 2001)
Another possible technique is to combine gesture-based techniques with traditional menus, as in the “marking menus” technique. This means that novice users can select a command from a visual menu, while more experienced users can access commands directly via gestural input. This redundancy is similar to the use of keyboard shortcuts in desktop interfaces.
9.9.2 Design Principles
Designing multimodal system control techniques can be a complex undertaking. On a single technique level, the design guidelines from the various techniques discussed in the previous sections will apply. However, by combining techniques, several new issues come into play.
First, the combination of modalities will depend on the task structure: How can you match a specific task to a specific modality, and how does the user switch between modalities? Switching may affect the flow of action in an application—disturbances in the flow of action may lead to bad performance and lower user acceptance. A good way to verify flow of action is to perform a detailed logging that identifies how much time is spend on specific subparts in the task chain and compares this to single-technique (non-multimodal) performance. When combining two techniques, it can also make sense to do multimapping of tasks to modalities, that is, to allow users to perform a specific task using multiple methods.
Second, while multimodal techniques may free cognitive resources, this is not necessarily the case for all implementations. Cognitive load should thus be evaluated, either through self-assessment by a user (which only provides general indications) or through additional correlation with physiological measures that can assess stress or even brain activity. See Chapter 3, “Human Factors Fundamentals,” section 3.4.3, for more information. In direct relation to cognitive load, attention is also an issue to consider: does the user need to pay much attention to using the combined technique (or accompanying visual or non-visual elements), or can the user remain focused on the main task?
9.9.3 Practical application
Using multimodal techniques can be useful in many situations. Complex applications can benefit from the complementary nature of multimodal techniques, allowing for more flexible input and potentially reducing errors. The reduction of errors is especially important for applications with limited or no time for user learning. For example, consider a public space installation: by supporting multiple modes of input, discovering the underlying functionality may become easier for users.
Also, some modalities may be easier to perform by certain classes of users: an elderly user may have difficulties with precise motor input but may be able to control an application by voice instead. This points to the general complementary behavior of multimodal techniques: when one input channel is blocked, either due to external factors or user abilities, another channel can be used. For instance, consider bright daylight limiting text legibility in an AR application or environmental noise limiting voice recognition. Being able to perform the task using another input channel can drastically increase performance.
Finally, multimodal techniques are applicable to scenarios that mimic natural behavior. In both realistic games and in applications that use a natural interaction approach, combinations of input modalities that mirror the ways we interact with other humans can improve the user experience.