MIT CogNet, The Brain Sciences ConnectionFrom the MIT Press, Link to Online Catalog
SPARC Communities
Subscriber : Stanford University Libraries » LOG IN

space

Powered By Google 
Advanced Search

The CogNet Library : References Collection
mitecs_logo  The Handbook of Multisensory Processes : Table of Contents: Spatial and Temporal Constraints on Audiovisual Speech Perception : Introduction
Next »»
 

Introduction

Introduction

Grasping the full meaning of a message during face-to-face conversation requires the perception of diverse auditory and visual information. When people listen to someone speaking, their processing of the utterances extends beyond the linguistic analysis of the words to include the perception of visible speech movements, facial expression, body posture, manual gestures, and the tone and timing of the voice. This extensive audiovisual information is produced in parallel by the talker and must be processed and integrated by the listener in order to understand the talker's full intent. Thus, human communication in its most natural form is cross-modal and multidimensional. In this chapter we focus on one part of this complex sensory processing, the audiovisual perception of speech.

It has been known for many years that the information for speech is both auditory and visual. With the exception of some rare individuals who can identify most of what is said from speech-reading alone (e.g., Bernstein, Demorest, & Tucker, 2000), most people are quite limited in their ability to identify speech from visual-only signals. A more pervasive phenomenon is the ability of visual speech to enhance the intelligibility of auditory speech in noise. When a talker's face is visible in a noisy environment, the intelligibility of the auditory speech is significantly better than auditory-alone speech perception (Sumby & Pollack, 1954). Visible speech can even alter the perception of perfectly audible speech sounds when the visual speech stimuli are mismatched with the auditory speech, as demonstrated by the McGurk effect (McGurk & MacDonald, 1976).

In this chapter we address the audiovisual integration of information for speech. We first summarize the spatial and temporal features of normal speech communication and the conditions that characterize face-to-face communication. Next we discuss the spatial and temporal limits of audiovisual speech integration. Finally, we summarize our production-based animation projects that provide audiovisual stimulus control for the examination of speech production and perception as coordinated processes.

Before we consider the existing data, we present the different experimental tasks that are used in visual and audiovisual speech perception research. Data from three different types of tasks are discussed in this chapter: speech-reading, speech-in-noise perception tasks, and McGurk effect tasks. Speech-reading in its strictest definition involves visual-only (silent) speech presentation, as would be the case when a person watches television with the sound turned off. “Speech-reading” is also used to refer to the speech perception of individuals with profound hearing impairments. However, the use of the term in this context can be misleading because in many cases, the visual signals of speech are being used to augment residual hearing. In such cases, speech perception is more akin to understanding speech in noise than to silent speech-reading, because the visual speech complements a weak or distorted auditory signal. The final audiovisual task concerns the audiovisual illusion known as the McGurk effect (McGurk & MacDonald, 1976). In this task a visual consonant is dubbed onto a different auditory consonant and the incongruent visual stimulus modifies the auditory percept. For example, when a visual g is dubbed onto an auditory b, subjects frequently hear a different consonant, such as d or th. Of course, subjects would clearly perceive the /b/ if they closed their eyes.

Although these three tasks clearly involve related information and similar information-processing skills, they are not equivalent. All of the tasks exhibit considerable individual differences that are not strongly correlated within subjects (Munhall, 2002; Watson, Qui, Chamberlain, & Li, 1996; cf. MacLeod & Summerfield, 1987). In particular, performance on silent speech-reading does not correlate with performance on either perceiving sentences in noise or McGurk effect tasks. Heightened performance on speech-reading correlates with a range of cognitive and perceptual abilities that are not necessarily related to speech (Rönnberg et al., 1999; cf. Summerfield, 1992). There are other indications that the three visual tasks employ different perceptual and cognitive components. Speech-reading abilities are very limited in the general population.1 The visual enhancement of the perception of speech in noise, on the other hand, is a more general and widespread phenomenon. There are also hints that the McGurk effect should be distinguished from the perception of congruent audiovisual speech (Jordan & Sergeant, 2000). For instance, the two tasks show different thresholds for the influence of degraded visual information (cf. Massaro, 1998).

A second task factor additionally complicates the literature on audiovisual speech—the stimulus corpus. In some tasks, such as in most McGurk effect studies, subjects choose answers from a small set of nonsense syllables. At the other extreme, subjects identify words in sentences, with the words drawn from the full corpus of the language. Experiments that use tasks from different parts of this range invoke different strategies from the subjects and require different parts of the linguistic system. Not surprisingly, individual differences on audiovisual recognition of nonsense syllables do not correlate with individual differences on audiovisual perception of sentence materials (Grant & Seitz, 1998).

Unfortunately, the issue of task differences in audiovisual speech perception cannot be resolved without obtaining data from direct comparisons. Thus, for the purpose of this chapter, we review data from all three tasks together, taking care to indicate the experimental tasks in each case and to warn that the conclusions may not generalize across tasks.

 
Next »»


© 2010 The MIT Press
MIT Logo