| |
Speechreading (lipreading, visual speech perception), a form of information processing, is defined by Boothroyd (1988) as a “process of perceiving spoken language using vision as the sole source of sensory evidence” (p. 77). Speechreading, a natural process in everyday communication, is especially helpful when communicating in noisy and reverberant conditions because facial motion in speech production may augment or replace degraded auditory information (Erber, 1969). Also, visual cues have been shown to influence speech perception in infants with normal hearing (Kuhl and Meltzoff, 1982), and speech perception phenomena, such as the McGurk effect (MacDonald and McGurk, 1978), demonstrate the influence of vision on auditory speech perception.
To understand language, the speechreader directs attention to, extracts, and uses linguistically relevant information from a talker's face movements, facial expressions, and body gestures. This information, which may vary within and across talkers (for a review, see Kricos, 1996), is integrated with other available sensory cues, such as auditory cues, as well as knowledge about speech production and language in general to make sense of the visual information. However, the visual information may be ambiguous because many sounds look alike on the lips, are hidden in the mouth, or are co-articulated during speech production. In addition, expectations about linguistic context may influence understanding. Nevertheless some individuals are expert speechreaders, scoring more than 80% correct on words in unrelated sentences, and demonstrate enhanced visual phonetic perception (Bernstein, Demorest, and Tucker, 2000). Attempts to relate speechreading proficiency to other sensory, perceptual, and cognitive function, including neurophysiological responsiveness, have met with limited success (for a review, see Summerfield, 1992).
From a historical perspective, speechreading was initially developed in Europe as a method to teach speech production to young children with hearing loss. Until the 1890s it was limited to children and was characterized by a vision-only (unisensory) approach. Speechreading training was based on analytic methods, which encouraged perceivers to analyze mouth position to recognize sounds, words, and sentences, or synthetic methods, which encouraged perceivers to grasp a speaker's whole meaning (Gestalt). O'Neill and Oyer (1961) reviewed several early distinctive methods that were adopted in America: Bruhn's method (characterized by syllable drill and close observation of lip movements), Nitchie's method (which shifted from an analytical to a synthetic method), Kinzie's method (Bruhn's classification of sounds plus Nitchie's basic psychological ideas), the Jena method (kinesthetic and visual cues), and film techniques (Mason's visual hearing, Markovin and Moore's contextual systemic approach). Gagné (1994) reviewed present-day approaches: multimodal speech perception (integration of available auditory cues with those from vision and other modalities), computer-based activities (interactive learning using full-motion video), and conversational contexts (question-answer approach, effective communication strategies, and training for talkers with normal hearing to improve communication behavior).
When indicated, speechreading training is included in comprehensive programs of aural rehabilitation. At best, the post-treatment gains from speechreading training are modest, in the range of approximately 15%. Unfortunately, data on improvements related to visual speech perception training are limited, and little is known about the efficacy of various approaches. However, some individuals demonstrate significant gains. Walden et al. (1977) have reported an increase in the number of visually distinctive groups of phonemes (visemes) with which consonants are recognized following practice. The identification responses following practice suggest that the distinctiveness of visual phonetic cues related to place-of-articulation information is increased. Although it is not clear what factors account for such improvements in performance, these results provide evidence for instances of learning in which perception is modified. Other studies show that the results are variable, both in performance on speechreading tasks and in gains related to learning programs. Improvements observed by Massaro, Cohen, and Gesi (1993) suggest that repeated testing experience may be as beneficial as structured training. Initial changes may be due to nonsensory factors such as increased familiarity with the task or improved viewing strategies. In contrast, Bernstein et al. (1991) suggest that speechreaders learn the visual phonetic characteristics of specific talkers after long periods of practice. Treatment efficacy studies may be enhanced by enrolling a larger number of participants, specifying training methods, using separate materials for training versus testing, evaluating asymptotic performance and long-term effects, determining whether the effects of the intervention are generalized to nontherapy situations, and designing studies to control for factors such as the motivation or test-taking behaviors of participants, as well as personal attention directed toward participants (for a review, see Gagné, 1994; Walden and Grant, 1993).
Current research has also focused on visual speech perception performance in psychophysical experiments. One research theme has centered on determining what regions of the face contain critical motion that is used in visual speech perception. Subjective comments from expert lipreaders suggest that movement in the cheek areas may aid lipreading. Data from Lansing and McConkie (1994) illustrate that eye gaze may shift from the mouth to the cheeks, chin, or jaw during lipreading. Results from Greenberg and Bode (1968) support the usefulness of the entire face for consonant recognition. In contrast, results from Ijsseldijk (1992) and Marassa and Lansing (1995) indicate that information from the lips and mouth region alone is sufficient for word recognition. Massaro (1998) reports that some individuals can discriminate among a small set of test syllables without directly gazing at the mouth of the talker, and Preminger et al. (1998) demonstrate that 70% of visemes in /a/ and /aw/ vowel contexts can be recognized when the mouth is masked. These diverse research findings underscore the presence of useful observable visual cues for spoken language at the mouth and in other face regions.
Eye-monitoring technology may be useful in understanding the role of visual processes in speechreading (Lansing and McConkie, 1994). It provides information about on-line processing of visual information and the tendency for perceivers to direct their eyes to the regions of interest on a talker's face. By moving the eyes, a perceiver may take advantage of the densely packed, highly specialized cone receptor cells in the fovea to inspect visual detail (Hallet, 1986). Eye monitoring has been used to study a variety of cognitive tasks, such as reading, picture perception, and face recognition, and to study human-computer interaction (for a review, see Rayner, 1984). The basic data obtained from eye monitoring reveals sequences of periods associated with perception in which the eye is relatively stable (fixations) and high-velocity jumps in eye position (saccades) during which perception is inhibited. Distributions of saccadic information (“where” decisions) are quantified in terms of length and direction, and distributions of fixations (“when” decisions) are quantified in terms of duration and location. Experiments are designed to evaluate the variance associated with cognitive processes. For example, distributions of fixation duration and of saccade length differ across cognitive tasks such as reading versus picture perception. Various types of instrumentation are available for eye-monitoring research, some of which include direct physical contact with the eyes to camera-based video systems that determine eye rotation characteristics based on changes in the location of landmarks such as the center of the pupil or corneal reflections, free of error induced by translation related to head movement (for a review, see Young and Sheena, 1975). Factors such as cost, accuracy, ease of calibration, response mode, and demands of the participants and experimental task must be considered in selecting an appropriate eye-monitoring system. An example of a system used in speechreading research is shown in Figure 1. The system is used to record the eye movements of the perceiver and to obtain a detailed record of the sequence and duration of fixations. A scan plot and sample record of eye movements are shown in Figure 2. Simultaneously, measurements are made of the accuracy of perception, efficiency of processing, or judgment of stimulus difficulty. For interpretation, the eye movement records are linked to the spatial and temporal characteristics of face motion for each video frame or speech event (e.g., lips opening).
Figure 1..
The photograph at the top shows a profile of the head mounted hardware of the prototype S&R (Stampe and Reingold) Eyelink system. The lightweight headband holds two custom-built ultra-miniature high-speed cameras mounted on adjustable rods to provide binocular eye monitoring with high spatial resolution (0.01 degrees) and fast sampling rates (250 Hz). Each camera uses two infrared light–emitting diodes (IR LEDs) to illuminate the eye to determine pupil position. The power supply is worn at the back of the head and coupled to a specialized image-processing card housed in a computer that runs the eye-tracking software. The photograph at the bottom shows the third camera on the headband that tracks the relative location for banks of IR LEDs affixed to each corner of the computer monitor which displays full-motion video or text. The relative location of the LEDs changes in relation to head movement and distance from the display and is used to compensate for head motion to determine x, y eye-fixation locations. An Ethernet link connects the experimental display computer to the eye-tracking computer with support for real-time data transfer and control.
Figure 2..
The graph at the top shows the sequence and location of x, y eye fixations for a perceiver who is speechreading the sentence, “You can catch the bus across the street.” The size of the markers is scaled in relative units to illustrate differences in total fixation times directed at each x, y location. The asterisk-shaped markers enclosed in a circle are used to show x, y fixation locations during observable face motion associated with speech, and the square-shaped markers show locations prior to and following speech motion. The rectangles are used to illustrate the regions of the talker's face, ordered from top to bottom, left to right: eye, left cheek, nose, mouth, right cheek, chin. Region boundaries accommodate dynamic face movements for the production of the entire sentence. The graph at the bottom half shows the corresponding data record and includes speechreading followed by the reading of text. The y-axis of the graph is scaled in units corresponding to the measurement identified by the number that has been superimposed: 1 = x (pixel) location of horizontal eye movements; 2 = y (pixel) location of vertical eye movements; 3 = pupil size/10; 4 = eye movement velocity. The darker vertical bars show periods in which no eye data are available due to eye blinks, and the light gray vertical bars illustrate saccades that are defined by high-velocity eye movements. Eye fixations are identified by the lines 1 and 2 and separated from one another by a vertical bar.
Results from eye-monitoring studies demonstrate that speechreaders make successive eye gazes (fixations) to inspect the talker's face or to track facial motion. The talker's eyes attract attention prior to and following speech production (Lansing and McConkie, 2003), and in the presence of auditory cues (Vatikiotis-Bateson, Eiigisti, and Munhall, 1998). If auditory cues are not available, perceivers with at least average proficiency attend to the talker's mouth region for accurate sentence perception (Lansing and McConkie, 2003). However, some gazes are observed toward the regions adjacent to the lips as well as toward the eyes of the talker. Similarly for word understanding, speechreaders direct eye gaze most often and for longer periods of time toward the talker's mouth than toward any other region of the face. Motion in regions other than the mouth may increase signal redundancy from that at the mouth and afford a natural context for observing detailed mouth motion. Task characteristics also influence where people look for information on the face of the talker (Lansing and McConkie, 1999). Eye gaze is directed toward secondary facial cues, located in the upper part of the face, with greater frequency for the recognition of intonation information than for phonemic or stress recognition. Phonemic and word stress information can be recognized from cues located in the middle and lower parts of the face.
Finally, new findings from brain imaging studies may provide valuable insights into the neural underpinnings of basic processes in the visual perception of spoken language (Calvert et al., 1997) and individual differences in speechreading proficiency (Ludman et al., 2000). Although preliminary results have not yet identified speechreading-specific regions, measures in perceivers with normal hearing indicate bilateral activation of the auditory cortext for silent speechreading (Calvert et al., 1997). Results from measures in perceivers with congenital onset of profound bilateral deafness (who rely on speechreading for understanding spoken language) do not indicate strong left temporal activation (MacSweeny et al., 2002). Functional magnetic resonance imaging may prove to be a useful tool to test hypotheses about task differences and the activation of primary sensory processing areas, the role of auditory experience and plasticity, and neural mechanisms and sites of cross-modal integration in the understanding of spoken language.
Continued study and research in the basic processes of speechreading are needed to determine research-based approaches to intervention, the relative advantages of different approaches, and how specific approaches relate to individual needs. Additional insight into the basic processes of visual speech perception is needed to develop and test a model of spoken word recognition that incorporates visual information, to optimize sensory-prosthetic aids, and to enhance the design of human-computer interfaces.
| |