| |
Voice quality is the auditory perception of acoustic elements of phonation that characterize an individual speaker. Thus, it is an interaction between the acoustic speech signal and a listener's perception of that signal. Voice quality has been of interest to scholars for as long as people have studied speech. The ancient Greeks associated certain kinds of voices with specific character traits; for example, a nasal voice indicated a spiteful and immoral character. Ancient writers on oratory emphasized voice quality as an essential component of polished speech and described methods for conveying a range of emotions appropriately, for cultivating power, brilliance, and sweetness, and for avoiding undesirable characteristics like roughness, brassiness, or shrillness (see Laver, 1981, for review).
Evaluation of vocal quality is an important part of the diagnosis and treatment of voice disorders. Patients usually seek clinical care because of their own perception of a voice quality deviation, and most often they judge the success of treatment for the voice problem by improvement in their voice quality. A clinician may also judge success by documenting changes in laryngeal anatomy or physiology, but in general, patients are more concerned with how their voices sound after treatment. Researchers from other disciplines are also interested in measuring vocal quality. For example, linguists are interested in how changes in voice quality can signal changes in meaning; psychologists are concerned with the perception of emotion and other personal information encoded in voice; engineers seek to develop algorithms for signal compression and transmission that preserve voice quality; and law enforcement officials need to assess the accuracy of speaker identifications.
Despite this long intellectual history and the substantial cross-disciplinary importance of voice quality, measurement of voice quality is problematic, both clinically and experimentally. Most techniques for assessing voice quality fall into one of two general categories: perceptual assessment protocols, or protocols employing an acoustic or physiologic measurement as an index of quality. In perceptual assessments, a listener (or listeners) rates a voice on a numerical scale or a set of scales representing the extent to which the voice is characterized by critical aspects of voice quality. For example, Fairbanks (1960) recommended that voices be assessed on 5-point scales for the qualities harshness, hoarseness, and breathiness. In the GRBAS protocol (Hirano, 1981), listeners evaluate voices on the scales Grade (or extent of pathology), Roughness, Breathiness, Asthenicity (weakness or lack of power in the voice), and Strain, with each scale ranging from 0 (normal) to 4 (severely disordered). A recent revision to this protocol (Dejonckere et al., 1998) has expanded it to GIRBAS by adding a scale for Instability. Many other similar protocols have been proposed. For example, the Wilson Voice Profile System (Wilson, 1977) includes 7-point scales for laryngeal tone, laryngeal tension, vocal abuse, loudness, pitch, vocal inflections, pitch breaks, diplophonia (perception of two pitches in the voice), resonance, nasal emission, rate, and overall vocal efficiency. A 13-scale protocol proposed by Hammarberg and Gauffin (1995) includes scales for assessing aphonia (lack of voice), breathiness, tension, laxness, creakiness, roughness, gratings, pitch instability, voice breaks, diplophonia, falsetto, pitch, and loudness. Even more elaborate protocols have been proposed by Gelfer (1988; 17 parameters) and Laver (approximately 50 parameters; e.g., Greene and Mathieson, 1989). Methods like visual-analog scaling (making a mark on an undifferentiated line to indicate the amount of a quality present) or direct magnitude estimation (assigning any number—as opposed to one of a finite number of scale values—to indicate the amount of a quality present) have also been applied in efforts to quantify voice quality. Ratings may be made with reference to “anchor” stimuli that exemplify the different scale values, or with reference to a listener's own internal standards for the different levels of a quality.
The usefulness of such protocols for perceptual assessment is limited by difficulties in establishing the correct and adequate set of scales needed to document the sound of a voice. Researchers have never agreed on a standardized set of scales for assessing voice quality, and some evidence suggests that differences between listeners in perceptual strategies are so large that standardization efforts are doomed to failure (Kreiman and Gerratt, 1996). In addition, listeners are apparently unable to agree in their ratings of voices. Evidence suggests that on average, more than 60% of the variance in ratings of voice quality is due to factors other than differences between voices in the quality being rated. For example, scale ratings may vary depending on variable listener attention, difficulty isolating single perceptual dimensions within a complex acoustic stimulus, and differences in listeners' previous experience with a class of voices (Kreiman and Gerratt, 1998). Evidence suggests that traditional perceptual scaling methods are effectively matching tasks, where external stimuli (the voices) are compared to stored mental representations that serve as internal standards for the various rating scales. These idiosyncratic, internal standards appear to vary with listeners' previous experience with voices (Verdonck de Leeuw, 1998) and with the context in which a judgment is made, and may vary substantially across listeners as well as within a given listener. In addition, severity of vocal deviation, difficulty isolating individual dimensions in complex perceptual contexts, and factors like lapses in attention can also influence perceptual measures of voice (de Krom, 1994). These factors (and possibly others) presumably all add uncontrolled variability to scalar ratings of vocal quality, and contribute to listener disagreement (see Gerratt and Kreiman, 2001, for review).
In response to these substantial difficulties, some researchers suggest substituting objective measures of physiologic function, airflow, or the acoustic signal for these flawed perceptual measures, for example, using a measure of acoustic frequency perturbation as a de facto measure of perceived roughness (see acoustic assessment of voice). This approach reflects the prevailing view that listeners are inherently unable to agree in their perception of such complex auditory stimuli. Theoretical and practical difficulties also beset this approach. Theoretically, we cannot know the perceptual importance of particular aspects of the acoustic signal without valid measures of that perceptual response, because voice quality is by definition the perceptual response to a particular acoustic stimulus. Thus, acoustic measures that purport to quantify vocal quality can only derive their validity as measures of voice quality from their causal association with auditory perception. Practically, consistent correlations have never been found between perceptual and instrumental measures of voice, suggesting that such instrumental measures are not stable indices of perceived quality. Finally, correlation does not imply causality: simply knowing the relationship of an acoustic variable to a perceptual one does not necessarily illuminate its contribution to perceived quality. Even if an acoustic variable were important to a listener's judgment of vocal quality, the nature of that contribution would not be revealed by a correlation coefficient. Further, given the great variability in perceptual strategies and habits that individual listeners demonstrate in their use of traditional rating scales, the overall correlation between acoustic and perceptual variables, averaged across samples of listeners and voices, fails to provide useful insight into the perceptual process. (See Kreiman and Gerratt, 2000, for an extended review of these issues.)
Gerratt and Kreiman (2001) proposed an alternative solution to this dilemma. They measured vocal quality by asking listeners to copy natural voice samples with a speech synthesizer. In this method, listeners vary speech synthesis parameters to create an acceptable auditory match to a natural voice stimulus. When a listener chooses the best match to a test stimulus, the synthesis settings parametrically represent the listener's perception of voice quality. Because listeners directly compare each synthetic token they create with the target natural voice, they need not refer to internal standards for particular voice qualities. Further, listeners can manipulate acoustic parameters and hear the result of their manipulations immediately. This process helps listeners focus attention on individual acoustic dimensions, reducing the perceptual complexity of the assessment task and the associated response variability. Preliminary evaluation of this method demonstrated near-perfect agreement among listeners in their assessments of voice quality, presumably because this analysis-synthesis method controls the major sources of variance in quality judgments while avoiding the use of dubiously valid scales for quality. These results indicate that listeners do in fact agree in their perceptual assessments of pathological voice quality, and that tools can be devised to measure perception reliably. However, how such protocols will function in clinical (rather than research) applications remains to be demonstrated. Much more research is certainly needed to determine a meaningful, parsimonious set of acoustic parameters that successfully characterizes all possible normal and pathological voice qualities. Such a set could obviate the need for voice quality labels, allowing researchers and clinicians to replace quality labels with acoustic parameters that are causally linked to auditory perception, and whose levels objectively, completely, and validly specify the voice quality of interest.
| |