MIT CogNet, The Brain Sciences ConnectionFrom the MIT Press, Link to Online Catalog
SPARC Communities
Subscriber : Stanford University Libraries » LOG IN

space

Powered By Google 
Advanced Search

The CogNet Library : References Collection
mitecs_logo  The MIT Encyclopedia of Communication Disorders : Table of Contents: Acoustic Assessment of Voice : Section 1
Next »»
 

Acoustic assessment of voice in clinical applications is dominated by measures of fundamental frequency (f0), cycle-to-cycle perturbations of period (jitter) and intensity (shimmer), and other measures of irregularity, such as noise-to-harmonics ratio (NHR). These measures are widely used, in part because of the availability of electronic and microcomputer-based instruments (e.g., Kay Elemetrics Computerized Speech Laboratory [CSL] or Multispeech, Real-Time Pitch, Multi-Dimensional Voice Program [MDVP], and other software/hardware systems), and in part because of long-term precedent for perturbation (Lieberman, 1961) and spectral noise measurements (Yanagihara, 1967). Absolute measures of vocal intensity are equally basic but require calibrations and associated instrumentation (Winholtz and Titze, 1997).

Independently, these basic acoustic descriptors—f0, intensity, jitter, shimmer, and NHR—can provide some very basic characterizations of vocal health. The first two, f0 and intensity, have very clear perceptual correlates—pitch and loudness, respectively—and should be assessed for both stability and variability and compared to age and sex norms (Kent, 1994; Baken and Orlikoff, 2000). Ideally, these tasks are recorded over headset microphones with direct digital acquisition at very high sampling rates (at least 48 kHz). The materials to be assessed should be obtained following standardized elicitation protocols that include sustained vowel phonations at habitual levels, levels spanning a client's vocal range in both f0 and intensity, running speech, and speech tasks designed to elicit variation (Titze, 1995; Awan, 2001). Note, however, that not all measures will be appropriate for all tasks; perturbation statistics, for example, are usually valid only when extracted from sustained vowel phonations.

These basic descriptors are not in any way comprehensive of the range of available measures or the available signal properties and dimensions. Table 1 categorizes measures (Buder, 2000) based on primary basic signal representations from which measures are derived. Although these categories are intended to be exhaustive and mutually exclusive, some more modern algorithms process components through several types. (For more detail on the measurement types, see Buder, 2000, and Baken and Orlikoff, 2000.) Modern algorithmic approaches should be selected for (1) interpretability with respect to aerodynamic and physiological models of phonation and (2) the incorporation of multivariate measures to characterize vocal function.







Table 1 : Outline of Traditional Acoustic Algorithm Types

f0 statistics
 Short-term perturbations
 Long-term perturbations
Amplitude statistics
 Short-term perturbations
 Long-term perturbations
f0/amplitude covariations
Waveform perturbations
Spectral measures
 Spectrographic measures
 Fourier and LPC spectra
 Long-term average spectra
 Cepstra
Inverse filter measures
 Radiated signal
 Flow-mask signals
Dynamic measures

Interdependence of Basic Measures

The interdependence between f0 and intensity is mapped in a voice range profile, or phonetogram, which is an especially valuable assessment for the professional voice user (Coleman, 1993). Furthermore, the dependence of perturbations and signal-to-noise ratios on both f0 and intensity is well known (Klingholz, 1990; Pabon, 1991). This dependence is not often assessed rigorously, perhaps because of the time-consuming and strenuous nature of a full voice profile. However, an abbreviated or focused profiling in which samples related to habitual f0 by a set number of semitones, or related to habitual intensity by a set number of decibels, could be standardized to control for this dependence efficiently. Finally, it should be understood that perturbations and NHR-type measures will usually covary for many reasons, the simplest ones being methodological (Hillenbrand, 1987): an increase in any one of the underlying phenomena detected by a single measure will also affect the other measures.

Periodicity as a Reference

The chief problem with nearly all acoustic assessments of voice is the determination of f0. Most voice quality algorithms are based on the prior identification of the periodic component in the signal (based on glottal pulses in the time domain or harmonic structure in the frequency domain). Because phonation is ideally a nearly periodic process, it is logical to conceive of voice measures in terms of the degree to which a given sample deviates from pure periodicity. There are many conceptual problems with this simplification, however. At the physiological level, glottal morphology is multidimensional—superior-inferior asymmetry is a basic feature of the two-mass model (Ishizaka and Flanagan, 1972), and some anterior-posterior asymmetry is also inevitable—rendering it unlikely that a glottal pulse will be marked by a discrete or even a single instant of glottal closure. At the level of the signal, the deviations from periodicity may be either random or correlated, and in many cases they are so extreme as to preclude identification of a regular period. Finally, at the perceptual level, many factors related to deviations from a pure f0 can contribute to pitch perception (Zwicker and Fastl, 1990).

At any or all of these levels, it becomes questionable to characterize deviations with pure periodicity as a reference. In acoustic assessment, the primary level of concern is the signal. The National Center for Voice and Speech issued a summary statement (Titze, 1995) recommending a typology for categorizing deviations from periodicity in voices (see also Baken and Orlikoff, 2000, for further subtypes). This typology capitalizes on the categorical nature of dynamic states in nonlinear systems; all the major categories, including stable points, limit cycles, period-doubling/tripling/…, and chaos can be observed in voice signals (Herzel et al., 1994; Sataloff and Hawkshaw, 2001). As in most highly nonlinear dynamic systems, deviations from periodicity can be categorized on the basis of bifurcations, or sudden qualitative changes in vibratory pattern from one of these states to another.

Figure 1 displays a common form for one such bifurcation and illustrates the importance of accounting for its presence in the application of perturbation measures. In this sustained vowel phonation by a middle-aged woman with spasmodic dysphonia, a transition to subharmonics is clearly visible in segment b (similar patterns occur in individuals without dysphonias). Two f0 extractions are presented for this segment, one at the targeted level of approximately 250 Hz and another which the tracker finds one octave below this; inspection of the waveform and a perceived biphonia both justify this 125-Hz analysis as a new fundamental frequency, although it can also be understood in this context as a subharmonic to the original fundamental. There is therefore some ambiguity as to which fundamental is valid during this episode, and an automatic analysis could plausibly identify either frequency. (Here the waveform-matching algorithm implemented in CSpeechSP [Milenkovic, 1997] does identify either frequency, depending on where in the waveform the algorithm is applied; initiating the algorithm within the subharmonic segment predisposes it to identify the lower fundamental.)

Figure 1..  

Approximately 900 ms of a sustained vowel phonation waveform (top panel) with two fundamental frequency analyses (bottom panel). Average f0, %jitter, %shimmer, and SNR results for selected segments were from the “newjit” routine of TF32 program (Milenkovic, 2001).


The acoustic measures of the segments displayed in Figure 1 reveal the nontrivial differences that result, depending on the basic glottal pulse form under consideration. When the pulses of segment a are considered, the perturbations around the base period associated with the high f0 are low and normative; in segment b, perturbations around the longer periods of the lower f0 are still low (jitter is improved, while shimmer and the signal-to-noise ratio show some degradation). However, when all segments are considered together to include the perturbations around the high f0 tracked through segment b and into c, the perturbation statistics are all increased by an order of magnitude. Many important methodological and theoretical questions should be raised by such common scenarios in which we must consider not just voice typing, but the segment-by-segment validity of applying perturbation measures with a particular f0 as reference. If, as is often assumed, jitter and shimmer are ascribed to “random” variations, then the correlated modulations of a strong subharmonic episode should be excluded. Alternatively, the perturbations might be analyzed with respect to the subharmonic f0. In any case, assessment by means of perturbation statistics with no consideration of their underlying sources is unwise.

Perceptual, Aerodynamic, and Physiological Correlates of Acoustic Measures

Regarding perceptual voice ratings, Gerratt and Kreiman (2000) have critiqued traditional assessments on several important methodological and theoretical points. However, these points may not apply to acoustic analysis if (1) acoustic analysis is validated on its own success and not exclusively in relation to the problematic perceptual classifications, and (2) acoustic analysis is thoroughly grounded for interpretation in some clear aerodynamic or physiological model of phonation. Gerratt and Kreiman also argue that clinical classification may not be derived along a continuum that is defined with reference to normal qualities, but again, this argument may need to be reversed for the acoustic domain. It is only by reference to a specific model that any assessment on acoustic grounds can be interpreted (though this does not preclude development of an independent model for a pathological phonatory mechanism). In clinical settings, acoustic voice assessment often serves to corroborate perceptual assessment. However, as guided by auditory experience and in conjunction with the ear and other instrumental assessments, careful acoustic analysis can be oriented to the identification of physiological status.

In attempting to draw safe and reasonably direct inferences from acoustic signal, aerodynamic models of glottal behavior present important links to the physiological domain. Attempts to recover the glottal flow waveform, either from a face mask-transduced flow recording (Rothenberg, 1973) or a microphone-transduced acoustic recording (Davis, 1975), have proved to be labor-intensive and prone to error (Ní Chasaide and Gobl, 1997). Rather than attempting to eliminate the effects of the vocal tract, it may be more fruitful to understand its in situ relationship with phonation, and infer, via the types of features displayed in Figure 2, the status of the glottis as a sound source. Interpretation of spectral features, such as the amplitudes of the first harmonics and at the formant frequencies, may be an effective alternative when guided by knowledge of glottal aerodynamics and acoustics (Hanson, 1997; Ní Chasaide and Gobl, 1997; Hanson and Chuang, 1999). Deep familiarity with acoustic mechanisms is essential for such interpretations (Titze, 1994; Stevens, 1998), as is a model with clear and meaningful parameters, such as the Liljencrants-Fant (LF) model (Fant, Liljencrants, and Lin, 1985). The parameters of the LF model have proved to be meaningful in acoustic studies (Gauffin & Sundberg, 1989) and useful in refined efforts at inverse filtering (Fröhlich, Michaelis, and Strube, 2001). Figure 2 summarizes selected parameters of the LF source model following Ní Chasaide and Gobl (1997) and the glottal turbulence source following Stevens (1998); see also voice acoustics for other approaches relating glottal status to spectral measures.

Figure 2..  

Spectral features associated with models of phonation, including the Liljencrants-Fant (LF) model of glottal flow and aperiodicity source models developed by Stevens. The LF model of glottal flow is shown at top left. At bottom left is the LF model of glottal flow derivative, showing the rate of change in flow. At right is a spectrum schematic showing four effects. These effects include three derived parameters of the LF model: (a) excitation strength (the maximum negative amplitude of the flow derivative, which is positively correlated with overall harmonic energy), (b) dynamic leakage or non-zero return phase following the point of maximum excitation (which is negatively correlated with high-frequency harmonic energy), and (c) pulse skewing (which is negatively correlated with low-frequency harmonic energy; this low-frequency region is also positively correlated with open quotient and peak volume velocity measures of the glottal flow waveform). The effect of turbulence due to high airflow through the glottis is schematized by (d), indicating the associated appearance of high-frequency aperiodic energy in the spectrum. See voice acoustics for other graphical and quantitative associations between glottal status and spectral characteristics.


Other spectral-based measures implement similar model-based strategies by selecting spectral component ratios (e.g., the VTI and SPI parameters of MDVP). Sophisticated spectral noise characterizations control for perturbations and modulations (Murphy, 1999; Qi, Hillman, and Milstein, 1999), or employ curve-fitting and statistical models to produce more robust measures (Alku, Strik, and Vilkman, 1997; Michaelis, Fröhlich, and Strube, 1998; Schoentgen, Bensaid, and Bucella, 2000). A particularly valuable modern technique for detecting turbulence at the glottis, the glottal-to-noise-excitation ratio (Michaelis, Gramss, and Strube, 1997), has been especially successful in combination with other measures (Fröhlich et al., 2000). The use of acoustic techniques for voice will only improve with the inclusion of more knowledge-based measures in multivariate representations (Wolfe, Cornell, and Palmer, 1991; Callen et al., 2000; Wuyts et al., 2000).

 
Next »»


© 2010 The MIT Press
MIT Logo