|
The basic acoustic source during normal phonation is a waveform consisting of a quasi-periodic sequence of pulses of volume velocity Us(t) that pass between the vibrating vocal folds (Fig. 1A). For modal vocal fold vibration, the volume velocity is zero in the time interval between the pulses, and there is a relatively abrupt discontinuity in slope at the time the volume velocity decreases to zero. The periodic nature of this waveform is reflected in the harmonic structure of the spectrum (Fig. 1B). The amplitudes of the harmonics at high frequencies decrease as 1/f2, where f = frequency, i.e., at about −12 dB per octave. The frequency of this source waveform varies from one individual to another and within an utterance. In the time domain (Fig. 1A), a change in frequency is represented in the number of pulses per second; in the frequency domain (Fig. 1B), the frequency is represented by the spacing between the harmonics. The shape of the individual pulses can also vary with the speaker, and during an utterance the shape can be modified depending on the position within the utterance and the prominence of the syllable.
Figure 1..
A, Idealized waveform of glottal volume velocity Us(t) for modal vocal fold vibration for an adult male speaker. B, Spectrum of waveform in A.
When the position and tension of the vocal folds are properly adjusted, a positive pressure below the glottis will cause the vocal folds to vibrate. As the cross-sectional area of the glottis changes during a cycle of vibration, the airflow is modulated. During the open phase of the cycle, the impedance of the glottal opening is usually large compared with the impedance looking into the vocal tract from the glottis. Thus, in most cases it is reasonable to represent the glottal source as a volume-velocity source that produces similar glottal pulses for different vocal tract configurations.
This source Us(t) is filtered by the vocal tract, as depicted in Figure 2. The volume velocity at the lips is Um(t), and the output sound pressure at a distance r from the lips is pr(t). The magnitudes of the spectral components of Us(f) and pr(f) are shown below the corresponding waveforms in Figure 2. When a non-nasal vowel is produced, the vocal tract transfer function T(f), defined as the ratio Um(f)/Us(f), is an all-pole transfer function. The sound pressure pr is related to Um by a radiation characteristic R(f). The magnitude of this radiation characteristic is approximately
Figure 2..
Schema showing how the acoustic source at the glottis is filtered by the vocal tract to yield a volume velocity Um(t) at the lips, which is radiated to obtain the sound pressure pr(t) at some distance from the lips. At the left of the figure both the source waveform Us(t) and its spectrum Us(f) are shown. At the right is the waveform pr(t) and spectrum pr(f) of the sound pressure. (Adapted with permission from Stevens, 1994.)
|R(f)|=
ρ⋅2πf
4πr
(1)
where ρ = density of air. Thus we have
pr(f)=Us(f)⋅T(f)⋅R(f).
(2)
The magnitude of pr(f) can be written as
|pr(f)|=|Us(f)⋅2πf||T(f)|⋅
ρ
4πr
.
(3)
The expression
|Us(f)⋅2πf|
is the magnitude of the Fourier transform of the derivative
U
s
′
(t)
. Thus the output sound pressure can be considered to be the result of filtering
U
s
′
(t)
by the vocal tract transfer function T(f), multiplied by a constant. That is, the derivative
U
s
′
(t)
can be viewed as the effective excitation of the vocal tract.
For the ideal or modal volume-velocity waveform (Fig. 1), this derivative has the form shown in Figure 3A (Fant, Liljencrants, and Lin, 1985). Each pulse has a sequence of two components: (1) an initial smooth portion where the waveform is first positive, then passes through zero (corresponding to the peak of the pulse in Fig. 1), and then reaches a maximum negative value; and (2) a second portion where the waveform returns abruptly to zero, corresponding to the discontinuity in slope of the original waveform Us(t) at the time the vocal folds come together. The principal acoustic excitation of the vocal tract occurs at the time of this discontinuity. For this ideal or modal derivative waveform, the spectrum (Fig. 3B) at high frequencies decreases as 1/f, i.e., at −6 dB/ octave, reflecting the discontinuity at closure.
Figure 3..
A, Derivative
U
s
′
(t)
of the modal volume-velocity waveform in Figure 1. B, Spectrum of waveform in A.
For normal speech production, there are several ways in which the glottal waveform can differ from the modal waveform (or its derivative). One obvious attribute is the frequency f0 of the glottal pulses, which is controlled primarily by changing the tension of the vocal folds, although the subglottal pressure also influences the frequency, particularly when the folds are relatively slack (Titze, 1989). Increasing or decreasing the subglottal pressure Ps causes increases or decreases in the amplitude of the glottal pulses, or, more specifically, in the magnitude of the discontinuity in slope at the time of glottal closure. The magnitude of the glottal excitation increases roughly as
P
s
3/2
(Ladefoged and McKinney, 1963; Isshiki, 1964; Tanaka and Gould, 1983).
Changes in the configuration of the membranous and cartilaginous portions of the vocal folds relative to the modal configuration can lead to changes in the waveform and spectrum of the glottal source. For some speakers and for some styles of speaking, the vocal folds and arytenoid cartilages are configured such that the glottis is never completely closed during a cycle of vibration, introducing several acoustic consequences. First, the speed with which the vocal folds approach the midline is reduced; the effect on the derivative waveform
U
s
′
(t)
is that the maximum negative value is reduced (that is, it is less negative). Thus, the excitation of the vocal tract and the overall amplitude of the output are decreased. Second, there is continuing airflow throughout the cycle. The inertia of the air in the glottis and supraglottal airways prevents the occurrence of the abrupt discontinuity in
U
s
′
(t)
that occurs at the time of vocal fold closure in modal phonation (Rothenberg, 1981). Rather, there is a non-zero return phase following the maximum negative peak, during which
U
s
′
(t)
gradually returns to zero (Fant, Liljencrants, and Lin, 1985). The derivative waveform
U
s
′
(t)
then has a shape that is schematized in Figure 4. The corresponding waveform Us(t) is shown below the waveform
U
s
′
(t)
. The spectral consequence of this non-zero return phase is a reduction in the high-frequency spectrum amplitude of
U
s
′
(t)
relative to the low-frequency spectrum amplitude. A third consequence of a somewhat abducted glottal configuration is an increased loss of acoustic energy from the vocal tract through the partially open glottis and into the subglottal airways. This energy loss affects the vocal tract filter rather than the source waveform. It is most apparent in the first formant range and results in an increased bandwidth of F1, causing a reduction in A1, the amplitude of the first-formant prominence in the spectrum (Hanson, 1997). The three consequences just described lead to a vowel for which the spectrum amplitude A1 in the F1 range is reduced and the amplitudes of the spectral prominences due to higher formants are reduced relative to A1.
Figure 4..
Schematized representation of volume velocity waveform Us(t) and its derivative
U
s
′
(t)
when the glottis is never completely closed within a cycle of vibration.
Still another consequence of glottal vibration with a partially open glottis is that there is increased average airflow through the glottis, as shown in the Us(t) waveform in Figure 4. This increased flow causes an increased amplitude of noise generated by turbulence in the vicinity of the glottis. Thus, in addition to the quasi-periodic source, there is an aspiration noise source with a continuous spectrum (Klatt and Klatt, 1990). Since the flow is modulated by the periodic fluctuation in glottal area, the noise source is also modulated. This type of phonation has been called “breathy-voiced.”
The aspiration noise source can be represented as an equivalent acoustic volume-velocity source that is added to the periodic source. In contrast to the periodic source, the noise source has a spectrum that tilts upward with increasing frequency. It appears to have a broad peak at high frequencies, around 2–4 kHz (Stevens, 1998). Figure 5A shows estimated spectra of the periodic and noise components that would occur during modal phonation. The noise component is relatively weak, and is generated only during the open phase of glottal vibration. Phonation with a more abducted glottis of the type represented in Figure 4 leads to greater noise energy and reduced high-frequency amplitude of the periodic component, and the noise component may dominate the periodic component at high frequencies (Fig. 5B). With breathy-voiced phonation, the individual harmonics corresponding to the periodic component may be obscured by the noise component at high frequencies. At low frequencies, however, the harmonics are well defined, since the noise component is weak in this frequency region.
Figure 5..
Schematized representation of spectra of the effective periodic and noise components of the glottal source for modal vibration (A) and breathy voicing (B). The spectrum of the periodic component is represented by the amplitudes of the harmonics. The spectrum of the noise is calculated with a bandwidth of about 300 Hz.
Figure 6 shows spectra of a vowel produced by a speaker with modal glottal vibration (A) and the same vowel produced by a speaker with a somewhat abducted glottis (B). Below the spectra are waveforms of the vowel before and after being filtered by a broad bandpass filter (bandwidth of 600 Hz) centered on the third-formant frequency F3. Filtered waveforms of this type have been used to highlight the presence of noise at high frequencies during phonation by a speaker with a breathy voice (Klatt and Klatt, 1990). The noise is also evident in the spectrum at high frequencies for the speaker of Figure 6B. Comparison of the two spectra in Figure 6 also shows the greater spectrum tilt and the reduced prominence of the first formant peak associated with an abducted glottis, as already noted.
Figure 6..
A, Spectrum of the vowel /ε/ produced by a male speaker with approximately modal phonation. Below the spectrum are waveforms of this vowel before and after being filtered with a bandpass filter centered on F3, with a bandwidth of 600 Hz. The individual glottal pulses as filtered by F3 of the vowel are evident. B, Spectrum of the vowel /ε/ produced by a male speaker who apparently phonated with a glottal chink. The waveforms below are as described in A. The noise in the waveform in the F3 region (and above) obscures the individual glottal pulses. The spectra are from Hanson and Chuang (1999). See text.
As the average glottal area increases, the transglottal pressure required to maintain vibration (phonation threshold pressure) increases. Therefore, for a given subglottal pressure an increase in the glottal area can lead to cessation of vocalfold vibration.
Adduction of the vocal folds relative to their modal configuration can also lead to changes in the source waveform. As the vocal folds are adducted, pressed voicing occurs, in which the glottal pulses are narrower and of lower amplitude than in modal phonation, and may occur aperiodically (glottalization). In addition, phonation-threshold pressure increases, eventually reaching a point where the folds no longer vibrate.
The above description of the glottal vibration pattern for various degrees of glottal abduction and adduction suggests that there is an optimum glottal width that gives rise to a maximum in sound energy (Hanson and Stevens, 2002). This optimum configuration has been examined experimentally by Verdolini et al. (1998).
There are substantial individual and sex differences in the degree to which the folds are abducted or adducted during phonation. These differences lead to significant differences in the waveform and spectrum of the glottal source, and consequently in the spectral characteristics of vowels generated by these sources (Hanson, 1997; Hanson and Chuang, 1999). Similar observations have also been made by Holmberg, Hillman, and Perkell (1988) using different measurement techniques. One acoustic measure that reflects the reduction of the high-frequency spectrum amplitude relative to the low-frequency spectrum amplitude is the difference H1*-A3* (in dB) between the amplitude of the first harmonic and the amplitude of the third-formant spectrum prominence. (The asterisks indicate that corrections are made in H1 due to the possible influence of the first formant, and in A3 due to the influence of the frequencies of the first and second formants.) Distributions of values of H1*-A3* are given in Figure 7 for a population of 22 female and 21 male speakers. The female speakers appear to have a greater spectrum tilt on average, suggesting a somewhat less abrupt glottal closure during a cycle and a greater tendency for lack of complete closure throughout the cycle. Note the substantial ranges of 20 dB or more within each sex.
Figure 7..
Distributions of H1*-A3*, a measure that reflects the reduction of the high-frequency spectrum relative to the low- frequency spectrum, for male (black bars) and female (gray bars) speakers. H1 is the amplitude of the first harmonic and A3 is the amplitude of the strongest harmonic in the F3 peak. The asterisks indicate that corrections have been applied to H1 and A3, as described in the text. (Adapted with permission from Hanson and Chuang, 1999.)
|