| |
Abstract:
A 3-D physiological articulatory model has been developed for
human-mimetic speech synthesis. The articulatory model is
constructed based on volumetric MRI data that were collected from
one male speaker, and is driven by muscle activation signals. The
muscle activation signals are computed by successively moving the
current position toward the articulatory target position for
several observation points on the articulators. In this study,
the multipoint control strategy involves three points on the
articulators: the tongue tip, tongue dorsum, and mandibular
incisor. A set of weight coefficients is defined for each muscle
to control each point independently in the model's geometric
space. A muscle workspace is proposed to predict the contribution
of muscle contraction to movement of each control point in the
geometric space. Muscle activation signals are generated by the
successive process, and fed to the muscles to drive the model.
Thus, a vocal tract shape is generated from model simulation
based on the articulatory targets. In the acoustic part of the
articulatory speech synthesizer, area functions are estimated
from vocal tract widths in the midsagittal and parasagittal
planes based on an improved alpha-beta model. Speech sounds are
generated from the area functions by the transmission line model.
This study applies the articulatory model to estimate vocal tract
shape from vowel sounds. Average formant patterns of five
Japanese vowels were employed as a template to classify an input
vowel, and five typical articulatory target patterns were
determined to produce the format patterns. An input vowel is
first classified according to the average formant patterns, and
then its articulatory targets are calculated using the typical
target patterns of the first two candidate vowels. The synthetic
vowels based on the calculated targets demonstrated a high
similarity with the input vowels. In this process, acoustical
parameters are considered as a function of the articulatory
targets for the model simulation. The location of articulatory
targets is estimated based on a comparison of acoustical
parameters between a real speech sound and the synthetic sound
corresponding to the targets. A realistic vocal tract shape is
achieved when the difference is minimized. The estimation was
performed on five Japanese vowels for two subjects, and its
accuracy was evaluated by examining acoustical and articulatory
data. The result shows that reliable acoustic-to-articulatory
inversion is obtainable using a physiological model.
|