| |
Abstract:
A well-known difficulty in using the articulatory
representation for applications in the areas of speech coding,
synthesis and recognition is the poor accuracy in the estimation
of the articulatory parameters from the acoustic signal of
speech. The difficulty is especially serious for most classes of
consonantal sounds. This paper presents a statistical method of
estimating the articulatory trajectories from the speech signal
based on training databases of physiological measurements of
articulatory and acoustic parameters obtained from continuous
speech utterances. The estimation of articulatory trajectories
uses the extended Kalman filtering technique and is based on new
linguistic constraints imposed to acoustic-to-articulatory
inversion. These new constraints are mainly implemented by
dividing the whole articulatory-acoustic function into a number
of phonological sub-functions, each corresponding to a unit of
speech defined as the patterns of the continuous transition
between two consecutive phonemes. A state-space dynamical model
has been used to represent each phonological unit of speech. A
different articulatory-acoustic sub-function has been modeled as
a part of the state-space model for each phonological unit of
speech. An automatic method of segmenting the speech signal and
recognizing the phonological units was developed based on
likelihood computation from Kalman filtering with different
models. The final estimation of articulatory trajectories was
obtained from Kalman smoother using the parameters of the
recognized models. The whole speech inversion method was
developed using synthesized speech data obtained with an
articulatory synthesizer. Then the method was evaluated on real
speech data recorded with an articulograph and an X-ray microbeam
system. Estimation results compared to articulographic and X-ray
speech data are presented in this paper. Average RMS errors of
about 2 mm have been obtained between estimated and actual
articulatory trajectories.
|