| |
Abstract:
Although stochastic models of speech signals (e.g. hidden
Markov models, trigrams, etc) have lead to impressive
improvements in speech recognition accuracy, it has been noted
that these models have little relationship to speech production,
and their recognition performance on some important tasks is far
from perfect. However, there have been recent attempts to bridge
the gap between speech production and speech recognition using
models that are stochastic and yet make more reasonable
assumptions about the mechanisms underlying speech production.
One of these models, Multiple Observable Maximum Likelihood
Continuity Mapping (MO-MALCOM) is described in this paper. There
are theoretical and experimental reasons to believe that
MO-MALCOM learns a stochastic mapping between articulator
positions and speech acoustics, even though no articulator
measurements are available during training. Furthermore,
MO-MALCOM can be combined with standard speech recognition
algorithms to form a speech recognition approach based on a
production model. Results of experiments related to MO-MALCOM
recovery of articulator positions are summarized. It is pointed
out that MO-MALCOM's ability to learn a mapping between acoustics
and articulation without a training signal has important
theoretical implications for theories of speech perception and
speech production For an example of the implications for speech
production models, consider that it has been argued that phoneme
targets must be acoustic because there is no teaching signal to
help learn the mapping between acoustics and tract variables.
Such an argument is not valid if a teaching signal is not
required to learn the mapping.
|