| |
Abstract:
The MOCHA (Multi-CHannel Articulatory) database is being
created to provide a resource for training speaker-independent
continuous ASR systems and for general co-articulatory studies.
The planned dataset includes 40 speakers of English, each reading
up to 460 TIMIT sentences (British version). The articulatory
channels currently include Electromagnetic Articulograph (EMA)
sensors directly attached to the vermilion border of the upper
and lower lips, lower incisor (jaw), tongue tip (5-10mm from the
tip), tongue blade (approximately 2-3cm posterior to the tongue
tip sensor), tongue dorsum (approximately 2-3cm posterior to the
tongue blade sensor) and soft palate (approximately 10-20mm from
the edge of the hard palate). A Laryngograph provides voicing
information and an Electropalatograph (EPG) provides
tongue-palate contact data. This paper describes work in progress
using this database to determine a set of articulatory parameters
which are optimised for the task of automatic phone recognition.
The articulatory feature vector is created by applying principal
components analysis to data provided by EPG and EMA, supplemented
with a voicing energy feature extracted from a Laryngograph. The
results show relatively poor recognition performance when
compared with a standard acoustic feature vector. Dental stops
are the only phonetic category in which the articulatory feature
vector outperforms the acoustic standard. Examination of the
phone level behaviour of the recogniser indicates several areas
in which it may be improved. The strong tendency of the system to
delete schwa and other central vowels may indicate a many-to-one
mapping problem because the tongue can be in many positions and
still define a uniform acoustic tube but it may also be
indicative that targetless schwa has no distinct gesture
associated with it and therefore brings into question its
legitimacy as a distinct segment. That is to say the acoustic
percept of a schwa in some instances may simply be a result of
the coarticulation between adjacent consonant segments. The
failure to improve on voiced/voiceless consonant discrimination
suggests that the addition of an instrument to measure glottal
opening may be required. More generally, principal component
analysis is probably not the best method for dimension reduction.
It accounts for the variance in the data, which naturally favours
tongue movement but is not optimised for discriminating between
phone classes. Linear discriminant analysis would be a better
choice. Two speakers from the MOCHA database along with MATLAB
display macros can be found at
http://www.cstr.ed.ac.uk/artic/
|