MIT CogNet, The Brain Sciences ConnectionFrom the MIT Press, Link to Online Catalog
SPARC Communities
Subscriber : Stanford University Libraries » LOG IN

space

Powered By Google 
Advanced Search

 

Estimation of Vocal Tract Shape from Speech Sounds Via a Physiological Articulatory Model

 Jianwu Dang and Kiyoshi Honda
  
 

Abstract:

A 3-D physiological articulatory model has been developed for human-mimetic speech synthesis. The articulatory model is constructed based on volumetric MRI data that were collected from one male speaker, and is driven by muscle activation signals. The muscle activation signals are computed by successively moving the current position toward the articulatory target position for several observation points on the articulators. In this study, the multipoint control strategy involves three points on the articulators: the tongue tip, tongue dorsum, and mandibular incisor. A set of weight coefficients is defined for each muscle to control each point independently in the model's geometric space. A muscle workspace is proposed to predict the contribution of muscle contraction to movement of each control point in the geometric space. Muscle activation signals are generated by the successive process, and fed to the muscles to drive the model. Thus, a vocal tract shape is generated from model simulation based on the articulatory targets. In the acoustic part of the articulatory speech synthesizer, area functions are estimated from vocal tract widths in the midsagittal and parasagittal planes based on an improved alpha-beta model. Speech sounds are generated from the area functions by the transmission line model. This study applies the articulatory model to estimate vocal tract shape from vowel sounds. Average formant patterns of five Japanese vowels were employed as a template to classify an input vowel, and five typical articulatory target patterns were determined to produce the format patterns. An input vowel is first classified according to the average formant patterns, and then its articulatory targets are calculated using the typical target patterns of the first two candidate vowels. The synthetic vowels based on the calculated targets demonstrated a high similarity with the input vowels. In this process, acoustical parameters are considered as a function of the articulatory targets for the model simulation. The location of articulatory targets is estimated based on a comparison of acoustical parameters between a real speech sound and the synthetic sound corresponding to the targets. A realistic vocal tract shape is achieved when the difference is minimized. The estimation was performed on five Japanese vowels for two subjects, and its accuracy was evaluated by examining acoustical and articulatory data. The result shows that reliable acoustic-to-articulatory inversion is obtainable using a physiological model.

 
 


© 2010 The MIT Press
MIT Logo