Quarterly (March, June, September, December)
160 pp. per issue
6 3/4 x 10
ISSN
0891-2017
E-ISSN
1530-9312
2014 Impact factor:
1.23

Computational Linguistics

Paola Merlo, Editor
September 2012, Vol. 38, No. 3, Pages 631-671
(doi: 10.1162/COLI_a_00107)
© 2012 Association for Computational Linguistics
A Scalable Distributed Syntactic, Semantic, and Lexical Language Model
Article PDF (485.24 KB)
Abstract

This paper presents an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating an n-gram model, a structured language model, and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the Bleu score and “readability” of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.