Quarterly (March, June, September, December)
160 pp. per issue
6 3/4 x 10
ISSN
0891-2017
E-ISSN
1530-9312
2014 Impact factor:
1.23

Computational Linguistics

Paola Merlo, Editor
September 2002, Vol. 28, No. 3, Pages 289-318
(doi: 10.1162/089120102760275992)
© 2002 Association for Computational Linguistics
Periods, Capitalized Words, etc.
Article PDF (178.79 KB)
Abstract

In this article we present an approach for tackling three important aspects of text normalization: sentence boundary disambiguation, disambiguation of capitalized words in positions where capitalization is expected, and identification of abbreviations. As opposed to the two dominant techniques of computing statistics or writing specialized grammars, our document-centered approach works by considering suggestive local contexts and repetitions of individual words within a document. This approach proved to be robust to domain shifts and new lexica and produced performance on the level with the highest reported results. When incorporated into a part-of-speech tagger, it helped reduce the error rate significantly on capitalized words and sentence boundaries. We also investigated the portability to other languages and obtained encouraging results.