## Computational Linguistics

December 2010, Vol. 36, No. 4, Pages 631-637
(doi: 10.1162/coli_a_00013)
© 2010 Association for Computational Linguistics
An Asymptotic Model for the English Hapax/Vocabulary Ratio
Article PDF (454.15 KB)
Abstract

In the known literature, hapax legomena in an English text or a collection of texts roughly account for about 50% of the vocabulary. This sort of constancy is baffling. The 100-million-word British National Corpus was used to study this phenomenon. The result reveals that the hapax/vocabulary ratio follows a U-shaped pattern. Initially, as the size of text increases, the hapax/vocabulary ratio decreases; however, after the text size reaches about 3,000,000 words, the hapax/vocabulary ratio starts to increase steadily. A computer simulation shows that as the text size continues to increase, the hapax/vocabulary ratio would approach 1.