| |
Abstract:
Classification of finite sequences without explicit knowledge
of their statistical origins is of great importance to various
applications, such as text categorization and biological sequence
analysis. We propose a new information theoretic algorithm for this
problem based on two important ingredients. The first, motivated by
the "two sample problem", provides an (almost) optimal model
independent criterion for comparing two statistical sources. The
second ingredient is a novel idea for agnostic "statistical
merging" of sequences in an asymptotically optimal way.
In this paper we introduce the algorithm and illustrate its
application for synthesis of text sources and for hierarchical
classification of short text segments.
|