| |
Abstract:
Many domains are naturally organized in an abstraction
hierarchy or taxonomy, where the instances in ``nearby'' classes
in the taxonomy are similar. In this paper, we provide a general
probabilistic framework for clustering data into a set of classes
organized as a taxonomy, where each class is associated with a
probabilistic model from which the data was generated. The
clustering algorithm simultaneously optimizes three things: the
assignment of data instances to clusters, the models associated
with the clusters, and the structure of the abstraction
hierarchy. A unique feature of our approach is that it utilizes
global optimization algorithms for both of the last two steps,
reducing the sensitivity to noise and the propensity to local
maxima that are characteristic of algorithms such as hierarchical
agglomerative clustering that only take local steps. We provide a
theoretical analysis for our algorithm, showing that it converges
to a local maximum of the joint likelihood of model and data. We
present experimental results on synthetic data, and on real data
in the domains of gene expression and text.
|