| |
Abstract:
Clustering is important in many fields including
manufacturing, biology, finance, and astronomy. Mixture models are
a popular approach due to their statistical foundations, and EM is
a very popular method for finding mixture models. EM, however,
requires many accesses of the data, and thus has been dismissed as
impractical for data mining of enormous datasets. We present a new
algorithm, based on the multiresolution kd-trees of our earlier
work, which dramatically reduces the cost of EM-based clustering,
with savings rising linearly with the number of datapoints.
Although presented here for maximum likelihood estimation of
Gaussian mixture models, it is also applicable to non-Gaussian
models (provided class densities are monotonic in Mahalanobis
distance), mixed categorical/numeric clusters, and Bayesian methods
such as Autoclass.
|