| |
Abstract:
In this paper we describe an approach to model selection in
unsupervised learning. This approach determines both the feature
set and the number of clusters. To this end we first derive an
objective function that explicitly incorporates this
generalization. We then evaluate two schemes for model selection -
one using this objective function (a Bayesian estimation scheme
that selects the best model structure using the marginal or
integrated likelihood) and the second based on a technique using a
cross-validated likelihood criterion. In the first scheme, for a
particular application in document clustering, we derive a
closed-form solution of the integrated likelihood by assuming an
appropriate form of the likelihood function and prior. Extensive
experiments are carried out to ascertain the validity of both
approaches and all results are verified by comparison against
ground truth. In our experiments the Bayesian scheme using our
objective function gave better results than
cross-validation.
|