| |
Abstract:
The project pursued in this paper is to develop from first
principles a general machine learning approach to learn the
similarity between text documents. We utilize a statistical latent
class model to generate a decomposition of document collections in
terms of topic factors. From this model a canonical kernel, the
Fisher kernel, is derived within the theoretical framework of
information geometry. The Fisher kernel provides a similarity
function that can be used for unsupervised and supervised learning
problems alike. This in particular covers the interesting case
where both labeled and unlabeled data are available. Experiments in
automated indexing and text categorization verify the advantages of
this approach.
|