MIT CogNet, The Brain Sciences ConnectionFrom the MIT Press, Link to Online Catalog
SPARC Communities
Subscriber : Stanford University Libraries » LOG IN

space

Powered By Google 
Advanced Search

 

Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization

 Thomas Hofmann
  
 

Abstract:
The project pursued in this paper is to develop from first principles a general machine learning approach to learn the similarity between text documents. We utilize a statistical latent class model to generate a decomposition of document collections in terms of topic factors. From this model a canonical kernel, the Fisher kernel, is derived within the theoretical framework of information geometry. The Fisher kernel provides a similarity function that can be used for unsupervised and supervised learning problems alike. This in particular covers the interesting case where both labeled and unlabeled data are available. Experiments in automated indexing and text categorization verify the advantages of this approach.

 
 


© 2010 The MIT Press
MIT Logo