| |
Abstract:
We propose a generative model for text and other collections
of discrete data that generalizes or improves on several previous
models including naive Bayes/unigram, mixture of unigrams [6],
and Hofmann's aspect model, also known as probabilistic latent
semantic indexing (pLSI) [3]. In the context of text modeling,
our model posits that each document is generated as a mixture of
topics, where the continuous-valued mixture proportions are
distributed as a latent Dirichlet random variable. Inference and
learning are carried out efficiently via variational algorithms.
We present empirical results on applications of this model to
problems in text modeling, collaborative filtering, and text
classification.
References
[3] T. Hofmann. Probabilistic latent semantic indexing.
Proceedings of the Twenty-Second Annual International SIGIR
Conference
, 1999.
[6] K. Nigam, A. Mccallum, S. Thrun, and T. Mitchell. Text
classification from labeled and unlabeled documents using EM.
Machine Learning
, 39(2/3):103-134, 2000.
|