The Latent Dirichlet Allocation (LDA) model, which comes from the machine
learning community, models documents as a mixture of topics. A topic is a language
model, just as we defined previously. In a retrieval model such as query likelihood,each document is assumed to be associated with a single topic. There are,
in effect, as many topics as there are documents in the collection. In the LDA approach,
in contrast, the assumption is that there is a fixed number of underlying
(or latent) topics that can be used to describe the contents of documents. Each
document is represented as a mixture of these topics, which achieves a smoothing
effect that is similar to LSI. In the LDA model, a document is generated by first
picking a distribution over topics, and then, for the next word in the document,
we choose a topic and generate a word from that topic.