For search applications, we use language models to represent the topical content
of a document. A topic is something that is talked about often but rarely defined
in information retrieval discussions. In this approach, we define a topic as a
probability distribution over words (in other words, a language model). For example,
if a document is about fishing in Alaska, we would expect to see words associated
with fishing and locations in Alaska with high probabilities in the language
model. If it is about fishing in Florida, some of the high-probability words will be
the same, but there will be more high probability words associated with locations
in Florida. If instead the document is about fishing games for computers, most of
the high-probability words will be associated with game manufacturers and computer
use, although there will still be some important words about fishing. Note
that a topic language model, or topic model for short, contains probabilities for all
words, not just the most important. Most of the words will have “default” probabilities
that will be the same for any text, but the words that are important for the
topic will have unusually high probabilities.
A language model representation of a document can be used to “generate” new
text by sampling words according to the probability distribution. If we imagine
the language model as a big bucket of words, where the probabilities determine how many instances of a word are in the bucket, then we can generate text by
reaching in (without looking), drawing out a word, writing it down, putting the
word back in the bucket, and drawing again. Note that we are not saying that we
can generate the original document by this process. In fact, because we are only
using a unigram model, the generated text is going to look pretty bad, with no
syntactic structure. Important words for the topic of the document will, however,
appear often. Intuitively, we are using the language model as a very approximate
model for the topic the author of the document was thinking about when he was
writing it.