One of the important issues in general information retrieval is vocabulary mismatch.
This refers to a situation where relevant documents do not match a query,
because they are using different words to describe the same topic. In the web environment,
many documents will contain all the query words, so this may not appear
to be an issue. In search applications with smaller collections, however, it will
be important, and even in web search, TREC experiments have shown that topical
queries produce better results using query expansion. Query expansion (using,
for example, pseudo-relevance feedback) is the standard technique for reducing
vocabulary mismatch, although stemming also addresses this issue to some extent.
A different approach would be to expand the documents by adding related terms.
For documents represented as language models, this is equivalent to smoothing
the probabilities in the language model so that words that did not occur in the
text have non-zero probabilities. Note that this is different from smoothing using
the collection probabilities, which are the same for all documents. Instead, we
need some way of increasing the probabilities of words that are associated with the topic of the document.