where fqi;D is the number of times word qi occurs in document D, and |D| is the number of words in D. For a multinomial distribution, this is the maximum likelihood estimate, which means this this is the estimate that makes the observed value of fqi;D most likely. The major problem with this estimate is that if any of the query words are missing from the document, the score given by the query likelihood model for P(Q|D) will be zero. This is clearly not appropriate for longer queries. For example, missing one word out of six should not produce a score of zero. We will also not be able to distinguish between documents that have different numbers of query words missing. Additionally, because we are building a topic model for a document, words associated with that topic should have some probability
of occurring, even if they were not mentioned in the document. For example,
a language model representing a document about computer games should
have some non-zero probability for the word “RPG” even if that word was not
mentioned in the document. A small probability for that word will enable the
document to receive a non-zero score for the query “RPG computer games”, although
it will be lower than the score for a document that contains all three words
Smoothing is a technique for avoiding this estimation problem and overcoming
data sparsity, which means that we typically do not have large amounts of
text to use for the language model probability estimates. The general approach
to smoothing is to lower (or discount) the probability estimates for words that
are seen in the document text, and assign that “leftover” probability to the estimates
for the words that are not seen in the text. The estimates for unseen words
are usually based on the frequency of occurrence of words in the whole document
collection. IfP(qi|C) is the probability for query word i in the collection language
model for document collectionC, then the estimate we use for an unseen word in
a document is αDP(qi|C), where αD is a coefficient controlling the probability
assigned to unseen words.9 In general, αD can depend on the document. In order
that the probabilities sum to one, the probability estimate for a word that is seen
in a document is (1 − αD)P(qi|D) + αDP(qi|C).
To make this clear, consider a simple example where there are only three words,
w1, w2, and w3, in our index vocabulary. If the collection probabilities for these
three words, based on maximum likelihood estimates, are 0.3, 0.5, and 0.2, and the
document probabilities based on maximum likelihood estimates are 0.5, 0.5, and
0.0, then the smoothed probability estimates for the document language model
are: