7.3.2 Relevance Models and Pseudo-Relevance Feedback
Although the basic query likelihood model has a number of advantages, it is limited
in terms of how it models information needs and queries. It is difficult, for
example, to incorporate information about relevant documents into the ranking
algorithm, or to represent the fact that a query is just one of many possible queries
that could be used to describe a particular information need. In this section, we
show how this can be done by extending the basic model.
In the introduction to section 7.3, we mentioned that it is possible to represent
the topic of a query as a language model. Instead of calling this the query language
model, we use the name relevance model since it represents the topic covered by
relevant documents. The query can be viewed as a very small sample of text generated
from the relevance model, and relevant documents are much larger samples
of text from the same model. Given some examples of relevant documents for a
query, we could estimate the probabilities in the relevance model and then use
this model to predict the relevance of new documents. In fact, this is a version of
the classification model presented in section 7.2.1, where we interpretP(D|R) as
the probability of generating the text in a document given a relevance model. This
is also called the document likelihood model. Although this model, unlike the binary
independence model, directly incorporates term frequency,
it turns out that
P(D|R) is difficult to calculate and compare across documents. This is because
documents contain a large and extremely variable number of words compared
to a query. Consider two documents Da and Db, for example, containing 5 and
500 words respectively. Because of the large difference in the number of words
involved, the comparison of P(Da|R) and P(Db|R) for ranking will be more
difficult than comparing P(Q|Da) and P(Q|Db), which use the same query and
smoothed representations for the documents. In addition, we still have the problem
of obtaining examples of relevant documents.
There is, however, another alternative. If we can estimate a relevance model
from a query, we can compare this language model directly with the model for a
document. Documents would then be ranked by the similarity of the document
model to the relevance model. A document with a model that is very similar to
the relevance model is likely to be on the same topic. The obvious next question
is how to compare two language models. A well-known measure from probability
theory and information theory, the Kullback-Leibler divergence (referred to as KL-divergence in this book), measures the difference between two probability
distributions. Given the true probability distribution P and another distribution
Q that is an approximation to P, the KL divergence is defined as: