7.2 Probabilistic Models
One of thefeatures that a retrieval model should provide is a clear statement about
the assumptions upon which it is based.The Boolean and vector space approaches
make implicit assumptions about relevance and text representation that impact
the design and effectiveness of ranking algorithms. The ideal situation would be
to show that, given the assumptions, a ranking algorithm based on the retrieval
model will achieve better effectiveness than any other approach. Such proofs are
actually very hard to come by in information retrieval, since we are trying to formalize
a complex human activity. The validity of a retrieval model generally has
to be validated empirically, rather than theoretically.
One early theoretical statement about effectiveness, known as the Probability
Ranking Principle (Robertson, 1977/1997), encouraged the development of
probabilistic retrieval models, which are the dominant paradigm today. These
models have achieved this status because probability theory is a strong foundation
for representing and manipulating the uncertainty that is an inherent partof the information retrieval process. The Probability Ranking Principle, as originally
stated, is as follows:
If a reference retrieval system’s1
response to each request is a ranking of
the documents in the collection in order of decreasing probability of relevance
to the user who submitted the request, where the probabilities are
estimated as accurately as possible on the basis of whatever data have been
made available to the system for this purpose, the overall effectiveness of
the system to its user will be the best that is obtainable on the basis of those
data.
Given some assumptions, such as that the relevance of a document to a query
is independent of other documents, it is possible to show that this statement is
true, in the sense that ranking by probability of relevance will maximize precision,
which is the proportion of relevant documents, at any given rank (for example,
in the top 10 documents). Unfortunately, the Probability Ranking Principle
doesn’t tell us how to calculate or estimate the probability of relevance. There are
many probabilistic retrieval models, and each one proposes a different method for
estimating this probability. Most of the rest of this chapter discusses some of the
most important probabilistic models.
In this section, we start with a simple probabilistic model based on treating
information retrieval as a classification problem. We then describe a popular and
effective ranking algorithm that is based on this model.
7.2.1 Information Retrieval as Classification
In any retrieval model that assumes relevance is binary, there will be two sets of
documents, the relevant documents and the non-relevant documents, for each
query. Given a new document, the task of a search engine could be described as
deciding whether the document belongs in the relevant set or the non-relevant2
set. That is, the system should classify the document as relevant or non-relevant,
and retrieve it if it is relevant.
Given some way of calculating the probability that the document is relevant
and the probability that it is non-relevant, then it would seem reasonable to classify
the document into the set that has the highest probability. In other words,
1 A “reference retrieval system” would now be called a search engine.
2 Note that we never talk about “irrelevant” documents in information retrieval; instead
they are “non-relevant.we would decide that a document D is relevant if P(R|D) > P(NR|D), where
P(R|D) is a conditional probability representing the probability of relevance
given the representation of that document, and P(NR|D) is the conditional
probability of non-relevance (Figure 7.3). This is known as the Bayes Decision
Rule, and a system that classifies documents this way is called a Bayes classifier.
In Chapter 9, we discuss other applications of classification (such as spam filtering)
and other classification techniques, but here we focus on the ranking algorithm
that results from this probabilistic retrieval model based on classification.