In any retrieval model that assumes relevance is binary, there will be two sets of
documents, the relevant documents and the non-relevant documents, for each
query. Given a new document, the task of a search engine could be described as
deciding whether the document belongs in the relevant set or the non-relevant2
set. That is, the system should classify the document as relevant or non-relevant,
and retrieve it if it is relevant.
Given some way of calculating the probability that the document is relevant
and the probability that it is non-relevant, then it would seem reasonable to classify
the document into the set that has the highest probability. In other words,we would decide that a documentDis relevant if P(R|D) > P(NR|D), where
P(R|D) is a conditional probability representing the probability of relevance
given the representation of that document, and P(NR|D) is the conditional
probability of non-relevance (Figure 7.3). This is known as the Bayes Decision
Rule, and a system that classifies documents this way is called a Bayes classifier.
In Chapter 9, we discuss other applications of classification (such as spam filtering)
and other classification techniques, but here we focus on the ranking algorithm
that results from this probabilistic retrieval model based on classification.