All of the probabilistic retrieval models presented so far fall into the category of
generative models. A generative model for text classification assumes that documents
were generated from some underlying model (in this case, usually a multinomial
distribution) and uses training data to estimate the parameters of the
model. The probability of belonging to a class (i.e., the relevant documents for
a query) is then estimated using Bayes’ Rule and the document model. A discriminative
model, in contrast, estimates the probability of belonging to a class directly
from the observed features of the document based on the training data.22 In general
classification problems, a generative model performs better with low numbers
of training examples, but the discriminative model usually has the advantage
given enough data. Given the amount of potential training data available to web
search engines, discriminative models may be expected to have some advantages
in this application. It is also easier to incorporate new features into a discriminative
model and, as we have mentioned, there can be hundreds of features that are
considered for web ranking.