You might be wondering where the query has gone, given that this is a document
ranking algorithm for a specific query. In many cases, the query provides
us with the only information we have about the relevant set. We can assume that,
in the absence of other information, terms that are not in the query will have the
same probability of occurrence in the relevant and non-relevant documents (i.e.,
pi = si). In that case, the summation will only be over terms that are both in
the query and in the document. This means that, given a query, the score for a
document is simply the sum of the term weights for all matching terms.
If we have no other information about the relevant set, we could make the
additional assumptions that pi is a constant and that si could be estimated by
using the term occurrences in the whole collection as an approximation. We make
the second assumption based on the fact that the number of relevant documents is
much smaller than the total number of documents in the collection. With a value
of 0.5 for pi in the scoring function described earlier, this gives a term weight for
term i of