Given this table, the obvious estimates for pi and si would be pi = ri/R (the
number of relevant documents that contain a term divided by the total number of
relevant documents) and si = (ni − ri)/(N − R) (the number of non-relevant
documents that contain a term divided by the total number of non-relevant documents).
Using these estimates could cause a problem, however, if some of the
entries in the contingency table were zeros. If ri was zero, for example, the term
weight would be log 0. To avoid this, a standard solution is to add 0.5 to each
count (and 1 to the totals), which gives us estimates of pi = (ri + 0.5)/(R + 1)
and si = (ni−ri+0.5)/(N−R+1.0). Putting these estimates into the scoring
function gives us: