5.3.2 Counts
Remember that our abstract model of ranking considers each document to be
composed of features. With an inverted index, each word in the index corresponds to a document feature. This feature data can be processed by a ranking
function into a document score. In an inverted index that contains only document information, the features are binary, meaning they are 1 if the document
contains a term, 0 otherwise. This is important information, but it is too coarse
to find the best few documents when there are a lot of possible matches.
For instance, consider the query “tropical fish”. Three sentences match this
query: S
1
, S
2
, and S
3
. The data in the document-based index (Figure 5.3) gives
us no reason to prefer any of these sentences over any other.
Now look at the index in Figure 5.4. This index looks similar to the previous
one. We still have the same words and the same number of postings, and the first
number in each posting is the same as in the previous index. However, each posting now has a second number. This second number is the number of times the
word appears in the document. This small amount of additional data allows us to
prefer S
2
over S
1
and S
3
for the query “tropical fish”, since S
2
contains “tropical”
twice and “fish” three times.
In this example, it may not be obvious that S2
is much better than S1
or S3
,
but in general, word counts can be a powerful predictor of document relevance. In
particular, word counts can help distinguish documents that are about a particular