The task of the document statistics component is simply to gather and record
statistical information about words, features, and documents. This information
is used by the ranking component to compute scores for documents. The types
of data generally required are the counts of index term occurrences (both words
and more complex features) in individual documents, the positions in the documents
where the index terms occurred, the counts of occurrences over groups
of documents (such as all documents labeled “sports” or the entire collection of
documents), and the lengths of documents in terms of the number of tokens. The
actual data required is determined by the retrieval model and associated ranking
algorithm. The document statistics are stored in lookup tables, which are data
structures designed for fast retrieval.