where tfik is the term frequency weight of term k in document Di
, and fik is
the number of occurrences of term k in the document. In the vector space model,
normalization is part of the cosine measure. A document collection can contain
documents of many different lengths. Although normalization accounts for this
to some degree, long documents can have many terms occurring once and others
occurring hundreds of times. Retrieval experiments have shown that to reduce the
impact of these frequent terms, it is effective to use the logarithm of the number
of term occurrences in tf weights rather than the raw count.
The inverse document frequency component (idf) reflects the importance of
the term in the collection of documents. The more documents that a term occurs
in, the less discriminating the term is between documents and, consequently, the
less useful it will be in retrieval. The typical form of this weight is