One of the basic requirements for evaluation is that the results from different
techniques can be compared. To do this comparison fairly and to ensure that experiments
are repeatable, the experimental settings and data used must be fixed.
Starting with the earliest large-scale evaluations of search performance in the
1960s, generally referred to as the Cranfield1 experiments (Cleverdon, 1970), researchers
assembled test collections consisting of documents, queries, and relevance
judgments to address this requirement.