When test collections are built, it is common for one, or a small number of people to make relevance assessments of the information objects that have been returned for topics. These assessments form the gold standard assessments on which all other evaluation measures are computed, such as mean average precision and discounted cumulated gain. It is well known that such assessments are not always generalizable; that is, different people often make different assessments of the same information objects for the same topic. However, this is accepted as one of the limitations of test collection-based evaluation.