Reusable test collections consist of:
Collection of documents
Should be “representative”
Things to consider: size, sources, genre, topics, …
Sample of information needs
Should be “randomized” and “representative”
Usually formalized topic statements
Known relevance judgments
Assessed by humans, for each topic-document pair (topic, not query!)
Binary judgments make evaluation easier
Measure of effectiveness
Usually a numeric score for quantifying “performance”
Used to compare different systems