A. Text Retrieval Conference (TREC).
The U.S. National Institute of Standards and Technology
(NIST) and TREC has run a large IR test bed evaluation series
since 1992. Here we are using trec_eval.9.0 and trec_eval.8.1
as sample dataset.
B. Reuters - 21578
The collection is available here as a gzipped tar archive
(8.2 MB; 28.0 MB uncompressed). The UCI KDD archive
also has an entry for the collection, including a copy. Various
researchers have prepared data files useful for work with
Reuters-21578.
C. UCI Knowledge Discovery in Databases