Collection statistics for Reuters-RCV1. Values are rounded for the computations
in this book. The unrounded values are: 806,791 documents, 222 tokens
per document, 391,523 (distinct) terms, 6.04 bytes per token with spaces and punctuation,
4.5 bytes per token without spaces and punctuation, 7.5 bytes per term, and
96,969,056 tokens. The numbers in this table correspond to the third line (“case folding”)
in Table 5.1 (page 87).