Reuters-RCV1 has 100million tokens. Collecting all termID–docIDpairs of
the collection using 4 bytes each for termID and docID therefore requires 0.8
GB of storage. Typical collections today are often one or two orders of magnitude
larger than Reuters-RCV1. You can easily see how such collections
overwhelm even large computers if we try to sort their termID–docID pairs
in memory. If the size of the intermediate files during index construction is
within a small factor of available memory, then the compression techniques
introduced in Chapter 5 can help; however, the postings file of many large
collections cannot fit into memory even after compression.