We work with the Reuters-RCV1 collection as our model collection in this
chapter, a collection with roughly 1 GB of text. It consists of about 800,000
documents that were sent over the Reuters new swire during a 1-year period
between August 20, 1996, and August 19, 1997. A typical document is
shown in Figure 4.1, but note that we ignore multimedia information like
images in this book and are only concerned with text. Reuters-RCV1 covers
a wide range of international topics, including politics, business, sports, and
(as in this example) science. Some key statistics of the collection are shown
in Table 4.2.