4.2 Blocked sort-based indexing 69
Block sizes of 8, 16, 32, and 64 kilobytes (KB) are common. We call the part
of main memory where a block being read or written BUFFER is stored a buffer.
• Data transfers fromdisk tomemory are handled by the systembus, not by
the processor. This means that the processor is available to process data
during disk I/O. We can exploit this fact to speed up data transfers by
storing compressed data on disk. Assuming an efficient decompression
algorithm, the total time of reading and then decompressing compressed
data is usually less than reading uncompressed data.
• Servers used in IR systems typically have several gigabytes (GB) of main
memory, sometimes tens of GB. Available disk space is several orders of
magnitude larger.
4.2 Blocked sort-based indexing
The basic steps in constructing a nonpositional index are depicted in Figure
1.4 (page 8). We first make a pass through the collection assembling all
term–docID pairs. We then sort the pairs with the term as the dominant key
and docID as the secondary key. Finally, we organize the docIDs for each
term into a postings list and compute statistics like term and document frequency.
For small collections, all this can be done in memory. In this chapter,
we describe methods for large collections that require the use of secondary
storage.
To make index construction more efficient, we represent terms as termIDs
TERMID (instead of strings as we did in Figure 1.4), where each termID is a unique
serial number. We can build the mapping from terms to termIDs on the fly
while we are processing the collection; or, in a two-pass approach, we compile
the vocabulary in the first pass and construct the inverted index in the
second pass. The index construction algorithms described in this chapter all
do a single pass through the data. Section 4.7 gives references to multipass
algorithms that are preferable in certain applications, for example,when disk
space is scarce.
REUTERS-RCV1 We work with the Reuters-RCV1 collection as our model collection in this
chapter, a collection with roughly 1 GB of text. It consists of about 800,000
documents that were sent over the Reuters newswire during a 1-year period
between August 20, 1996, and August 19, 1997. A typical document is
shown in Figure 4.1, but note that we ignore multimedia information like
images in this book and are only concerned with text. Reuters-RCV1 covers
a wide range of international topics, including politics, business, sports, and
(as in this example) science. Some key statistics of the collection are shown
in Table 4.2.