list belongs to, so the termIDs of postings need not be stored. As a result, the
blocks that individual calls of SPIMI-INVERT can process are much larger
and the index construction process as a whole is more efficient.
Becausewe do not know how large the postings list of a termwill be when
we first encounter it, we allocate space for a short postings list initially and
double the space each time it is full (lines 8–9). This means that some memory
is wasted, which counteracts the memory savings from the omission of
termIDs in intermediate data structures. However, the overall memory requirements
for the dynamically constructed index of a block in SPIMI are
still lower than in BSBI.
Whenmemory has been exhausted,we write the index of the block (which
consists of the dictionary and the postings lists) to disk (line 12). We have to
sort the terms (line 11) before doing this because we want to write postings
lists in lexicographic order to facilitate the final merging step. If each block’s
postings lists were written in unsorted order, merging blocks could not be
accomplished by a simple linear scan through each block.
Each call of SPIMI-INVERT writes a block to disk, just as in BSBI. The last
step of SPIMI (corresponding to line 7 in Figure 4.2; not shown in Figure 4.4)
is then to merge the blocks into the final inverted index.
In addition to constructing a new dictionary structure for each block and
eliminating the expensive sorting step, SPIMI has a third important component:
compression. Both the postings and the dictionary terms can be stored
compactly on disk if we employ compression. Compression increases the efficiency
of the algorithm further because we can process even larger blocks,
and because the individual blocks require less space on disk. We refer readers
to the literature for this aspect of the algorithm (Section 4.7).