The algorithm parses documents into termID–docID pairs and accumulates
the pairs inmemory until a block of a fixed size is full (PARSENEXTBLOCK
in Figure 4.2). We choose the block size to fit comfortably into memory to
permit a fast in-memory sort. The block is then inverted and written to disk.
INVERSION Inversion involves two steps. First, we sort the termID–docID pairs. Next,
we collect all termID–docID pairs with the same termID into a postings list,
POSTING where a posting is simply a docID. The result, an inverted index for the block
we have just read, is thenwritten to disk. Applying this to Reuters-RCV1 and
assuming we can fit 10 million termID–docID pairs into memory, we end up
with ten blocks, each an inverted index of one part of the collection.
In the final step, the algorithm simultaneously merges the ten blocks into
one large merged index. An example with two blocks is shown in Figure 4.3,
where we use di to denote the ith document of the collection. To do the merging,
we open all block files simultaneously, and maintain small read buffers
for the ten blocks we are reading and a write buffer for the final merged index
we are writing. In each iteration, we select the lowest termID that has
not been processed yet using a priority queue or a similar data structure. All
postings lists for this termID are read and merged, and the merged list is
written back to disk. Each read buffer is refilled from its file when necessary.
How expensive is BSBI? Its time complexity is Q(T log T) because the step
with the highest time complexity is sorting and T is an upper bound for the
number of items we must sort (i.e., the number of termID–docID pairs).