not be enough to make use of more than a handful of CPU cores. Handling large
collections will require less reliance on memory and improved parallelism.
5.6.2 Merging
The classic way to solve the memory problem in the previous example is by merging. We can build the inverted list structure I until memory runs out. When that
happens, we write the partial index I to disk, then start making a new one. At the
end of this process, the disk is filled with many partial indexes, I
1
, I
2
, I
3
, ..., I
n
.
The system then merges these files into a single result.
By definition, it is not possible to hold even two of the partial index files in
memory at one time, so the input files need to be carefully designed so that they
can be merged in small pieces. One way to do this is to store the partial indexes in
alphabetical order. It is then possible for a merge algorithm to merge the partial
indexes using very little memory.
Figure 5.9 shows an example of this kind of merging procedure. Even though
this figure shows only two indexes, it is possible to merge many at once. The algorithm is essentially the same as the standard merge sort algorithm. Since both I
1
and I
2
are sorted, at least one of them points to the next piece of data necessary
to write to I . The data from the two files is interleaved to produce a sorted result