partitioning the keys into j term partitions and having the parsers write keyvalue
pairs for each term partition into a separate segment file. In Figure 4.5,
the term partitions are according to first letter: a–f, g–p, q–z, and j = 3. (We
chose these key ranges for ease of exposition. In general, key ranges need not
correspond to contiguous terms or termIDs.) The term partitions are defined
by the person who operates the indexing system (Exercise 4.10). The parsers
then write corresponding segment files, one for each term partition. Each
term partition thus corresponds to r segments files, where r is the number
of parsers. For instance, Figure 4.5 shows three a–f segment files of the a–f
partition, corresponding to the three parsers shown in the figure.
Collecting all values (here: docIDs) for a given key (here: termID) into one
INVERTER list is the task of the inverters in the reduce phase. The master assigns each
term partition to a different inverter – and, as in the case of parsers, reassigns
term partitions in case of failing or slow inverters. Each term partition
(corresponding to r segment files, one on each parser) is processed by one inverter.
We assume here that segment files are of a size that a single machine
can handle (Exercise 4.9). Finally, the list of values is sorted for each key and
written to the final sorted postings list (“postings” in the figure). (Note that
postings in Figure 4.6 include term frequencies, whereas each posting in the
other sections of this chapter is simply a docID without term frequency information.)
The data flow is shown for a–f in Figure 4.5. This completes the
construction of the inverted index.