Parsers and inverters are not separate sets of machines. The master identifies
idle machines and assigns tasks to them. The same machine can be a
parser in the map phase and an inverter in the reduce phase. And there are
often other jobs that run in parallel with index construction, so in between
being a parser and an inverter a machine might do some crawling or another
unrelated task.
Tominimizewrite times before inverters reduce the data, each parserwrites
its segment files to its local disk. In the reduce phase, the master communicates
to an inverter the locations of the relevant segment files (e.g., of the r
segment files of the a–f partition). Each segment file only requires one sequential
read because all data relevant to a particular inverter were written
to a single segment file by the parser. This setup minimizes the amount of
network traffic needed during indexing.
Figure 4.6 shows the general schema of the MapReduce functions. Input
and output are often lists of key-value pairs themselves, so that several
MapReduce jobs can run in sequence. In fact, this was the design of the
Google indexing system in 2004. What we describe in this section corresponds
to only one of five to ten MapReduce operations in that indexing
system. Another MapReduce operation transforms the term-partitioned index
we just created into a document-partitioned one.
MapReduce offers a robust and conceptually simple framework for implementing
index construction in a distributed environment. By providing a
semiautomatic method for splitting index construction into smaller tasks, it
can scale to almost arbitrarily large collections, given computer clusters of sufficient size.