In general, MapReduce breaks a large computing problem into smaller
KEY-VALUE PAIRS parts by recasting it in terms of manipulation of key-value pairs. For indexing,
a key-value pair has the form (termID,docID). In distributed indexing,
the mapping from terms to termIDs is also distributed and therefore more
complex than in single-machine indexing. A simple solution is to maintain
a (perhaps precomputed) mapping for frequent terms that is copied to all
nodes and to use terms directly (instead of termIDs) for infrequent terms.
We do not address this problem here and assume that all nodes share a consistent
term →termID mapping.
MAP PHASE The map phase of MapReduce consists of mapping splits of the input data
to key-value pairs. This is the same parsing task we also encountered in BSBI
and SPIMI, and we therefore call the machines that execute the map phase
PARSER parsers. Each parser writes its output to local intermediate files, the segment
SEGMENT FILE files (shown as a-f g-p q-z in Figure 4.5).
REDUCE PHASE For the reduce phase, we want all values for a given key to be stored close
together, so that they can be read and processed quickly. This is achieved by