partitioned index (which can be easily generated from a term-partitioned
index). We discuss this topic further in Section 20.3 (page 454).
The distributed index constructionmethod we describe in this section is an
application of MapReduce, a general architecture for distributed MAPREDUCE computing.
MapReduce is designed for large computer clusters. The point of a cluster is
to solve large computing problems on cheap commodity machines or nodes
that are built fromstandard parts (processor,memory, disk) as opposed to on
a supercomputer with specialized hardware. Although hundreds or thousands
of machines are available in such clusters, individual machines can
fail at any time. One requirement for robust distributed indexing is, therefore,
that we divide the work up into chunks that we can easily assign and
MASTER NODE – in case of failure – reassign. A master node directs the process of assigning
and reassigning tasks to individual worker nodes.
The map and reduce phases of MapReduce split up the computing job
into chunks that standard machines can process in a short time. The various
steps of MapReduce are shown in Figure 4.5 and an example on a collection
consisting of two documents is shown in Figure 4.6. First, the input data,
SPLITS in our case a collection of web pages, are split into n splits where the size of
the split is chosen to ensure that the work can be distributed evenly (chunks
should not be too large) and efficiently (the total number of chunks we need
to manage should not be too large); 16 or 64MB are good sizes in distributed
indexing. Splits are not preassigned to machines, but are instead assigned
by the master node on an ongoing basis: As a machine finishes processing
one split, it is assigned the next one. If a machine dies or becomes a laggard
due to hardware problems, the split it is working on is simply reassigned to
another machine.