MapReduce was introduced by Dean et. al. in 2004 [8].
Understanding the complete details of how MapReduce works is
not a necessary prerequisite for understanding this paper. In short,
MapReduce processes data distributed (and replicated) across
many nodes in a shared-nothing cluster via three basic operations.
First, a set of Map tasks are processed in parallel by each node in
the cluster without communicating with other nodes. Next, data is
repartitioned across all nodes of the cluster. Finally, a set of Reduce
tasks are executed in parallel by each node on the partition it
receives. This can be followed by an arbitrary number of additional
Map-repartition-Reduce cycles as necessary. MapReduce does not
create a detailed query execution plan that specifies which nodes
will run which tasks in advance; instead, this is determined at
runtime. This allows MapReduce to adjust to node failures and
slow nodes on the fly by assigning more tasks to faster nodes and
reassigning tasks from failed nodes. MapReduce also checkpoints
the output of each Map task to local disk in order to minimize the
amount of work that has to be redone upon a failure.
Of the desired properties of large scale data analysis workloads,