MapReduce best meets the fault tolerance and ability to operate in
heterogeneous environment properties. It achieves fault tolerance
by detecting and reassigning Map tasks of failed nodes to other
nodes in the cluster (preferably nodes with replicas of the input Map
data). It achieves the ability to operate in a heterogeneous environment
via redundant task execution. Tasks that are taking a long time
to complete on slow nodes get redundantly executed on other nodes
that have completed their assigned tasks. The time to complete the
task becomes equal to the time for the fastest node to complete the
redundantly executed task. By breaking tasks into small, granular
tasks, the effect of faults and “straggler” nodes can be minimized.
MapReduce has a flexible query interface; Map and Reduce functions
are just arbitrary computations written in a general-purpose
language. Therefore, it is possible for each task to do anything on
its input, just as long as its output follows the conventions defined
by the model. In general, most MapReduce-based systems (such as
Hadoop, which directly implements the systems-level details of the
MapReduce paper) do not accept declarative SQL. However, there
are some exceptions (such as Hive).
As shown in previous work, the biggest issue with MapReduce
is performance [23]. By not requiring the user to first model and
load data before processing, many of the performance enhancing
tools listed above that are used by database systems are not possible.
Traditional business data analytical processing, that have standard
reports and many repeated queries, is particularly, poorly suited for
the one-time query processing model of MapReduce.
Ideally, the fault tolerance and ability to operate in heterogeneous
environment properties of MapReduce could be combined with the
performance of parallel databases systems. In the following sections,
we will describe our attempt to build such a hybrid system.