MapReduce best meets the fault tolerance and ability to operate in
heterogeneous environment properties. It achieves fault tolerance
by detecting and reassigning Map tasks of failed nodes to other
nodes in the cluster (preferably nodes with replicas of the input Map
data). It achieves the ability to operate in a heterogeneous environment
via redundant task execution. Tasks that are taking a long time
to complete on slow nodes get redundantly executed on other nodes
that have completed their assigned tasks. The time to complete the
task becomes equal to the time for the fastest node to complete the
redundantly executed task. By breaking tasks into small, granular
tasks, the effect of faults and “straggler” nodes can be minimized.
MapReduce has a flexible query interface; Map and Reduce functions
are just arbitrary computations written in a general-purpose
language. Therefore, it is possible for each task to do anything on
its input, just as long as its output follows the conventions defined
by the model. In general, most MapReduce-based systems (such as
Hadoop, which directly implements the systems-level details of the
MapReduce paper) do not accept declarative SQL. However, there
are some exceptions (such as Hive).
As shown in previous work, the biggest issue with MapReduce
is performance [23]. By not requiring the user to first model and
load data before processing, many of the performance enhancing
tools listed above that are used by database systems are not possible.
Traditional business data analytical processing, that have standard
reports and many repeated queries, is particularly, poorly suited for
the one-time query processing model of MapReduce.
Ideally, the fault tolerance and ability to operate in heterogeneous
environment properties of MapReduce could be combined with the
performance of parallel databases systems. In the following sections,
we will describe our attempt to build such a hybrid system.
MapReduce best meets the fault tolerance and ability to operate inheterogeneous environment properties. It achieves fault toleranceby detecting and reassigning Map tasks of failed nodes to othernodes in the cluster (preferably nodes with replicas of the input Mapdata). It achieves the ability to operate in a heterogeneous environmentvia redundant task execution. Tasks that are taking a long timeto complete on slow nodes get redundantly executed on other nodesthat have completed their assigned tasks. The time to complete thetask becomes equal to the time for the fastest node to complete theredundantly executed task. By breaking tasks into small, granulartasks, the effect of faults and “straggler” nodes can be minimized.MapReduce has a flexible query interface; Map and Reduce functionsare just arbitrary computations written in a general-purposelanguage. Therefore, it is possible for each task to do anything onits input, just as long as its output follows the conventions definedby the model. In general, most MapReduce-based systems (such asHadoop, which directly implements the systems-level details of theMapReduce paper) do not accept declarative SQL. However, thereare some exceptions (such as Hive).As shown in previous work, the biggest issue with MapReduceis performance [23]. By not requiring the user to first model andload data before processing, many of the performance enhancingtools listed above that are used by database systems are not possible.Traditional business data analytical processing, that have standardreports and many repeated queries, is particularly, poorly suited forthe one-time query processing model of MapReduce.Ideally, the fault tolerance and ability to operate in heterogeneousenvironment properties of MapReduce could be combined with theperformance of parallel databases systems. In the following sections,we will describe our attempt to build such a hybrid system.
การแปล กรุณารอสักครู่..