Given a query, Shark uses the Hive query compiler to parse the
query and generate an abstract syntax tree. The tree is then turned
into a logical plan and basic logical optimization, such as predicate
pushdown, is applied. Up to this point, Shark and Hive share
an identical approach. Hive would then convert the operator into a
physical plan consisting of multiple MapReduce stages. In the case
of Shark, its optimizer applies additional rule-based optimizations,
such as pushing LIMIT down to individual partitions, and creates
a physical plan consisting of transformations on RDDs rather than
MapReduce jobs. We use a variety of operators already present in
Spark, such as map and reduce, aswell as newoperatorswe implemented
for Shark, such as broadcast joins. Spark’s master then executes
this graph using standard MapReduce scheduling techniques,
such as placing tasks close to their input data, rerunning lost tasks,
and performing straggler mitigation [39].
Given a query, Shark uses the Hive query compiler to parse thequery and generate an abstract syntax tree. The tree is then turnedinto a logical plan and basic logical optimization, such as predicatepushdown, is applied. Up to this point, Shark and Hive sharean identical approach. Hive would then convert the operator into aphysical plan consisting of multiple MapReduce stages. In the caseof Shark, its optimizer applies additional rule-based optimizations,such as pushing LIMIT down to individual partitions, and createsa physical plan consisting of transformations on RDDs rather thanMapReduce jobs. We use a variety of operators already present inSpark, such as map and reduce, aswell as newoperatorswe implementedfor Shark, such as broadcast joins. Spark’s master then executesthis graph using standard MapReduce scheduling techniques,such as placing tasks close to their input data, rerunning lost tasks,and performing straggler mitigation [39].
การแปล กรุณารอสักครู่..
