Given a query, Shark uses the Hive query compiler to parse the
query and generate an abstract syntax tree. The tree is then turned
into a logical plan and basic logical optimization, such as predicate
pushdown, is applied. Up to this point, Shark and Hive share
an identical approach. Hive would then convert the operator into a
physical plan consisting of multiple MapReduce stages. In the case
of Shark, its optimizer applies additional rule-based optimizations,
such as pushing LIMIT down to individual partitions, and creates
a physical plan consisting of transformations on RDDs rather than
MapReduce jobs. We use a variety of operators already present in
Spark, such as map and reduce, aswell as newoperatorswe implemented
for Shark, such as broadcast joins. Spark’s master then executes
this graph using standard MapReduce scheduling techniques,
such as placing tasks close to their input data, rerunning lost tasks,
and performing straggler mitigation [39].