phases of a query plan. The Hive optimizer is a simple, na¨ıve, rulebased
optimizer. It does not use cost-based optimization techniques.
Therefore, it does not always generate efficient query plans. This is
another advantage of pushing as much as possible of the query processing
logic into DBMSs that have more sophisticated, adaptive or
cost-based optimizers.
(5) Finally, the physical plan generator converts the logical query
plan into a physical plan executable by one or more MapReduce
jobs. The first and every other Reduce Sink operator marks a transition
from a Map phase to a Reduce phase of a MapReduce job and
the remaining Reduce Sink operators mark the start of new MapReduce
jobs. The above SQL query results in a single MapReduce
job with the physical query plan illustrated in Fig. 2(a). The boxes
stand for the operators and the arrows represent the flow of data.
(6) Each DAG enclosed within a MapReduce job is serialized
into an XML plan. The Hive driver then executes a Hadoop job.
The job reads the XML plan and creates all the necessary operator
objects that scan data from a table in HDFS, and parse and process
one tuple at a time.
The SMS planner modifies Hive. In particular we intercept the
normal Hive flow in two main areas:
(i) Before any query execution, we update the MetaStore with
references to our database tables. Hive allows tables to exist externally,
outside HDFS. The HadoopDB catalog, Section 5.2.2, provides
information about the table schemas and required Deserializer
and InputFormat classes to the MetaStore. We implemented
these specialized classes.
(ii) After the physical query plan generation and before the execution
of the MapReduce jobs, we perform two passes over the
physical plan. In the first pass, we retrieve data fields that are actually
processed by the plan and we determine the partitioning keys
used by the Reduce Sink (Repartition) operators. In the second
pass, we traverse the DAG bottom-up from table scan operators to
the output or File Sink operator. All operators until the first repartition
operator with a partitioning key different from the database’s
key are converted into one or more SQL queries and pushed into
the database layer. SMS uses a rule-based SQL generator to recreate
SQL from the relational operators. The query processing logic
that could be pushed into the database layer ranges from none (each