table is scanned independently and tuples are pushed one at a time
into the DAG of operators) to all (only a Map task is required to
output the results into an HDFS file).
Given the above GroupBy query, SMS produces one of two different
plans. If the sales table is partitioned by YEAR(saleDate),
it produces the query plan in Fig. 2(b): this plan pushes the entire
query processing logic into the database layer. Only a Map task
is required to output results into an HDFS file. Otherwise, SMS
produces the query plan in Fig. 2(c) in which the database layer
partially aggregates data and eliminates the selection and group-by
operator used in the Map phase of the Hive generated query plan
(Fig. 2(a)). The final aggregation step in the Reduce phase of the
MapReduce job, however, is still required in order to merge partial
results from each node.
For join queries, Hive assumes that tables are not collocated.
Therefore, the Hive generated plan scans each table independently
and computes the join after repartitioning data by the join key. In
contrast, if the join key matches the database partitioning key, SMS
pushes the entire join sub-tree into the database layer.
So far, we only support filter, select (project) and aggregation
operators. Currently, the partitioning features supported by Hive
are extremely na¨ıve and do not support expression-based partitioning.
Therefore, we cannot detect if the sales table is partitioned
by YEAR(saleDate) or not, therefore we have to make the pessimistic
assumption that the data is not partitioned by this attribute.
The Hive build [15] we extended is a little buggy; as explained in
Section 6.2.5, it fails to execute the join task used in our benchmark,
even when running over HDFS tables3. However, we use the
SMS planner to automatically push SQL queries into HadoopDB’s
DBMS layer for all other benchmark queries presented in our experiments
for this paper.
5.3 Summary
HadoopDB does not replace Hadoop. Both systems coexist enabling
the analyst to choose the appropriate tools for a given dataset
and task. Through the performance benchmarks in the following
sections, we show that using an efficient database storage layer cuts
down on data processing time especially on tasks that require complex
query processing over structured data such as joins. We also
show that HadoopDB is able to take advantage of the fault-tolerance
and the ability to run on heterogeneous environments that comes
naturally with Hadoop-style systems.
6. BENCHMARKS
In this section we evaluate HadoopDB, comparing it with a
MapReduce implementation and two parallel database implementations,
using a benchmark first presented in [23]4. This
benchmark consists of five tasks. The first task is taken directly
from the original MapReduce paper [8] whose authors claim is
representative of common MR tasks. The next four tasks are
analytical queries designed to be representative of traditional
structured data analysis workloads that HadoopDB targets.
We ran our experiments on Amazon EC2 “large” instances (zone:
us-east-1b). Each instance has 7.5 GB memory, 4 EC2 Compute
Units (2 virtual cores), 850 GB instance storage (2 x 420 GB plus
10 GB root partition) and runs 64-bit platform Linux Fedora 8 OS.