In this section, we describe the design of HadoopDB. The goal of
this design is to achieve all of the properties described in Section 3.
The basic idea behind behind HadoopDB is to connect multiple
single-node database systems using Hadoop as the task coordinator
and network communication layer. Queries are parallelized across
nodes using the MapReduce framework; however, as much of the
single node query work as possible is pushed inside of the corresponding
node databases. HadoopDB achieves fault tolerance and
the ability to operate in heterogeneous environments by inheriting
the scheduling and job tracking implementation from Hadoop, yet
it achieves the performance of parallel databases by doing much of
the query processing inside of the database engine.