HadoopDB extends the Hadoop framework (see Fig. 1) by providing
the following four components:
5.2.1 Database Connector
The Database Connector is the interface between independent
database systems residing on nodes in the cluster and TaskTrackers.
It extends Hadoop’s InputFormat class and is part of the Input-
Format Implementations library. Each MapReduce job supplies the
Connector with an SQL query and connection parameters such as:
which JDBC driver to use, query fetch size and other query tuning
parameters. The Connector connects to the database, executes the
SQL query and returns results as key-value pairs. The Connector
could theoretically connect to any JDBC-compliant database that
resides in the cluster. However, different databases require different
read query optimizations. We implemented connectors for MySQL
and PostgreSQL. In the future we plan to integrate other databases
including open-source column-store databases such as MonetDB
and InfoBright. By extending Hadoop’s InputFormat, we integrate
seamlessly with Hadoop’s MapReduce Framework. To the framework,
the databases are data sources similar to data blocks in HDFS.
5.2.2 Catalog
The catalog maintains metainformation about the databases. This
includes the following: (i) connection parameters such as database
location, driver class and credentials, (ii) metadata such as data
sets contained in the cluster, replica locations, and data partitioning
properties.
The current implementation of the HadoopDB catalog stores its
metainformation as an XML file in HDFS. This file is accessed by