We observed that disk I/O performance on EC2 nodes were initially
quite slow (25MB/s). Consequently, we initialized some additional
space on each node so that intermediate files and output of
the tasks did not suffer from this initial write slow-down. Once disk
space is initialized, subsequent writes are much faster (86MB/s).
Network speed is approximately 100-110MB/s. We execute each
task three times and report the average of the trials. The final results
from all parallel databases queries are piped from the shell command
into a file. Hadoop and HadoopDB store results in Hadoop’s
distributed file system (HDFS). In this section, we only report results
using trials where all nodes are available, operating correctly,
and have no concurrent tasks during benchmark execution (we drop
these requirements in Section 7). For each task, we benchmark performance
on cluster sizes of 10, 50, and 100 nodes.
6.1 Benchmarked Systems
Our experiments compare performance of Hadoop, HadoopDB
(with PostgreSQL5 as the underlying database) and two commercial
parallel DBMSs.
6.1.1 Hadoop
Hadoop is an open-source version of the MapReduce framework,
implemented by directly following the ideas described in the original
MapReduce paper, and is used today by dozens of businesses
to perform data analysis [1]. For our experiments in this paper, we
use Hadoop version 0.19.1 running on Java 1.6.0. We deployed the
system with several changes to the default configuration settings.
Data in HDFS is stored using 256MB data blocks instead of the default
64MB. Each MR executor ran with a maximum heap size of
1024MB. We allowed two Map instances and a single Reduce instance
to execute concurrently on each node. We also allowed more
buffer space for file read/write operations (132MB) and increased
the sort buffer to 200MB with 100 concurrent streams for merging.
Additionally, we modified the number of parallel transfers run by
Reduce during the shuffle phase and the number of worker threads
for each TaskTracker’s http server to be 50. These adjustments
follow the guidelines on high-performance Hadoop clusters [13].
Moreover, we enabled task JVMs to be reused.
For each benchmark trial, we stored all input and output data in
HDFS with no replication (we add replication in Section 7). After
benchmarking a particular cluster size, we deleted the data directories
on each node, reformatted and reloaded HDFS to ensure
uniform data distribution across all nodes.
We present results of both hand-coded Hadoop and Hive-coded
Hadoop (i.e. Hadoop plans generated automatically via Hive’s SQL
interface). These separate results for Hadoop are displayed as split
bars in the graphs. The bottom, colored segment of the bars represent
the time taken by Hadoop when hand-coded and the rest of the
bar indicates the additional overhead as a result of the automatic
plan-generation by Hive, and operator function-call and dynamic
data type resolution through Java’s Reflection API for each tuple
processed in Hive-coded jobs.
6.1.2 HadoopDB
The Hadoop part of HadoopDB was configured identically to the
description above except for the number of concurrent Map tasks,
which we set to one. Additionally, on each worker node, PostgreSQL
version 8.2.5 was installed. We increased memory used by
the PostgreSQL shared buffers to 512 MB and the working memory