The Hadoop system is the most popular open-source implementation of the MapReduce framework, under development
by Yahoo! and the Apache Software Foundation [1]. Unlike the
Google implementation of the original MR framework written in
C++, the core Hadoop system is written entirely in Java. For our
experiments in this paper, we use Hadoop version 0.19.0 running
on Java 1.6.0. We deployed the system with the default configuration settings, except for the following changes that we found yielded
better performance without diverging from core MR fundamentals:
(1) data is stored using 256MB data blocks instead of the default
64MB, (2) each task executor JVM ran with a maximum heap size
of 512MB and the DataNode/JobTracker JVMs ran with a maximum heap size of a 1024MB (for a total size of 3.5GB per node),
(3) we enabled Hadoop’s “rack awareness” feature for data locality
in the cluster, and (4) we allowed Hadoop to reuse the task JVM
executor instead starting a new process for each Map/Reduce task.
Moreover, we configured the system to run two Map instances and
a single Reduce instance concurrently on each node.