5.1.2 Task Start-up
We found that our MR programs took some time before all nodes
were running at full capacity. On a cluster of 100 nodes, it takes 10
seconds from the moment that a job is submitted to the JobTracker
before the first Map task begins to execute and 25 seconds until all
the nodes in the cluster are executing the job. This coincides with
the results in [8], where the data processing rate does not reach its
peak for nearly 60 seconds on a cluster of 1800 nodes. The “cold
start” nature is symptomatic to Hadoop’s (and apparently Google’s)
implementation and not inherent to the actual MR model itself. For
example, we also found that prior versions of Hadoop would create
a new JVM process for each Map and Reduce instance on a node,
which we found increased the overhead of running jobs on large
data sets; enabling the JVM reuse feature in the latest version of
Hadoop improved our results for MR by 10–15%.
In contrast, parallel DBMSs are started at OS boot time, and thus
are considered to always be “warm”, waiting for a query to execute.
Moreover, all modern DBMSs are designed to execute using multiple threads and processes, which allows the currently running code
to accept additional tasks and further optimize its execution schedule. Minimizing start-up time was one of the early optimizations of
DBMSs, and is certainly something that MR systems should be able
to incorporate without a large rewrite of the underlying architecture.