THE past decade has witnessed the explosive growth of
data for processing. Large Internet companies routinely
generate hundreds of tera-bytes of logs and operation
records. MapReduce has proven itself to be an effective tool
to process such large datasets [1]. It divides a job into multiple
small tasks and assign them to a large number of nodes
for parallel processing. Due to its remarkable simplicity and
fault tolerance, MapReduce has been widely used in various
applications, including web indexing, log analysis, data
mining, scientific simulations, machine translation, etc..
There are several parallel computing frameworks that support
MapReduce, such as Apache Hadoop [2], Google Map-
Reduce [1], and Microsoft Dryad [3], of which Hadoop is
open-source and widely used.