THE data for processing. Large Internet companies routinely past decade has witnessed the explosive growth of
generate hundreds of tera-bytes of logs and operation
records. MapReduce has proven itself to be an effective tool
to process such large datasets [1]. It divides a job into multiple small tasks and assign them to a large number of nodes
for parallel processing. Due to its remarkable simplicity and
fault tolerance, MapReduce has been widely used in various
applications, including web indexing, log analysis, data
mining, scientific simulations, machine translation, etc..
There are several parallel computing frameworks that support MapReduce, such as Apache Hadoop [2], Google MapReduce [1], and Microsoft Dryad [3], of which Hadoop is
open-source and widely used