Building MapReduce applications using the Message-Passing
Interface (MPI) enables us to exploit the performance of
large HPC clusters for big data analytics. However, due
to the lacking of native fault tolerance support in MPI and
the incompatibility between the MapReduce fault tolerance
model and HPC schedulers, it is very hard to provide a fault
tolerant MapReduce runtime for HPC clusters. We propose
and develop FT-MRMPI, the first fault tolerant MapReduce
framework on MPI for HPC clusters. We discover a unique
way to perform failure detection and recovery by exploiting
the current MPI semantics and the new proposal of userlevel
failure mitigation. We design and develop the checkpoint/restart
model for fault tolerant MapReduce in MPI.
We further tailor the detect/resume model to conserve work
for more efficient fault tolerance. The experimental results
on a 256-node HPC cluster show that FT-MRMPI effectively
masks failures and reduces the job completion time by 39%.