The job completion time in MapReduce depends on the
slowest running task in the job. If one task takes significantly
longer to finish than others (the so-called straggler), it can
delay the progress of the entire job. Stragglers can occur due
to various reasons, among which data skew is an important
one. Data skew refers to the imbalance in the amount of data
assigned to each task, or the imbalance in the amount of
work required to process such data. The fundamental reason of data skew is that datasets in the real world are often
skewed and that we do not know the distribution of the
data beforehand. Note that this problem cannot be solved
by the speculative execution strategy in MapReduce