The upper segments of each Hadoop bar in the graphs represent
the execution time of the additional MR job to combine the output
into a single file. Since we ran this as a separate MapReduce job,
these segments consume a larger percentage of overall time in Figure 4, as the fixed start-up overhead cost again dominates the work
needed to perform the rest of the task. Even though the Grep task is
selective, the results in Figure 5 show how this combine phase can
still take hundreds of seconds due to the need to open and combine
many small output files. Each Map instance produces its output in
a separate HDFS file, and thus even though each file is small there
are many Map tasks and therefore many files on each node.