Aside from the initial decision on where to schedule Map instances, a MR programmer must perform these tasks manually. For
example, suppose a user writes a MR program to process a collection of documents in two parts. First, the Map function scans the
documents and creates a histogram of frequently occurring words.
The documents are then passed to a Reduce function that groups
files by their site of origin. Using this data, the user, or another
user building on the first user’s work, now wants to find sites with
a document that contains more than five occurrences of the word
‘Google’ or the word ‘IBM’. In the naive implementation of this
query, where the Map is executed over the accumulated statistics,
the filtration is done after the statistics for all documents are computed and shipped to reduce workers, even though only a small subset of documents satisfy the keyword filter.