The credit card data example we saw in the previous section works well as
a MapReduce task. In the Mapper (Figure 5.11), each record is split into a key
(the credit card number) and a value (the money amount in the transaction). The
shuffle stage sorts the data so that the records with the same credit card number
end up next to each other. The reduce stage emits a record for each unique credit
card number, so the total number of unique credit card numbers is the number of
records emitted by the reducer (Figure 5.12).
Typically, we assume that both the Mapper and Reducer are idempotent. By
idempotent, we mean that if the Mapper or Reducer is called multiple times on
the same input, the output will always be the same. This idempotence allows the
MapReduce library to be fault tolerant. If any part of the computation fails, perhaps because of a hardware machine failure, the MapReduce library can just process that part of the input again on a different machine. Even when machines
don’t fail, sometimes machines can be slow because of misconfiguration or slowly
failing parts. In this case, a machine that appears to be normal could return results much more slowly than other machines in a cluster. To guard against this,
as the computation nears completion, the MapReduce library issues backup Mappers and Reducers that duplicate the processing done on the slowest machines.
This ensures that slow machines don’t become the bottleneck of a computation.
The idempotence of the Mapper and Reducer are what make this possible. If the
Mapper or Reducer modified files directly, for example, multiple copies of them
could not be run simultaneously.
Let’s look at the problem of indexing a corpus with MapReduce. In our simple
indexer, we will store inverted lists with word positions.