Data placement
Before diving into the mechanics ofdistributed processing, consider the problems
of handling huge amounts of data on a single computer. Distributed processing
and large-scale data processing have one major aspect in common, which is that
not all of the input data is available at once. In distributed processing, the data
might be scattered among many machines. In large-scale data processing, most of
the data is on the disk. In both cases, the key to efficient data processing is placing
the data correctly.
Let’s take a simple example. Suppose you have a text file that contains data
about credit card transactions. Each line of the file contains a credit card number
and an amount of money. How might you determine the number of unique credit
card numbers in the file?
If the file is not very big, you could read each line, parse the credit card number, and store the credit card number in a hash table. Once the entire file had been
read, the hash table would contain one entry for each unique credit card number.
Counting the number of entries in the hash table would give you the answer. Unfortunately, for a big file, the hash table would be too large to store in memory.
Now suppose you had the very same credit card data, but the transactions in
the file were ordered by credit card number. Counting the number of unique
credit card numbers in this case is very simple. Each line in the file is read and
the credit card number on the line is parsed. If the credit card number found is
different than the one on the line before it, a counter is incremented. When the
end of the file is reached, the counter contains a count of the unique credit card
numbers in the file. No hash table is necessary for this to work.
Now, back to distributed computation. Suppose you have more than one computer to use for this counting task. You can split the big file of transactions into
small batches of transactions. Each computer can count its fraction, and then the
results can be merged together to produce a final result.
Initially, we start with an unordered file of transactions. We split that file into
small batches of transactions and count the unique credit card numbers in each
batch. How do we combine the results? We could add the number of credit card
numbers found in each batch, but this is incorrect, since the same credit card number might appear in more than one batch, and therefore would be counted more
than once in the final total. Instead, we would need to keep a list of the unique
credit card numbers found in each batch, and then merge those lists together to
make a final result list. The size of this final list is the number of unique credit card
numbers in the whole set.
In contrast, suppose the transactions are split into batches with more care, so
that all transactions made with the same credit card end up in the same batch.
With this extra restriction, each batch can be counted individually, and then the
counts from each batch can be added to make a final result. No merge is necessary,
because there is no possibility of double-counting. Each credit card number will
appear in precisely one batch.
These examples might be a little bit tedious, but the point is that proper data
grouping can radically change the performance characteristics of a task. Using a
sorted input file made the counting task easy, reduced the amount of memory
needed to nearly zero, and made it possible to distribute the computation easily.