most users expect. This leaves a huge gulf between the size of the Web and what we
can handle with current single-computer technology. Note that this problem is
not restricted to a few major web search companies; many more companies want
to analyze the content of the Web instead of making it available for public search.
These companies have the same scala bility problem.
The second factor is simple economics. The incredible popularity of personal
computers has made them very powerful and inexpensive. In contrast, large computers serve a very small market, and therefore have fewer opportunities to develop economies of scale. Over time, this difference in scale has made it difficult
to make a computer that is much more powerful than a personal computer that
is still sold for a reasonable amount of money. Many large information retrieval
systems ran on mainframes in the past, but today’s platform of choice consists of
many inexpensive commodity servers.
Inexpensive servers have a few disadvantages when compared to mainframes.
First, they are more likely to break, and the likelihood of at least one server failure goes up as you add more servers. Second, they are difficult to program. Most
programmers are well trained for single-threaded programming, less well trained
for threaded or multi-process programming, and not well trained at all for cooperative network programming. Many programming tool kits have been developed
to help address this kind of problem. RPC, CORBA, Java RMI, and SOAP have
been developed to allow function calls across machine boundaries. MPI provides
a different abstraction, called message passing, which is popular for many scientific
tasks. None of these techniques are particularly robust against system failures, and
the programming models can be complex. In particular, these systems do not help
distribute data evenly among machines; that is the programmer’s job.