Since I1 and I2
may have used the same document numbers, the merge function
renumbers documents in I2.
This merging process can succeed even if there is only enough memory to store
two words (w1and w2), a single inverted list posting, and a few file pointers. In
practice, a real merge function would read large chunks of I1 and I2, and then
write large chunks to I in order to use the disk most efficiently.
This merging strategy also shows a possible parallel indexing strategy. If many
machines build their own partial indexes, a single machine can combine all of
those indexes together into a single, final index. However, in the next section,
we will explore more recent distributed indexing frameworks that are becoming
popular.
5.6.3 Parallelism and Distribution
The traditional model for search engines has been to use a single, fast machine to
create the index and process queries. This is still the appropriate choice for a large
number of applications, but it is no longer a good choice for the largest systems.
Instead, for these large systems, it is increasingly popular to use many inexpensive servers together and use distributed processing software to coordinate their
activities. MapReduce is a distributed processing tool that makes this possible.
Two factors have forced this shift. First, the amount of data to index in the
largest systems is exploding. Modern web search engines already index tens of billions of pages, but even larger indexes are coming. Consider that if each person on
earth wrote one blog post each day, the Web would increase in size by over two
trillion pages every year. Optimistically, one typical modern computer can handle
a few hundred million pages, although not with the kind of response times that