concatenated together. When ReducePostingsToLists (Figure 5.14) is called, the
emitted postings have been shuffled so that all postings for the same word are
together. The Reducer calls WriteWord to start writing an inverted list and then
uses EncodePosting to write each posting.
5.6.4 Update
So far, we have assumed that indexing is a batch process. This means that a set of
documents is given to the indexer as input, the indexer builds the index, and then
the system allows users to run queries. In practice, most interesting document col
lections are constantly changing. At the very least, collections tend to get bigger
over time; every day there is more news and more email. In other cases, such as
web search or file system search, the contents of documents can change over time
as well. A useful search engine needs to be able to respond to dynamic collections.
We can solve the problem of update with two techniques: index merging and
result merging. If the index is stored in memory, there are many options for quick
index update. However, even if the search engine is evaluating queries in mem
ory, typically the index is stored on a disk. Inserting data in the middle of a file
is not supported by any common file system, so direct disk-based update is not
straightforward. We do know how to merge indexes together, though, as we saw
in section 5.6.2. This gives us a simple approach for adding data to the index: make
a new, smaller index (I
2
) and merge it with the old index (I
1
) to make a new in
dex containing all of the data (I ). Postings in I
1
for any deleted documents can
be ignored during the merge phase so they do not appear in I .
Index merging is a reasonable update strategy when index updates come in
large batches, perhaps many thousands of documents at a time. For single docu
ment updates, it isn’t a very good strategy, since it is time-consuming to write the
entire index to disk. For these small updates, it is better to just build a small index