(Chapter 6 explains this in more detail). In order to get this kind of information,
the text of the document needs to be retrieved.
In Chapter 3, we saw some ways that documents can be stored for fast access.
There are many ways to approach this problem, but in the end, a separate system is
necessary to convert search engine results from numbers into something readable
by people.
5.6 Index Construction
Before an index can be used for query processing, it has to be created from the text
collection. Building a small index is not particularly difficult, but as input sizes
grow, some index construction tricks can be useful. In this section, we will look at
simple in-memory index construction first, and then consider the case where the
input data does not fit in memory. Finally, we will consider how to build indexes
using more than one computer.
5.6.1 Simple Construction
Pseudocode for a simple indexer is shown in Figure 5.8. The process involves only
a few steps. A list of documents is passed to the BuildIndex function, and the
function parses each document into tokens, as discussed in Chapter 4. These to
kens are words, perhaps with some additional processing, such as downcasing or
stemming. The function removes duplicate tokens, using, for example, a hash ta
ble. Then, for each token, the function determines whether a new inverted list
needs to be created in I , and creates one if necessary. Finally, the current docu
ment number, n, is added to the inverted list.
The result is a hash table of tokens and inverted lists. The inverted lists are
just lists of integer document numbers and contain no special information. This
is enough to do very simple kinds of retrieval, as we saw in section 5.3.1.
As described, this indexer can be used for many small tasks—for example, in
dexing less than a few thousand documents. However, it is limited in two ways.
First, it requires that all of the inverted lists be stored in memory, which may not
be practical for larger collections. Second, this algorithm is sequential, with no
obvious way to parallelize it. The primary barrier to parallelizing this algorithm is
the hash table, which is accessed constantly in the inner loop. Adding locks to the
hash table would allow parallelism for parsing, but that improvement alone will