Conjunctive processing
The simplest kind of query optimization is conjunctive processing. By conjunctive
processing, we just mean that every document returned to the user needs to contain all of the query terms. Conjunctive processing is the default mode for many
web search engines, in part because of speed and in part because users have come to
expect it. With short queries, conjunctive processing can actually improve effectiveness and efficiency simultaneously. In contrast, search engines that use longer
queries, such as entire paragraphs, will not be good candidates for conjunctive
processing.
Conjunctive processing works best when one of the query terms is rare, as in
the query “fish locomotion”. The word “fish” occurs about 100 times as often as
the word “locomotion”. Since we are only interested in documents that contain
both words, the system can skip over most of the inverted list for “fish” in order
to find only the postings in documents that also contain the word “locomotion”.
Conjunctive processing can be employed with both term-at-a-time and document-at-a-time systems. Figure 5.20 shows the updated term-at-a-time algorithm
for conjunctive processing. When processing the first term, (i = 0), processing
proceeds normally. However, for the remaining terms, (i > 0), the algorithm
processes postings starting at line ??. It checks the accumulator table for the next
document that contains all of the previous query terms, and instructs list l
i
to
skip forward to that document if there is a posting for it (line ??). If there is a
posting, the accumulator is updated. Ifthe posting does not exist, the accumulator
is deleted (line ??).
The document-at-a-time version (Figure 5.21) is similar to the old documentat-a-time version, except in the inner loop. It begins by finding the largest document d currently pointed to by an inverted list (line 13). This document d is not
guaranteed to contain all the query terms, but it is a reasonable candidate. The
next loop tries to skip all lists forward to point at d (line 16). If this is not successful, the loop terminates and another document d is chosen. If it is successful, the
document is scored and added to the priority queue.
In both algorithms, the system runs fastest when the first list (l
0
) is the shortest
and the last list (ln) is the longest. This results in the biggest possible skip distances
in the last list, which is where skipping will help most.