1 System Design
We followed the structure of the starter code and kept the predefined interface. To summarize: • Index class implements the BSBI indexing algorithm
• Query class implements boolean conjunctive query processing algorithm
• BaseIndex defines the interface for writing/reading posting lists
• BasicIndex class writes uncompressed posting lists to/reads posting lists from disk • VBIndex writes/reads posting lists with variable byte encoding for gaps
• GammaIndex writes/reads posting lists with gamma encoding for gaps
There are also some helper classes, e.g. ListLengthComparator which is used for sorting posting lists when processing the query.
Some key algorithm are explained in detail as follows.
1.1 Indexing Algorithm
Key steps in the indexing algorithm are: 1. Process blocks one by one:
(a) For each block, traverse each file and generate pairs
(b) After processing all files in the block, sort the pairs first by Term
ID then by Doc ID and organize pairs with the same Term ID into a posting list
(c) Create a file and store the posting lists with specified posting list writing algorithm
(d) Push the file into the merge queue
2. Merge blocks into a unified index:
(a) Pop two block files from the queue
(b) Merge the postings lists with the same Term ID (if possible) into a single one with standard merging algorithm
1
(c) Write merged posting lists into a merged file (d) Push the merged file into the merge queue
(e) Proceed until the size of queue is 1
3. Write supplemental information (Term to Term ID mapping, Doc Name to Doc ID mapping, posting list positions) out to files
The indexing algorithm guarantees that:
1. Term ID and Doc ID are always greater than 0 (so there’s no problem for γ encoding)
2. When indexing the blocks, only one block is loaded into memory at a time
3. When merging two blocks, the indexes are loaded in in a stream fashion. That is, for each index to merge, only a posting list is loaded at a time.