While this is not a very effective or efficient
method of classification, it proved adequate for
creating document classes (subsets of documents in
a corpus) matching an information need expressed
using Boolean operators as the Boolean model merely
views a document as a set of words; A document
class matching an information need is formed using
Boolean operators – AND, OR, NOT. However, the
limitations of such an approach to classification
were realised quite early and refinements followed.
Let us consider a realistic situation of a corpus
of 100000 documents requiring classification to
support retrieval. If each document, on an average,
is 1000 words long and there are 100000 unique
words in the corpus, the term-document matrix will
have 10 billion ‘0s’ and ‘1s’; the ‘1s’ will probably
be less than 10 % of this with over 90 % of the
cells in the matrix being ‘0s’; A more efficient
approach is to record only the ‘1s’ which is what
an inverted file or back of the book index (with
terms arranged alphabetically) does, a classificatory
approach extensively used in today’s database
management systems. This still left unsolved the
problem of ‘dodging’ irrelevant documents, a problem
frequently faced while using Web search engines.
The problems could be traced to: