A number of issues with clustering algorithms have resulted in them being less
widely used in practice than classification algorithms. These issues include the
computational costs, as well as the difficulty of interpreting and evaluating the
clusters. Clustering has been used in a number of search engines for organizing
the results, as we discussed in section 6.3.3. There are very few results for a search
compared to the size of the document collection, so the efficiency of clustering is
less of a problem. Clustering is also able to discover structure in the result set for
arbitrary queries that would not be possible with a classification algorithm.
Topic modeling, which we discussed in section 7.6.2, can also be viewed as
an application of clustering with the goal of improving the ranking effectiveness
of the search engine. In fact, most of the information retrieval research involving
clustering has focused on this goal. The basis for this research is the well-known
cluster hypothesis. As originally stated by van Rijsbergen (1979), the cluster hypothesis
is:
Closely associated documents tend to be relevant to the same requests.
Note that this hypothesis doesn’t actually mention clusters. However, “closely associated”
or similar documents will generally be in the same cluster. So the hypothesis
is usually interpreted as saying that documents in the same cluster tend
to be relevant to the same queries.
Two different tests have been used to verify whether the cluster hypothesis
holds for a given collection of documents. The first compares the distribution of
similarity scores for pairs of relevant documents (for a set of queries) to the distribution
for pairs consisting of a non-relevant and a relevant document. If the
cluster hypothesis holds, we might expect to see a separation between these two
distributions. On some smaller corpora, such as the CACM corpus mentioned
in Chapter 8, this is indeed the case. If there were a number of clusters of relevant
documents, however, which were not similar to each other, then this test may
fail to show any separation. To address this potential problem, Voorhees (1985)
proposed a test based on the assumption that if the cluster hypothesis holds, relevant
documents would have high local precision, even if they were scattered in
many clusters. Local precision simply measures the number of relevant documents
found in the top five nearest neighbors for each relevant document.