RECENT TRENDS IN HIERARCHIC DOCUMENT
CLUSTERING: A CRITICAL REVIEW
Abstract -This article reviews recent research into the use of hierarchic agglomerative
clustering methods for document retrieval. After an introduction to the calculation of
interdocument similarities and to clustering methods that are appropriate for document
clustering, the article discusses algorithms that can be used to allow the implementation
of these methods on databases of nontrivial size. The validation of document hierarchies
is described using tests based on the theory of random graphs and on empirical characteristics
of document collections that are to be clustered. A range of search strategies
is available for retrieval from document hierarchies and the results are presented of a
series of research projects that have used these strategies to search the clusters resulting
from several different types of hierarchic agglomerative clustering method. It is suggested
that the complete linkage method is probably the most effective method in terms
of retrieval performance; however, it is also difficult to implement in an efficient manner.
Other applications of document clustering techniques are discussed briefly; experimental
evidence suggests that nearest neighbor clusters, possibly represented as a
network model, provide a reasonably efficient and effective means of including interdocument
similarity information in document retrieval systems.