Abstract— Document clustering as an unsupervised approach
extensively used to navigate, filter, summarize and manage large
collection of document repositories like the World Wide Web
(WWW). Recently, focuses in this domain shifted from
traditional vector based document similarity for clustering to
suffix tree based document similarity, as it offers more semantic
representation of the text present in the document. In this paper,
we compare and contrast two recently introduced approaches to
document clustering based on suffix tree data model. The first is
an Efficient Phrase based document clustering, which extracts
phrases from documents to form compact document
representation and uses a similarity measure based on common
suffix tree to cluster the documents. The second approach is a
frequent word/word meaning sequence based document
clustering, it similarly extracts the common word sequence from
the document and uses the common sequence/ common word
meaning sequence to perform the compact representation, and
finally, it uses document clustering approach to cluster the
compact documents. These algorithms are using agglomerative
hierarchical document clustering to perform the actual clustering
step, the difference in these approaches are mainly based on
extraction of phrases, model representation as a compact
document, and the similarity measures used for clustering. This
paper investigates the computational aspect of the two
algorithms, and the quality of results they produced.
Abstract— Document clustering as an unsupervised approach
extensively used to navigate, filter, summarize and manage large
collection of document repositories like the World Wide Web
(WWW). Recently, focuses in this domain shifted from
traditional vector based document similarity for clustering to
suffix tree based document similarity, as it offers more semantic
representation of the text present in the document. In this paper,
we compare and contrast two recently introduced approaches to
document clustering based on suffix tree data model. The first is
an Efficient Phrase based document clustering, which extracts
phrases from documents to form compact document
representation and uses a similarity measure based on common
suffix tree to cluster the documents. The second approach is a
frequent word/word meaning sequence based document
clustering, it similarly extracts the common word sequence from
the document and uses the common sequence/ common word
meaning sequence to perform the compact representation, and
finally, it uses document clustering approach to cluster the
compact documents. These algorithms are using agglomerative
hierarchical document clustering to perform the actual clustering
step, the difference in these approaches are mainly based on
extraction of phrases, model representation as a compact
document, and the similarity measures used for clustering. This
paper investigates the computational aspect of the two
algorithms, and the quality of results they produced.
การแปล กรุณารอสักครู่..
