Although existing approaches rely on term
semantic similarity, not many studies have
been done on evaluating the effects of different
similarity measures on document clustering for
a specific domain. Yoo, Hu, and Song (2006)
employed one similarity measure that calculates
the number of shared ancestor concepts and the
number of co-occurred documents. Jing et al.
(2006) compared two ontology-based term similarity
measure. Even though these approaches
are heavily relied on term similarity information
and all these similarity measures are domain
independent, however, to date, relatively little
work has been done on evaluating measures of
term similarity for biomedical domain (where
there are a growing number of ontologies that
organize medical concepts into hierarchies such
as MeSH ontology) on document clustering.
In our pervious study (Zhang et al., 2007), a
comparative study is conducted on a selected
PubMed document set. However, the conclusion
on one dataset may not be very general.
Moreover, the similarity score threshold applied
in previous study brings unfairness to term reweighting
since the distribution of similarity
scores are different in terms of different similarity
measure. Therefore, for a fair comparison,
we use the minimum path length between two
documents as the threshold