Overlapping correlation clustering
Abstract—We introduce a new approach to the problem of
overlapping clustering. The main idea is to formulate overlapping
clustering as an optimization problem in which each data point
is mapped to a small set of labels, representing membership to
different clusters. The objective is to find a mapping so that the
distances between data points agree as much as possible with
distances taken over their label sets. To define distances between
label sets, we consider two measures: a set-intersection indicator
function and the Jaccard coefficient.
To solve the main optimization problem we propose a localsearch
algorithm. The iterative step of our algorithm requires
solving non-trivial optimization subproblems, which, for the
measures of set-intersection and Jaccard, we solve using a greedy
method and non-negative least squares, respectively.
Since our frameworks uses pairwise similarities of objects
as the input, it lends itself naturally to the task of clustering
structured objects for which feature vectors can be difficult to
obtain. As a proof of concept we show how easily our framework
can be applied in two different complex application domains.
Firstly, we develop overlapping clustering of animal trajectories,
obtaining zoologically meaningful results. Secondly, we apply
our framework for overlapping clustering of proteins based on
pairwise similarities of aminoacid sequences, outperforming the
of state-of-the-art method in matching a ground truth taxonomy