In the basic form the input to our algorithms contains all
|V | × |V | pairwise similarities. However, it turns out that
there can be a lot of redundancy in this input. Often we can
prune most of the pairwise comparisons with negligible loss
in quality. This is an important characteristic, as it allows us
to apply the algorithm also for larger data sets. Selecting the
best set of edges to prune is an interesting problem in its own
right. In this experiment we took the simple approach, and
prune edges at random: an edge is taken in consideration with
probability q (denoted the pruning threshold) independently of
the other edges. In Figure 3 we show edge-specific cost as well
as precision and recall as a function of q for the OCC-JACC
algorithm (the curves are again medians over 30 trials). Clearly
with these example data sets the pruning threshold can be set
very low. Also, there is a noticeable “threshold effect” in the
cost/edge that may serve as an indicator to find the pruning
threshold in a setting where a ground truth is not available.
This suggests that in practice it is not necessary to use all