5.3.2 Top-K Correlation Method
We evaluate the individual performance of TCC and CCC, by measuring their average compression error and its variance while varying top-k sizes from 1 to 10 times of their initial length. To construct the top-k list, we computed correlations over disjoint pairs of demographic nodes for fixed time intervals (with local averages), taken from 17 different topics in MovieLens. Then, we filtered correlations according to the significance and minimum correlation criteria, obtaining the lists of (approximately) 12 K high( l >0.50) and 2 K very high( l >0.75) top-k correlations for an initial set of 140 K disjoint pairs.
Compression error is computed as the root mean squared error (RMSE) between actual correlations and those retrieved according to Algorithm 5. We note that an optimal compression (with the smallest error) for CCC clustering method was sometimes achieved with a size, smaller than that required by the compression ratio parameter. In such cases the remaining space was filled with the highest non-clustered correlations.
In Figure 8 we present the results of our evaluation. We observe that TCC triangulation compression shows better performance when it is able to fit all the high correlations necessary for describing the rest of correlations, what happens in the case of large initial top k list( l >0.50). In the case when all correlations are high (l > 0.75) and there is a high compression ratio, there is a large portion of correlations which do not fit into the compressed top-k list and neither can be triangulated from the correlations present in the list. The error in this case is the highest. On the other hand, CCC clustering compression benefits from compressing higher correlations as soon as there is enough space in top-k to store an optimal number of clusters. In this case most of the high correlations appear within clusters and the amount of correlations which are not approximated by cluster-cluster distances becomes relatively small. Nevertheless, CCC can be come in efficient due to the clustering information overhead if there are many distanced small clusters of 3 items. Since TCC method is able to approximate the third correlation using the remaining two, it can be a good companion in such cases. We recommend to use hybrid CCC+TCC method for compressing correlations as the most universally applicable, especially in the case of moderate compression ratios.
We now look at the efficiency of correlation extraction of our top-k method using the same synthetic dataset, we used to evaluate the base line methods. We calculated lists of high correlations (l min=0.5) for each of the fixed time intervals, containing 20-50K top-k values out of 15 M (disjoint) and 40M (total) group pairs, and compressed them using the hybrid CCC+TCC method with clustering parameters optimized individually for the specific top-k size. We varied top-k sizes from 4 to 16 four-kilo byte disk pages (each page can hold up to 800 correlations or cluster distances). Figure 9 demonstrates that with the sufficient list of top-k correlations, computed for fixed-windows, it is possible to match the accuracy of conventional methods. A more detailed inspection shows that the drop in accuracy for smaller top-k sizes is caused mainly by a decreasing recall, due to the in ability of top-k to fit all the high correlations, which our synthetic dataset is mainly composed of.