i'm serious and sincere.
I'm trying to figure out how to calculate the Rand Index of a cluster algorithm, but I'm stuck at the point how to calculate the true and false negatives.
At the moment I'm using the example from the book An Introduction into Information Retrieval (Manning, Raghavan & Schütze, 2009). At page 359 they talk about how to calculate the Rand index. For this example they use three clusters and the clusters contains the following objects.
a a a a a b
a b b b b c
a a c c c
I replace the object (orginal signs to letters, but the idea and count stay the same). I'll give the exact words from the book in order to see what they are talking about:
We first compute TP +FP. The three clusters contain 6, 6, and 5 points, respectively, so the total number of “positives” or pairs of documents that are in the same cluster is:
TP + FP = (62) + (62) + (52) = 15 + 15+ 10 = 40
Of these, the a pairs in cluster 1, the b pairs in cluster 2, the c pairs in cluster 3, and the a pair in cluster 3 are true positives:
TP = (52) + (42) + (32) + (22) = 10 + 6 + 3 + 1 = 20
Thus, FP = 40 − 20 = 20.
Till here there calculations are clear, and if I take other examples I get the same results, but when I want to calculate the false negative and true negative Manning et al. state the following:
FN and TN are computed similarly, resulting in the following contingency table:
The contingency table looks as follows:
+--------+--------+
| TP: 20 | FN: 24 |
+--------+--------+
| FP: 20 | TN: 72 |
+--------+--------+
The sentence: "FN and TN are computed similarly" is not clear to my and I don't understand which numbers I need to calculate the TN and FN. I can calculate the right side of the table by doing the following:
TP + FP + FN + TN = (n2) = (172) = 136
Source: http://en.wikipedia.org/wiki/Rand_index
Thus, FN + TN = 136 - TP + FP = 136 - 40 = 96, but this doesn't really help my in figuring out how to calculate the variables separately. Especially when the authors say: "FN and TN are computed similarly". I don't see how. Also when I look at other examples they calculate each cell of the contingency table by looking at each pair.
For example: http://www.otlet-institute.org/wikics/Clustering_Problems.html#toc-Subsection-4.1
My first question, based on the example of Manning et al (2009), is it possible to calculate the TN and FN if you only know the TP & NP? And if so, how does the similar calculation looks like based of the given example?