The goal is to partition a set of records into groups such that records within a group are similar to each other and records that belong to two different groups are dissimilar. Each such group is called a cluster and each record belongs to exactly one cluster.1 Similarity between records is measured computationally by a distance function. A distance function takes two input records and returns a value that is a measure of their similarity. Different applications have different notions of similarity and there is no one measure that works for all domains.
As an example, consider the schema of the CustomerInfo view:
CustomerInfo(age: int, salary: real)
The two coordinates of a record are the values of the record's salary and age fields. We can visually identify three clusters: Young customers who have low salaries, young customers with high salaries, and older customers with high salaries. Usually, the output of a clustering algorithm consists of a summarized representation of each cluster. The type of summarized representation depends strongly on the type and shape of clusters the algorithm computes. For example, assume that we have spherical clusters as in the example shown in Figure 24.12. We can summarize each cluster by its center (often also called the mean) and its radius which are dened as follows. Given a collection of records r1,r2......rn , their center C and radius R are defined as follows:
There are two types of clustering algorithms. A partitional clustering algorithm partitions the data into k groups such that some criterion that evaluates the clustering quality is optimized. The number of clusters k is a parameter whose value is specfied by the user. A hierarchical clustering algorithm generates a sequence of partitions of the records. Starting with a partition in which each cluster consists of one single record, the algorithm merges two partitions in each step until only one single partition remains in the end.