Clustering is a very old problem and numerous algorithms have been developed to cluster a collection of records. Traditionally, the number of records in the input database was assumed to be relatively small and the complete database was assumed to t into main memory. In this section we describe a clustering algorithm called BIRCH that handles very large databases.
The design of BIRCH re�?ects the following two assumptions:
The number of records is potentially very large and therefore we want to make only one scan over the database. We have only a limited amount of main memory available.
A user can set two parameters to control the BIRCH
parameter is a threshold on the amount of main memory available. This main memory threshold translates into a maximum number of cluster summaries k that can be maintained in memory. The second parameter is an initial threshold for the radius of any cluster. The value of is an upper bound on the radius of any cluster and controls the number of clusters that the algorithm discovers. If
small, we discover many small clusters; if is large, we discover very few clusters, each of which is relatively large. We say that a cluster is compact if its radius is smaller than .BIRCH always maintains k or fewer cluster summaries (C i ;R i ) in main memory, where C i is the center of cluster i and R i is the radius of cluster The algorithm always maintains compact clusters, i.e., the radius of each cluster is less than . If this invariant cannot be maintained with the given amount of main memory, is increased as
The algorithm reads records from the database sequentially and processes them as follows:
The second step above presents a problem if we already have the maximum number of cluster summaries, k. If we now read a record that requires us to create a new cluster, we don't have the main memory required to hold its summary. In this case, we increase the radius threshold |using some heuristic to determine the increase|in order to merge existing clusters: An increase of
two consequences. First, existing clusters can accommodate `more' records, since their maximum radius has increased. Second, it might be possible to merge existing clusters such that the resulting cluster is still compact. Thus, an increase in
existing clusters. The complete BIRCH algorithm uses a balanced in-memory tree, which is similar to a B tree in structure, to quickly identify the closest cluster center for a new record.