Partitioning a large set of objects into homogeneous clusters is a
fundamental operation in data mining. The k-means algorithm is
best suited for implementing this operation because of its
efficiency in clustering large data sets. However, working only on
numeric values limits its use in data mining because data sets in
data mining often contain categorical values. In this paper we
present an algorithm, called k-modes, to extend the k-means
paradigm to categorical domains. We introduce new dissimilarity
measures to deal with categorical objects, replace means of
clusters with modes, and use a frequency based method to update
modes in the clustering process to minimise the clustering cost
function. Tested with the well known soybean disease data set
the algorithm has demonstrated a very good classification
performance. Experiments on a very large health insurance data
set consisting of half a million records and 34 categorical
attributes show that the algorithm is scalable in terms of both the
number of clusters and the number of records.