In (Huang 1997) we have proposed an algorithm, called k-prototypes, to cluster large data sets with mixed numeric and categorical values.// In the k-prototypes algorithm we define a dissimilarity measure that takes into account both numeric and categorical attributes.// Assume SN is the dissimilarity measure on numeric attributes defined by the squared Euclidean distance and SC is the dissimilarity measure on categorical attributes defined as the number of mismatches of categories between two objects.// We define the dissimilarity measure between two objects as SN + GSC, where g is a weight to balance the two parts to avoid favoring either type of attribute.// The clustering process of the k-prototypes algorithm is similar to the k-means algorithm except that a new method is used to update the categorical attribute values of cluster prototypes.// A problem in using that algorithm is to choose a proper weight.// We have suggested the use of the average
standard deviation of numeric attributes as a guide in choosing the weight.