Distances for categorical data
In our introductory example we have only one categorical variable (sediment), so the
question of computing distance is fairly trivial: if two samples have the same sediment then
their distance is 0, and if its different it is 1. But what if there were several categorical
variables, say K of them? There are several possibilities, one of the simplest being to
simply extend the ‘matching’ idea and count how many matches and mismatches there are
between samples, with optional averaging over variables. For example, suppose that there
are five categorical variables, C1 to C5, each with three categories, which we denote by
a/b/c and that there are two samples with the following characteristics: