The contribution of the second variable depth to this calculation is huge – one could say
that the distance is practically just the absolute difference in the depth values (equal to
|51-99| = 48) with only tiny additional contributions from pollution and temperature. This
is the problem of standardization discussed in Chapter 3 – the three variables are on
completely different scales of measurement and the larger depth values have larger intersample
differences, so they will dominate in the calculation of Euclidean distances.
Some form of standardization is necessary to balance out the contributions, and the
conventional way to do this is to transform the variables so they all have the same variance
of 1. At the same time we centre the variables at their means – this centring is not
necessary for calculating distance, but it makes the variables all have mean zero and thus
easier to compare. The transformation commonly called standardization is thus as follows:
standardized value = (original value – mean) / standard deviation (4.5)
The means and standard deviations of the three variables are: