Detecting influential observations
Even before undertaking the regression analysis, it is important to diagnose
outliers in the model. The presence of outliers (influential observations) can influence the slope and/or intercept of the regression model (Tabachnick and
Fidell, 2001). Including these observations will thus result in the estimates being
biased. The vertical distance of a point from a regression line is called the
‘residual or deviation’, while the horizontal distance of a point from the mean is
called ‘leverage’. ‘Influence’ is computed as the product of the residual and the
leverage. The influence of a point is the amount by which the slope of the
regression line changes when that point is removed from the dataset and thus
a new slope is computed. The amount of the change was developed by Cook and
is called Cook’s influence or Cook’s distance. Intuitively, Cook’s distance
measures the change in the sum of squared differences for every observation,
except when the relevant point is removed. In regression diagnostics, values
greater than 1 are of concern and should be carefully investigated. When, the
regressions are run, the Cook’s influence values will appear in the data set as
a new variable labeled coo_1, coo_2. We delete the observations for which this
distance was greater than one and undertake the regression analysis from the
reduced sample.