data fed to each member. The distribution is recalculated in each round, taking into account
the errors of the immediately previous network. Oza [103] presents a variant of AdaBoost
that calculates the distribution with respect to all networks trained so far. In this way the
data received by each successive network is explicitly ‘designed’ so that their errors should
be diverse and compensate for one another.
Zenobi and Cunningham [157] use their own metric of classification diversity, as defined
in equation (3.17) to select subsets of features for each learner. They build an ensemble
by adding predictors successively, and use their metric to estimate how much diversity is
in the ensemble so far. The feature subset used to train a predictor is determined by a
hill-climbing strategy, based on the individual error and estimated ensemble diversity. A
predictor is rejected if it causes a reduction in diversity according to a pre-defined threshold,
or an increase in overall ensemble error. In this case a new feature subset is generated and
another predictor trained. The DECORATE algorithm, by Melville and Mooney [97] utilises
the same metric to decide whether to accept or reject predictors to be added to the ensemble.
Predictors here are generated by training on the original data, plus a ‘diversity set’ of
artificially generated new examples. The input vectors of this set are first passed through
the current ensemble to see what its decision would be. Each pattern in the diversity set has
its output re-labelled as the opposite of whatever the ensemble predicted. A new predictor
trained on this set will therefore have a high disagreement with the ensemble, increasing
diversity and hopefully decreasing ensemble error. If ensemble error is not reduced, a new
diversity set is produced and a new predictor trained.
Oza and Tumer [104] present Input Decimation Ensembles, which seeks to reduce the
correlations among individual estimators by using different subsets of the input features.
Feature selection is achieved by calculating the correlation of each feature individually with
each class, then training predictors to be specialists to particular classes or groups of classes.
This showed significant benefits on several real [137] and artificial [104] datasets.
Liao and Moody [84] demonstrate an information-theoretic technique for feature selection, where all input variables are first grouped based on their mutual information [51,
p492]. Statistically similar variables are assigned to the same group, and each member’s
input set is then formed by input variables extracted from different groups. Experiments
on a noisy and nonstationary economic forecasting problem show it outperforms Bagging