From this viewpoint we can immediately see the parallels to an ensemble system. Each
member of an ensemble is a realisation of the random variable defined by this distribution over all possible training datasets and weight initialisations. The ensemble output is
the average of this set of realisations; all our diversity promoting mechanisms are there to
encourage our sample mean f¯ to be a closer approximation to E {f}. If we have a large
ensemble, we have a large sample from this space; consequently with a large sample we
can expect that we have a good approximation to the mean, E{f}. If we have a smaller
ensemble, we cannot expect this: our sample mean may be upward or downward biased. In
order to correct this, some methods, such as Bagging (see section 2.3.1), construct our networks from different training datasets, allowing us to sample a more representative portion
of the space. This illustration assumes that the expected value of our estimator is equal to
the true target value, i.e. an unbiased estimator. If this is not the case, we may have the
situation in figure 3.2.