It can be seen that what we are really doing is evaluating the ability of our learning
algorithm L to update the parameter vector toward the optimal vector, given all possible
training sets and random parameter initializations. Just as we approximated (2.5) with
(2.7), we can approximate the expected error at a single testing point x with an average
over several training sets and several random parameter initializations: