The two assumptions we need about the cost function
The goal of backpropagation is to compute the partial derivatives ∂C/∂w∂C/∂w and ∂C/∂b∂C/∂b of the cost function CC with respect to any weight ww or bias bb in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it's useful to have an example cost function in mind. We'll use the quadratic cost function from last chapter (c.f. Equation (6)). In the notation of the last section, the quadratic cost has the form
C=12n∑x∥y(x)−aL(x)∥2,(26)
(26)C=12n∑x‖y(x)−aL(x)‖2,
where: nn is the total number of training examples; the sum is over individual training examples, xx; y=y(x)y=y(x) is the corresponding desired output; LL denotes the number of layers in the network; and aL=aL(x)aL=aL(x) is the vector of activations output from the network when xx is input.
Okay, so what assumptions do we need to make about our cost function, CC, in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average C=1n∑xCxC=1n∑xCx over cost functions CxCx for individual training examples, xx. This is the case for the quadratic cost function, where the cost for a single training example is Cx=12∥y−aL∥2Cx=12‖y−aL‖2. This assumption will also hold true for all the other cost functions we'll meet in this book.
The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives ∂Cx/∂w∂Cx/∂w and ∂Cx/∂b∂Cx/∂b for a single training example. We then recover ∂C/∂w∂C/∂w and ∂C/∂b∂C/∂b by averaging over training examples. In fact, with this assumption in mind, we'll suppose the training example xx has been fixed, and drop the xx subscript, writing the cost CxCx as CC. We'll eventually put the xx back in, but for now it's a notational nuisance that is better left implicit.
The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:
For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example xx may be written as
C=12∥y−aL∥2=12∑j(yj−aLj)2,(27)
(27)C=12‖y−aL‖2=12∑j(yj−ajL)2,
and thus is a function of the output activations. Of course, this cost function also depends on the desired output yy, and you may wonder why we're not regarding the cost also as a function of yy. Remember, though, that the input training example xx is fixed, and so the output yy is also a fixed parameter. In particular, it's not something we can modify by changing the weights and biases in any way, i.e., it's not something which the neural network learns. And so it makes sense to regard CC as a function of the output activations aLaL alone, with yy merely a parameter that helps define that function.