where the gradient ∇C∇C is the vector
Just as for the two variable case, we can choose
and we're guaranteed that our (approximate) expression (12) for ΔCΔC will be negative. This gives us a way of following the gradient to a minimum, even when CC is a function of many variables, by repeatedly applying the update rule
You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position vv in order to find a minimum of the function CC. The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of CC, a point we'll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.
Indeed, there's even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let's suppose that we're trying to make a move ΔvΔv in position so as to decrease CC as much as possible. This is equivalent to minimizing ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv. We'll constrain the size of the move so that ∥Δv∥=ϵ‖Δv‖=ϵ for some small fixed ϵ>0ϵ>0. In other words, we want a move that is a small step of a fixed size, and we're trying to find the movement direction which decreases CC as much as possible. It can be proved that the choice of ΔvΔv which minimizes ∇C⋅Δv∇C⋅Δv is Δv=−η∇CΔv=−η∇C, where η=ϵ/∥∇C∥η=ϵ/‖∇C‖ is determined by the size constraint ∥Δv∥=ϵ‖Δv‖=ϵ. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease CC.