Backpropagation
Although stochastic gradient descent operates well on its own to adjust parameters and minimize cost in many types of machine learning models, for deep learning models in particular there is an extra hurdle: We need to be able to efficiently adjust parameters through multiple layers of artificial neurons. To do this, stochastic gradient descent is partnered up with a technique called backpropagation.
Backpropagation—or backprop for short—is an elegant application of the “chain rule” from calculus.16 As shown along the bottom of Figure 8.6 and as suggested by its very name, backpropagation courses through a neural network in the opposite direction of forward propagation. Whereas forward propagation carries information about the input x through successive layers of neurons to approximate y with ŷ, backpropagation carries information about the cost C backwards through the layers in reverse order and, with the overarching aim of reducing cost, adjusts neuron parameters throughout the network.
Although the nitty-gritty of backpropagation has been relegated to Appendix B, it’s worth understanding (in broad strokes) what the backpropagation algorithm does: Any given neural network model is randomly initialized with parameter (w and b) values (such initialization is detailed in Chapter 9). Thus, prior to any training, when the first x value is fed in, the network outputs a random guess at ŷ. This is unlikely to be a good guess, and the cost associated with this random guess will probably be high. At this point, we need to update the weights in order to minimize the cost—the very essence of machine learning. To do this within a neural network, we use backpropagation to calculate the gradient of the cost function with respect to each weight in the network.
Recall from our mountaineering analogies earlier that the cost function represents a hiking trail, and our trilobite is trying to reach basecamp. At each step along the way, the trilobite finds the gradient (or the slope) of the cost function and moves down that gradient. That movement corresponds to a weight update: By adjusting the weight in proportion to the cost function’s gradient with respect to that weight, backprop adjusts that weight in a direction that reduces the cost.
Reflecting back on the “most important equation” from Figure 6.7 (w · x + b), and remembering that neural networks are stacked with information forward propagating through their layers, we can grasp that any given weight in the network contributes to the final ŷ output, and thus the cost C. Using backpropagation, we move layer-by-layer backwards through the network, starting at the cost in the output layer, and we find the gradients of every single parameter. A given parameter’s gradient can then be used to adjust the parameter up or down (by an increment corresponding to the learning rate η)—whichever of the two directions is associated with a reduction in cost.
We appreciate that this is not the lightest section of this book. If there’s only one thing you take away, let it be this: Backpropagation uses cost to calculate the relative contribution by every single parameter to the total cost, and then it updates each parameter accordingly. In this way, the network iteratively reduces cost and, well . . . learns!