- Programming Example: Moving to a DL Framework
- The Problem of Saturated Neurons and Vanishing Gradients
- Initialization and Normalization Techniques to Avoid Saturated Neurons
- Cross-Entropy Loss Function to Mitigate Effect of Saturated Output Neurons
- Different Activation Functions to Avoid Vanishing Gradient in Hidden Layers
- Variations on Gradient Descent to Improve Learning
- Experiment: Tweaking Network and Learning Parameters
- Hyperparameter Tuning and Cross-Validation
- Concluding Remarks on the Path Toward Deep Learning
Variations on Gradient Descent to Improve Learning
There are a number of variations on gradient descent aiming to enable better and faster learning. One such technique is momentum, where in addition to computing a new gradient every iteration, the new gradient is combined with the gradient from the previous iteration. This can be likened with a ball rolling down a hill where the direction is determined not only by the slope in the current point but also by how much momentum the ball has picked up, which was caused by the slope in previous points. Momentum can enable faster convergence due to a more direct path in cases where the gradient is changing slightly back and forth from point to point. It can also help with getting out of a local minimum. One example of a momentum algorithm is Nesterov momentum (Nesterov, 1983).
Another variation is to use an adaptive learning rate instead of a fixed learning rate, as we have used previously. The learning rate adapts over time on the basis of historical values of the gradient. Two algorithms using adaptive learning rate are adaptive gradient, known as AdaGrad (Duchi, Hazan, and Singer, 2011), and RMSProp (Hinton, n.d.). Finally, adaptive moments, known as Adam (Kingma and Ba, 2015), combines both adaptive learning rate and momentum. Although these algorithms adaptively modify the learning rate, we still have to set an initial learning rate. These algorithms even introduce a number of additional parameters that control how the algorithms perform, so we now have even more parameters to tune for our model. However, in many cases, the default values work well.
Finally, we discussed earlier how to avoid vanishing gradients, but there can also be a problem with exploding gradients, where the gradient becomes too big in some point, causing a huge step size. It can cause weight updates that completely throw off the model. Gradient clipping is a technique to avoid exploding gradients by simply not allowing overly large values of the gradient in the weight update step. Gradient clipping is available for all optimizers in Keras.
Code Snippet 5-11 shows how we set an optimizer for our model in Keras. The example shows stochastic gradient descent with a learning rate of 0.01 and no other bells and whistles.
Code Snippet 5-11 Setting an Optimizer for the Model
opt = keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False) model.compile(loss='mean_squared_error', optimizer = opt, metrics =['accuracy'])
Just as we can for initializers, we can choose a different optimizer by declaring any one of the supported optimizers in Tensorflow, such as the three we just described:
opt = keras.optimizers.Adagrad(lr=0.01, epsilon=None) opt = keras.optimizers.RMSprop(lr=0.001, rho=0.8, epsilon=None) opt = keras.optimizers.Adam(lr=0.01, epsilon=0.1, decay=0.0)
In the example, we freely modified some of the arguments and left out others, which will then take on the default values. If we do not feel the need to modify the default values, we can just pass the name of the optimizer to the model compile function, as in Code Snippet 5-12.
Code Snippet 5-12 Passing the Optimizer as a String to the Compile Function
model.compile(loss='mean_squared_error', optimizer ='adam', metrics =['accuracy'])
We now do an experiment in which we apply some of these techniques to our neural network.