- Cost Functions
- Optimization: Learning to Minimize Cost
- Backpropagation
- Tuning Hidden-Layer Count and Neuron Count
- An Intermediate Net in Keras
- Summary
- Key Concepts
An Intermediate Net in Keras
To wrap up this chapter, let’s incorporate the new theory we’ve covered into a neural network to see if we can outperform our previous Shallow Net in Keras model at classifying handwritten digits.
The first few stages of our Intermediate Net in Keras Jupyter notebook are identical to those of its Shallow Net predecessor. We load the same Keras dependencies, load the MNIST dataset in the same way, and preprocess the data in the same way. As shown in Example 8.1, the situation begins to get interesting when we design our neural network architecture.
Example 8.1 Keras code to architect an intermediate-depth neural network
model = Sequential() model.add(Dense(64, activation='relu', input_shape=(784,))) model.add(Dense(64, activation='relu')) model.add(Dense(10, activation='softmax'))
The first line of this code chunk, model = Sequential(), is the same as before (refer to Example 5.6); this is our instantiation of a neural network model object. It’s in the second line that we begin to diverge. In it, we specify that we’ll substitute the sigmoid activation function in the first hidden layer with our most-highly-recommended neuron from Chapter 6, the relu. Other than this activation function swap, the first hidden layer remains the same: It still consists of 64 neurons, and the dimensionality of the 784-neuron input layer is unchanged.
The other significant change in Example 8.1 relative to the shallow architecture of Example 5.6 is that we specify a second hidden layer of artificial neurons. By calling the model.add() method, we nearly effortlessly add a second Dense layer of 64 relu neurons, providing us with the notebook’s namesake: an intermediate-depth neural network. With a call to model.summary(), you can see from Figure 8.9 that this additional layer corresponds to an additional 4,160 trainable parameters relative to our shallow architecture (refer to Figure 7.5). We can break these parameters down into:
4,096 weights, corresponding to each of the 64 neurons in the second hidden layer densely receiving input from each of the 64 neurons in the first hidden layer (64 × 64 = 4,096)
Plus 64 biases, one for each of the neurons in the second hidden layer
Giving us a total of 4,160 parameters: nparameters = nw + nb = 4,096 + 64 = 4,160
FIGURE 8.9 A summary of the model object from our Intermediate Net in Keras Jupyter notebook
In addition to changes to the model architecture, we’ve also made changes to the parameters we specify when compiling our model, as shown in Example 8.2.
Example 8.2 Keras code to compile our intermediate-depth neural network
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.1), metrics=['accuracy'])
With these lines from Example 8.2, we:
Set our loss function to cross-entropy cost by using loss='categorical_crossentropy' (in Shallow Net in Keras, we used quadratic cost by using loss='mean_squared_error')
Set our cost-minimizing method to stochastic gradient descent by using optimizer=SGD
Specify our SGD learning rate hyperparameter η by setting lr=0.118
Indicate that, in addition to the Keras default of providing feedback on loss, by setting metrics=['accuracy'], we’d also like to receive feedback on model accuracy19
Finally, we train our intermediate net by running the code in Example 8.3.
Example 8.3 Keras code to train our intermediate-depth neural network
model.fit(X_train, y_train, batch_size=128, epochs=20, verbose=1, validation_data=(X_valid, y_valid))
Relative to the way we trained our shallow net (see Example 5.7), the only change we’ve made is reducing our epochs hyperparameter from 200 down by an order of magnitude to 20. As you’ll see, our much-more-efficient intermediate architecture required far fewer epochs to train.
Figure 8.10 provides the results of the first three epochs of training the network. Recalling that our shallow architecture plateaued as it approached 86 percent accuracy on the validation dataset after 200 epochs, our intermediate-depth network is clearly superior: The val_acc field shows that we attained 92.34 percent accuracy after a single epoch of training. This accuracy climbs to more than 95 percent by the third epoch and appears to plateau around 97.6 percent by the twentieth. My, how far we’ve come already!
FIGURE 8.10 The performance of our intermediate-depth neural network over its first four epochs of training
Let’s break down the verbose model.fit() output shown in Figure 8.10 in further detail:
The progress bar shown next fills in over the course of the 469 “rounds of training” (Figure 8.5):
60000/60000 [==============================]
1s 15us/step indicates that all 469 rounds in the first epoch required 1 second to train, at an average rate of 15 microseconds per round.
loss shows the average cost on our training data for the epoch. For the first epoch this is 0.4744, and, epoch over epoch, this cost is reliably minimized via stochastic gradient descent (SGD) and backpropagation, eventually diminishing to 0.0332 by the twentieth epoch.
acc is the classification accuracy on training data for the epoch. The model correctly classified 86.37 percent for the first epoch, increasing to more than 99 percent by the twentieth. Because a model can overfit to the training data, one shouldn’t be overly impressed by high accuracy on the training data.
Thankfully, our cost on the validation dataset (val_loss) does generally decrease as well, eventually plateauing as it approaches 0.08 over the final five epochs of training.
Corresponding to the decreasing cost of the validation data is an increase in accuracy (val_acc). As mentioned, validation accuracy plateaued at about 97.6 percent, which is a vast improvement over the 86 percent of our shallow net.