These are my notes from the book Grokking Deep Learning by Andrew Trask. Feel free to check my first post on this book to get my overall thoughts and recommendations on how to approach this series. The rest of my notes for this book can be found here
Regularization and batching
This chapter will focus on making the network hone on signal, and ignore the noise. Key concepts:
- Batch gradient descent
An example of overfitting
Let’s train a three-layer network on the MNIST dataset. We’ll use this dataset to produce an example of overfitting, and then apply regularization to combat it.
import sys import numpy as np from keras.datasets import mnist
/Users/howie/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters Using TensorFlow backend.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000]) one_hot_labels = np.zeros((len(labels), 10)) for i,l in enumerate(labels): one_hot_labels[i][l] = 1 labels = one_hot_labels
array([[0., 0., 0., ..., 0., 0., 0.], [1., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [1., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]])
test_images = x_test.reshape(len(x_test), 28*28) / 255 test_labels = np.zeros((len(y_test), 10)) for i,l in enumerate(y_test): test_labels[i][l] = 1
np.random.seed(1) relu = lambda x:(x>=0) * x # returns x if x > 0, return 0 otherwise relu2deriv = lambda x: x>=0 # returns 1 for input > 0, return 0 otherwise alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40, 784, 10) weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1 weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1 for j in range(iterations): error, correct_cnt = (0.0, 0) for i in range(len(images)): layer_0 = images[i:i+1] layer_1 = relu(np.dot(layer_0,weights_0_1)) layer_2 = np.dot(layer_1,weights_1_2) error += np.sum((labels[i:i+1] - layer_2) ** 2) correct_cnt += int(np.argmax(layer_2) == \ np.argmax(labels[i:i+1])) layer_2_delta = (labels[i:i+1] - layer_2) layer_1_delta = layer_2_delta.dot(weights_1_2.T)\ * relu2deriv(layer_1) weights_1_2 += alpha * layer_1.T.dot(layer_2_delta) weights_0_1 += alpha * layer_0.T.dot(layer_1_delta) sys.stdout.write("\r I:"+str(j)+ \ " Train-Err:" + str(error/float(len(images)))[0:5] +\ " Train-Acc:" + str(correct_cnt/float(len(images))))
I:349 Train-Err:0.108 Train-Acc:1.099
The above neural network perfectly learned to predict all 1,000 images
The training accuracy above shows 100%. Unfortunately this is likely due to overfitting. If we run the code against
test_labels, we can see how this network would perform against images it has never seen before.
if(j % 10 == 0 or j == iterations-1): error, correct_cnt = (0.0, 0) for i in range(len(test_images)): layer_0 = test_images[i:i+1] layer_1 = relu(np.dot(layer_0,weights_0_1)) layer_2 = np.dot(layer_1,weights_1_2) error += np.sum((test_labels[i:i+1] - layer_2) ** 2) correct_cnt += int(np.argmax(layer_2) == \ np.argmax(test_labels[i:i+1])) sys.stdout.write(" Test-Err:" + str(error/float(len(test_images)))[0:5] +\ " Test-Acc:" + str(correct_cnt/float(len(test_images))) + "\n") print()
The network only predicted with an accuracy of 70%. This test accuracy is important because it simulates how well the network will perform in the real world. So why did the network do well on the training set, but so terribly on the test set?
Memorization vs. generalization
Neural networks are only usefull if they can be generalized. If the network overfits (meaning trained to the point where it exactly matches the input data), then it is basically memorizing the pre-labled images. This makes it kind of pointless, because we already know the labels of those images. We want the neural network to be general enough so that it can predict images that it has not seen before.
Overfitting in neural networks
Neural networks can ge worse if you train them too much.
Fork mold example:
- Say we are creating a mold for a dinner fork as a tool to determine whether a particular utensil is a fork.
- If object fits in the mold, then we say it’s a fork
- Start with clay, and bucket of 3-pronged forks, spoons, knives
- Press each fork into the same place to create an outline
- Let the mold dry. None of the knives or spoons fit. Only 3-pronged forks fit.
- What happens if you try a 4-pronged fork?
- It won’t fit… even though it’s a fork. The mold only has 3 prongs.
- The mold has been overfit to 3-pronged forks!
What causes networks to overfit?
In the fork example, what if we only pushed in 1 or 2 forks? Assuming the clay was very thick, it wouldn’t have much detail. Just a general shape of a fork. This shape might be compatible with both 3 and 4-pronged forks.
The mold got worse at the testing dataset as more forks were imprinted because it learned detailed information that was too specific to the forks being used (training set). In this case, it was the number of prongs. In images, this is generally referred to as noise. How do we get a neural network to train only on the signal (the shape of the fork), and not the noise (the prongs)?
Simplest regularization: Early Stopping
Stop training the network when it starts getting worse! Early stopping is the cheapest form of regularization.
Regularization is a way for getting models to generalize to new datapoints as opposed to just memorizing the data. Helps neural networks learn the signal and ignore the noise.
The only real way to know when to stop training is to run the model on a valiation set. Don’t use the test set, because the network may overfit to the test set.
Industry standard regularization: Dropout
During training, randomly set neurons in the network to 0. This causes the network to train exclusively using random subsections of the network.
The smaller the network, the less it’s able to overfit. Going back to clay example - imagine clay made of very fine grained sand vs. larger rocks. The larger rocks would not be able to express the same amount of detail as the fine grained sand. Larger networks are like fine grained sand. More room or capacity.
Randomly turning off nodes makes a big network behave like a small one, but the sum of the total of the entire network still maintains its expressive power!
Why dropout works
If you train 100 randomly initialized neural networks, they will each latch onto different noise, but similar signal. When they make mistakes, they will be differing mistakes. Their noise would tend to cancel out, revealing only what they all learned in common, the signal.
- It’s likely large unregularized networks will overfit to noise, but it’s unlikely it will be the same noise.
- Neural networks start by learning the biggest most broadly sweeping features before learning miuch about noise
Batch gradient descent
A method for increasing speed of training and the rate of convergence.
Rather than training one example at a time, and updating the weights after each example - we train 100 examples at a time, and average the weight updates among all 100 examples.
Individual training examples are very noisy in terms of the weight updates they generate.
Running in batches is much faster. Each
np.dot function is now performing 100 vector dot products at a time.
Batch gradient descent also allows for higher learning rates (alpha), because the example takes an average of a noisy signals, thus it can take bigger steps.