Blog by Jason Brownlee. Jason started this blog because he is passionate about helping professional developers to get started and confidently apply machine learning to address complex problems.
Training a neural network with a small dataset can cause the network to memorize all training examples, in turn leading to poor performance on a holdout dataset.
Small datasets may also represent a harder mapping problem for neural networks to learn, given the patchy or sparse sampling of points in the high-dimensional input space.
One approach to making the input space smoother and easier to learn is to add noise to inputs during training.
In this post, you will discover that adding noise to a neural network during training can improve the robustness of the network, resulting in better generalization and faster learning.
After reading this post, you will know:
Small datasets can make learning challenging for neural nets and the examples can be memorized.
Adding noise during training can make the training process more robust and reduce generalization error.
Noise is traditionally added to the inputs, but can also be added to weights, gradients, and even activation functions.
Let’s get started.
Train Neural Networks With Noise to Reduce Overfitting Photo by John Flannery, some rights reserved.
Overview
This tutorial is divided into five parts; they are:
Challenge of Small Training Datasets
Add Random Noise During Training
How and Where to Add Noise
Examples of Adding Noise During Training
Tips for Adding Noise During Training
Challenge of Small Training Datasets
Small datasets can introduce problems when training large neural networks.
The first problem is that the network may effectively memorize the training dataset. Instead of learning a general mapping from inputs to outputs, the model may learn the specific input examples and their associated outputs. This will result in a model that performs well on the training dataset, and poor on new data, such as a holdout dataset.
The second problem is that a small dataset provides less opportunity to describe the structure of the input space and its relationship to the output. More training data provides a richer description of the problem from which the model may learn. Fewer data points means that rather than a smooth input space, the points may represent a jarring and disjointed structure that may result in a difficult, if not unlearnable, mapping function.
It is not always possible to acquire more data. Further, getting a hold of more data may not address these problems.
Add Random Noise During Training
One approach to improving generalization error and to improving the structure of the mapping problem is to add random noise.
Many studies […] have noted that adding small amounts of input noise (jitter) to the training data often aids generalization and fault tolerance.
At first, this sounds like a recipe for making learning more challenging. It is a counter-intuitive suggestion to improving performance because one would expect noise to degrade performance of the model during training.
Heuristically, we might expect that the noise will ‘smear out’ each data point and make it difficult for the network to fit individual data points precisely, and hence will reduce over-fitting. In practice, it has been demonstrated that training with noise can indeed lead to improvements in network generalization.
The addition of noise during the training of a neural network model has a regularization effect and, in turn, improves the robustness of the model. It has been shown to have a similar impact on the loss function as the addition of a penalty term, as in the case of weight regularization methods.
It is well known that the addition of noise to the input data of a neural network during training can, in some circumstances, lead to significant improvements in generalization performance. Previous work has shown that such training with noise is equivalent to a form of regularization in which an extra term is added to the error function.
In effect, adding noise expands the size of the training dataset. Each time a training sample is exposed to the model, random noise is added to the input variables making them different every time it is exposed to the model. In this way, adding noise to input samples is a simple form of data augmentation.
Injecting noise in the input to a neural network can also be seen as a form of data augmentation.
Adding noise means that the network is less able to memorize training samples because they are changing all of the time, resulting in smaller network weights and a more robust network that has lower generalization error.
The noise means that it is as though new samples are being drawn from the domain in the vicinity of known samples, smoothing the structure of the input space. This smoothing may mean that the mapping function is easier for the network to learn, resulting in better and faster learning.
… input noise and weight noise encourage the neural-network output to be a smooth function of the input or its weights, respectively.
The most common type of noise used during training is the addition of Gaussian noise to input variables.
Gaussian noise, or white noise, has a mean of zero and a standard deviation of one and can be generated as needed using a pseudorandom number generator. The addition of Gaussian noise to the inputs to a neural network was traditionally referred to as “jitter” or “random jitter” after the use of the term in signal processing to refer to to the uncorrelated random noise in electrical circuits.
The amount of noise added (eg. the spread or standard deviation) is a configurable hyperparameter. Too little noise has no effect, whereas too much noise makes the mapping function too challenging to learn.
This is generally done by adding a random vector onto each input pattern before it is presented to the network, so that, if the patterns are being recycled, a different random vector is added each time.
The standard deviation of the random noise controls the amount of spread and can be adjusted based on the scale of each input variable. It can be easier to configure if the scale of the input variables has first been normalized.
Noise is only added during training. No noise is added during the evaluation of the model or when the model is used to make predictions on new data.
The addition of noise is also an important part of automatic feature learning, such as in the case of autoencoders, so-called denoising autoencoders that explicitly require models to learn robust features in the presence of noise added to inputs.
We have seen that the reconstruction criterion alone is unable to guarantee the extraction of useful features as it can lead to the obvious solution “simply copy the input” or similarly uninteresting ones that trivially maximizes mutual information. […] we change the reconstruction criterion for a both more challenging and more interesting objective: cleaning partially corrupted input, or in short denoising.
Although additional noise to the inputs is the most common and widely studied approach, random noise can be added to other parts of the network during training. Some examples include:
Add noise to activations, i.e. the outputs of each layer.
Add noise to weights, i.e. an alternative to the inputs.
Add noise to the gradients, i.e. the direction to update weights.
Add noise to the outputs, i.e. the labels or target variables.
The addition of noise to the layer activations allows noise to be used at any point in the network. This can be beneficial for very deep networks. Noise can be added to the layer outputs themselves, but this is more likely achieved via the use of a noisy activation function.
The addition of noise to weights allows the approach to be used throughout the network in a consistent way instead of adding noise to inputs and layer activations. This is particularly useful in recurrent neural networks.
Another way that noise has been used in the service of regularizing models is by adding it to the weights. This technique has been used primarily in the context of recurrent neural networks. […] Noise applied to the weights can also be interpreted as equivalent (under some assumptions) to a more traditional form of regularization, encouraging stability of the function to be learned.
The addition of noise to gradients focuses more on improving the robustness of the optimization process itself rather than the structure of the input domain. The amount of noise can start high at the beginning of training and decrease over time, much like a decaying learning rate. This approach has proven to be an effective method for very deep networks and for a variety of different network types.
We consistently see improvement from injected gradient noise when optimizing a wide variety of models, including very deep fully-connected networks, and special-purpose architectures for question answering and algorithm learning. […] Our experiments indicate that adding annealed Gaussian noise by decaying the variance works better than using fixed Gaussian noise
Adding noise to the activations, weights, or gradients all provide a more generic approach to adding noise that is invariant to the types of input variables provided to the model.
If the problem domain is believed or expected to have mislabeled examples, then the addition of noise to the class label can improve the model’s robustness to this type of error. Although, it can be easy to derail the learning process.
Adding noise to a continuous target variable in the case of regression or time series forecasting is much like the addition of noise to the input variables and may be a better use case.
Examples of Adding Noise During Training
This section summarizes some examples where the addition of noise during training has been used.
Lasse Holmstrom studied the addition of random noise both analytically and experimentally with MLPs in the 1992 paper titled “Using Additive Noise in Back-Propagation Training.” They recommend first standardizing input variables then using cross-validation to choose the amount of noise to use during training.
If a single general-purpose noise design method should be suggested, we would pick maximizing the cross-validated likelihood function. This method is easy to implement, is completely data-driven, and has a validity that is supported by theoretical consistency results
Klaus Gref, et al. in their 2016 paper titled “LSTM: A Search Space Odyssey” used a hyperparameter search for the standard deviation for Gaussian noise on the input variables for a suite of sequence prediction tasks and found that it almost universally resulted in worse performance.
Additive Gaussian noise on the inputs, a traditional regularizer for neural networks, has been used for LSTM as well. However, we find that not only does it almost always hurt performance, it also slightly increases training times.
Alex Graves, et al. in their groundbreaking 2013 paper titled “Speech recognition with deep recurrent neural networks” that achieved then state-of-the-art results for speech recognition added noise to the weights of LSTMs during training.
… weight noise [was used] (the addition of Gaussian noise to the network weights during training). Weight noise was added once per training sequence, rather than at every timestep. Weight noise tends to ‘simplify’ neural networks, in the sense of reducing the amount of information required to transmit the parameters, which improves generalisation.
In a prior 2011 paper that studies different types of static and adaptive weight noise titled “Practical Variational Inference for Neural Networks,” Graves recommends using early stopping in conjunction with the addition of weight noise with LSTMs.
… in practice early stopping is required to prevent overfitting when training with weight noise.
Tips for Adding Noise During Training
This section provides some tips for adding noise during training with your neural network.
Problem Types for Adding Noise
Noise can be added to training regardless of the type of problem that is being addressed.
It is appropriate to try adding noise to both classification and regression type problems.
The type of noise can be specialized to the types of data used as input to the model, for example, two-dimensional noise in the case of images and signal noise in the case of audio data.
Add Noise to Different Network Types
Adding noise during training is a generic method that can be used regardless of the type of neural network that is being used.
It was a method used primarily with multilayer Perceptrons given their prior dominance, but can be and is used with Convolutional and Recurrent Neural Networks.
Rescale Data First
It is important that the addition of noise has a consistent effect on the model.
This requires that the input data is rescaled so that all variables have the same scale, so that when noise is added to the inputs with a fixed variance, it has the same effect. The also applies to adding noise to weights and gradients as they too are affected by the scale of the inputs.
This can be achieved via standardization or normalization of input variables.
If random noise is added after data scaling, then the variables may need to be rescaled again, perhaps per mini-batch.
Test the Amount of Noise
You cannot know how much noise will benefit your specific model on your training dataset.
Experiment with different amounts, and even different types of noise, in order to discover what works best.
Be systematic and use controlled experiments, perhaps on smaller datasets across a range of values.
Noisy Training Only
Noise is only added during the training of your model.
Be sure that any source of noise is not added during the evaluation of your model, or when your model is used to make predictions on new data.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
In this post, you discovered that adding noise to a neural network during training can improve the robustness of the network resulting in better generalization and faster learning.
Specifically, you learned:
Small datasets can make learning challenging for neural nets and the examples can be memorized.
Adding noise during training can make the training process more robust and reduce generalization error.
Noise is traditionally added to the inputs, but can also be added to weights, gradients, and even activation functions.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
A problem with training neural networks is in the choice of the number of training epochs to use.
Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.
In this tutorial, you will discover the Keras API for adding early stopping to overfit deep learning neural network models.
After completing this tutorial, you will know:
How to monitor the performance of a model during training using the Keras API.
How to create and configure early stopping and model checkpoint callbacks using the Keras API.
How to reduce overfitting by adding an early stopping to an existing model.
Let’s get started.
How to Stop Training Deep Neural Networks At the Right Time With Using Early Stopping Photo by Ian D. Keating, some rights reserved.
Tutorial Overview
This tutorial is divided into six parts; they are:
Using Callbacks in Keras
Evaluating a Validation Dataset
Monitoring Model Performance
Early Stopping in Keras
Checkpointing in Keras
Early Stopping Case Study
Using Callbacks in Keras
Callbacks provide a way to execute code and interact with the training model process automatically.
Callbacks can be provided to the fit() function via the “callbacks” argument.
First, callbacks must be instantiated.
...
cb = Callback(...)
Then, one or more callbacks that you intend to use must be added to a Python list.
...
cb_list = [cb, ...]
Finally, the list of callbacks is provided to the callback argument when fitting the model.
...
model.fit(..., callbacks=cb_list)
Evaluating a Validation Dataset in Keras
Early stopping requires that a validation dataset is evaluated during training.
This can be achieved by specifying the validation dataset to the fit() function when training your model.
There are two ways of doing this.
The first involves you manually splitting your training data into a train and validation dataset and specifying the validation dataset to the fit() function via the validation_data argument. For example:
Alternately, the fit() function can automatically split your training dataset into train and validation sets based on a percentage split specified via the validation_split argument.
The validation_split is a value between 0 and 1 and defines the percentage amount of the training dataset to use for the validation dataset. For example:
In both cases, the model is not trained on the validation dataset. Instead, the model is evaluated on the validation dataset at the end of each training epoch.
Monitoring Model Performance
The loss function chosen to be optimized for your model is calculated at the end of each epoch.
To callbacks, this is made available via the name “loss.”
If a validation dataset is specified to the fit() function via the validation_data or validation_split arguments, then the loss on the validation dataset will be made available via the name “val_loss.”
Additional metrics can be monitored during the training of the model.
They can be specified when compiling the model via the “metrics” argument to the compile function. This argument takes a Python list of known metric functions, such as ‘mse‘ for mean squared error and ‘acc‘ for accuracy. For example:
...
model.compile(..., metrics=['acc'])
If additional metrics are monitored during training, they are also available to the callbacks via the same name, such as ‘acc‘ for accuracy on the training dataset and ‘val_acc‘ for the accuracy on the validation dataset. Or, ‘mse‘ for mean squared error on the training dataset and ‘val_mse‘ on the validation dataset.
Early Stopping in Keras
Keras supports the early stopping of training via a callback called EarlyStopping.
This callback allows you to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process.
The EarlyStopping callback is configured when instantiated via arguments.
The “monitor” allows you to specify the performance measure to monitor in order to end training. Recall from the previous section that the calculation of measures on the validation dataset will have the ‘val_‘ prefix, such as ‘val_loss‘ for the loss on the validation dataset.
es = EarlyStopping(monitor='val_loss')
Based on the choice of performance measure, the “mode” argument will need to be specified as whether the objective of the chosen metric is to increase (maximize or ‘max‘) or to decrease (minimize or ‘min‘).
For example, we would seek a minimum for validation loss and a minimum for validation mean squared error, whereas we would seek a maximum for validation accuracy.
es = EarlyStopping(monitor='val_loss', mode='min')
By default, mode is set to ‘auto‘ and knows that you want to minimize loss or maximize accuracy.
That is all that is needed for the simplest form of early stopping. Training will stop when the chosen performance measure stops improving. To discover the training epoch on which training was stopped, the “verbose” argument can be set to 1. Once stopped, the callback will print the epoch number.
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)
Often, the first sign of no further improvement may not be the best time to stop training. This is because the model may coast into a plateau of no improvement or even get slightly worse before getting much better.
We can account for this by adding a delay to the trigger in terms of the number of epochs on which we would like to see no improvement. This can be done by setting the “patience” argument.
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=50)
The exact amount of patience will vary between models and problems. Reviewing plots of your performance measure can be very useful to get an idea of how noisy the optimization process for your model on your data may be.
By default, any change in the performance measure, no matter how fractional, will be considered an improvement. You may want to consider an improvement that is a specific increment, such as 1 unit for mean squared error or 1% for accuracy. This can be specified via the “min_delta” argument.
es = EarlyStopping(monitor='val_acc', mode='max', min_delta=1)
Finally, it may be desirable to only stop training if performance stays above or below a given threshold or baseline. For example, if you have familiarity with the training of the model (e.g. learning curves) and know that once a validation loss of a given value is achieved that there is no point in continuing training. This can be specified by setting the “baseline” argument.
This might be more useful when fine tuning a model, after the initial wild fluctuations in the performance measure seen in the early stages of training a new model are past.
es = EarlyStopping(monitor='val_loss', mode='min', baseline=0.4)
Checkpointing in Keras
The EarlyStopping callback will stop training once triggered, but the model at the end of training may not be the model with best performance on the validation dataset.
An additional callback is required that will save the best model observed during training for later use. This is the ModelCheckpoint callback.
The ModelCheckpoint callback is flexible in the way it can be used, but in this case we will use it only to save the best model observed during training as defined by a chosen performance measure on the validation dataset.
Saving and loading models requires that HDF5 support has been installed on your workstation. For example, using the pip Python installer, this can be achieved as follows:
The callback will save the model to file, which requires that a path and filename be specified via the first argument.
mc = ModelCheckpoint('best_model.h5')
The preferred loss function to be monitored can be specified via the monitor argument, in the same way as the EarlyStopping callback. For example, loss on the validation dataset (the default).
mc = ModelCheckpoint('best_model.h5', monitor='val_loss')
Also, as with the EarlyStopping callback, we must specify the “mode” as either minimizing or maximizing the performance measure. Again, the default is ‘auto,’ which is aware of the standard performance measures.
mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min')
Finally, we are interested in only the very best model observed during training, rather than the best compared to the previous epoch, which might not be the best overall if training is noisy. This can be achieved by setting the “save_best_only” argument to True.
mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', save_best_only=True)
That is all that is needed to ensure the model with the best performance is saved when using early stopping, or in general.
It may be interesting to know the value of the performance measure and at what epoch the model was saved. This can be printed by the callback by setting the “verbose” argument to “1“.
mc = ModelCheckpoint('best_model.h5', monitor='val_loss', mode='min', verbose=1)
The saved model can then be loaded and evaluated any time by calling the load_model() function.
# load a saved model
from keras.models import load_model
saved_model = load_model('best_model.h5')
Now that we know how to use the early stopping and model checkpoint APIs, let’s look at a worked example.
Early Stopping Case Study
In this section, we will demonstrate how to use early stopping to reduce overfitting of an MLP on a simple binary classification problem.
This example provides a template for applying early stopping to your own neural network for classification and regression problems.
Binary Classification Problem
We will use a standard binary classification problem that defines two semi-circles of observations, one semi-circle for each class.
Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “moons” dataset because of the shape of the observations in each class when plotted.
We can use the make_moons() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.
We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.
The complete example of generating the dataset and plotting it is listed below.
# generate two moons dataset
from sklearn.datasets import make_moons
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# scatter plot, dots colored by class value
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = pyplot.subplots()
grouped = df.groupby('label')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()
Running the example creates a scatter plot showing the semi-circle or moon shape of the observations in each class. We can see the noise in the dispersal of the points making the moons less obvious.
Scatter Plot of Moons Dataset With Color Showing the Class Value of Each Sample
This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.
We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.
Overfit Multilayer Perceptron
We can develop an MLP model to address this binary classification problem.
The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.
Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
Next, we can define the model.
The hidden layer uses 500 nodes and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1. The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.
We will also use the test dataset as a validation dataset. This is just a simplification for this example. In practice, you would split the training set into train and validation and also hold back a test set for final model evaluation.
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
We can evaluate the performance of the model on the test dataset and report the result.
Finally, we will plot the loss of the model on both the train and test set each epoch.
If the model does indeed overfit the training dataset, we would expect the line plot of loss (and accuracy) on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.
# plot training history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
We can tie all of these pieces together; the complete example is listed below.
# mlp overfit on the moons dataset
from sklearn.datasets import make_moons
from keras.layers import Dense
from keras.models import Sequential
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot training history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.
Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.
Train: 1.000, Test: 0.914
A figure is created showing line plots of the model loss on the train and test sets.
We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.
Reviewing the figure, we can also see flat spots in the ups and downs in the validation loss. Any early stopping will have to account for these behaviors. We would also expect that a good time to stop training might be around epoch 800.
Line Plots of Loss on Train and Test Datasets While Training Showing an Overfit Model
Overfit MLP With Early Stopping
We can update the example and add very simple early stopping.
As soon as the loss of the model begins to increase on the test dataset, we will stop training.
First, we can define the early stopping callback.
# simple early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)
We can then update the call to the fit() function and specify a list of callbacks via the “callback” argument.
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])
The complete example with the addition of simple early stopping is listed below.
# mlp overfit on the moons dataset with simple early stopping
from sklearn.datasets import make_moons
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# simple early stopping
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1)
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es])
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot training history
pyplot.plot(history.history['loss'], label='train')
pyplot.plot(history.history['val_loss'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
We can also see that the callback stopped training at epoch 200. This is too early as we would expect an early stop to be around epoch 800. This is also highlighted by the classification accuracy on both the train and test sets, which is worse than no early stopping.
Epoch 00219: early stopping
Train: 0.967, Test: 0.814
Reviewing the line plot of train and test loss, we can indeed see that training was stopped at the point when validation loss began to plateau for the first time.
A major challenge in training neural networks is how long to train them.
Too little training will mean that the model will underfit the train and the test sets. Too much training will mean that the model will overfit the training dataset and have poor performance on the test set.
A compromise is to train on the training dataset but to stop training at the point when performance on a validation dataset starts to degrade. This simple, effective, and widely used approach to training neural networks is called early stopping.
In this post, you will discover that stopping the training of a neural network early before it has overfit the training dataset can reduce overfitting and improve the generalization of deep neural networks.
After reading this post, you will know:
The challenge of training a neural network long enough to learn the mapping, but not so long that it overfits the training data.
Model performance on a holdout validation dataset can be monitored during training and training stopped when generalization error starts to increase.
The use of early stopping requires the selection of a performance measure to monitor, a trigger to stop training, and a selection of the model weights to use.
Let’s get started.
A Gentle Introduction to Early Stopping for Avoiding Overtraining Neural Network Models Photo by Benson Kua, some rights reserved.
Overview
This tutorial is divided into five parts; they are:
The Problem of Training Just Enough
Stop Training When Generalization Error Increases
How to Stop Training Early
Examples of Early Stopping
Tips for Early Stopping
The Problem of Training Just Enough
Training neural networks is challenging.
When training a large network, there will be a point during training when the model will stop generalizing and start learning the statistical noise in the training dataset.
This overfitting of the training dataset will result in an increase in generalization error, making the model less useful at making predictions on new data.
The challenge is to train the network long enough that it is capable of learning the mapping from inputs to outputs, but not training the model so long that it overfits the training data.
However, all standard neural network architectures such as the fully connected multi-layer perceptron are prone to overfitting [10]: While the network seems to get better and better, i.e., the error on the training set decreases, at some point during training it actually begins to get worse again, i.e., the error on unseen examples increases.
One approach to solving this problem is to treat the number of training epochs as a hyperparameter and train the model multiple times with different values, then select the number of epochs that result in the best performance on the train or a holdout test dataset.
The downside of this approach is that it requires multiple models to be trained and discarded. This can be computationally inefficient and time-consuming, especially for large models trained on large datasets over days or weeks.
Stop Training When Generalization Error Increases
An alternative approach is to train the model once for a large number of training epochs.
During training, the model is evaluated on a holdout validation dataset after each epoch. If the performance of the model on the validation dataset starts to degrade (e.g. loss begins to increase or accuracy begins to decrease), then the training process is stopped.
… the error measured with respect to independent data, generally called a validation set, often shows a decrease at first, followed by an increase as the network starts to over-fit. Training can therefore be stopped at the point of smallest error with respect to the validation data set
The model at the time that training is stopped is then used and is known to have good generalization performance.
This procedure is called “early stopping” and is perhaps one of the oldest and most widely used forms of neural network regularization.
This strategy is known as early stopping. It is probably the most commonly used form of regularization in deep learning. Its popularity is due both to its effectiveness and its simplicity.
If regularization methods like weight decay that update the loss function to encourage less complex models are considered “explicit” regularization, then early stopping may be thought of as a type of “implicit” regularization, much like using a smaller network that has less capacity.
Regularization may also be implicit as is the case with early stopping.
Early stopping requires that you configure your network to be under constrained, meaning that it has more capacity than is required for the problem.
When training the network, a larger number of training epochs is used than may normally be required, to give the network plenty of opportunity to fit, then begin to overfit the training dataset.
There are three elements to using early stopping; they are:
Monitoring model performance.
Trigger to stop training.
The choice of model to use.
Monitoring Performance
The performance of the model must be monitored during training.
This requires the choice of a dataset that is used to evaluate the model and a metric used to evaluate the model.
It is common to split the training dataset and use a subset, such as 30%, as a validation dataset used to monitor performance of the model during training. This validation set is not used to train the model. It is also common to use the loss on a validation dataset as the metric to monitor, although you may also use prediction error in the case of regression, or accuracy in the case of classification.
The loss of the model on the training dataset will also be available as part of the training procedure, and additional metrics may also be calculated and monitored on the training dataset.
Performance of the model is evaluated on the validation set at the end of each epoch, which adds an additional computational cost during training. This can be reduced by evaluating the model less frequently, such as every 2, 5, or 10 training epochs.
Early Stopping Trigger
Once a scheme for evaluating the model is selected, a trigger for stopping the training process must be chosen.
The trigger will use a monitored performance metric to decide when to stop training. This is often the performance of the model on the holdout dataset, such as the loss.
In the simplest case, training is stopped as soon as the performance on the validation dataset decreases as compared to the performance on the validation dataset at the prior training epoch (e.g. an increase in loss).
More elaborate triggers may be required in practice. This is because the training of a neural network is stochastic and can be noisy. Plotted on a graph, the performance of a model on a validation dataset may go up and down many times. This means that the first sign of overfitting may not be a good place to stop training.
… the validation error can still go further down after it has begun to increase […] Real validation error curves almost always have more than one local minimum.
No change in metric over a given number of epochs.
An absolute change in a metric.
A decrease in performance observed over a given number of epochs.
Average change in metric over a given number of epochs.
Some delay or “patience” in stopping is almost always a good idea.
… results indicate that “slower” criteria, which stop later than others, on the average lead to improved generalization compared to “faster” ones. However, the training time that has to be expended for such improvements is rather large on average and also varies dramatically when slow criteria are used.
At the time that training is halted, the model is known to have slightly worse generalization error than a model at a prior epoch.
As such, some consideration may need to be given as to exactly which model is saved. Specifically, the training epoch from which weights in the model that are saved to file.
This will depend on the trigger chosen to stop the training process. For example, if the trigger is a simple decrease in performance from one epoch to the next, then the weights for the model at the prior epoch will be preferred.
If the trigger is required to observe a decrease in performance over a fixed number of epochs, then the model at the beginning of the trigger period will be preferred.
Perhaps a simple approach is to always save the model weights if the performance of the model on a holdout dataset is better than at the previous epoch. That way, you will always have the model with the best performance on the holdout set.
Every time the error on the validation set improves, we store a copy of the model parameters. When the training algorithm terminates, we return these parameters, rather than the latest parameters.
This section summarizes some examples where early stopping has been used.
Yoon Kim in his seminal application of convolutional neural networks to sentiment analysis in the 2014 paper titled “Convolutional Neural Networks for Sentence Classification” used early stopping with 10% of the training dataset used as the validation hold outset.
We do not otherwise perform any dataset-specific tuning other than early stopping on dev sets. For datasets without a standard dev set we randomly select 10% of the training data as the dev set.
Chiyuan Zhang, et al. from MIT, Berkeley, and Google in their 2017 paper titled “Understanding deep learning requires rethinking generalization” highlight that on very deep convolutional neural networks for photo classification where there is an abundant dataset that early stopping may not always offer benefit, as the model is less likely to overfit such large datasets.
[regarding] the training and testing accuracy on ImageNet [results suggest] a reference of potential performance gain for early stopping. However, on the CIFAR10 dataset, we do not observe any potential benefit of early stopping.
Lack of regularisation in RNN models makes it difficult to handle small data, and to avoid overfitting researchers often use early stopping, or small and under-specified models
Alex Graves, et al., in their famous 2013 paper titled “Speech recognition with deep recurrent neural networks” achieved state-of-the-art results with LSTMs for speech recognition, while making use of early stopping.
Regularisation is vital for good performance with RNNs, as their flexibility makes them prone to overfitting. Two regularisers were used in this paper: early stopping and weight noise …
Tips for Early Stopping
This section provides some tips for using early stopping regularization with your neural network.
When to Use Early Stopping
Early stopping is so easy to use, e.g. with the simplest trigger, that there is little reason to not use it when training neural networks.
Use of early stopping may be a staple of the modern training of deep neural networks.
Before using early stopping, it may be interesting to fit an under constrained model and monitor the performance of the model on a train and validation dataset.
Plotting the performance of the model in real-time or at the end of a long run will show how noisy the training process is with your specific model and dataset.
This may help in the choice of a trigger for early stopping.
Monitor an Important Metric
Loss is an easy metric to monitor during training and to trigger early stopping.
The problem is that loss does not always capture what is most important about the model to you and your project.
It may be better to choose a performance metric to monitor that best defines the performance of the model in terms of the way you intend to use it. This may be the metric that you intend to use to report the performance of the model.
Suggested Training Epochs
A problem with early stopping is that the model does not make use of all available training data.
It may be desirable to avoid overfitting and to train on all possible data, especially on problems where the amount of training data is very limited.
A recommended approach would be to treat the number of training epochs as a hyperparameter and to grid search a range of different values, perhaps using k-fold cross-validation. This will allow you to fix the number of training epochs and fit a final model on all available data.
Early stopping could be used instead. The early stopping procedure could be repeated a number of times. The epoch number at which training was stopped could be recorded. Then, the average of the epoch number across all repeats of early stopping could be used when fitting a final model on all available training data.
This process could be performed using a different split of the training set into train and validation steps each time early stopping is run.
An alternative might be to use early stopping with a validation dataset, then update the final model with further training on the held out validation set.
Early Stopping With Cross-Validation
Early stopping could be used with k-fold cross-validation, although it is not recommended.
The k-fold cross-validation procedure is designed to estimate the generalization error of a model by repeatedly refitting and evaluating it on different subsets of a dataset.
Early stopping is designed to monitor the generalization error of one model and stop training when generalization error begins to degrade.
They are at odds because cross-validation assumes you don’t know the generalization error and early stopping is trying to give you the best model based on knowledge of generalization error.
It may be desirable to use cross-validation to estimate the performance of models with different hyperparameter values, such as learning rate or network structure, whilst also using early stopping.
In this case, if you have the resources to repeatedly evaluate the performance of the model, then perhaps the number of training epochs may also be treated as a hyperparameter to be optimized, instead of using early stopping.
Instead of using cross-validation with early stopping, early stopping may be used directly without repeated evaluation when evaluating different hyperparameter values for the model (e.g. different learning rates).
One possible point of confusion is that early stopping is sometimes referred to as “cross-validated training.” Further, research into early stopping that compares triggers may use cross-validation to compare the impact of different triggers.
Overfit Validation
Repeating the early stopping procedure many times may result in the model overfitting the validation dataset.
This can happen just as easily as overfitting the training dataset.
One approach is to only use early stopping once all other hyperparameters of the model have been chosen.
Another strategy may be to use a different split of the training dataset into train and validation sets each time early stopping is used.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
In this post, you discovered that stopping the training of neural network early before it has overfit the training dataset can reduce overfitting and improve the generalization of deep neural networks.
Specifically, you learned:
The challenge of training a neural network long enough to learn the mapping, but not so long that it overfits the training data.
Model performance on a holdout validation dataset can be monitored during training and training stopped when generalization error starts to increase.
The use of early stopping requires the selection of a performance measure to monitor, a trigger for stopping training, and a selection of the model weights to use.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Dropout regularization is a computationally cheap way to regularize a deep neural network.
Dropout works by probabilistically removing, or “dropping out,” inputs to a layer, which may be input variables in the data sample or activations from a previous layer. It has the effect of simulating a large number of networks with very different network structure and, in turn, making nodes in the network generally more robust to the inputs.
In this tutorial, you will discover the Keras API for adding dropout regularization to deep learning neural network models.
After completing this tutorial, you will know:
How to create a dropout layer using the Keras API.
How to add dropout regularization to MLP, CNN, and RNN layers using the Keras API.
How to reduce overfitting by adding a dropout regularization to an existing model.
Let’s get started.
How to Reduce Overfitting With Dropout Regularization in Keras Photo by PROJorge Láscar, some rights reserved.
Tutorial Overview
This tutorial is divided into three parts; they are:
Dropout Regularization in Keras
Dropout Regularization on Layers
Dropout Regularization Case Study
Dropout Regularization in Keras
Keras supports dropout regularization.
The simplest form of dropout in Keras is provided by a Dropout core layer.
When created, the dropout rate can be specified to the layer as the probability of setting each input to the layer to zero. This is different from the definition of dropout rate from the papers, in which the rate refers to the probability of retaining an input.
Therefore, when a dropout rate of 0.8 is suggested in a paper (retain 80%), this will, in fact, will be a dropout rate of 0.2 (set 20% of inputs to zero).
Below is an example of creating a dropout layer with a 50% chance of setting inputs to zero.
layer = Dropout(0.5)
Dropout Regularization on Layers
The Dropout layer is added to a model between existing layers and applies to outputs of the prior layer that are fed to the subsequent layer.
We can insert a dropout layer between them, in which case the outputs or activations of the first layer have dropout applied to them, which are then taken as input to the next layer.
It is this second layer now which has dropout applied.
Dropout can also be applied to the visible layer, e.g. the inputs to the network.
This requires that you define the network with the Dropout layer as the first layer and add the input_shape argument to the layer to specify the expected shape of the input samples.
...
model.add(Dropout(0.5, input_shape=(2,)))
...
Let’s take a look at how dropout regularization can be used with some common network types.
MLP Dropout Regularization
The example below adds dropout between two dense fully connected layers.
# example of dropout between fully connected layers
from keras.layers import Dense
from keras.layers import Dropout
...
model.add(Dense(32))
model.add(Dropout(0.5))
model.add(Dense(1))
...
CNN Dropout Regularization
Dropout can be used after convolutional layers (e.g. Conv2D) and after pooling layers (e.g. MaxPooling2D).
Often, dropout is only used after the pooling layers, but this is just a rough heuristic.
# example of dropout for a CNN
from keras.layers import Dense
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import Dropout
...
model.add(Conv2D(32, (3,3)))
model.add(Conv2D(32, (3,3)))
model.add(MaxPooling2D())
model.add(Dropout(0.5))
model.add(Dense(1))
...
In this case, dropout is applied to each element or cell within the feature maps.
An alternative way to use dropout with convolutional neural networks is to dropout entire feature maps from the convolutional layer which are then not used during pooling. This is called spatial dropout (or “SpatialDropout“).
Instead we formulate a new dropout method which we call SpatialDropout. For a given convolution feature tensor […] [we] extend the dropout value across the entire feature map.
Spatial Dropout is provided in Keras via the SpatialDropout2D layer (as well as 1D and 3D versions).
# example of spatial dropout for a CNN
from keras.layers import Dense
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.layers import SpatialDropout2D
...
model.add(Conv2D(32, (3,3)))
model.add(Conv2D(32, (3,3)))
model.add(SpatialDropout2D(0.5))
model.add(MaxPooling2D())
model.add(Dense(1))
...
RNN Dropout Regularization
The example below adds dropout between two layers: an LSTM recurrent layer and a dense fully connected layers.
# example of dropout between LSTM and fully connected layers
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
...
model.add(LSTM(32))
model.add(Dropout(0.5))
model.add(Dense(1))
...
This example applies dropout to, in this case, 32 outputs from the LSTM layer provided as input to the Dense layer.
Alternately, the inputs to the LSTM may be subjected to dropout. In this case, a different dropout mask is applied to each time step within each sample presented to the LSTM.
# example of dropout before LSTM layer
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
...
model.add(Dropout(0.5, input_shape=(...)))
model.add(LSTM(32))
model.add(Dense(1))
...
There is an alternative way to use dropout with recurrent layers like the LSTM. The same dropout mask may be used by the LSTM for all inputs within a sample. The same approach may be used for recurrent input connections across the time steps of the sample. This approach to dropout with recurrent models is called a Variational RNN.
The proposed technique (Variational RNN […]) uses the same dropout mask at each time step, including the recurrent layers. […] Implementing our approximate inference is identical to implementing dropout in RNNs with the same network units dropped at each time step, randomly dropping inputs, outputs, and recurrent connections. This is in contrast to existing techniques, where different network units would be dropped at different time steps, and no dropout would be applied to the recurrent connections
Keras supports Variational RNNs (i.e. consistent dropout across the time steps of a sample for inputs and recurrent inputs) via two arguments on the recurrent layers, namely “dropout” for inputs and “recurrent_dropout” for recurrent inputs.
# example of variational LSTM dropout
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
...
model.add(LSTM(32, dropout=0.5, recurrent_dropout=0.5))
model.add(Dense(1))
...
Dropout Regularization Case Study
In this section, we will demonstrate how to use dropout regularization to reduce overfitting of an MLP on a simple binary classification problem.
This example provides a template for applying dropout regularization to your own neural network for classification and regression problems.
Binary Classification Problem
We will use a standard binary classification problem that defines two two-dimensional concentric circles of observations, one circle for each class.
Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “circles” dataset because of the shape of the observations in each class when plotted.
We can use the make_circles() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.
We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.
The complete example of generating the dataset and plotting it is listed below.
# generate two circles dataset
from sklearn.datasets import make_circles
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# scatter plot, dots colored by class value
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = pyplot.subplots()
grouped = df.groupby('label')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()
Running the example creates a scatter plot showing the concentric circles shape of the observations in each class. We can see the noise in the dispersal of the points making the circles less obvious.
Scatter Plot of Circles Dataset with Color Showing the Class Value of Each Sample
This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.
We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.
Overfit Multilayer Perceptron
We can develop an MLP model to address this binary classification problem.
The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.
Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
Next, we can define the model.
The hidden layer uses 500 nodes in the hidden layer and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1.
The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.
We will also use the test dataset as a validation dataset.
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
We can evaluate the performance of the model on the test dataset and report the result.
Finally, we will plot the performance of the model on both the train and test set each epoch.
If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
We can tie all of these pieces together; the complete example is listed below.
# mlp overfit on the two circles dataset
from sklearn.datasets import make_circles
from keras.layers import Dense
from keras.models import Sequential
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.
Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.
Train: 1.000, Test: 0.757
A figure is created showing line plots of the model accuracy on the train and test sets.
We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.
Line Plots of Accuracy on Train and Test Datasets While Training Showing an Overfit
Overfit MLP With Dropout Regularization
We can update the example to use dropout regularization.
We can do this by simply inserting a new Dropout layer between the hidden layer and the output layer. In this case, we will specify a dropout rate (probability of setting outputs from the hidden layer to zero) to 40% or 0.4.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The complete updated example with the addition of dropout after the hidden layer is listed below:
# mlp with dropout on the two circles dataset
from sklearn.datasets import make_circles
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
Your results will likely vary. In this case, the resulting model has a high variance.
In this specific case, we can see that dropout resulted in a slight drop in accuracy on the training dataset, down from 100% to 96%, and a lift in accuracy on the test set, up from 75% to 81%.
Train: 0.967, Test: 0.814
Reviewing the line plot of train and test accuracy during training, we can see that it no longer appears that the model has overfit the training dataset.
Model accuracy on both the train and test sets continues to increase to a plateau, albeit with a lot of noise given the use of dropout during training.
Line Plots of Accuracy on Train and Test Datasets While Training With Dropout Regularization
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
Input Dropout. Update the example to use dropout on the input variables and compare results.
Weight Constraint. Update the example to add a max-norm weight constraint to the hidden layer and compare results.
Repeated Evaluation. Update the example to repeat the evaluation of the overfit and dropout model and summarize and compare the average results.
Grid Search Rate. Develop a grid search of dropout probabilities and report the relationship between dropout rate and test set accuracy.
If you explore any of these extensions, I’d love to know.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Deep learning neural networks are likely to quickly overfit a training dataset with few examples.
Ensembles of neural networks with different model configurations are known to reduce overfitting, but require the additional computational expense of training and maintaining multiple models.
A single model can be used to simulate having a large number of different network architectures by randomly dropping out nodes during training. This is called dropout and offers a very computationally cheap and remarkably effective regularization method to reduce overfitting and generalization error in deep neural networks of all kinds.
In this post, you will discover the use of dropout regularization for reducing overfitting and improving the generalization of deep neural networks.
After reading this post, you will know:
Large weights in a neural network are a sign of a more complex network that has overfit the training data.
Probabilistically dropping out nodes in the network is a simple and effective regularization method.
A large network with more training and the use of a weight constraint are suggested when using dropout.
Let’s get started.
A Gentle Introduction to Dropout for Regularizing Deep Neural Networks Photo by Jocelyn Kinghorn, some rights reserved.
Overview
This tutorial is divided into five parts; they are:
Problem With Overfitting
Randomly Drop Nodes
How to Dropout
Examples of Using Dropout
Tips for Using Dropout Regularization
Problem With Overfitting
Large neural nets trained on relatively small datasets can overfit the training data.
This has the effect of the model learning the statistical noise in the training data, which results in poor performance when the model is evaluated on new data, e.g. a test dataset. Generalization error increases due to overfitting.
One approach to reduce overfitting is to fit all possible different neural networks on the same dataset and to average the predictions from each model. This is not feasible in practice, and can be approximated using a small collection of different models, called an ensemble.
With unlimited computation, the best way to “regularize” a fixed-sized model is to average the predictions of all possible settings of the parameters, weighting each setting by its posterior probability given the training data.
A problem even with the ensemble approximation is that it requires multiple models to be fit and stored, which can be a challenge if the models are large, requiring days or weeks to train and tune.
Randomly Drop Nodes
Dropout is a regularization method that approximates training a large number of neural networks with different architectures in parallel.
During training, some number of layer outputs are randomly ignored or “dropped out.” This has the effect of making the layer look-like and be treated-like a layer with a different number of nodes and connectivity to the prior layer. In effect, each update to a layer during training is performed with a different “view” of the configured layer.
By dropping a unit out, we mean temporarily removing it from the network, along with all its incoming and outgoing connections
Dropout has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs.
This conceptualization suggests that perhaps dropout breaks-up situations where network layers co-adapt to correct mistakes from prior layers, in turn making the model more robust.
… units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. […]
Dropout simulates a sparse activation from a given layer, which interestingly, in turn, encourages the network to actually learn a sparse representation as a side-effect. As such, it may be used as an alternative to activity regularization for encouraging sparse representations in autoencoder models.
We found that as a side-effect of doing dropout, the activations of the hidden units become sparse, even when no sparsity inducing regularizers are present.
Because the outputs of a layer under dropout are randomly subsampled, it has the effect of reducing the capacity or thinning the network during training. As such, a wider network, e.g. more nodes, may be required when using dropout.
How to Dropout
Dropout is implemented per-layer in a neural network.
It can be used with most types of layers, such as dense fully connected layers, convolutional layers, and recurrent layers such as the long short-term memory network layer.
Dropout may be implemented on any or all hidden layers in the network as well as the visible or input layer. It is not used on the output layer.
The term “dropout” refers to dropping out units (hidden and visible) in a neural network.
A new hyperparameter is introduced that specifies the probability at which outputs of the layer are dropped out, or inversely, the probability at which outputs of the layer are retained. The interpretation is an implementation detail that can differ from paper to code library.
A common value is a probability of 0.5 for retaining the output of each node in a hidden layer and a value close to 1.0, such as 0.8, for retaining inputs from the visible layer.
In the simplest case, each unit is retained with a fixed probability p independent of other units, where p can be chosen using a validation set or can simply be set at 0.5, which seems to be close to optimal for a wide range of networks and tasks. For the input units, however, the optimal probability of retention is usually closer to 1 than to 0.5.
Dropout is not used after training when making a prediction with the fit network.
The weights of the network will be larger than normal because of dropout. Therefore, before finalizing the network, the weights are first scaled by the chosen dropout rate. The network can then be used as per normal to make predictions.
If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time
The rescaling of the weights can be performed at training time instead, after each weight update at the end of the mini-batch. This is sometimes called “inverse dropout” and does not require any modification of weights during training. Both the Keras and PyTorch deep learning libraries implement dropout in this way.
At test time, we scale down the output by the dropout rate. […] Note that this process can be implemented by doing both operations at training time and leaving the output unchanged at test time, which is often the way it’s implemented in practice
Dropout works well in practice, perhaps replacing the need for weight regularization (e.g. weight decay) and activation regularization (e.g. representation sparsity).
… dropout is more effective than other standard computationally inexpensive regularizers, such as weight decay, filter norm constraints and sparse activity regularization. Dropout may also be combined with other forms of regularization to yield a further improvement.
This section summarizes some examples where dropout was used in recent research papers to provide a suggestion for how and where it may be used.
Geoffrey Hinton, et al. in their 2012 paper that first introduced dropout titled “Improving neural networks by preventing co-adaptation of feature detectors” applied used the method with a range of different neural networks on different problem types achieving improved results, including handwritten digit recognition (MNIST), photo classification (CIFAR-10), and speech recognition (TIMIT).
… we use the same dropout rates – 50% dropout for all hidden units and 20% dropout for visible units
Nitish Srivastava, et al. in their 2014 journal paper introducing dropout titled “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” used dropout on a wide range of computer vision, speech recognition, and text classification tasks and found that it consistently improved performance on each problem.
We trained dropout neural networks for classification problems on data sets in different domains. We found that dropout improved generalization performance on all data sets compared to neural networks that did not use dropout.
On the computer vision problems, different dropout rates were used down through the layers of the network in conjunction with a max-norm weight constraint.
Dropout was applied to all the layers of the network with the probability of retaining the unit being p = (0.9, 0.75, 0.75, 0.5, 0.5, 0.5) for the different layers of the network (going from input to convolutional layers to fully connected layers). In addition, the max-norm constraint with c = 4 was used for all the weights. […]
A simpler configuration was used for the text classification task.
We used probability of retention p = 0.8 in the input layers and 0.5 in the hidden layers. Max-norm constraint with c = 4 was used in all the layers.
Alex Krizhevsky, et al. in their famous 2012 paper titled “ImageNet Classification with Deep Convolutional Neural Networks” achieved (at the time) state-of-the-art results for photo classification on the ImageNet dataset with deep convolutional neural networks and dropout regularization.
We use dropout in the first two fully-connected layers [of the model]. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.
George Dahl, et al. in their 2013 paper titled “Improving deep neural networks for LVCSR using rectified linear units and dropout” used a deep neural network with rectified linear activation functions and dropout to achieve (at the time) state-of-the-art results on a standard speech recognition task. They used a bayesian optimization procedure to configure the choice of activation function and the amount of dropout.
… the Bayesian optimization procedure learned that dropout wasn’t helpful for sigmoid nets of the sizes we trained. In general, ReLUs and dropout seem to work quite well together.
Tips for Using Dropout Regularization
This section provides some tips for using dropout regularization with your neural network.
Use With All Network Types
Dropout regularization is a generic approach.
It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks.
In the case of LSTMs, it may be desirable to use different dropout rates for the input and recurrent connections.
Dropout Rate
The default interpretation of the dropout hyperparameter is the probability of training a given node in a layer, where 1.0 means no dropout, and 0.0 means no outputs from the layer.
A good value for dropout in a hidden layer is between 0.5 and 0.8. Input layers use a larger dropout rate, such as of 0.8.
Use a Larger Network
It is common for larger networks (more layers or more nodes) to more easily overfit the training data.
When using dropout regularization, it is possible to use larger networks with less risk of overfitting. In fact, a large network (more nodes per layer) may be required as dropout will probabilistically reduce the capacity of the network.
A good rule of thumb is to divide the number of nodes in the layer before dropout by the proposed dropout rate and use that as the number of nodes in the new network that uses dropout. For example, a network with 100 nodes and a proposed dropout rate of 0.5 will require 200 nodes (100 / 0.5) when using dropout.
If n is the number of hidden units in any layer and p is the probability of retaining a unit […] a good dropout net should have at least n/p units
Rather than guess at a suitable dropout rate for your network, test different rates systematically.
For example, test values between 1.0 and 0.1 in increments of 0.1.
This will both help you discover what works best for your specific model and dataset, as well as how sensitive the model is to the dropout rate. A more sensitive model may be unstable and could benefit from an increase in size.
Use a Weight Constraint
Network weights will increase in size in response to the probabilistic removal of layer activations.
Large weight size can be a sign of an unstable network.
To counter this effect a weight constraint can be imposed to force the norm (magnitude) of all weights in a layer to be below a specified value. For example, the maximum norm constraint is recommended with a value between 3-4.
[…] we can use max-norm regularization. This constrains the norm of the vector of incoming weights at each hidden unit to be bound by a constant c. Typical values of c range from 3 to 4.
This does introduce an additional hyperparameter that may require tuning for the model.
Use With Smaller Datasets
Like other regularization methods, dropout is more effective on those problems where there is a limited amount of training data and the model is likely to overfit the training data.
Problems where there is a large amount of training data may see less benefit from using dropout.
For very large datasets, regularization confers little reduction in generalization error. In these cases, the computational cost of using dropout and larger models may outweigh the benefit of regularization.
Activity regularization provides an approach to encourage a neural network to learn sparse features or internal representations of raw observations.
It is common to seek sparse learned representations in autoencoders, called sparse autoencoders, and in encoder-decoder models, although the approach can also be used generally to reduce overfitting and improve a model’s ability to generalize to new observations.
In this tutorial, you will discover the Keras API for adding activity regularization to deep learning neural network models.
After completing this tutorial, you will know:
How to create vector norm regularizers using the Keras API.
How to add activity regularization to MLP, CNN, and RNN layers using the Keras API.
How to reduce overfitting by adding activity regularization to an existing model.
Let’s get started.
How to Reduce Generalization Error in Deep Neural Networks With Activity Regularization in Keras Photo by Johan Neven, some rights reserved.
Tutorial Overview
This tutorial is divided into three parts; they are:
Activity Regularization in Keras
Activity Regularization on Layers
Activity Regularization Case Study
Activity Regularization in Keras
Keras supports activity regularization.
There are three different regularization techniques supported, each provided as a class in the keras.regularizers module:
l1: Activity is calculated as the sum of absolute values.
l2: Activity is calculated as the sum of the squared values.
l1_l2: Activity is calculated as the sum of absolute and sum of the squared values.
Each of the l1 and l2 regularizers takes a single hyperparameter that controls the amount that each activity contributes to the sum. The l1_l2 regularizer takes two hyperparameters, one for each of the l1 and l2 methods.
The regularizer class must be imported and then instantiated; for example:
Activity regularization is specified on a layer in Keras.
This can be achieved by setting the activity_regularizer argument on the layer to an instantiated and configured regularizer class.
The regularizer is applied to the output of the layer, but you have control over what the “output” of the layer actually means. Specifically, you have flexibility as to whether the layer output means that the regularization is applied before or after the ‘activation‘ function.
For example, you can specify the function and the regularization on the layer, in which case activation regularization is applied to the output of the activation function, in this case, relu.
Alternately, you can specify a linear activation function (the default, that does not perform any transform) which means that the activation regularization is applied on the raw outputs, then, the activation function can be added as a subsequent layer.
The latter is probably the preferred usage of activation regularization as described in “Deep Sparse Rectifier Neural Networks” in order to allow the model to learn to take activations to a true zero value in conjunction with the rectified linear activation function. Nevertheless, the two possible uses of activation regularization may be explored in order to discover what works best for your specific model and dataset.
Let’s take a look at how activity regularization can be used with some common layer types.
MLP Activity Regularization
The example below sets l1 norm activity regularization on a Dense fully connected layer.
# example of l1 norm on activity from a dense layer
from keras.layers import Dense
from keras.regularizers import l1
...
model.add(Dense(32, activity_regularizer=l1(0.001)))
...
CNN Activity Regularization
The example below sets l1 norm activity regularization on a Conv2D convolutional layer.
# example of l1 norm on activity from a cnn layer
from keras.layers import Conv2D
from keras.regularizers import l1
...
model.add(Conv2D(32, (3,3), activity_regularizer=l1(0.001)))
...
RNN Activity Regularization
The example below sets l1 norm activity regularization on an LSTM recurrent layer.
# example of l1 norm on activity from an lstm layer
from keras.layers import LSTM
from keras.regularizers import l1
...
model.add(LSTM(32, activity_regularizer=l1(0.001)))
...
Now that we know how to use the activity regularization API, let’s look at a worked example.
Activity Regularization Case Study
In this section, we will demonstrate how to use activity regularization to reduce overfitting of an MLP on a simple binary classification problem.
Although activity regularization is most often used to encourage sparse learned representations in autoencoder and encoder-decoder models, it can also be used directly within normal neural networks to achieve the same effect and improve the generalization of the model.
This example provides a template for applying activity regularization to your own neural network for classification and regression problems.
Binary Classification Problem
We will use a standard binary classification problem that defines two two-dimensional concentric circles of observations, one circle for each class.
Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “circles” dataset because of the shape of the observations in each class when plotted.
We can use the make_circles() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.
We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.
The complete example of generating the dataset and plotting it is listed below.
# generate two circles dataset
from sklearn.datasets import make_circles
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# scatter plot, dots colored by class value
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = pyplot.subplots()
grouped = df.groupby('label')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()
Running the example creates a scatter plot showing the concentric circles shape of the observations in each class.
We can see the noise in the dispersal of the points making the circles less obvious.
Scatter Plot of Circles Dataset with Color Showing the Class Value of Each Sample
This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.
We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization.
Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.
Overfit Multilayer Perceptron
We can develop an MLP model to address this binary classification problem.
The model will have one hidden layer with more nodes that may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.
Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
Next, we can define the model.
The hidden layer uses 500 nodes and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1.
The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.
We will also use the test dataset as a validation dataset.
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
We can evaluate the performance of the model on the test dataset and report the result.
Finally, we will plot the performance of the model on both the train and test set each epoch.
If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
We can tie all of these pieces together, the complete example is listed below.
# mlp overfit on the two circles dataset
from sklearn.datasets import make_circles
from keras.layers import Dense
from keras.models import Sequential
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.
Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is severely overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.
Train: 1.000, Test: 0.786
A figure is created showing line plots of the model accuracy on the train and test sets.
We can see the expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.
Line Plots of Accuracy on Train and Test Datasets While Training Showing an Overfit
Overfit MLP With Activation Regularization
We can update the example to use activation regularization.
There are a few different regularization methods to choose from, but it is probably a good idea to use the most common, which is the L1 vector norm.
This regularization has the effect of encouraging a sparse representation (lots of zeros), which is supported by the rectified linear activation function that permits true zero values.
We can do this by using the keras.regularizers.l1 class in Keras.
We will configure the layer to use the linear activation function so that we can regularize the raw outputs, then add a relu activation layer after the regularized outputs of the layer. We will set the regularization hyperparameter to 1E-4 or 0.0001, found with a little trial and error.
The complete updated example with the L1 norm constraint is listed below:
# mlp overfit on the two circles dataset with activation regularization
from sklearn.datasets import make_circles
from keras.layers import Dense
from keras.models import Sequential
from keras.regularizers import l1
from keras.layers import Activation
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='linear', activity_regularizer=l1(0.0001)))
model.add(Activation('relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
We can see that activity regularization resulted in a slight drop in accuracy on the training dataset down from 100% to 96% and a lift in accuracy on the test set up from 78% to 82%.
Train: 0.967, Test: 0.829
Reviewing the line plot of train and test accuracy, we can see that it no longer appears that the model has overfit the training dataset.
Model accuracy on both the train and test sets continues to increase to a plateau.
Line Plots of Accuracy on Train and Test Datasets While Training With Activity Regularization
For completeness, we can compare results to a version of the model where activity regularization is applied after the relu activation function.
# mlp overfit on the two circles dataset with activation regularization
from sklearn.datasets import make_circles
from keras.layers import Dense
from keras.models import Sequential
from keras.regularizers import l1
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_circles(n_samples=100, noise=0.1, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu', activity_regularizer=l1(0.0001)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
We can see that, at least on this problem and with this model, activation regularization after the activation function did not improve generalization error; in fact, it made it worse.
Train: 1.000, Test: 0.743
Reviewing the line plot of train and test accuracy, we can see that indeed the model still shows the signs of having overfit the training dataset.
Line Plots of Accuracy on Train and Test Datasets While Training With Activity Regularization, Still Overfit
This suggests that it may be worth experimenting with both approaches for implementing activity regularization with your own dataset, to confirm that you are getting the most out of the method.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
Report Activation Mean. Update the example to calculate the mean activation of the regularized layer and confirm that indeed the activations have been made more sparse.
Grid Search. Update the example to grid search different values for the regularization hyperparameter.
Alternate Norm. Update the example to evaluate the L2 or L1_L2 vector norm for regularizing the hidden layer outputs.
Repeated Evaluation. Update the example to fit and evaluate the model multiple times and report the mean and standard deviation of model performance.
If you explore any of these extensions, I’d love to know.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Deep learning models are capable of automatically learning a rich internal representation from raw input data.
This is called feature or representation learning. Better learned representations, in turn, can lead to better insights into the domain, e.g. via visualization of learned features, and to better predictive models that make use of the learned features.
A problem with learned features is that they can be too specialized to the training data, or overfit, and not generalize well to new examples. Large values in the learned representation can be a sign of the representation being overfit. Activity or representation regularization provides a technique to encourage the learned representations, the output or activation of the hidden layer or layers of the network, to stay small and sparse.
In this post, you will discover activation regularization as a technique to improve the generalization of learned features in neural networks.
After reading this post, you will know:
Neural networks learn features from data and models, such as autoencoders and encoder-decoder models, explicitly seek effective learned representations.
Similar to weights, large values in learned features, e.g. large activations, may indicate an overfit model.
The addition of penalties to the loss function that penalize a model in proportion to the magnitude of the activations may result in more robust and generalized learned features.
Let’s get started.
Activation Regularization for Reducing Generalization Error in Deep Learning Neural Networks Photo by Nicholas A. Tonelli, some rights reserved.
Overview
This tutorial is divided into five parts; they are:
Problem With Learned Features
Encourage Small Activations
How to Encourage Small Activations
Examples of Activation Regularization
Tips for Using Activation Regularization
Problem With Learned Features
Deep learning models are able to perform feature learning.
That is, during the training of the network, the model will automatically extract the salient features from the input patterns or “learn features.” These features may be used in the network in order to predict a quantity for regression or predict a class value for classification.
These internal representations are tangible things. The output of a hidden layer within the network represent the learned features by the model at that point in the network.
There is a field of study focused on the efficient and effective automatic learning of features, often investigated by having a network reduce an input to a small learned feature before using a second network to reconstruct the original input from the learned feature. Models of this type are called auto-encoders, or encoder-decoders, and their learned features can be useful to learn more about the domain (e.g. via visualization) and in predictive models.
The learned features, or “encoded inputs,” must be large enough to capture the salient features of the input but also focused enough to not over-fit the specific examples in the training dataset. As such, there is a tension between the expressiveness and the generalization of the learned features.
More importantly, when the dimension of the code in an encoder-decoder architecture is larger than the input, it is necessary to limit the amount of information carried by the code, lest the encoder-decoder may simply learn the identity function in a trivial way and produce uninteresting features.
In the same way that large weights in the network can signify an unstable and overfit model, large output values in the learned features can signify the same problems.
It is desirable to have small values in the learned features, e.g. small outputs or activations from the encoder network.
Encourage Small Activations
The loss function of the network can be updated to penalize models in proportion to the magnitude of their activation.
This is similar to “weight regularization” where the loss function is updated to penalize the model in proportion to the magnitude of the weights. The output of a layer is referred to as its ‘activation,’ as such, this form of penalty or regularization is referred to as ‘activation regularization.’
… place a penalty on the activations of the units in a neural network, encouraging their activations to be sparse.
The output of an encoder or, generally, the output of a hidden layer in a neural network may be considered the representation of the problem at that point in the model. As such, this type of penalty may also be referred to as ‘representation regularization.’
The desire to have small activations or even very few activations with mostly zero values is also called a desire for sparsity. As such, this type of penalty is also referred to as ‘sparse feature learning.’
One way to limit the information content of an overcomplete code is to make it sparse.
The encouragement of sparse learned features in autoencoder models is referred to as ‘sparse autoencoders.’
A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity penalty on the code layer, in addition to the reconstruction error
Sparsity is most commonly sought when a larger-than-required hidden layer (e.g. over-complete) is used to learn features that may encourage over-fitting. The introduction of a sparsity penalty counters this problem and encourages better generalization.
A sparse overcomplete learned feature has been shown to be more effective than other types of learned features offering better robustness to noise and even transforms in the input, e.g. learned features of images may have improved invariance to the position of objects in the image.
Sparse-overcomplete representations have a number of theoretical and practical advantages, as demonstrated in a number of recent studies. In particular, they have good robustness to noise, and provide a good tiling of the joint space of location and frequency. In addition, they are advantageous for classifiers because classification is more likely to be easier in higher dimensional spaces.
There is a general focus on sparsity of the representations rather than small vector magnitudes. A study of these representations that is more general than the use of neural networks is known as ‘sparse coding.’
Sparse coding provides a class of algorithms for finding succinct representations of stimuli; given only unlabeled input data, it learns basis functions that capture higher-level features in the data.
An activation penalty can be applied per-layer, perhaps only at one layer that is the focus of the learned representation, such as the output of the encoder model or the middle (bottleneck) of an autoencoder model.
A constraint can be applied that adds a penalty proportional to the magnitude of the vector output of the layer.
The activation values may be positive or negative, so we cannot simply sum the values.
Two common methods for calculating the magnitude of the activation are:
Sum of the absolute activation values, called l1 vector norm.
Sum of the squared activation values, called the l2 vector norm.
The L1 norm encourages sparsity, e.g. allows some activations to become zero, whereas the l2 norm encourages small activations values in general. Use of the L1 norm may be a more commonly used penalty for activation regularization.
A hyperparameter must be specified that indicates the amount or degree that the loss function will weight or pay attention to the penalty. Common values are on a logarithmic scale between 0 and 0.1, such as 0.1, 0.001, 0.0001, etc.
Activity regularization can be used in conjunction with other regularization techniques, such as weight regularization.
Examples of Activation Regularization
This section provides some examples of activation regularization in order to provide some context for how the technique may be used in practice.
Regularized or sparse activations were originally sought as an approach to support the development of much deeper neural networks, early in the history of deep learning. As such, many examples may make use of architectures like restricted Boltzmann machines (RBMs) that have been replaced by more modern methods. Another big application of weight regularization is in autoencoders with semi-labeled or unlabeled data, so-called sparse autoencoders.
Xavier Glorot, et al. at the University of Montreal introduced the use of the rectified linear activation function to encourage sparsity of representation. They used an L1 penalty and evaluate deep supervised MLPs on a range of classical computer vision classification tasks such as MNIST and CIFAR10.
Additionally, an L1 penalty on the activations with a coefficient of 0.001 was added to the cost function during pre-training and fine-tuning in order to increase the amount of sparsity in the learned representations
Stephen Merity, et al. from Salesforce Research used L2 activation regularization with LSTMs on outputs and recurrent outputs for natural language process in conjunction with dropout regularization. They tested a suite of different activation regularization coefficient values on a range of language modeling problems.
While simple to implement, activity regularization and temporal activity regularization are competitive with other far more complex regularization techniques and offer equivalent or better results.
This section provides some tips for using activation regularization with your neural network.
Use With All Network Types
Activation regularization is a generic approach.
It can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks.
Use With Autoencoders and Encoder-Decoders
Activity regularization may be best suited to those model types that explicitly seek an efficient learned representation.
These include models such as autoencoders (i.e. sparse autoencoders) and encoder-decoder models, such as encoder-decoder LSTMs used for sequence-to-sequence prediction problems.
Experiment With Different Norms
The most common activation regularization is the L1 norm as it encourages sparsity.
Experiment with other types of regularization such as the L2 norm or using both the L1 and L2 norms at the same time, e.g. like the Elastic Net linear regression algorithm.
Use Rectified Linear
The rectified linear activation function, also called relu, is an activation function that is now widely used in the hidden layer of deep neural networks.
Unlike classical activation functions such as tanh (hyperbolic tangent function) and sigmoid (logistic function), the relu function allows exact zero values easily. This makes it a good candidate when learning sparse representations, such as with the l1 vector norm activation regularization.
Grid Search Parameters
It is common to use small values for the regularization hyperparameter that controls the contribution of each activation to the penalty.
Perhaps start by testing values on a log scale, such as 0.1, 0.001, and 0.0001. Then use a grid search at the order of magnitude that shows the most promise.
Standardize Input Data
It is a generally good practice to rescale input variables to have the same scale.
When input variables have different scales, the scale of the weights of the network will, in turn, vary accordingly. Large weights can saturate the nonlinear transfer function and reduce the variance in the output from the layer. This may introduce a problem when using activation regularization.
This problem can be addressed by either normalizing or standardizing input variables.
Use an Overcomplete Representation
Configure the layer chosen to be the learned features, e.g. the output of the encoder or the bottleneck in the autoencoder, to have more nodes that may be required.
This is called an overcomplete representation that will encourage the network to overfit the training examples. This can be countered with a strong activation regularization in order to encourage a rich learned representation that is also sparse.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
In this post, you discovered activation regularization as a technique to improve the generalization of learned features.
Specifically, you learned:
Neural networks learn features from data and models, such as autoencoders and encoder-decoder models, explicitly seek effective learned representations.
Similar to weights, large values in learned features, e.g. large activations, may indicate an overfit model.
The addition of penalties to the loss function that penalize a model in proportion to the magnitude of the activations may result in more robust and generalized learned features.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Weight constraints provide an approach to reduce the overfitting of a deep learning neural network model on the training data and improve the performance of the model on new data, such as the holdout test set.
There are multiple types of weight constraints, such as maximum and unit vector norms, and some require a hyperparameter that must be configured.
In this tutorial, you will discover the Keras API for adding weight constraints to deep learning neural network models to reduce overfitting.
After completing this tutorial, you will know:
How to create vector norm constraints using the Keras API.
How to add weight constraints to MLP, CNN, and RNN layers using the Keras API.
How to reduce overfitting by adding a weight constraint to an existing model.
Let’s get started.
How to Reduce Overfitting in Deep Neural Networks With Weight Constraints in Keras Photo by Ian Sane, some rights reserved.
Tutorial Overview
This tutorial is divided into three parts; they are:
Weight Constraints in Keras
Weight Constraints on Layers
Weight Constraint Case Study
Weight Constraints in Keras
The Keras API supports weight constraints.
The constraints are specified per-layer, but applied and enforced per-node within the layer.
Using a constraint generally involves setting the kernel_constraint argument on the layer for the input weights and the bias_constraint for the bias weights.
Generally, weight constraints are not used on the bias weights.
A suite of different vector norms can be used as constraints, provided as classes in the keras.constraints module. They are:
Maximum norm (max_norm), to force weights to have a magnitude at or below a given limit.
Non-negative norm (non_neg), to force weights to have a positive magnitude.
Unit norm (unit_norm), to force weights to have a magnitude of 1.0.
Min-Max norm (min_max_norm), to force weights to have a magnitude between a range.
For example, a constraint can imported and instantiated:
The weight norms can be used with most layers in Keras.
In this section, we will look at some common examples.
MLP Weight Constraint
The example below sets a maximum norm weight constraint on a Dense fully connected layer.
# example of max norm on a dense layer
from keras.layers import Dense
from keras.constraints import max_norm
...
model.add(Dense(32, kernel_constraint=max_norm(3), bias_constraint==max_norm(3)))
...
CNN Weight Constraint
The example below sets a maximum norm weight constraint on a convolutional layer.
# example of max norm on a cnn layer
from keras.layers import Conv2D
from keras.constraints import max_norm
...
model.add(Conv2D(32, (3,3), kernel_constraint=max_norm(3), bias_constraint==max_norm(3)))
...
RNN Weight Constraint
Unlike other layer types, recurrent neural networks allow you to set a weight constraint on both the input weights and bias, as well as the recurrent input weights.
The constraint for the recurrent weights is set via the recurrent_constraint argument to the layer.
The example below sets a maximum norm weight constraint on an LSTM layer.
# example of max norm on an lstm layer
from keras.layers import LSTM
from keras.constraints import max_norm
...
model.add(LSTM(32, kernel_constraint=max_norm(3), recurrent_constraint=max_norm(3), bias_constraint==max_norm(3)))
...
Now that we know how to use the weight constraint API, let’s look at a worked example.
Weight Constraint Case Study
In this section, we will demonstrate how to use weight constraints to reduce overfitting of an MLP on a simple binary classification problem.
This example provides a template for applying weight constraints to your own neural network for classification and regression problems.
Binary Classification Problem
We will use a standard binary classification problem that defines two semi-circles of observations, one semi-circle for each class.
Each observation has two input variables with the same scale and a class output value of either 0 or 1. This dataset is called the “moons” dataset because of the shape of the observations in each class when plotted.
We can use the make_moons() function to generate observations from this problem. We will add noise to the data and seed the random number generator so that the same samples are generated each time the code is run.
We can plot the dataset where the two variables are taken as x and y coordinates on a graph and the class value is taken as the color of the observation.
The complete example of generating the dataset and plotting it is listed below.
# generate two moons dataset
from sklearn.datasets import make_moons
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# scatter plot, dots colored by class value
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue'}
fig, ax = pyplot.subplots()
grouped = df.groupby('label')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()
Running the example creates a scatter plot showing the semi-circle or moon shape of the observations in each class. We can see the noise in the dispersal of the points making the moons less obvious.
Scatter Plot of Moons Dataset With Color Showing the Class Value of Each Sample
This is a good test problem because the classes cannot be separated by a line, e.g. are not linearly separable, requiring a nonlinear method such as a neural network to address.
We have only generated 100 samples, which is small for a neural network, providing the opportunity to overfit the training dataset and have higher error on the test dataset: a good case for using regularization. Further, the samples have noise, giving the model an opportunity to learn aspects of the samples that don’t generalize.
Overfit Multilayer Perceptron
We can develop an MLP model to address this binary classification problem.
The model will have one hidden layer with more nodes than may be required to solve this problem, providing an opportunity to overfit. We will also train the model for longer than is required to ensure the model overfits.
Before we define the model, we will split the dataset into train and test sets, using 30 examples to train the model and 70 to evaluate the fit model’s performance.
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
Next, we can define the model.
The hidden layer uses 500 nodes in the hidden layer and the rectified linear activation function. A sigmoid activation function is used in the output layer in order to predict class values of 0 or 1.
The model is optimized using the binary cross entropy loss function, suitable for binary classification problems and the efficient Adam version of gradient descent.
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
The defined model is then fit on the training data for 4,000 epochs and the default batch size of 32.
We will also use the test dataset as a validation dataset.
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
We can evaluate the performance of the model on the test dataset and report the result.
Finally, we will plot the performance of the model on both the train and test set each epoch.
If the model does indeed overfit the training dataset, we would expect the line plot of accuracy on the training set to continue to increase and the test set to rise and then fall again as the model learns statistical noise in the training dataset.
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
We can tie all of these pieces together; the complete example is listed below.
# mlp overfit on the moons dataset
from sklearn.datasets import make_moons
from keras.layers import Dense
from keras.models import Sequential
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
We can see that the model has better performance on the training dataset than the test dataset, one possible sign of overfitting.
Your specific results may vary given the stochastic nature of the neural network and the training algorithm. Because the model is overfit, we generally would not expect much, if any, variance in the accuracy across repeated runs of the model on the same dataset.
Train: 1.000, Test: 0.914
A figure is created showing line plots of the model accuracy on the train and test sets.
We can see that expected shape of an overfit model where test accuracy increases to a point and then begins to decrease again.
Line Plots of Accuracy on Train and Test Datasets While Training Showing an Overfit
Overfit MLP With Weight Constraint
We can update the example to use a weight constraint.
There are a few different weight constraints to choose from. A good simple constraint for this model is to simply normalize the weights so that the norm is equal to 1.0.
This constraint has the effect of forcing all incoming weights to be small.
We can do this by using the unit_norm in Keras. This constraint can be added to the first hidden layer as follows:
The complete updated example with the unit norm constraint is listed below:
# mlp overfit on the moons dataset with a unit norm constraint
from sklearn.datasets import make_moons
from keras.layers import Dense
from keras.models import Sequential
from keras.constraints import unit_norm
from matplotlib import pyplot
# generate 2d classification dataset
X, y = make_moons(n_samples=100, noise=0.2, random_state=1)
# split into train and test
n_train = 30
trainX, testX = X[:n_train, :], X[n_train:, :]
trainy, testy = y[:n_train], y[n_train:]
# define model
model = Sequential()
model.add(Dense(500, input_dim=2, activation='relu', kernel_constraint=unit_norm()))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0)
# evaluate the model
_, train_acc = model.evaluate(trainX, trainy, verbose=0)
_, test_acc = model.evaluate(testX, testy, verbose=0)
print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))
# plot history
pyplot.plot(history.history['acc'], label='train')
pyplot.plot(history.history['val_acc'], label='test')
pyplot.legend()
pyplot.show()
Running the example reports the model performance on the train and test datasets.
We can see that indeed the strict constraint on the size of the weights has improved the performance of the model on the holdout set without impacting performance on the training set.
Train: 1.000, Test: 0.943
Reviewing the line plot of train and test accuracy, we can see that it no longer appears that the model has overfit the training dataset.
Model accuracy on both the train and test sets continues to increase to a plateau.
Line Plots of Accuracy on Train and Test Datasets While Training With Weight Constraints
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
Report Weight Norm. Update the example to calculate the magnitude of the network weights and demonstrate that the constraint indeed made the magnitude smaller.
Constrain Output Layer. Update the example to add a constraint to the output layer of the model and compare the results.
Constrain Bias. Update the example to add a constraint to the bias weight and compare the results.
Repeated Evaluation. Update the example to fit and evaluate the model multiple times and report the mean and standard deviation of model performance.
If you explore any of these extensions, I’d love to know.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Weight regularization methods like weight decay introduce a penalty to the loss function when training a neural network to encourage the network to use small weights.
Smaller weights in a neural network can result in a model that is more stable and less likely to overfit the training dataset, in turn having better performance when making a prediction on new data.
Unlike weight regularization, a weight constraint is a trigger that checks the size or magnitude of the weights and scales them so that they are all below a pre-defined threshold. The constraint forces weights to be small and can be used instead of weight decay and in conjunction with more aggressive network configurations, such as very large learning rates.
In this post, you will discover the use of weight constraint regularization as an alternative to weight penalties to reduce overfitting in deep neural networks.
After reading this post, you will know:
Weight penalties encourage but do not require neural networks to have small weights.
Weight constraints, such as the L2 norm and maximum norm, can be used to force neural networks to have small weights during training.
Weight constraints can improve generalization when used in conjunction with other regularization methods like dropout.
Let’s get started.
A Gentle Introduction to Weight Constraints to Reduce Generalization Error in Deep Learning Photo by Dawn Ellner, some rights reserved.
Overview
Alternative to Penalties for Large Weights
Force Small Weights
How to Use a Weight Constraint
Example Uses of Weight Constraints
Tips for Using Weight Constraints
Alternative to Penalties for Large Weights
Large weights in a neural network are a sign of overfitting.
A network with large weights has very likely learned the statistical noise in the training data. This results in a model that is unstable, and very sensitive to changes to the input variables. In turn, the overfit network has poor performance when making predictions on new unseen data.
A popular and effective technique to address the problem is to update the loss function that is optimized during training to take the size of the weights into account.
This is called a penalty, as the larger the weights of the network become, the more the network is penalized, resulting in larger loss and, in turn, larger updates. The effect is that the penalty encourages weights to be small, or no larger than is required during the training process, in turn reducing overfitting.
A problem in using a penalty is that although it does encourage the network toward smaller weights, it does not force smaller weights.
A neural network trained with weight regularization penalty may still allow large weights, in some cases very large weights.
Force Small Weights
An alternate solution to using a penalty for the size of network weights is to use a weight constraint.
A weight constraint is an update to the network that checks the size of the weights, and if the size exceeds a predefined limit, the weights are rescaled so that their size is below the limit or between a range.
You can think of a weight constraint as an if-then rule checking the size of the weights while the network is being trained and only coming into effect and making weights small when required. Note, for efficiency, it does not have to be implemented as an if-then rule and often is not.
Unlike adding a penalty to the loss function, a weight constraint ensures the weights of the network are small, instead of mearly encouraging them to be small.
It can be useful on those problems or with networks that resist other regularization methods, such as weight penalties.
Weight constraints prove especially useful when you have configured your network to use alternative regularization methods to weight regularization and yet still desire the network to have small weights in order to reduce overfitting. One often-cited example is the use of a weight constraint regularization with dropout regularization.
Although dropout alone gives significant improvements, using dropout along with [weight constraint] regularization, […] provides a significant boost over just using dropout.
A constraint is enforced on each node within a layer.
All nodes within the layer use the same constraint, and often multiple hidden layers within the same network will use the same constraint.
Recall that when we talk about the vector norm in general, that this is the magnitude of the vector of weights in a node, and by default is calculated as the L2 norm, e.g. the square root of the sum of the squared values in the vector.
Some examples of constraints that could be used include:
Force the vector norm to be 1.0 (e.g. the unit norm).
Limit the maximum size of the vector norm (e.g. the maximum norm).
Limit the minimum and maximum size of the vector norm (e.g. the min_max norm).
The maximum norm, also called max-norm or maxnorm, is a popular constraint because it is less aggressive than other norms such as the unit norm, simply setting an upper bound.
Max-norm regularization has been previously used […] It typically improves the performance of stochastic gradient descent training of deep neural nets …
When using a limit or a range, a hyperparameter must be specified. Given that weights are small, the hyperparameter too is often a small integer value, such as a value between 1 and 4.
… we can use max-norm regularization. This constrains the norm of the vector of incoming weights at each hidden unit to be bound by a constant c. Typical values of c range from 3 to 4.
If the norm exceeds the specified range or limit, the weights are rescaled or normalized such that their magnitude is below the specified parameter or within the specified range.
If a weight-update violates this constraint, we renormalize the weights of the hidden unit by division. Using a constraint rather than a penalty prevents weights from growing very large no matter how large the proposed weight-update is.
All layers had L2 weight constraints on the incoming weights of each hidden unit.
Nitish Srivastava, et al. in their 2014 paper titled “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” used a maxnorm constraint with an MLP on the MNIST handwritten digit classification task and with CNNs on the streetview house numbers dataset with the parameter configured via a holdout validation set.
Max-norm regularization was used for weights in both convolutional and fully connected layers.
Jan Chorowski, et al. in their 2015 paper titled “Attention-Based Models for Speech Recognition” use LSTM and attention models for speech recognition with a max norm constraint set to 1.
We first trained our models with a column norm constraint with the maximum norm 1 …
Tips for Using Weight Constraints
This section provides some tips for using weight constraints with your neural network.
Use With All Network Types
Weight constraints are a generic approach.
They can be used with most, perhaps all, types of neural network models, not least the most common network types of Multilayer Perceptrons, Convolutional Neural Networks, and Long Short-Term Memory Recurrent Neural Networks.
In the case of LSTMs, it may be desirable to use different constraints or constraint configurations for the input and recurrent connections.
Standardize Input Data
It is a good general practice to rescale input variables to have the same scale.
When input variables have different scales, the scale of the weights of the network will, in turn, vary accordingly. This introduces a problem when using weight constraints because large weights will cause the constraint to trigger more frequently.
This problem can be done by either normalization or standardization of input variables.
Use a Larger Learning Rate
The use of a weight constraint allows you to be more aggressive during the training of the network.
Specifically, a larger learning rate can be used, allowing the network to, in turn, make larger updates to the weights each update.
This is cited as an important benefit to using weight constraints. Such as the use of a constraint in conjunction with dropout:
Using a constraint rather than a penalty prevents weights from growing very large no matter how large the proposed weight-update is. This makes it possible to start with a very large learning rate which decays during learning, thus allowing a far more thorough search of the weight-space than methods that start with small weights and use a small learning rate.
Explore the use of other weight constraints, such as a minimum and maximum range, non-negative weights, and more.
You may also choose to use constraints on some weights and not others, such as not using constraints on bias weights in an MLP or not using constraints on recurrent connections in an LSTM.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Books
7.2 Norm Penalties as Constrained Optimization, Deep Learning, 2016.
In this post, you discovered the use of weight constraint regularization as an alternative to weight penalties to reduce overfitting in deep neural networks.
Specifically, you learned:
Weight penalties encourage but do not require neural networks to have small weights.
Weight constraints such as the L2 norm and maximum norm can be used to force neural networks to have small weights during training.
Weight constraints can improve generalization when used in conjunction with other regularization methods, like dropout.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.