In this post, I’m going to cover the very important deep learning concept called transfer learning. Transfer learning is the process whereby one uses neural network models trained in a related domain to accelerate the development of accurate models in your more specific domain of interest. For instance, a deep learning practitioner can use one of the state-of-the-art image classification models, already trained, as a starting point for their own, more specialized, image classification task. In this tutorial, I’ll be showing you how to perform transfer learning using an advanced, pre-trained image classification model – ResNet50 – to improve a more specific image classification task – the cats vs dogs classification problem. In particular, I’ll be showing you how to do this using TensorFlow 2. The code for this tutorial, in a Google Colaboratory notebook format, can be found on this site’s Github repository here. This code borrows some components from the official TensorFlow tutorial.
Eager to build deep learning systems? Get the book here
What are the benefits of transfer learning?
Transfer learning has many benefits, these are:
It speeds up learning: For state of the art results in deep learning, one often needs to build very deep networks with many layers. In order to train such networks, one needs lots of data, computational power and time. These three things are often not readily available.
It needs less data: As will be shown, transfer learning usually only adds a few extra layers to the pre-trained model, and the weights in the pre-trained model are generally fixed. Therefore, during the fine tuning of the model, only those few extra layers, or a small subset of the total number of layers, is subjected to training. This requires much less data to get good results.
You can leverage the expert tuning of state-of-the-art models: As anyone who has been involved in building deep learning systems can tell you, it requires a lot of patience and tuning of the models to get the best results. By utilizing pre-trained, state-of-the-art models, you can skip a lot of this arduous work and rely on the efforts of experts in the field.
For these reasons, if you are performing some image recognition task, it may be worth using some of the pre-trained, state-of-the-art image classification models, like ResNet, DenseNet, InceptionNet and so on. How does one use these pre-trained models?
How to create a transfer learning model
To create a transfer learning model, all that is required is to take the pre-trained layers and “bolt on” your own network. This could be either at the beginning or end of the pre-trained model. Usually, one disables the pre-trained layer weights and only trains the “bolted on” layers which have been added. For image classification transfer learning, one usually takes the convolutional neural network (CNN) layers from the pre-trained model and adds one or more densely connected “classification” layers at the end (for more on convolutional neural networks, see this tutorial). The pre-trained CNN layers act as feature extractors / maps, and the classification layer/s at the end can be “taught” to “interpret” these image features. The transfer learning model architecture that will be used in this example is shown below:
ResNet50 transfer learning architecture
The full ResNet50 model shown in the image above, in addition to a Global Average Pooling (GAP) layer, contains a 1000 node dense / fully connected layer which acts as a “classifier” of the 2048 (4 x 4) feature maps output from the ResNet CNN layers. For more on Global Average Pooling, see my tutorial. In this transfer learning task, we’ll be removing these last two layers (GAP and Dense layer) and replacing these with our own GAP and dense layer (in this example, we have a binary classification task – hence the output size is only 1). The GAP layer has no trainable parameters, but the dense layer obviously does – these will be the only parameters trained in this example. All of this is performed quite easily in TensorFlow 2, as will be shown in the next section.
Transfer learning in TensorFlow 2
In this example, we’ll be using the pre-trained ResNet50 model and transfer learning to perform the cats vs dogs image classification task. I’ll also train a smaller CNN from scratch to show the benefits of transfer learning. To access the image dataset, we’ll be using the tensorflow_datasets package which contains a number of common machine learning datasets. To load the data, the following commands can be run:
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds
split = (80, 10, 10)
splits = tfds.Split.TRAIN.subsplit(weighted=split)
(cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', split=list(splits), with_info=True, as_supervised=True)
A few things to note about the code snippet above. First, the split tuple (80, 10, 10) signifies the (training, validation, test) split as percentages of the dataset. This is then passed to the tensorflow_datasets split object which tells the dataset loader how to break up the data. Finally, the tfds.load() function is invoked. The first argument is a string specifying the dataset name to load. Following arguments relate to whether a split should be used, whether to return an argument with information about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels being included. The variables cat_train, cat_valid and cat_test are TensorFlow Dataset objects – to learn more about these, check out my previous post. In order to examine the images in the data set, the following code can be run:
import matplotlib.pylab as plt
for image, label in cat_train.take(2):
plt.figure()
plt.imshow(image)
This produces the following images: As can be observed, the images are of varying sizes. This will need to be rectified so that the images have a consistent size to feed into our model. As usual, the image pixel values (which range from 0 to 255) need to be normalized – in this case, to between 0 and 1. The function below performs these tasks:
In this example, we’ll be resizing the images to 100 x 100 using tf.image.resize. To get state of the art levels of accuracy, you would probably want a larger image size, say 200 x 200, but in this case I’ve chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and also shuffle and batch where appropriate:
First, we’ll build a smaller CNN image classifier which will be trained from scratch.
A smaller CNN model
In the code below, a 3 x CNN layer head, a GAP layer and a final densely connected output layer is created. The Keras API, which is the encouraged approach for TensorFlow 2, is used in the model definition below. For more on Keras, see this and this tutorial.
Note that the loss function is ‘binary cross-entropy’, due to the fact that the cats vs dogs image classification task is a binary classification problem (i.e. 0 = cat, 1 = dog or vice-versa). Running the code above, after 7 epochs, gives a training accuracy of around 89% and a validation accuracy of around 85%. Next we’ll see how this compares to the transfer learning case.
ResNet50 transfer learning example
To download the ResNet50 model, you can utilize the tf.keras.applications object to download the ResNet50 model in Keras format with trained parameters. To do so, run the following code:
The weights argument ‘imagenet’ denotes that the weights to be used are those generated by being trained on the ImageNet dataset. The include_top argument states that we only want the CNN-feature maps part of the ResNet50 model – not its final GAP and dense connected layers. Finally, we need to specify what input shape we want the model being setup to receive. Next, we need to disable the training of the parameters within this Keras model. This is performed really easily:
res_net.trainable = False
Next we create a Global Average Pooling layer, along with a final densely connected output layer with sigmoid activation. Then the model is combined using the Keras sequential framework where Keras models can be chained together:
As can be observed, while the total number of parameters is large (i.e. 23 million) the number of trainable parameters, corresponding to the weights of the final output layer, is only 2,049.
The graphs below from TensorBoard show the relative performance of the small CNN model trained from scratch and the ResNet50 transfer learning model:
Transfer learning training accuracy comparison (blue – ResNet50, pink – smaller CNN model)
Transfer learning validation accuracy comparision (red – ResNet50 model, green – smaller CNN model)
The results above show that the ResNet50 model reaches higher levels of both training and validation accuracy much quicker than the smaller CNN model that was trained from scratch. This illustrates the benefit of using these powerful pre-trained models as a starting point for your more domain specific deep learning tasks. I hope this post has been a help and given you a good understanding of the benefits of transfer learning, and also how to implement it easily in TensorFlow 2.
Eager to build deep learning systems? Get the book here
For those familiar with convolutional neural networks (if you’re not, check out this post), you will know that, for many architectures, the final set of layers are often of the fully connected variety. This is like bolting a standard neural network classifier onto the end of an image processor. The convolutional neural network starts with a series of convolutional (and, potentially, pooling) layers which create feature maps which represent different components of the input images. The fully connected layers at the end then “interpret” the output of these features maps and make category predictions. However, as with many things in the fast moving world of deep learning research, this practice is starting to fall by the wayside in favor of something called Global Average Pooling (GAP). In this post, I’ll introduce the benefits of Global Average Pooling and apply it on the Cats vs Dogs image classification task using TensorFlow 2. In the process, I’ll compare its performance to the standard fully connected layer paradigm. The code for this tutorial can be found in a Jupyter Notebook on this site’s Github repository, ready for use in Google Colaboratory.
Eager to build deep learning systems? Get the book here
Global Average Pooling
Global Average Pooling is an operation that calculates the average output of each feature map in the previous layer. This fairly simple operation reduces the data significantly and prepares the model for the final classification layer. It also has no trainable parameters – just like Max Pooling (see here for more details). The diagram below shows how it is commonly used in a convolutional neural network:
Global Average Pooling in a CNN architecture
As can be observed, the final layers consist simply of a Global Average Pooling layer and a final softmax output layer. As can be observed, in the architecture above, there are 64 averaging calculations corresponding to the 64, 7 x 7 channels at the output of the second convolutional layer. The GAP layer transforms the dimensions from (7, 7, 64) to (1, 1, 64) by performing the averaging across the 7 x 7 channel values. Global Average Pooling has the following advantages over the fully connected final layers paradigm:
The removal of a large number of trainable parameters from the model. Fully connected or dense layers have lots of parameters. A 7 x 7 x 64 CNN output being flattened and fed into a 500 node dense layer yields 1.56 million weights which need to be trained. Removing these layers speeds up the training of your model.
The elimination of all these trainable parameters also reduces the tendency of over-fitting, which needs to be managed in fully connected layers by the use of dropout.
The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more closely related to the classification categories – so that each feature map becomes a kind of “category confidence map”.
Finally, the authors also argue that, due to the averaging operation over the feature maps, this makes the model more robust to spatial translations in the data. In other words, as long as the requisite feature is included / or activated in the feature map somewhere, it will still be “picked up” by the averaging operation.
To test out these ideas in practice, in the next section I’ll show you an example comparing the benefits of the Global Average Pooling with the historical paradigm. This example problem will be the Cats vs Dogs image classification task and I’ll be using TensorFlow 2 to build the models. At the time of writing, only TensorFlow 2 Alpha is available, and the reader can follow this link to find out how to install it.
Global Average Pooling with TensorFlow 2 and Cats vs Dogs
To download the Cats vs Dogs data for this example, you can use the following code:
import tensorflow as tf
from tensorflow.keras import layers
import tensorflow_datasets as tfds
split = (80, 10, 10)
splits = tfds.Split.TRAIN.subsplit(weighted=split)
(cat_train, cat_valid, cat_test), info = tfds.load('cats_vs_dogs', split=list(splits), with_info=True, as_supervised=True)
The code above utilizes the TensorFlow Datasets repository which allows you to import common machine learning datasets into TF Dataset objects. For more on using Dataset objects in TensorFlow 2, check out this post. A few things to note. First, the split tuple (80, 10, 10) signifies the (training, validation, test) split as percentages of the dataset. This is then passed to the tensorflow_datasets split object which tells the dataset loader how to break up the data. Finally, the tfds.load() function is invoked. The first argument is a string specifying the dataset name to load. Following arguments relate to whether a split should be used, whether to return an argument with information about the dataset (info) and whether the dataset is intended to be used in a supervised learning problem, with labels being included. In order to examine the images in the data set, the following code can be run:
import matplotlib.pylab as plt
for image, label in cat_train.take(2):
plt.figure()
plt.imshow(image)
This produces the following images: As can be observed, the images are of varying sizes. This will need to be rectified so that the images have a consistent size to feed into our model. As usual, the image pixel values (which range from 0 to 255) need to be normalized – in this case, to between 0 and 1. The function below performs these tasks:
In this example, we’ll be resizing the images to 100 x 100 using tf.image.resize. To get state of the art levels of accuracy, you would probably want a larger image size, say 200 x 200, but in this case I’ve chosen speed over accuracy for demonstration purposes. As can be observed, the image values are also cast into the tf.float32 datatype and normalized by dividing by 255. Next we apply this function to the datasets, and also shuffle and batch where appropriate:
For more on TensorFlow datasets, see this post. Now it is time to build the model – in this example, we’ll be using the Keras API in TensorFlow 2. In this example, I’ll be using a common “head” model, which consists of layers of standard convolutional operations – convolution and max pooling, with batch normalization and ReLU activations:
Next, we need to add the “back-end” of the network to perform the classification.
Standard fully connected classifier results
In the first instance, I’ll show the results of a standard fully connected classifier, without dropout. Because, for this example, there are only two possible classes – “cat” or “dog” – the final output layer is a dense / fully connected layer with a single node and a sigmoid activation.
As can be observed, in this case, the output classification layers includes 2 x 100 node dense layers. To combine the head model and this standard classifier, the following commands can be run:
Note that the loss used is binary crossentropy, due to the binary classes for this example. The training progress over 7 epochs can be seen in the figure below:
Standard classifier accuracy (red – training, blue – validation)
Standard classifier loss (red – training, blue – validation)
As can be observed, with a standard fully connected classifier back-end to the model (without dropout), the training accuracy reaches high values but it overfits with respect to the validation dataset. The validation dataset accuracy stagnates around 80% and the loss begins to increase – a sure sign of overfitting.
Global Average Pooling results
The next step is to test the results of the Global Average Pooling in TensorFlow 2. To build the GAP layer and associated model, the following code is added:
The accuracy results for this model, along with the results of the standard fully connected classifier model, are shown below:
Global average pooling accuracy vs standard fully connected classifier model (pink – GAP training, green – GAP validation, blue – FC classifier validation)
As can be observed from the graph above, the Global Average Pooling model has a higher validation accuracy by the 7th epoch than the fully connected model. The training accuracy is lower than the FC model, but this is clearly due to overfitting being reduced in the GAP model. A final comparison including the case of the FC model with a dropout layer inserted is shown below:
Global average pooling validation accuracy vs FC classifier with and without dropout (green – GAP model, blue – FC model without DO, orange – FC model with DO)
As can be seen, of the three model options sharing the same convolutional front end, the GAP model has the best validation accuracy after 7 epochs of training (x – axis in the graph above is the number of batches). Dropout improves the validation accuracy of the FC model, but the GAP model is still narrowly out in front. Further tuning could be performed on the fully connected models and results may improve. However, one would expect Global Average Pooling to be at least equivalent to a FC model with dropout – even though it has hundreds of thousands of fewer parameters. I hope this short tutorial gives you a good understanding of Global Average Pooling and its benefits. You may want to consider it in the architecture of your next image classifier design.
Eager to build deep learning systems? Get the book here
In this relatively short post, I’m going to show you how to deal with metrics and summaries in TensorFlow 2. Metrics, which can be used to monitor various important variables during the training of deep learning networks (such as accuracy or various losses), were somewhat unwieldy in TensorFlow 1.X. Thankfully in the new TensorFlow 2.0 they are much easier to use. Summary logging, for visualization of training in the TensorBoard interface, has also undergone some changes in TensorFlow 2 that I will be demonstrating. Please note – at time of writing, only the alpha version of TensorFlow 2 is available, but it is probably safe to assume that the syntax and forms demonstrated in this tutorial will remain the same in TensorFlow 2.0. To install the alpha version, use the following command:
pip install tensorflow==2.0.0-alpha0
In this tutorial, I’ll be using a generic MNIST Convolutional Neural Network example, but utilizing full TensorFlow 2 design paradigms. To learn more about CNNs, see this tutorial – to understand more about TensorFlow 2 paradigms, see this tutorial. All the code for this tutorial can be found as a Google Colaboratory file on my Github repository.
Eager to build deep learning systems? Get the book here
TensorFlow 2 metrics
Metrics in TensorFlow 2 can be found in the TensorFlow Keras distribution – tf.keras.metrics. Metrics, along with the rest of TensorFlow 2, are now computed in an Eager fashion. In TensorFlow 1.X, metrics were gathered and computed using the imperative declaration, tf.Session style. All that is required now is to declare the metrics as a Python variable, use the method update_state() to add a state to the metric, result() to summarize the metric, and finally reset_states() to reset all the states of the metric. The code below shows a simple implementation of a Mean metric:
This will print the average result -> 3.0. As can be observed, there is an internal memory for the metric, which can be appended to using update_state(). The Mean metric operation is executed when result() is called. Finally, to reset the memory of the metric, we can use reset_states() as follows:
This will print the default response of an empty metric – 0.0.
TensorFlow 2 summaries
Metrics fit hand-in-glove with summaries in TensorFlow 2. In order to log summaries in TensorFlow 2, the developer uses the with Python context manager. First, one creates a summary_writer object like so:
To log something to the summary writer, the developer must first enclose the “space” within your code which does the logging with a Python with statement. The logging looks like so:
with summary_writer.as_default():
tf.summary.scalar('mean', mean_metric.result(), step=1)
The with context can surround the full training loop, or just the area of the code where you are storing the summaries. As can be observed, the logged scalar value is set by using the metric result() method. The step value needs to be provided to the summary – this allows TensorBoard to plot the variation of various values, images etc. between training steps. The step number can be tracked manually, but the easiest way is to use the iterations property of whatever optimizer you are using. This will be demonstrated in the example below.
TensorFlow 2 metrics and summaries – CNN example
In this example, I’ll show how to use metrics and summaries in the context of a CNN MNIST classification example. In this example, I’ll use a custom training loop, rather than a Keras fit loop. In the next section, I’ll show you how to implement custom metrics even within the Keras fit functionality.
As usual for any machine learning task, the first step is to prepare the training and validation data. In this case, we’ll be using the prepackaged Keras MNIST dataset, then converting the numpy data arrays into a TensorFlow dataset (for more on TensorFlow datasets, see here and here). This looks like the following:
In the lines above, some preprocessing is applied to the image data to normalize it (divide the pixel values by 255, make the tensors 4D for consumption into CNN layers). Next I define the CNN model, using the Keras sequential paradigm:
The model declaration above is all standard Keras – for more on the sequential model type of Keras, see here. Next, we create a custom training loop function in TensorFlow. It is now best practice to encapsulate core parts of your code in Python functions – this is so that the @tf.function decorator can be applied easily to the function. This signals to TensorFlow to perform Just In Time (JIT) compilation of the relevant code into a graph, which allows the performance benefits of a static graph as per TensorFlow 1.X. Otherwise, the code will execute eagerly, which is not a big deal, but if one is building production or performance dependent code it is better to decorate with @tf.function.
Here’s the training loop and optimization/loss function definitions:
optimizer = tf.keras.optimizers.Adam()
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
def train(ds_train, optimizer, loss_fn, model, num_batches, log_freq=10):
avg_loss = tf.keras.metrics.Mean()
avg_acc = tf.keras.metrics.SparseCategoricalAccuracy()
batch_idx = 0
for batch_idx, (images, labels) in enumerate(ds_train):
images = tf.expand_dims(images, -1)
with tf.GradientTape() as tape:
logits = model(images)
loss_value = loss_fn(labels, logits)
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
avg_loss.update_state(loss_value)
avg_acc.update_state(labels, logits)
if batch_idx % log_freq == 0:
print(f"Batch {batch_idx}, average loss is {avg_loss.result().numpy()}, average accuracy is {avg_acc.result().numpy()}")
tf.summary.scalar('loss', avg_loss.result(), step=optimizer.iterations)
tf.summary.scalar('acc', avg_acc.result(), step=optimizer.iterations)
avg_loss.reset_states()
avg_acc.reset_states()
if batch_idx > num_batches:
break
As can be observed, I have created two metrics for use in this training loop – avg_loss and avg_acc. These are Mean and SparseCategoricalAccuracy metrics, respectively. The Mean metric has been discussed previously. The SparseCategoricalAccuracy metric takes, as input, the training labels and logits (raw, unactivated outputs from your model). Because it is a sparse categorical accuracy measure, it can take the training labels in scalar integer form, rather than one-hot encoded label vectors. Calling result() on this metric will calculate the average accuracy of all the labels/logits pairs passed during the update_state() call – see line 15 above.
Every log_freq number of batches, the results of the metrics are printed and also passed as summary scalars. After the metrics are logged in the summaries, their states are reset. You will notice that I have not provided a with context for these summaries – this is applied in the outer epoch loop is shown below:
num_epochs = 10
summary_writer = tf.summary.create_file_writer('./log/{}'.format(dt.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")))
for i in range(num_epochs):
print(f"Epoch {i + 1} of {num_epochs}")
with summary_writer.as_default():
train(train_dataset, optimizer, loss_fn, model, 10000//BATCH_SIZE)
As can be observed, the summary_writer.as_default() is supplied as context to the whole train function.
So far so good. However, this is utilizing a “manual” TensorFlow training loop, which is no longer the easiest way to train in TensorFlow 2, given the tight Keras integration. In the next example, I’ll show you how to include run of the mill metrics in the Keras API, but also custom metrics.
TensorFlow 2 Keras metrics and summaries
To include normal metrics such as the accuracy in Keras is straight-forward – one supplies a list of metrics to be logged in the compile statement like so:
However, if one wishes to log more complicated or custom metrics, it becomes difficult to see how to set this up in Keras. One easy way of doing so is by creating a custom Keras layer whose sole purpose is to add a metric to the model / training. In the example below, I have created a custom layer which adds the standard deviation of the kernel weights as a metric:
A few things to notice about the creation of the custom layer above. First, notice that the layer is defined as a Python class object which inherits from the keras.layers.Layer object. The only variable passed to the initialization of this custom class is the layer with the kernel weights which we wish to log. The call method tells Keras / TensorFlow what to do when the layer is called in a feed forward pass. In this case, the input is passed straight through to the output – it is, in essence, a dummy layer. However, you’ll notice within the call a metric is added.
The value of the metric is the standard deviation of layer_to_log.variables[0]. For a CNN layer, the zero index [0] of the layer variables is the kernel weights. A name is provided to the metric for ease of viewing during training, and finally the aggregation method of the metric is specified – in this case, a ‘mean’ aggregation of the standard deviations.
To include this layer, one can just add it as a sequential element in the Keras model. In the below I take the existing CNN model created in the previous example, and create a new model with the custom metric layer appended to the end:
As can be observed in the above, the first layer of the previous model is passed to the custom MetricLayer. Running the fit training method on this model will now generate both the SparseCategoricalAccuracy metric, along with the custom standard deviation from the first layer. To monitor in TensorBoard, one must also include the TensorBoard callback. All of this looks like the following:
The code above will perform the training and ensure all the metrics (including the metric added in the custom metric layer) are output to TensorBoard via the TensorBoard callback.
This concludes my quick introduction to metrics and summaries in TensorFlow 2. Watch out for future posts and updates of existing posts as the transition to TensorFlow 2 develops.
Eager to build deep learning systems? Get the book here
If you’ve been involved with neural networks and have beeen using them for classification, you almost certainly will have used a cross entropy loss function. However, have you really understood what cross-entropy means? Do you know what entropy means, in the context of machine learning? If not, then this post is for you. In this introduction, I’ll carefully unpack the concepts and mathematics behind entropy, cross entropy and a related concept, KL divergence, to give you a better foundational understanding of these important ideas. For starters, let’s look at the concept of entropy.
Entropy
The term entropy originated in statistical thermodynamics, which is a sub-domain of physics. However, for machine learning, we are more interested in the entropy as defined in information theory or Shannon entropy. This formulation of entropy is closely tied to the allied idea of information. Entropy is the average rate of information produced from a certain stochastic process (see here). As such, we first need to unpack what the term “information” means in an information theory context. If you’re feeling a bit lost at this stage, don’t worry, things will become much clearer soon.
Eager to build deep learning systems? Get the book here
Information content
Information I in information theory is generally measured in bits, and can loosely, yet instructively, be defined as the amount of “surprise” arising from a given event. To take a simple example – imagine we have an extremely unfair coin which, when flipped, has a 99% chance of landing heads and only 1% chance of landing tails. We can represent this using set notation as {0.99, 0.01}.
If we toss the coin once, and it lands heads, we aren’t very surprised and hence the information “transmitted” by such an event is low. Alternatively, if it lands tails, we would be very surprised (given what we know about the coin) and therefore the information of such an event would be high. This can be represented mathematically by the following formula:
$$I(E) = -log[Pr(E)] = -log(P)$$
This equation gives the information entailed in a stochastic event E, which is given by the negative log of the probability of the event. This can be expressed more simply as $-log(P)$. One thing to note – if we are dealing with information expressed in bits (i.e. each bit is either 0 or 1) the logarithm has a base of 2 – so $I(E) = -log_{2}(P)$. An alternative unit often used in machine learning is nats, and applies where the natural logarithm is used.
For the coin toss event with our unfair coin, the information entailed by heads would be $-log_{2}(0.99) = 0.0144bits$, which is quite low. Alternatively, for tails the information is equal to 6.64bits. So this lines up nicely with the interpretation of information as “surpise” as discussed above.
Now, recall that entropy is defined as the average rate of information produced from a stochastic process. What is the average or expected rate of information produced from the process of flipping our very unfair coin?
Entropy and information
How do we calculate the expected or average value of something? Recall that the expected value of a variable X is given by:
$$E[X] = \sum_{i=1}^{n} x_{i}p_{i}$$
Where $x_{i}$ is the i-th possible value of $x$ and $p_{i}$ is the probability of that $x$ occurring. Likewise, we can define the entropy as the expected value of information, so that it looks like this:
It can therefore be said that the unfair coin is a stochastic information generator which has an average information delivery rate of 0.08bits. This is quite small, because it is dominated by the high probability of the {heads} outcome. However it should be noted that a fair coin will give one an entropy of 1bit. In any case, this should give you a good idea of what entropy is, what it measures and how to calculate it.
So far so good. However, what does this have to do with machine learning?
Entropy and machine learning
You might be wondering at this point where entropy is used in machine learning. Well, first of all, it is a central concept of cross entropy which you are probably already familiar with. More on that later. However, entropy is also used in its own right within machine learning. One notable and instructive instance is its use in policy gradient optimization in reinforcement learning. In such a case, a neural network is trained to control an agent, and its output consists of a softmax layer. This softmax output layer is a probability distribution of what the best action for the agent is.
The output, for an environment with an action size of 4, may look something like this for a given game state:
{0.9, 0.05, 0.025, 0.025}
In the case above, the agent will most likely choose the first action (i.e. p = 0.9). So far so good – but how does entropy come into this? One of the key problems which needs to be addressed in reinforcement learning is making sure the agent doesn’t learn to converge on one set of actions or strategies too quickly. This is called encouraging exploration. In the policy gradient version of reinforcement learning, exploration can be encouraged by putting the negative of the entropy of the output layer into the loss function. Thus, as the loss is minimized, any “narrowing down” of the probabilities of the agent’s actions must be strong enough to counteract the increase in the negative entropy.
For instance, the entropy of the example output above, given its predilection for choosing the first action (i.e. p = 0.9), is quite low – 0.61bits. Consider an alternative output of the softmax layer however:
{0.3, 0.2, 0.25, 0.25}
In this case, the entropy is larger 1.98bits, given that there is more uncertainty in what action the agent should choose. If the negative of entropy is included in the loss function, a higher entropy will act to reduce the loss value more than a lower entropy, and hence there will be a tendency not to converge too quickly on a definitive set of actions (i.e. low entropy).
Entropy is also used in certain Bayesian methods in machine learning, but these won’t be discussed here. It is now time to consider the commonly used cross entropy loss function.
Cross entropy and KL divergence
Cross entropy is, at its core, a way of measuring the “distance” between two probability distributions P and Q. As you observed, entropy on its own is just a measure of a single probability distribution. As such, if we are trying to find a way to model a true probability distribution, P, using, say, a neural network to produce an approximate probability distribution Q, then there is the need for some sort of distance or difference measure which can be minimized.
The difference measure in cross entropy arises from something called Kullback–Leibler (KL) divergence. This can be seen in the definition of the cross entropy function:
$$H(p, q) = H(p) + D_{KL}(p \parallel q)$$
The first term, the entropy of the true probability distribution p, during optimization is fixed – it reduces to an additive constant during optimization. It is only the parameters of the second, approximation distribution, q that can be varied during optimization – and hence the core of the cross entropy measure of distance is the KL divergence function.
KL divergence
The KL divergence between two distributions has many different interpretations from an information theoretic perspective. It is also, in simplified terms, an expression of “surprise” – under the assumption that P and Q are close, it is surprising if it turns out that they are not, hence in those cases the KL divergence will be high. If they are close together, then the KL divergence will be low.
Another interpretation of KL divergence, from a Bayesian perspective, is intuitive – this interpretation says KL divergence is the information gained when we move from a prior distribution Q to a posterior distribution P. The expression for KL divergence can also be derived by using a likelihood ratio approach.
The likelihood ratio can be expressed as:
$$LR = \frac{p(x)}{q(x)}$$
This can be interpreted as follows: if a value x is sampled from some unknown distribution, the likelihood ratio expresses how much more likely the sample has come from distribution p than from distribution q. If it is more likely from p, the LR > 1, otherwise if it is more likely from q, the LR < 1.
So far so good. Let’s say we have lots of independent samples and we want to estimate the likelihood function taking into account all this evidence – it then becomes:
$$LR = \prod_{i=0}^{n}\frac{p(x_{i})}{q(x_{i})}$$
If we convert the ratio to log, it’s possible to turn the product in the above definition to a summation:
So now we have the likelihood ratio as a summation. Let’s say we want to answer the question of how much, on average, each sample gives evidence of $p(x)$ over $q(x)$. To do this, we can take the expected value of the likelihood ratio and arrive at:
The expression above is the definition of KL divergence. It is basically the expected value of the likelihood ratio – where the likelihood ratio expresses how much more likely the sampled data is from distribution P instead of distribution Q. Another way of expressing the above definition is as follows (using log rules):
The first term in the above equation is the entropy of the distribution P. As you can recall it is the expected value of the information content of P. The second term ($\sum_{i=0}^{n}p(x_{i})log (q(x_{i}))$) is the information content of Q, but instead weighted by the distribution P. This yields the interpretation of the KL divergence to be something like the following – if P is the “true” distribution, then the KL divergence is the amount of information “lost” when expressing it via Q.
However you wish to interpret the KL divergence, it is clearly a difference measure between the probability distributions P and Q. It is only a “quasi” distance measure however, as $P_{KL}(P \parallel Q) \neq P_{KL}(Q \parallel P)$.
Now we need to show how the KL divergence generates the cross-entropy function.
Cross entropy
As explained previously, the cross entropy is a combination of the entropy of the “true” distribution P and the KL divergence between P and Q:
$$H(p, q) = H(p) + D_{KL}(p \parallel q)$$
Using the definition of the entropy and KL divergence, and log rules, we can arrive at the following cross entropy definition:
What does the utilization of this function look like in practice for a classification task in neural networks? In such a task, we are usually dealing with the true distribution P being a one-hot encoded vector. So, for instance, in the MNIST hand-written digit classification task, if the image represents a hand-written digit of “2”, P will look like:
{0, 0, 1, 0, 0, 0, 0, 0, 0, 0}
The output layer of our neural network in such a task will be a softmax layer, where all outputs have been normalized so they sum to one – representing a quasi probability distribution. The output layer, Q, for this image could be:
To get the predicted class, one would run an np.argmax over the output and, in this example, we would get the correct prediction. However, observe how the cross entropy loss function works in this instance. For all values apart from i=2, $p(x_{i}) = 0$, so the value within the summation for these indices falls to 0. The only index which doesn’t have a zero value is i=2. As such, for one-hot encoded vectors, the cross entropy collapses to:
$$H(p,q) = -log(q(x_{i}))$$
In this example, the cross entropy loss would be $-log(0.75) = 0.287$ (using nats as the information unit). The closer the Q value gets to 1 for the i=2 index, the lower the loss would get. This is because the KL divergence between P and Qis reducing for this index.
One might wonder – if the cross entropy loss for classification tasks reduces to a single output node calculation, how does the neural network learn to both increase the softmax value that corresponds to the true index, and decrease the values of all the other nodes? It does this via the cross interaction of nodes through the weights, but also, through the nature of the softmax function itself – if a single index is encouraged to increase, all the other indices/output classes will be encouraged to decrease in the softmax function.
In TensorFlow 2.0, the function to use to calculate the cross entropy loss is the tf.keras.losses.CategoricalCrossentropy() function, where the P values are one-hot encoded. If you’d prefer to leave your true classification values as integers which designate the true values (rather than one-hot encoded vectors), you can use instead the tf.losses.SparseCategoricalCrossentropy() function. In PyTorch, the function to use is torch.nn.CrossEntropyLoss() – however, note that this function performs a softmax transformation of the input before calculating the cross entropy – as such, one should supply only the “logits” (the raw, pre-activated output layer values) from your classifier network. The TensorFlow functions above require a softmax activation to already be applied to the output of the neural network.
That concludes this tutorial on the important concepts of entropy, cross entropy and KL divergence. I hope it will help you deepen your understanding of these commonly used functions, and thereby deepen your understanding of machine learning and neural networks.
Eager to build deep learning systems? Get the book here
It’s a great time to be practising deep learning. The main existing deep learning frameworks like TensorFlow, Keras and PyTorch are maturing and offer a lot of functionality to streamline the deep learning process. There are also other great tool sets emerging for the deep learning practitioner. One of these is the Google Colaboratory environment. This environment, based on Python Jupyter notebooks, gives the user free access to Tesla K80 GPUs. If your local machine lacks a GPU, there is now no need to hire out GPU time on Amazon AWS, at least for prototyping smaller learning tasks. This opens up the ability of anybody to experiment with deep learning beyond simple datasets like MNIST. Google has also just recently opened up the free use of TPUs (Tensor Processing Units) within the environment.
Free access to GPUs and TPUs are just one benefit of Google Colaboratory. This post will explore the capabilities of the environment and show you how to efficiently and effectively use it as a deep learning “home base”. I’ll also be running through an example CIFAR10 classifier built in TensorFlow to demonstrate its use. The Google Colaboratory notebook for this tutorial can be found here.
Eager to build deep learning systems? Learn how here
Google Colaboratory basics
Google Colaboratory is based on the Jupyter notebook design and operation paradigm. I won’t be reviewing how Jupyter works, as I imagine most Python and TensorFlow users are already aware of this package. To access the environment, you must have a Google Drive account and be signed in. The .ipynb files that you create will be saved in your Google Drive account.
To access Google Colaboratory, go here. Once you open up a new file, the first thing to do is rename the file (File -> Rename) and setup your running environment (i.e. whether to use a standard CPU, GPU or TPU). Whenever you change your running environment, the current notebook session will restart – so it is best to do this first up. To do so, go to Runtime -> Change runtime type.
One of the most important and useful components of Google Colaboratory is the ability to share your notebooks with others, and also allow others to comment on your work. The ability to do this can be see in the Share and Comment buttons on the top right of the screen, see below:
The Comment functionality allows users to make comments on individual cells within the notebook, which is useful for remote collaboration.
Each cell can be selected as either a “code” cell, or a “text” cell. The text cells allow the developer to create commentary surrounding the code, which is useful for explaining what is going on or creating document-like implementation of various algorithms. This is all similar to standard Jupyter notebooks.
Local bash commands can be run from the cells also, which interact with the virtual machine (VM) that has been created as part of your session. For instance, to install the PyDrive package which will be used later in this introduction, run the following straight into one of the :
!pip install -U -q PyDrive
This runs a normal pip installation command on the VM and installs the PyDrive package. One important thing to note is that Google Colaboratory will time-out and erase your session environment after a period of inactivity. This is to free up VM space for other users. Therefore, it may be necessary in some cases to run a series of pip install commands at the beginning of your notebook each time to get your local environment ready for your particular use. However, deep learning checkpoints, data and result summaries can be exported to various other locations with permanent storage such as Google Drive, your local hard-drive and so on.
You can also run other common LINUX commands such as ls, mkdir, rmdir and curl. A fuller list of bash commands and functionality available on Google Colaboratory can be found by running !ls \bin.
Now that these basics have been covered, it is time to examine how to access TensorBoard while in Google Collaboratory.
Accessing TensorBoard
TensorBoard is a useful visualization tool for TensorFlow, see my previous post for more details on how to use it. It can be accessed by calling on log files written during or after the training process. It is most useful to have access to these files during training, in my view, so that one can observe whether your current approach is yielding results or not. On a PC it is easy to call up the TensorBoard server and access through the web-browser, however this can’t be done in a straight-forward fashion in Google Colaboratory.
Nevertheless, there is an easy solution – ngrok. This package creates secure tunnels through firewalls and other network blocking limitations and allows access to the public internet. I am indebted for this solution to the writer of this blog post. To download and install ngrok to Google Colaboratory, run the following commands:
The get_ipython() command allows one to access IPython commands, and system_raw() executes commands in the native command prompt / terminal. The string argument passed to system_raw() starts a TensorBoard session which searches for log files in LOG_DIR, and runs on port 6006.
The next step is to execute ngrok and print out the link which will take the user to the TensorBoard portal:
The first line in the above starts the ngrok tunnel via the http protocol to port 6006 – the same port that TensorBoard can be accessed from. The second line is a complicated looking curl command. The curl command in Linux is used to run http requests. In this case, a request is being made to “http://localhost:4040/api/tunnels”. This is an ngrok API running locally that contains information about the tunnels that are operating.
The information received from that curl http request is then sent to the local Python 3 application via the Linux pipe “|” operator. The results come into Python via sys.stdin in json format – and the public URL of the tunnel that has been created is printed to screen. Running this command will return a URL in Google Colaboratory that looks something like:
https://712a59dd.ngrok.io
Clicking on the link in Google Colaboratory will send your browser to your TensorBoard portal. So now you can use TensorBoard in Google Colaboratory – which is very handy.
Next we’ll be looking at saving models and loading files from Google Drive – this is the easiest way to checkpoint and restore models while using Google Colaboratory.
Saving and restoring files in Google Colaboratory
In this post, I’m going to introduce the two ways of working with files in Google Colaboratory that I believe will be of most common use. The files to be worked with generally consist of training or testing data, and saved model data (i.e. checkpoints or fully trained model data).
The most simple way of loading and downloading files in Google Colaboratory is using the inbuilt folder structure browser available. Clicking on View -> Table of contents in the menu will launch a left hand pane/menu. At the top of this pane, there will be a tab called “Files” – selecting this will show you the file structure of your current runtime session, from which you can upload and download from your local PC.
Alternatively, this can be performed programatically by running the following commands:
from google.colab import files
uploaded = files.upload()
The code above will launch a dialog box which allows you to navigate to a local file to upload to your session. The following code will download a specified file to your downloads area on your PC (if you’re using Windows):
files.download("downloaded_weights.12-1.05.hdf5")
So far so good. However, this is a very manual way of playing around with files. This won’t be possible during training, so storing checkpoints to your local drive from Google Colaboratory isn’t feasible using this method. Another issue is when you are running a long running training session on Google Colaboratory. If your training finishes and you don’t interact with the console for a while (i.e. you run an overnight training session and you’re asleep when the training finishes), your runtime will be automatically ended and released to free up resources.
Unfortunately this means that you will also lose all your training progress and model data up to that point. So in other words, it is important to be able to programatically store files / checkpoints while training. In the example below, I’m going to show you how to setup a training callback which automatically stores checkpoints to your Google Drive account, which can then be downloaded and used again later. I’ll demonstrate it in the context of training a TensorFlow/Keras model to classify CIFAR-10 images. For more details on that, see my tutorial or my book.
A file saving example using Keras and callbacks
First off, I’ll show you the imports required, the data preparation using the Dataset API and then the Keras model development. I won’t explain these, as the details are outlined in the aforementioned tutorial, so check that out if you’d like to understand the model better. The Google Colaboratory notebook for this tutorial can be found here. We’re going to use the PyDrive package to do all the talking to Google Drive, so first you have to install it in your session:
!pip install -U -q PyDrive
Next comes all the imports, data stuff and Keras model development:
import tensorflow as tf
from tensorflow import keras
import datetime as dt
import os
import numpy as np
from google.colab import files
from google.colab import drive
# these are all the Google Drive and authentication libraries required
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# import the CIFAR-10 data then load into TensorFlow datasets
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
# the training set with data augmentation
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(256).shuffle(10000)
train_dataset = train_dataset.map(lambda x, y: (tf.div(tf.cast(x, tf.float32), 255.0), tf.reshape(tf.one_hot(y, 10), (-1, 10))))
train_dataset = train_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y))
train_dataset = train_dataset.map(lambda x, y: (tf.image.random_flip_left_right(x), y))
train_dataset = train_dataset.repeat()
# the validation set
valid_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(5000).shuffle(10000)
valid_dataset = valid_dataset.map(lambda x, y: (tf.div(tf.cast(x, tf.float32),255.0), tf.reshape(tf.one_hot(y, 10), (-1, 10))))
valid_dataset = valid_dataset.map(lambda x, y: (tf.image.central_crop(x, 0.75), y))
valid_dataset = valid_dataset.repeat()
# now the model creation function
def create_model():
model = keras.models.Sequential([
keras.layers.Conv2D(96, 3, padding='same', activation=tf.nn.relu,
kernel_initializer=keras.initializers.VarianceScaling(distribution='truncated_normal'),
kernel_regularizer=keras.regularizers.l2(l=0.001),
input_shape=(24, 24, 3)),
keras.layers.Conv2D(96, 3, 2, padding='same', activation=tf.nn.relu,
kernel_initializer=keras.initializers.VarianceScaling(distribution='truncated_normal'),
kernel_regularizer=keras.regularizers.l2(l=0.001)),
keras.layers.Dropout(0.2),
keras.layers.Conv2D(192, 3, padding='same', activation=tf.nn.relu,
kernel_initializer=keras.initializers.VarianceScaling(distribution='truncated_normal'),
kernel_regularizer=keras.regularizers.l2(l=0.001)),
keras.layers.Conv2D(192, 3, 2, padding='same', activation=tf.nn.relu,
kernel_regularizer=keras.regularizers.l2(l=0.001)),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.5),
keras.layers.Flatten(),
keras.layers.Dense(256, activation=tf.nn.relu,
kernel_initializer=keras.initializers.VarianceScaling(),
kernel_regularizer=keras.regularizers.l2(l=0.001)),
keras.layers.Dense(10),
keras.layers.Softmax()
])
model.compile(optimizer=tf.train.AdamOptimizer(),
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
# finally create the model
model = create_model()
The next step in the code is to create the “GoogleDriveStore” callback. This callback inherits from the general keras.callbacks.Callback super class, and this allows the class definition below to create functions which are run at the beginning of the training, after each epoch etc. The code below is the initialization step of the callback which fires at the beginning of the training:
You can observe that first a few initialization variables are set, whose purpose will become clear shortly. After that, there is 4 lines related to setting up the Google Drive connection. I’ll confess that I am not 100% sure how these authentication functions work in detail, but I’ll give it a shot at explaining.
First, the auth.authenticate_user() function is run – this is a native Google Colaborotory function which will supply a link for the user to click on. This will lead the user to logon to a Google account and will supply a token that needs to be entered into the Colaboratory notebook to complete the authentication.
Next, this authentication needs to be loaded into the PyDrive Google Drive connection. First an OAuth authenciation object is created (gauth). Then the credentials for this connection are supplied via the get_application_default() method of GoogleCredentials. I’m not sure how this method works exactly, but it seems to pick up the authentication that was performed in the first step by running auth.authenticate_user(). Finally, a GoogleDrive object is created, and the authentication credentials are passed to this object creation.
Now the callback has an authenticated Google Drive connection at its disposal. The next step is to create the checkpoint storage to Google Drive after the end of each epoch:
def on_epoch_begin(self, batch, logs={}):
if not self.first:
# get the latest
model_files = os.listdir(self.model_folder)
max_date = self.init_date
for f in model_files:
if os.path.isfile(self.model_folder + "/" + f):
if f.split(".")[-1] == 'hdf5':
creation_date = dt.datetime.fromtimestamp(
os.path.getmtime((self.model_folder + "/" + f)))
if creation_date > max_date:
file_name = f
latest_file_path = self.model_folder + "/" + f
max_date = creation_date
uploaded = self.drive.CreateFile({'title': file_name})
uploaded.SetContentFile(latest_file_path)
uploaded.Upload()
else:
self.first = False
The function above simply loops through all the files within the self.model_folder directory. It searches for files with the hdf5 extension, which is the Keras model save format. It then finds the hdf5 file with the most recent creation date (latest_file_path). Once this has been found, a file is created using the CreateFile method of the GoogleDrive object within PyDrive, and a name is assigned to the file. On the next line, the content of this file is set to be equal to the latest hdf5 file stored locally. Finally, using the upload() method, the file is saved to Google Drive.
The remainder of the training code looks like the following:
As can be observed in the code above, there are in total three callbacks being included in the training – a TensorBoard callback (updates the TensorBoard result file), a Model Checkpoint callback (which creates the hdf5 model checkpoints) and finally the Google Drive callback which I created.
The next thing to cover is how to load a trained model from Google Drive back into your Google Colaboratory session.
Loading a trained model from Google Drive
The easiest way to access the files from Google Drive and make them available to your Google Colaboratory session is to mount your Google Drive into the session. This can be performed easily (though, you’ll have to authenticate if you haven’t already, using the token code as explained previously) by calling the drive.mount() method. The code below shows you how to use this method to reload the data into a model and run some predictions:
drive.mount('/content/gdrive')
model = create_model()
model.load_weights("./gdrive/My Drive/weights.12-1.05.hdf5")
model.predict(valid_dataset, steps=1)
The final line will print out an array of predictions from the newly loaded Keras model. Note the argument supplied to drive.mount() is the location in the session’s file structure where the drive contents should be mounted. This location is then accessed to load the weights in the code above.
That gives you a quick overview of Google Colaboratory, plus a couple of handy code snippets which will allow you to run TensorBoard normally from Colaboratory, and how to save and load files from Google Drive – a must for long training sessions. I hope this allows you to get the most out of this great way to test and build deep learning models, with free GPU time!
Eager to build deep learning systems? Learn how here
A recent announcement from the TensorFlow development team has informed the world that some major new changes are coming to TensorFlow, resulting in a new major version TensorFlow 2.0. The three biggest changes include:
TensorFlow Eager execution to be standard
The cleaning up of a lot of redundant APIs
The removal of tf.contrib with select components being transferred into the TensorFlow core
The most important change is the shift from static graph definitions to the imperative Eager framework as standard. If you aren’t familiar with the Eager way of doing things, I’d recommend checking out my introduction to Eager. It will still be possible to create static graph definitions in TensorFlow 2.0, but the push to standardize Eager is a substantial change and will alter the way TensorFlow is used. However, it is already possible to use TensorFlow in this new paradigm before the transition to TensorFlow 2.0. With the newest releases of TensorFlow (i.e. version 1.11), we can already preview what the new TensorFlow 2.0 will be like. The major new TensorFlow paradigm will include the biggest APIs already available – the Dataset API, the Keras API and Eager.
This post will give you an overview of the approach that (I believe) the TensorFlow developers are pushing, and the most effective way of building and training networks in this new and upcoming TensorFlow 2.0 paradigm.
Eager to build deep learning systems? Get the book here
Bringing together Keras, Dataset and Eager
Many people know and love the Keras deep learning framework. In recent versions, Keras has been extensively integrated into the core TensorFlow package. This is a good thing – gone are the days of “manually” constructing common deep learning layers such as convolutional layers, LSTM and other recurrent layers, max pooling, batch normalization and so on. For an introduction into the “bare” Keras framework, see my Keras tutorial. The Keras layers API makes all of this really straight-forward, and the good news is that Keras layers integrate with Eager execution. These two factors combined make rapid model development and easy debugging a reality in TensorFlow.
A final piece of the puzzle is the flexible and effective Dataset API. This API streamlines the pre-processing, batching and consumption of data efficiently into your deep learning model. For more on the capabilities of this API, check out my TensorFlow Dataset tutorial. The good news is that Datasets are able to be consumed in the Keras fit function. This means that all three components (Dataset, Keras and Eager) now fit together seamlessly. In this post, I’ll give an example of what I believe will be an easy, clear and efficient way of developing your deep learning models in the new TensorFlow 2.0 (and existing TensorFlow 1.11). This example will involve creating a CIFAR-10 convolutional neural network image classifier.
Building an image classifier in TensorFlow 2.0
In this example, I’ll show you how to build a TensorFlow image classifier using the convolutional neural network deep learning architecture. If you’re not familiar with CNNs, check out my convolutional neural network tutorial. The structure of the network will consist of the following:
Conv2D layer – 64 filters, 5 x 5 filter, 1 x 1 strides – ReLU activation
Max pooling – 3 x 3 window, 2 x 2 strides
Batch normalization
Conv2D layer – 64 filters, 5 x 5 filter, 1 x 1 strides – ReLU activation
Max pooling – 3 x 3 window, 2 x 2 strides
Batch normalization
Flatten layer
Dense layer – 750 nodes, with dropout
10 node output layer with Softmax activation
I’ll go through each step of the example, and discuss the new way of doing things as I go.
import tensorflow as tf
from tensorflow import keras
import datetime as dt
tf.enable_eager_execution()
First things first, in TensorFlow 2.0 it is not expected that the tf.enable_eager_execution() line will need to be executed. For the time being however, in TensorFlow 1.10+ we still need to enable the Eager execution mode. In the next code segment, I setup the training dataset using the Dataset API in TensorFlow, and extracting the CIFAR10 data from the Keras datasets library:
The lines of code above show how useful and streamlined the Dataset API can be. First, numpy arrays extracted from the Keras dataset library are loaded directly into the Dataset object using the from_tensor_slices method. Batching and shuffling operations are then added to the Dataset pipeline. On the following line, the Dataset map method is used to scale the input images x to between 0 and 1, and to transform the labels into a one-hot vector of length 10. After this a random distortion (randomly flipping the image) is applied to the pre-processing pipeline – this effectively increases the number of image training samples. Finally, the dataset is set to repeat indefinitely i.e. it does not halt extraction after all the dataset has been fed through the model – rather it allows the dataset to be resampled.
The following code shows the same treatment applied to the validation dataset, however with a larger batch size and no random distortion of the images:
def call(self, x):
x = self.max_pool2d(self.conv1(x))
x = self.max_norm(x)
x = self.max_pool2d(self.conv2(x))
x = self.max_norm(x)
x = self.flatten(x)
x = self.dropout(self.fc1(x))
x = self.fc2(x)
return self.softmax(x)
The code above shows how Keras can be now integrated into the Eager framework. Notice the class definition inherits from the Keras.Model class. This is called model subclassing. The structure of such a model definition involves first creating layer definitions in the class __init__ function. These layer definitions are then utilized in the call method of the class. Note, this way of doing things allows easy layer sharing – for instance, two different inputs could use the same layer definitions, which would also involving sharing weights. This occurs in networks such as Siamese Networks and others. In this case, I’ve utilized the same max pooling definition for both convolutional layers. The feed forward pass during the Keras fit function will automatically use the call method of the model passed to it. The Keras compile function can also be called directly from the instantiated model.
This is one way of doing things. Another way is by utilizing the Keras Sequential model type. This is possible in Eager due to the fact that placeholder functions are not required, which don’t work in Eager mode (given that they operate on a deferred execution basis). However, the Keras Functional API cannot currently be used in Eager mode, given that the Functional API requires some form of placeholder inputs. The Sequential model type in Keras is handy but not overly flexible, so it may be best to stick with the model subclassing approach shown above – but it really depends on your application.
After the model definition, it is then time to instantiate the model class, compile in Keras and run the training:
In this example, we are using the TensorFlow Adam Optimizer and the Keras categorical cross-entropy loss to train the network. Rather than having to define common metrics such as accuracy in TensorFlow, we can simply use the existing Keras metrics. Keras can also log to TensorBoard easily using the TensorBoard callback. Finally, in the Keras fit method, you can observe that it is possible to simply supply the Dataset objects, train_dataset and the valid_dataset, directly to the Keras function.
Running the above code in Google Colaboratory on a Tesla K80 GPU yields a training accuracy of around 78% and a validation accuracy of around 60% after 200 epochs. This result could be improved with some more iterations and tuning. Note, if you want to learn how to get the most out of Google Colaboratory, check out my introduction here.
The plots below from TensorBoard show the progress of the per-iteration training accuracy and the per-epoch validation accuracy:
Batch CIFAR-10 training accuracy
Epoch CIFAR-10 validation accuracy
In this tutorial, I’ve presented what I believe to be the direction the TensorFlow developers are heading in with respect to the forthcoming release of TensorFlow 2.0. This direction includes three key themes which are already available – the Dataset API, the Keras API and Eager execution. Because these themes are already available for use in TensorFlow 1.10+, this post will hopefully aid you in preparing for the release of TensorFlow 2.0.
Deep learning can be complicated…and sometimes frustrating. Why is my lousy 10 layer conv-net only achieving 95% accuracy on MNIST!? We’ve all been there – something is wrong and it can be hard to figure out why. Often the best solution to a problem can be found by visualizing the issue. This is why TensorFlow’s TensorBoard add-on is such a useful tool, and one reason why TensorFlow is a more mature solution than other deep learning frameworks. It also produces pretty pictures, and let’s face it, everybody loves pretty pictures.
In the latest release of TensorFlow (v1.10 as of writing), TensorBoard has been released with a whole new bunch of functionality. This tutorial is going to cover the basics, so that future tutorials can cover more specific (and complicated) features on TensorBoard. The code for this tutorial can be found on this site’s Github page.
Eager to learn more? Get the book here
Visualizing the graph in TensorBoard
As you are likely to be aware, TensorFlow calculations are performed in the context of a computational graph (if you’re not aware of this, check out my TensorFlow tutorial). To communicate the structure of your network, and to check it for complicated networks, it is useful to be able to visualize the computational graph. Visualizing the graph of your network is very straight-forward in TensorBoard. To do so, all that is required is to build your network, create a session, then create a TensorFlow FileWriter object.
The FileWriter definition takes the file path of the location you want to store the TensorBoard file in as the first argument, and the TensorFlow graph object, sess.graph, as the second argument. This can be observed in the code below:
The same FileWriter that can be used to display your computational graph in TensorBoard will also be used for other visualization functions, as will be shown below. In this example, a simple, single hidden layer neural network will be created in TensorFlow to classify MNIST hand-written digits. The graph for this network is what will be visualized. The network, as defined in TensorFlow, looks like:
# declare the training data placeholders
x = tf.placeholder(tf.float32, [None, 28, 28])
# reshape input x - for 28 x 28 pixels = 784
x_rs = tf.reshape(x, [-1, 784])
# scale the input data (maximum is 255.0, minimum is 0.0)
x_sc = tf.div(x_rs, 255.0)
# now declare the output data placeholder - 10 digits
y = tf.placeholder(tf.int64, [None, 1])
# convert the y data to one hot values
y_one_hot = tf.reshape(tf.one_hot(y, 10), [-1, 10])
# now let's define the cost function which we are going to train the model on
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_one_hot,
logits=logits))
# add an optimiser
optimiser = tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(cross_entropy)
I won’t go through the code in detail, but here is a summary of what is going on:
Input placeholders are created
The x input image data, which is of size (-1, 28, 28) is flattened to (-1, 784) and scaled from a range of 0-255 (greyscale pixels) to 0-1
The y labels are converted to one-hot format
A hidden layer is created with 300 nodes and sigmoid activation
An output layer of 10 nodes is created
The logits of this output layer are sent to the TensorFlow softmax + cross entropy loss function
Gradient descent optimization is used
Some accuracy calculations are performed
So when a TensorFlow session is created, and the FileWriter defined and run, you can start TensorBoard to visualize the graph. To define the FileWriter and send it the graph, run the following:
# start the session
with tf.Session() as sess:
writer = tf.summary.FileWriter(STORE_PATH, sess.graph)
After running the above code, it is time to start TensorBoard. To do this, run the following command in the command prompt:
tensorboard --logdir=STORE_PATH
This will create a local server and the text output in the command prompt will let you know what web address to type into your browser to access TensorBoard. After doing this and loading up TensorBoard, you can click on the Graph tab and observe something like the following:
TensorBoard computational graph
As you can see, there is a lot going on in the graph above. The major components which are the most obvious are the weight variable blocks (W, W_1, b, b_1 etc.), the weight initialization operations (random_normal) and the softmax_cross_entropy nodes. These larger rectangle boxes with rounded edges are called “namespaces”. These are like sub-diagrams in the graph, which contain children operation and can be expanded. More on these shortly.
Surrounding these larger colored blocks are a host of other operations – MatMul, Add, Sigmoid and so on – these operations are shown as ovals. Other nodes which you can probably see are the small circles which represent constants. Finally, if you look carefully, you will be able to observe some ovals and rounded rectangles with dotted outlines. These are an automatic simplification by TensorBoard to reduce clutter in the graph. They show common linkages which apply to many of the nodes – such as all of the nodes requiring initialization (init), those nodes which have gradients associated, and those nodes which will be trained by gradient descent. If you look at the upper right hand side of the diagram, you’ll be able to see these linkages to the gradients, GradientDescent and init nodes:
Nodes with many linkages
One final thing to observe within the graph are the linkages or edges connecting the nodes – these are actually tensors flowing around the computational graph. Zooming in more closely reveals these linkages:
Tensor edges in the graph
As can be observed, the edges between the node display the dimensions of the Tensors flowing around the graph. This is handy for debugging for more complicated graphs. Now that these basics have been reviewed, we’ll examine how to reduce the clutter of your graph visualizations.
Namespaces
Namespaces are scopes which you can surround your graph components with to group them together. By doing so, the detail within the namespace will be collapsed into a single Namespace node within the computational graph visualization in TensorBoard. To create a namespace in TensorFlow, you use the Python with functionality like so:
with tf.name_scope("layer_1"):
# now declare the weights connecting the input to the hidden layer
W1 = tf.Variable(tf.random_normal([784, 300], stddev=0.01), name='W')
b1 = tf.Variable(tf.random_normal([300]), name='b')
hidden_logits = tf.add(tf.matmul(x_sc, W1), b1)
hidden_out = tf.nn.sigmoid(hidden_logits)
As you can observe in the code above, the first layer variables have been surrounded with tf.name_scope(“layer_1”). This will group all of the operations / variables within the scope together in the graph. Doing the same for the input placeholders and associated operations, the second layer and the accuracy operations, and re-running we can generate the following, much cleaner visualization in TensorBoard:
TensorBoard visualization with namespaces
As you can see, the use of namespaces drastically cleans up the clutter of a TensorBoard visualization. You can still access the detail within the namespace nodes by double clicking on the block to expand.
Before we move onto other visualization options within TensorBoard, it’s worth noting the following:
tf.variable_scope() can also be used instead of tf.name_scope(). Variable scope is used as part of the get_variable() variable sharing mechanism in TensorFlow.
You’ll notice in the first cluttered visualization of the graph, the weights and bias variables/operations had underscored numbers following them i.e. W_1 and b_1. When operations share the same name outside of a namescope, TensorFlow automatically appends a number to the operation name so that not two operations are labelled the same. However, when a name or variable scope is added, you can name operations the same thing and the namescope will be appended to the name of the operation. For instance, the weight variable in the first layer is called ‘W’ in the definition, but given it is now in the namescope “layer_1” it is now called “layer_1/W”. The “W” in layer 2 is called “layer_2/W”.
Now that visualization of the computational graph has been covered, it’s time to move onto other visualization functions which can aid in debugging and analyzing your networks.
Scalar summaries
At any point within your network, you can log scalar (i.e. single, real valued) quantities to display in TensorBoard. This is useful for tracking things like the improvement of accuracy or the reduction in the loss function during training, or studying the standard deviation of your weight distributions and so on. It is executed very easily. For instance, the code below shows you how to log the accuracy scalar within this graph:
# add a summary to store the accuracy
tf.summary.scalar('acc_summary', accuracy)
The first argument is the name you chose to give the quantity within the TensorBoard visualization, and the second is the operation (which must return a single real value) you want to log. The output of the tf.summary.scalar() call is an operation. In the code above, I have not assigned this operation to any variable within Python, though the user can do so if they desire. However, as with everything else in TensorFlow, these summary operations will not do anything until they are run. Given that often there are a lot of summaries run in any given graph depending on what the developer wants to observe, there is a handy helper function called merge_all(). This merges together all the summary calls within the graph so that you only have to call the merge operation and it will gather all the other summary operations for you and log the data. It looks like this:
merged = tf.summary.merge_all()
During execution within a Session, the developer can then simply run merged. A collection of summary objects will be returned from running this merging operation, and these can then be output to the FileWriter mentioned previously. The training code for the network looks like the following, and you can check to see where the merged operation has been called:
# start the session
with tf.Session() as sess:
sess.run(init_op)
writer = tf.summary.FileWriter(STORE_PATH, sess.graph)
# initialise the variables
total_batch = int(len(y_train) / batch_size)
for epoch in range(epochs):
avg_cost = 0
for i in range(total_batch):
batch_x, batch_y = get_batch(x_train, y_train, batch_size=batch_size)
_, c = sess.run([optimiser, cross_entropy], feed_dict={x: batch_x, y: batch_y.reshape(-1, 1)})
avg_cost += c / total_batch
acc, summary = sess.run([accuracy, merged], feed_dict={x: x_test, y: y_test.reshape(-1, 1)})
print("Epoch: {}, cost={:.3f}, test set accuracy={:.3f}%".format(epoch + 1, avg_cost, acc*100))
writer.add_summary(summary, epoch)
print("\nTraining complete!")
I won’t go through the training details of the code above – it is similar to that shown in other tutorials of mine like my TensorFlow tutorial. However, there are a couple of lines to note. First, you can observe that after every training epoch two operations are run – accuracy and merged. The merged operation returns a list of summary objects ready for writing to the FileWriter stored in summary. This list of objects is then added to the summary by running writer.add_summary(). The first argument to this function is the list of summary objects, and the second is an optional argument which logs the global training step along with the summary data.
Before showing you the results of this code, it is important to note something. TensorBoard starts to behave badly when there are multiple output files within the same folder that you launched the TensorBoard instance from. Therefore, if you are running your code multiple times you have two options:
Delete the FileWriter output file after each run or,
Use the fact that TensorBoard can perform sub-folder searches for TensorBoard files. So for instance, you could create a separate sub-folder for each run i.e. “Run_1”, “Run_2” etc. and then launch TensorBoard from the command line, pointing it to the parent directory. This is recommended when you are doing multiple runs for cross validation, or other diagnostic and testing runs.
To access the accuracy scalar summary that was logged, launch TensorBoard again and click on the Scalar tab. You’ll see something like this:
TensorBoard accuracy summary
The scalar page in TensorBoard has a few features which are worth checking out. Of particular note is the smoothing slider. This explains why there are two lines in the graph above – the thicker orange line is the smoothed values, and the lighter orange line is the actual accuracy values which were logged. This smoothing can be useful for displaying the overall trend when the summary logging frequency is higher i.e. after every training step rather than after every epoch as in this example.
The next useful data visualization in TensorBoard is the histogram summary.
Histogram summaries
Histogram summaries are useful for examining the statistical distributions of your network variables and operation outputs. For instance, in my weight initialization tutorial, I have used TensorBoard histograms to show how poor weight initialization can lead to sub-optimal network learning due to less than optimal outputs from the node activation function. To log histogram summaries, all the developer needs to do is create a similar operation to the scalar summaries:
I have added these summaries so that we can examine how the distribution of the inputs and outputs of the hidden layer progress over training epochs. After running the code, you can open TensorBoard again. Clicking on the histogram tab in TensorBoard will give you something like the following:
TensorBoard histogram outputs – offset
This view in the histogram tab is the offset view, so that you can clearly observe how the distribution changes through the training epochs. On the left hand side of the histogram page in TensorBoard you can choose another option called “Overlay” which looks like the following:
TensorBoard histogram outputs – overlay
Another view of the statistical distribution of your histogram summaries can be accessed through the “Distributions” tab in TensorBoard. An example of the graphs available in this tab is below:
TensorBoard histogram outputs – Distributions tab
This graph gives another way of visualizing the distribution of the data in your histogram summaries. The x-axis is the number of steps or epochs, and the different shadings represent varying multiples of the standard deviation from the mean.
This covers off the histogram summaries – it is now time to review the last summary type that will be covered off in this tutorial – the image summary.
Image summaries
The final summary to be reviewed is the image summary. This summary object allows the developer to capture images of interest during training to visualize them. These images can be either grayscale or RGB. One possible application of image summaries is to use them to visualize which images are classified well by the classifier, and which ones are classified poorly. This application will be demonstrated in this example – the additional code can be observed below:
TensorFlow is a great deep learning framework. In fact, it is still the reigning monarch within the deep learning framework kingdom. However, it has some frustrating limitations. One of these is the difficulties that arise during debugging. In TensorFlow, it’s difficult to diagnose what is happening in your model. This is due to its static graph structure (for details, see my TensorFlow tutorial) – in TensorFlow the developer has to first create the full set of graph operations, and only then are these operations compiled with a TensorFlow session object and fed data. Wouldn’t it be great if you could define operations, then immediately run data through them to observe what the output was? Or wouldn’t it be great to set standard Python debug breakpoints within your code, so you can step into your deep learning training loops wherever and whenever you like and examine the tensors and arrays in your models? This is now possible using the TensorFlow Eager API, available in the latest version of TensorFlow.
The TensorFlow Eager API allows you to dynamically create your model in an imperative programming framework. In other words, you can create tensors, operations and other TensorFlow objects by typing the command into Python, and run them straight way without the need to set up the usual session infrastructure. This is useful for debugging, as mentioned above, but also it allows dynamic adjustments of deep learning models as training progresses. In fact, in natural language processing, the ability to create dynamic graphs is useful, given that sentences and other utterances in natural language have varying lengths. In this TensorFlow Eager tutorial, I’ll show you the basics of the new API and also show how you can use it to create a fully fledged convolutional neural network.
The first thing you need to do to use TensorFlow Eager is to enable Eager execution. To do so, you can run the following (note, you can type this directly into your Python interpreter):
import tensorflow as tf
tf.enable_eager_execution()
Now you can define TensorFlow operations and run them on the fly. In the code below, a numpy range from 0 to 9 is multiplied by a scalar value of 10, using the TensorFlow multiply operation:
# simple example
z = tf.constant(np.arange(10))
z_tf = tf.multiply(z, np.array(10))
print(z_tf)
Notice we can immediately access the results of the operation. If we ran the above without running the tf.enable_eager_execution() command, we would instead see the definition of the TensorFlow operation i.e.:
Tensor(“Mul:0”, shape=(10,), dtype=int32)
Notice also how easily TensorFlow Eager interacts with the numpy framework. So far, so good. Now, the main component of any deep learning API is how gradients are handled – this will be addressed in the next section.
Gradients in TensorFlow Eager
Gradient calculation is necessary in neural networks during the back-propagation stage (if you’d like to know more, check out my neural networks tutorial). The gradient calculations in the TensorFlow Eager API work similarly to the autograd package used in PyTorch. To calculate the gradient of an operation using Eager, you can use the gradients_function() operation. The code below calculates the gradient for an $x^3$ function:
import tensorflow.contrib.eager as tfe
def f_cubed(x):
return x**3
grad = tfe.gradients_function(f_cubed)
grad(3.)[0].numpy()
Notice the use of tfe.gradients_function(f_cubed) – when called, this operation will return the gradient of df/dx for the x value. The code above returns the value 27 – this makes sense as the derivative of $x^3$ is $3x^2 = 3 * 3^2 = 27$. The final line shows the grad operation, and then the conversion of the output to a numpy scalar i.e. a float value.
We can show the use of this gradients_function in a more complicated example – polynomial line fitting. In this example, we will use TensorFlow Eager to discover the weights of a noisy 3rd order polynomial. This is what the line looks like:
x = np.arange(0, 5, 0.1)
y = x**3 - 4*x**2 - 2*x + 2
y_noise = y + np.random.normal(0, 1.5, size=(len(x),))
plt.close("all")
plt.plot(x, y)
plt.scatter(x, y_noise)
A noisy polynomial to fit
As can be observed from the code, the polynomial is expressed as $x^3 – 4x^2 – 2x +2$ with some random noise added. Therefore, we want our code to find a “weight” vector of approximately [1, -4, -2, 2]. First, let’s define a few functions:
def get_batch(x, y, batch_size=20):
idxs = np.random.randint(0, len(x), (batch_size))
return x[idxs], y[idxs]
class PolyModel(object):
def __init__(self):
self.w = tfe.Variable(tf.random_normal([4]))
def f(self, x):
return self.w[0] * x ** 3 + self.w[1] * x ** 2 + self.w[2] * x + self.w[3]
def loss(model, x, y):
err = model.f(x) - y
return tf.reduce_mean(tf.square(err))
The first function is a simple randomized batching function. The second is a class definition for our polynomial model. Upon initialization, we create a weight variable self.w and set to a TensorFlow Eager variable type, randomly initialized as a 4 length vector. Next, we define a function f which returns the weight vector by the third order polynomial form. Finally, we have a loss function defined, which returns the mean squared error between the current model output and the noisy y vector.
To train the model, we can run the following:
model = PolyModel()
grad = tfe.implicit_gradients(loss)
optimizer = tf.train.AdamOptimizer()
iters = 20000
for i in range(iters):
x_batch, y_batch = get_batch(x, y)
optimizer.apply_gradients(grad(model, x_batch, y_batch))
if i % 1000 == 0:
print("Iteration {}, loss: {}".format(i+1, loss(model, x_batch, y_batch).numpy()))
First, we create a model and then use a TensorFlow Eager function called implicit_gradients. This function will detect any upstream or parent gradients involved in calculating the loss, which is handy. We are using a standard Adam optimizer for this task. Finally a loop begins, which supplies the batch data and the model to the gradient function. Then the program applies the returned gradients to the optimizer to perform the optimizing step.
After running this code, we get the following output graph:
The orange line is the fitted line, the blue is the “ground truth”. Not perfect, but not too bad.
Next, I’ll show you how to use TensorFlow Eager to create a proper neural network classifier trained on the MNIST dataset.
A neural network with TensorFlow Eager
In the code below, I’ll show you how to create a Convolutional Neural Network to classify MNIST images using TensorFlow Eager. If you’re not sure about Convolutional Neural Networks, you can check out my tutorial here. The first part of the code shows you how to extract the MNIST dataset:
In the case above, we are making use of the Keras datasets now available in TensorFlow (by the way, the Keras deep learning framework is now heavily embedded within TensorFlow – to learn more about Keras see my tutorial). The raw MNIST image dataset has values ranging from 0 to 255 which represent the grayscale values – these need to be scaled to between 0 and 1. The function below accomplishes this:
Next, in order to setup the Keras image dataset into a TensorFlow Dataset object, we use the following code. This code creates a scaled training and testing dataset. This dataset is also randomly shuffled and ready for batch extraction. It also applies the tf.one_hot function to the labels to convert the integer label to a one hot vector of length 10 (one for each hand-written digit). If you’re not familiar with the TensorFlow Dataset API, check out my TensorFlow Dataset tutorial.
The next section of code creates the MNIST model itself, which will be trained. The best practice at the moment for TensorFlow Eager is to create a class definition for the model which inherits from the tf.keras.Model class. This is useful for a number of reasons, but the main one for our purposes is the ability to call on the model.variables property when determining Eager gradients, and this “gathers together” all the trainable variables within the model. The code looks like:
In the model definition, we create layers to implement the following network structure:
32 channel, 5×5 convolutional layer with ReLU activation
2×2 max pooling, with (2,2) strides
64 channel 5×5 convolutional layer with ReLU activation
Flattening
Dense/Fully connected layer with 750 nodes, ReLU activation
Dropout layer
Dense/Fully connected layer with 10 nodes, no activation
As stated above, if you’re not sure what these terms mean, see my Convolutional Neural Network tutorial. Note that the call method is a mandatory method for the tf.keras.Model superclass – it is where the forward pass through the model is defined.
The next function is the loss function for the optimization:
Note that this function calls the forward pass through the model (which is an instance of our MNISTModel) and calculates the “raw” output. This raw output, along with the labels, passes through to the TensorFlow function softmax_cross_entropy_with_logits_v2. This applies the softmax activation to the “raw” output from the model, then creates a cross entropy loss.
Next, I define an accuracy function below, to keep track of how the training is progressing regarding training set accuracy, and also to check test set accuracy:
Finally, the full training code for the model is shown below:
model = MNISTModel()
optimizer = tf.train.AdamOptimizer()
epochs = 1000
for (batch, (images, labels)) in enumerate(train_ds):
with tfe.GradientTape() as tape:
loss = loss_fn(model, images, labels)
grads = tape.gradient(loss, model.variables)
optimizer.apply_gradients(zip(grads, model.variables), global_step=tf.train.get_or_create_global_step())
if batch % 10 == 0:
acc = get_accuracy(model, images, labels).numpy()
print("Iteration {}, loss: {:.3f}, train accuracy: {:.2f}%".format(batch, loss_fn(model, images, labels).numpy(), acc*100))
if batch > epochs:
break
In the code above, we create the model along with an optimizer. The code then enters the training loop, by iterating over the training dataset train_ds. Then follows the definition of the gradients for the model. Here we are using the TensorFlow Eager object called GradientTape(). This is an efficient way of defining the gradients over all the variables involved in the forward pass. It will track all the operations during the forward pass and will efficiently “play back” these operations during back-propagation.
Using the Python with functionality, we can include the loss_fn function, and all associated upstream variables and operations, within the tape to be recorded. Then, to extract the gradients of the relevant model variables, we call tape.gradient. The first argument is the “target” for the calculation, i.e. the loss, and the second argument is the “source” i.e. all the model variables.
We then pass the gradients and the variables zipped together to the Adam optimizer for a training step. Every 10 iterations some results are printed and the training loop exits if the iterations number exceeds the maximum number of epochs.
Running this code for 1000 iterations will give you a loss < 0.05, and training set accuracy approaching 100%. The code below calculates the test set accuracy:
avg_acc = 0
test_epochs = 20
for (batch, (images, labels)) in enumerate(test_ds):
avg_acc += get_accuracy(model, images, labels).numpy()
if batch % 100 == 0 and batch != 0:
print("Iteration:{}, Average test accuracy: {:.2f}%".format(batch, (avg_acc/batch)*100))
print("Final test accuracy: {:.2f}%".format(avg_acc/batch * 100))
You should be able to get a test set accuracy, using the code defined above, on the order of 98% or greater for the trained model.
In this post, I’ve shown you the basics of using the TensorFlow Eager API for imperative deep learning. I’ve also shown you how to use the autograd-like functionality to perform a polynomial line fitting task and build a convolutional neural network which achieves relatively high test set accuracy for the MNIST classification task. Hopefully you can now use this new TensorFlow paradigm to reduce development time and enhance debugging for your future TensorFlow projects. All the best!
Reinforcement learning has gained significant attention with the relatively recent success of DeepMind’s AlphaGo system defeating the world champion Go player. The AlphaGo system was trained in part by reinforcement learning on deep neural networks. This type of learning is a different aspect of machine learning from the classical supervised and unsupervised paradigms. In reinforcement learning using deep neural networks, the network reacts to environmental data (called the state) and controls the actions of an agent to attempt to maximize a reward. This process allows a network to learn to play games, such as Atari or other video games, or any other problem that can be recast as some form of game. In this tutorial, I’ll introduce the broad concepts of Q learning, a popular reinforcement learning paradigm, and I’ll show how to implement deep Q learning in TensorFlow. If you need to get up to speed in TensorFlow, check out my introductory tutorial.
As stated above, reinforcement learning comprises of a few fundamental entities or concepts. They are: an environment which produces a state and reward, and an agent which performs actions in the given environment. This interaction can be seen in the diagram below:
Reinforcement learning environment
The goal of the agent in such an environment is to examine the state and the reward information it receives, and choose an action which maximizes the reward feedback it receives. The agent learns by repeated interaction with the environment, or, in other words, repeated playing of the game.
To be successful, the agent needs to:
Learn the interaction between states, actions and subsequent rewards
Determine which is the best action to choose given (1)
The implementation of (1) involves determining some set of values which can be used to inform (2), and (2) is called the action policy. One of the most common ways of implementing (1) and (2) using deep learning is via the Deep Q network and the epsilon-greedy policy. I’ll cover both of these concepts in the next two sections.
Q learning
Q learning is a value based method of supplying information to inform which action an agent should take. An initially intuitive idea of creating values upon which to base actions is to create a table which sums up the rewards of taking action a in state s over multiple game plays. This could keep track of which moves are the most advantageous. For instance, let’s consider a simple game which has 3 states and two possible actions in each state – the rewards for this game can be represented in a table:
Simple state-action-reward table
In the table above, you can see that for this simple game, when the agent is State 1 and takes Action 2, it will receive a reward of 10 but zero reward if it takes Action 1. In State 2, the situation is reversed and finally State 3 resembles State 1. If an agent randomly explored this game, and summed up which actions received the most reward in each of the three states (storing this information in an array, say), then it would basically learn the functional form of the table above.
In other words, if the agent simply chooses the action which it learnt had yielded the highest reward in the past (effectively learning some form of the table above) it would have learnt how to play the game successfully. Why do we need fancy concepts such as Q learning and neural networks then, when simply creating tables by summation is sufficient?
Deferred reward
Well, the first obvious answer is that the game above is clearly very simple, with only 3 states and 2 actions per state. Real games are significantly more complex. The other significant concept that is missing in the example above is the idea of deferred reward. To adequately play most realistic games, an agent needs to learn to be able to take actions which may not immediately lead to a reward, but may result in a large reward further down the track.
Consider another game, defined by the table below:
Simple delayed reward value table
In the game defined above, in all states, if Action 2 is taken, the agent moves back to State 1 i.e. it goes back to the beginning. In States 1 to 3, it also receives a reward of 5 when it does so. However, in all States 1 – 3, if Action 1 is taken, the agent moves forward to the next state, but doesn’t receive a reward until it reaches State 4 – at which point it receives a reward of 20. In other words, an agent is better off if it doesn’t take Action 2 to get an instantaneous reward of 5, but rather it should choose Action 1 consistently to progress through the states to get the reward of 20. The agent needs to be able to select actions which result in a delayed reward, if the delayed reward value is sufficiently large.
The Q learning rule
This allows us to define the Q learning rule. In deep Q learning, the neural network needs to take the current state, s, as a variable and return a Q value for each possible action, a, in that state – i.e. it needs to return $Q(s,a)$ for all s and a. This $Q(s,a)$ needs to be updated in training via the following rule:
This updating rule needs a bit of unpacking. First, you can see that the new value of $Q(s,a)$ involves updating it’s current value by adding on some extra bits on the right hand side of the equation above. Moving left to right, ignore the $\alpha$ for a bit. We see inside the square brackets the first term is r which stands for the reward that is received for taking action a in state s. This is the immediate reward, no delayed gratification is involved yet.
The next term is the delayed reward calculation. First, we have the $\gamma$ value which discounts the delayed reward impact – it is always between 0 and 1. More on that in a second. The next term $\max_{a’} Q(s’, a’)$ is the maximum Q value possible in the next state. Let’s make that a bit clearer – the agent starts in state s, takes action a, ends up in state s’ and then the code determines the maximum Q value in state s’ i.e. $\max_{a’} Q(s’, a’)$.
So why is the value $\max_{a’} Q(s’, a’)$ considered? It is considered because it represents the maximum future reward coming to the agent if it takes action a in state s. However, this value is discounted by $\gamma$ to take into account that it isn’t ideal for the agent to wait forever for a future reward – it is best for the agent to aim for the maximum award in the least period of time. Note that the value $Q(s’,a’)$ implicitly also holds the maximum discounted reward for the state after that, i.e. $Q(s”, a”)$ and likewise, it holds the discounted reward for the state $Q(s”’, a”’)$ and so on. This is how the agent can choose the action a based on not just the immediate reward r, but also based on possible future discounted rewards.
The final components in the formula above are the $\alpha$ value, which is the learning rate during the updating, and finally the current state, $Q(s,a)$, which is subtracted from the square bracket sum. This is done to normalize the updating. Both $\alpha$ and the $Q(s,a)$ subtraction are not required to be explicitly defined in deep Q learning, as the neural network will take care of that during its optimized learning process. This process will be discussed in the next section.
Deep Q learning
Deep Q learning applies the Q learning updating rule during the training process. In other words, a neural network is created which takes the state s as its input, and then the network is trained to output appropriate Q(s,a) values for each action in state s. The action a of the agent can then be chosen by taking the action with the greatest Q(s,a) value (by taking an argmax of the output of the neural network). This can be seen in the first step of the diagram below:
Action selecting and training steps – Deep Q learning
Once this step has been taken and an action has been selected, the agent can perform that action. The agent will then receive feedback on what reward is received by taking that action from that state. Now, the next step that we want to perform is to train the network according to the Q learning rule. This can be seen in the second part of the diagram above. The x input array for training the network is the state vector s, and the y output training sample is the Q(s,a) vector retrieved during the action selection step. However, one of the Q(s,a) values, corresponding to action a, is set to have a target of $r + \gamma Q(s’, a’)$ – this can be observed in the figure above.
By training the network in this way, the Q(s,a) output vector from the network will over time become better at informing the agent what action will be the best to select for its long term gain. There is a bit more to the story about action selection, however, which will be discussed in the next section.
The epsilon-greedy policy
In the explanation above, the action selection policy was simply the action which corresponded to the highest Q output from the neural network. However, this policy isn’t the most effective. Why is that? It is because, when the neural network is randomly initialized, it will be predisposed to select certain sub-optimal actions randomly. This may cause the agent to fall into sub-optimal behavior patterns without thoroughly exploring the game and action / reward space. As such, the agent won’t find the best strategies to play the game.
It is useful here to introduce two concepts – exploration and exploitation. At the beginning of an optimization problem, it is best to allow the problem space to be explored extensively in the hope of finding good local (or even global) minima. However, once the problem space has been adequately searched, it is now best for the optimization algorithm to focus on exploiting what it has found by converging on the best minima to arrive at a good solution.
Therefore, in reinforcement learning, it is best to allow some randomness in the action selection at the beginning of the training. This randomness is determined by the epsilon parameter. Essentially, a random number is drawn between 0 and 1, and if it is less than epsilon, then a random action is selection. If not, an action is selected based on the output of the neural network. The epsilon variable usually starts somewhere close to 1, and is slowly decayed to somewhere around 0 during training. This allows a large exploration of the game at the beginning, but then the decay of the epsilon value allows the network to zero in on a good solution.
We’re almost at the point where we can check out the game that will be used in this example, and begin to build our deep Q network. However, there is just one final important point to consider.
Batching in reinforcement learning
If a deep Q network is trained at each step in the game i.e. after each action is performed and the reward collected, there is a strong risk of over-fitting in the network. This is because game play is highly correlated i.e. if the game starts from the same place and the agent performs the same actions, there will likely be similar results each time (not exactly the same though, because of randomness in some games). Therefore, after each action it is a good idea to add all the data about the state, reward, action and the new state into some sort of memory. This memory can then be randomly sampled in batches to avoid the risk of over-fitting.
The network can therefore still be trained after each step if you desire (or less frequently, it’s up to the developer), but it is extracting the training data not from the agent’s ordered steps through the game, but rather a randomized memory of previous steps and outcomes that the agent has experienced. You’ll be able to see how this works in the code below.
We are now ready to examine the game/environment that we will develop our network to learn.
The Mountain Car Environment and Open AI Gym
In this reinforcement learning tutorial, the deep Q network that will be created will be trained on the Mountain Car environment/game. This can be accessed through the open source reinforcement learning library called Open AI Gym. A screen capture from the rendered game can be observed below:
Mountain Car game
The object of this game is to get the car to go up the right-side hill to get to the flag. There’s one problem however, the car doesn’t have enough power to motor all the way up the hill. Instead, the car / agent needs to learn that it must motor up one hill for a bit, then accelerate down the hill and back up the other side, and repeat until it builds up enough momentum to make it to the top of the hill.
As stated above, Open AI Gym is an open source reinforcement learning package that allows developers to interact easily with games such as the Mountain Car environment. You can find details about the Mountain Car environment here. Basically, the environment is represented by a two-element state vector, detailed below:
Mountain Car state vector
As can be observed, the agent’s state is represented by the car’s position and velocity. The goal/flag is sitting at a position = 0.5. The actions available to the agent are shown below:
Mountain Car actions
As can be observed, there are three actions available to the agent – accelerate to the left, right and no acceleration.
In the game’s default arrangement, for each time step where the car’s position is <0.5, it receives a reward of -1, up to a maximum of 200 time steps. So the incentive for the agent is to get the car’s position to >0.5 as soon as possible, after which the game ends. This will minimize the negative reward, which is the aim of the game.
However, in this default arrangement, it will take a significant period of time of random exploration before the car stumbles across the positive feedback of getting to the flag. As such, to speed things up a bit, in this example we’ll alter the reward structure to:
Position > 0.1, r += 10
Position > 0.25 r += 20
Position > 0.5 r += 100
This new reward structure gives the agent better positive feedback when it starts learning how to ascend the hill on the right hand side toward the flag. The position of 0.1 is just over half way up the right-hand hill.
Ok, so now you know the environment, let’s write some code!
Reinforcement learning in TensorFlow
In this reinforcement learning implementation in TensorFlow, I’m going to split the code up into three main classes, these classes are:
Model: This class holds the TensorFlow operations and model definitions
Memory: This class is where the memory of the actions, rewards and states are stored and retrieved from
GameRunner: This class is the main training and agent control class
As stated before, I’ll be assuming some prior knowledge of TensorFlow here. If you’re not up to speed your welcome to wing it. Otherwise check out my TensorFlow tutorial. All the code for this tutorial can be found on this site’s Github repository.
I’ll go through each of the classes in turn in the sub-sections below.
The Model class
class Model:
def __init__(self, num_states, num_actions, batch_size):
self._num_states = num_states
self._num_actions = num_actions
self._batch_size = batch_size
# define the placeholders
self._states = None
self._actions = None
# the output operations
self._logits = None
self._optimizer = None
self._var_init = None
# now setup the model
self._define_model()
def _define_model(self):
self._states = tf.placeholder(shape=[None, self._num_states], dtype=tf.float32)
self._q_s_a = tf.placeholder(shape=[None, self._num_actions], dtype=tf.float32)
# create a couple of fully connected hidden layers
fc1 = tf.layers.dense(self._states, 50, activation=tf.nn.relu)
fc2 = tf.layers.dense(fc1, 50, activation=tf.nn.relu)
self._logits = tf.layers.dense(fc2, self._num_actions)
loss = tf.losses.mean_squared_error(self._q_s_a, self._logits)
self._optimizer = tf.train.AdamOptimizer().minimize(loss)
self._var_init = tf.global_variables_initializer()
The first function within the class is of course the initialization function. All you need to pass into the Model definition is the number of states of the environment (2 in this game), the number of possible actions (3 in this game) and the batch size. The function simply sets up a few internal variables and operations, some of which are exposed as public properties later in the class definition. At the end of the initialization, the second method displayed above _define_model() is called. This method sets up the model structure and the main operations.
First, two placeholders are created _states and _q_s_a – these hold the state data and the $Q(s,a)$ training data respectively. The first dimension of these placeholders is set to None, so that it will automatically adapt when a batch of training data is fed into the model and also when single predictions from the model are required. The next lines create two fully connected layers fc1 and fc2 using the handy TensorFlow layers module. These hidden layers have 50 nodes each, and they are activated using the ReLU activation function (if you want to know more about the ReLU, check out my vanishing gradient and ReLU tutorial).
The next layer is the output layer _logits – this is another fully connected or dense layer, but with no activation supplied. When no activation function is supplied to the dense layer API in TensorFlow, it defaults to a ‘linear’ activation i.e. no activation. This is what we want, as we want the network to learn continuous $Q(s,a)$ values across all possible real numbers.
Next comes the loss – this isn’t a classification problem, so a good loss to use is simply a mean squared error loss. The next line specifies the optimizer – in this example, we’ll just use the generic Adam optimizer. Finally, the TensorFlow boiler plate global variable initializer operation is assigned to _var_init.
So far so good. Next, some methods of the Model class are created to perform prediction and training:
The first method predict_one simply returns the output of the network (i.e. by calling the _logits operation) with an input of a single state. Note the reshaping operation that is used to ensure that the data has a size (1, num_states). This is called whenever action selection by the agent is required. The next method, predict_batch predicts a whole batch of outputs when given a whole bunch of input states – this is used to perform batch evaluation of $Q(s,a)$ and $Q(s’,a’)$ values for training. Finally, there is a method called train_batch which takes a batch training step of the network.
That’s the Model class, now it is time to consider the Memory class.
The Memory class
The next class to consider in the code is the Memory class – this class stores all the results of the..
In the late 80’s and 90’s, neural network research stalled due to a lack of good performance. There were a number of reasons for this, outlined by the prominent AI researcher Geoffrey Hinton – these reasons included poor computing speeds, lack of data, using the wrong type of non-linear activation functions and poor initialization of the weights in neural networks. My post on the vanishing gradient problem and ReLUs addresses the problem of the wrong kind of non-linear activation functions, and this post will deal with proper weight initialization. In particular, in this post we’ll be examining the problem with a naive normal distribution when initializing weights, and examine Xavier and He initialization as a remedy to this problem. This will be empirically studied using TensorFlow and some associated TensorBoard visualizations. Note: to run the code in this tutorial, you’ll need TensorFlow 1.8 or greater installed.
Eager to learn more? Get the book here
The problem with a naive initialization of weights
The random initialization of weights is critical to learning good mappings from input to output in neural networks. Because the search space involving many weights during training is very large, there are multiple local minimums within which the back-propagation may be trapped. Effective randomization of weights ensures that the search space is adequately explored during training, resulting in the best chances of a good minimum being found during back-propagation (for more on back-propagation, see my neural networks tutorial). However, the weight initialization randomization function needs to be carefully chosen and specified otherwise there is a large risk that the training progress will be slowed to the point of impracticality.
This is especially the case when using the historical “squashing” non-linear activation functions such as the sigmoid function and the tanh function, though it is still an issue with ReLU function, as will be seen later. The reason for this problem is that, if the weights are such that the activation functions of nodes are pushed into the “flat” regions of their curves, they are “saturated” and impede learning. Consider the plot below showing the tanh function and its first derivative:
Tanh function and its first derivative
It can observed that when abs(x) > 2, the derivative of the tanh function approaches zero. Now because the back-propagation method of updating the weight values in a neural network depends on the derivative of the activation functions, this means that when nodes are pushed into such “saturation” regions, slow or no learning will take place. Therefore, we don’t want to start with weight values that push some or all of the nodes into those saturation regions, as that network just won’t work very well. The sigmoid function operates similarly, as can be observed in my vanishing gradient post.
A naive initialization of weights might be to simply use a normal distribution of mean zero and unit standard deviation (i.e. 1.0). Let’s consider how this might play out using a bit of simple statistical theory. Recall that the input to a neuron in the first layer of a neural network looks like:
The input, in other words, is a summation of the respective weights and their inputs. The variance (the square of the standard deviation) of each element in this sum can be explained by the product of independent variables law:
If we assume that the input has been appropriately scaled with a mean of 0 and a unit variance, and likewise we initialize the weights for a mean 0 and unit variance, then this results in:
So each product within the total sum of in has a variance of 1. What is the total variance of the node input variable in? We can make the assumption that each product (i.e. each $X_i W_i$) is statistically independent (not quite correct for things like images, but close enough for our purposes) and then apply the sum of uncorrelated independent variables law:
Where n is the number of inputs. So here, we can observe that if there are, say, 784 inputs (equal to the input size of the MNIST problem), the variance will be large and the standard deviation will be $\sqrt{Var(in)} = \sqrt{784} = 28$. This will result in the vast majority of neurons in the first layer being saturated, as most values will be >> |2| (i.e. the saturation regions of the functions).
Clearly this is not ideal, and so another way of initializing our weight variables is desirable.
Xavier or variance scaling for weight initialization
The Xavier method of weight initialization is a big improvement on the naive way of weight scaling shown in the section above. This method has helped accelerate the field of deep learning in a big way. It takes into account the problems shown above and bases the standard deviation or the variance of the weight initialization on the number of variables involved. It thereby adjusts itself based on the number of weight values. It works on the idea that if you can keep the variance constant from layer to layer in both the feed forward direction and back-propagation direction, your network will learn optimally. This makes sense, as if the variance increases or decreases as you go through the layers, your weights will eventually saturate your non-linear neurons in either the positive or negative direction.
So, how do we use this idea to work out what variance should be used to best initialize the weights? First, because the network will be learning effectively when it is operating in the linear regions of the tanh and sigmoid functions, the activation function can be approximated by a linear activation, i.e.:
Therefore, with this linear activation function, we can use the same result that was arrived at above using the product of independent variables and sum of uncorrelated independent variables, namely:
$$ Var(Y) = n_{in} Var(W_i)Var(X_i)$$
Where $n_{in}$ is the number of inputs to each node. If we want the variance of the input ($Var(X_i)$) to be equal to the variance of the output ($Var(Y)$) this reduces to:
$$ Var(W_i) = \frac{1}{n_{in}} $$
Which is a preliminary result for a good initialization variance for the weights in your network. However, this is really just keeping the variance constant during the forward pass. What about trying to keep the variance constant also during back-propagation? It turns out that during back-propagation, to try to do this you need:
$$ n_{i+1} Var(W_i) = 1 $$
Or:
$$ Var(W_i) = \frac{1}{n_{out}} $$
Now there are two different ways of calculating the variance, one depending on the value of the number of inputs and the other on the number of outputs. The authors of the original paper on Xavier initialization take the average of the two:
That is the final result in the Xavier initialization of weights for squashing activation functions i.e. tanh and sigmoid. However, it turns out this isn’t quite as optimal for ReLU functions.
ReLU activations and the He initialization
Consider the ReLU function – for all values less than zero, the output of the activation function is also zero. For values greater than zero, the ReLU function simply returns it’s input. In other words, half of the output is linear, like the assumption made in the analysis above – so that’s easy. However, for the other half of the inputs, for input values < 0, the output is zero. If we assume that the inputs to the ReLU neurons are approximately centered about 0, then, roughly speaking, half the variance will be in line with the Xavier initialization result, and the other half will be 0.
This is basically equivalent to halving the number of input nodes. So if we return to our Xavier calculations, but with half the number of input nodes, we have:
$$ Var(Y) = \frac{n_{in}}{2} Var(W_i)Var(X_i) $$
Again, if we want the variance of the input ($Var(X_i)$) to be equal to the variance of the output ($Var(Y)$) this reduces to:
$$ Var(W_i) = \frac{2}{n_{in}} $$
This is He initialization, and this initialization has been found to generally work better with ReLU activation functions.
Now that we’ve reviewed the theory, let’s get to the code.
Weight initialization in TensorFlow
This section will show you how to initialize weights easily in TensorFlow. The full code can be found on this site’s Github page. Performing Xavier and He initialization in TensorFlow is now really straight-forward using the tf.contrib.layers.variance_scaling_initializer. By adjusting the available parameters, we can create either Xavier, He or other types of modern weight initializations. In this TensorFlow example, I’ll be creating a simple MNIST classifier using TensorFlow’s packaged MNIST dataset, with a simple three layer fully connected neural network architecture. I’ll also be logging various quantities so that we can visualize the variance, activations and so on in TensorBoard.
First, we define a Model class to hold the neural network model:
class Model(object):
def __init__(self, input_size, label_size, initialization, activation, num_layers=3,
hidden_size=100):
self._input_size = input_size
self._label_size = label_size
self._init = initialization
self._activation = activation
# num layers does not include the input layer
self._num_layers = num_layers
self._hidden_size = hidden_size
self._model_def()
The above code is the class initialization function – notice that various initialization and activation functions can be passed to the model. Later on, we’ll cycle through different weight initialization and activation functions and see how they perform.
In the next section, I define the model creation function inside the Model class:
def _model_def(self):
# create placeholder variables
self.input_images = tf.placeholder(tf.float32, shape=[None, self._input_size])
self.labels = tf.placeholder(tf.float32, shape=[None, self._label_size])
# create self._num_layers dense layers as the model
input = self.input_images
tf.summary.scalar("input_var", self._calculate_variance(input))
for i in range(self._num_layers - 1):
input = tf.layers.dense(input, self._hidden_size, kernel_initializer=self._init,
activation=self._activation, name='layer{}'.format(i+1))
# get the input to the nodes (sans bias)
mat_mul_in = tf.get_default_graph().get_tensor_by_name("layer{}/MatMul:0".format(i + 1))
# log pre and post activation function histograms
tf.summary.histogram("mat_mul_hist_{}".format(i + 1), mat_mul_in)
tf.summary.histogram("fc_out_{}".format(i + 1), input)
# also log the variance of mat mul
tf.summary.scalar("mat_mul_var_{}".format(i + 1), self._calculate_variance(mat_mul_in))
# don't supply an activation for the final layer - the loss definition will
# supply softmax activation. This defaults to a linear activation i.e. f(x) = x
logits = tf.layers.dense(input, 10, name='layer{}'.format(self._num_layers))
mat_mul_in = tf.get_default_graph().get_tensor_by_name("layer{}/MatMul:0".format(self._num_layers))
tf.summary.histogram("mat_mul_hist_{}".format(self._num_layers), mat_mul_in)
tf.summary.histogram("fc_out_{}".format(self._num_layers), input)
# use softmax cross entropy with logits - no need to apply softmax activation to
# logits
self.loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(logits=logits,
labels=self.labels))
# add the loss to the summary
tf.summary.scalar('loss', self.loss)
self.optimizer = tf.train.AdamOptimizer().minimize(self.loss)
self.accuracy = self._compute_accuracy(logits, self.labels)
tf.summary.scalar('acc', self.accuracy)
self.merged = tf.summary.merge_all()
self.init_op = tf.global_variables_initializer()
I’ll step through the major points in this function. First, there is the usual placeholders to hold the training input and output data – if you’re unfamiliar with the basics of TensorFlow, check out my introductory tutorial here. Then, a scalar variable is logged called “input_var” which logs the variance of the input images, calculated via the _calculate_variance function – this will be presented later (if TensorFlow logging and visualization is unfamiliar to you, check out my TensorFlow visualization tutorial). The next step involves a loop through the layers, and here I have used the TensorFlow layers API which allows us to create densely connected layers easily. Notice that the kernel_initializer argument is what will initialize the weights of the layer, and activation is the activation function which the layer neurons will use.
Next, I access the values of the matrix multiplication between the weights and inputs for each layer, and log the values. This way we can observe what the values of the inputs to each neuron is, and the variance of these inputs. We log these values as histograms. Finally, within the layer loop, the variance of the matrix multiplication input is also logged as a scalar.
The remainder of this model construction function is all the standard TensorFlow operations which define the loss, the optimizer and variable initialization, and also some additional logging of variables. The next function to take notice of within the Model class is the _calculate_variance function – it looks like:
The function above is just a simple calculation of the variance of x.
The main code block creates a list of various scenarios to run through, each with a different folder name in which to store the results, a different weight initialization function and finally a different activation function to supply to the neurons. The main training / analysis loop first runs a single batch of data through the network to examine initial variances. Thereafter it performs a full training run of the network so that performance indicators can be analysed.
if __name__ == "__main__":
sub_folders = ['first_pass_normal', 'first_pass_variance',
'full_train_normal', 'full_train_variance',
'full_train_normal_relu', 'full_train_variance_relu',
'full_train_he_relu']
initializers = [tf.random_normal_initializer,
tf.contrib.layers.variance_scaling_initializer(factor=1.0, mode='FAN_AVG', uniform=False),
tf.random_normal_initializer,
tf.contrib.layers.variance_scaling_initializer(factor=1.0, mode='FAN_AVG', uniform=False),
tf.random_normal_initializer,
tf.contrib.layers.variance_scaling_initializer(factor=1.0, mode='FAN_AVG', uniform=False),
tf.contrib.layers.variance_scaling_initializer(factor=2.0, mode='FAN_IN', uniform=False)]
activations = [tf.sigmoid, tf.sigmoid, tf.sigmoid, tf.sigmoid, tf.nn.relu, tf.nn.relu, tf.nn.relu]
assert len(sub_folders) == len(initializers) == len(activations)
maybe_create_folder_structure(sub_folders)
for i in range(len(sub_folders)):
tf.reset_default_graph()
model = Model(784, 10, initializers[i], activations[i])
if "first_pass" in sub_folders[i]:
init_pass_through(model, sub_folders[i])
else:
train_model(model, sub_folders[i], 30, 1000)
The most important thing to consider in the code above is the Xavier and He weight initialization definitions. The function used to create these is the tf.contrib.layers.variance_scaling_initializer which allows us to create weight initializers which are based on the number of input and output connections in order to execute the Xavier and He initialization discussed previously.
The three arguments used in this function are:
The factor argument, which is a multiplicative factor that is applied to the scaling. This is 1.0 for Xavier weight initialization, and 2.0 for He weight initialization
The mode argument: this defines which is on the denominator of the variance calculation. If ‘FAN_IN’, the variance scaling is based solely on the number of inputs to the node. If ‘FAN_OUT’ it is based solely on the number of outputs. If it is ‘FAN_AVG’, it is based on an averaging calculation, i.e. Xavier initialization. For He initialization, use ‘FAN_IN’
The uniform argument: this defines whether to use a uniform distribution or a normal distribution to sample the weights from during initialization. For both Xavier and He weight initialization, you can use a normal distribution, so set this argument to False
The other weight initialization function used in the scenarios is the tf.random_normal_initializer with default parameters. The default parameters for this initializer are a mean of zero, and a unit (i.e. 1.0) standard deviation / variance.
After running this code, a number of interesting results are obtained.
Visualizing the TensorFlow model variables
The first thing that we want to look at is the “first pass” model results, where only one batch is passed through the model. If we look at the distribution of inputs into the first layer in TensorBoard, with our naive normally distributed weight values with a unit variance, we can see the following (if TensorBoard visualization is unfamiliar to you, check out my TensorFlow visualization tutorial):
First pass distribution of inputs to the first layer
As can be observed the matrix multiplication input into the first layer is approximately normally distributed, with a standard deviation around 10. If you recall, the variance scalar of the matrix multiplication input was also been logged, and it gives a value of approximately 88. Does this make sense? I mentioned earlier that with 784 inputs (i.e. the input size of the MNIST dataset), we should expect a variance of approximately 784. What’s the explanation of this discrepancy? Well, remember I also logged the variance of the input data – it turns out that the MNIST TensorFlow dataset has a variance of 0.094. You’ll recall that we assumed a unit variance in the calculations previously shown. In this case, though, we should expect a variance of (remember that $Var(W_i)$, for the normal distribution initializer we are currently considering, is equal to 1.0):
This is roughly in line with the observed variance – so we can be happy that we are on the right track. The distribution shown above is the distribution into the first layer neurons. In the first set of scenarios, we’re using a sigmoid activation function – so what does the first layer output distribution look like for this type of input distribution?
Distribution of outputs from first layer – sigmoid activations and normal weight initialization
As can be observed, the input distribution with such a relatively large variance completely saturates the first layer – with the output distribution being squeezed to the saturated regions of the sigmoid curve i.e. outputs close to 0 and 1 (we’d observe the same thing with a tanh activation). This confirms our previous analysis of the problems with a naive normally distributed weight initialization.
What happens when we use the Xavier initialization configuration of the variance scaler..