Take the saliency map for each channel and either take the max, average, or use all 3 channels.
Two good papers outlining the functioning of saliency maps are:Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency MapsAttention-based Extraction of Structured Information from Street View ImageryThere is a GitHub repository associated with this article in which I show how to generate saliency maps (the repository can be found here).
Here is a snippet of the code from the Jupyter notebook:from vis.
visualization import visualize_saliencyfrom vis.
utils import utilsfrom keras import activations# Utility to search for layer index by name.
# Alternatively we can specify this as -1 since it corresponds to the last layer.
layer_idx = utils.
find_layer_idx(model, 'preds')plt.
rcParams["figure.
figsize"] = (5,5)from vis.
visualization import visualize_camimport warningswarnings.
filterwarnings('ignore')# This corresponds to the Dense linear layer.
for class_idx in np.
arange(10): indices = np.
where(test_labels[:, class_idx] == 1.
)[0] idx = indices[0] f, ax = plt.
subplots(1, 4) ax[0].
imshow(test_images[idx][.
, 0]) for i, modifier in enumerate([None, 'guided', 'relu']): grads = visualize_cam(model, layer_idx, filter_indices=class_idx, seed_input=test_images[idx], backprop_modifier=modifier) if modifier is None: modifier = 'vanilla' ax[i+1].
set_title(modifier) ax[i+1].
imshow(grads, cmap='jet')This code results in the following saliency maps being generated (assuming that the relevant libraries vis.
utils and vis.
visualization are installed).
Please see the notebook if you want a fuller walkthrough of the implementation.
In the next section, we will discuss the idea of upsampling through the use of transposed convolutions.
Transposed ConvolutionSo far, the convolutions we have looked at either maintain the size of their input or make it smaller.
We can use the same technique to make the input tensor larger.
This process is called upsampling.
When we do it inside of a convolution step, it is called transposed convolution or fractional striding.
Note: Some authors call upsampling while convolving deconvolution, but that name is already taken by a different idea outlined in the following paper:https://arxiv.
org/pdf/1311.
2901.
pdfTo illustrate how the transposed convolution works, we will look at some illustrated examples of convolutions.
The first is an example of a typical convolutional layer with no padding, acting on an image of size 5 × 5.
After the convolution, we end up with a 3 × 3 image.
Image taken from A.
Glassner, Deep Learning, Vol.
2: From Basics to PracticeNow we look at a convolutional layer with a padding of 1.
The original image is 5 × 5, and the output image after the convolution is also 5 × 5.
Image taken from A.
Glassner, Deep Learning, Vol.
2: From Basics to PracticeNow we look at a convolutional layer with a padding of 2.
The original image is 3× 3, and the output image after the convolution is also 5 × 5.
Image taken from A.
Glassner, Deep Learning, Vol.
2: From Basics to PracticeWhen used in Keras, such as in the development of a variational autoencoder, these are implemented using an upsampling layer.
Hopefully, if you have seen this before, it now makes sense as to how these convolution layers are able to increase the size of the image through the use of transposed convolutions.
In the next section, we will discuss the architectures of some of the classic networks.
Each of these networks was revolutionary in some sense in forwarding the field of deep convolutional networks.
Classic NetworksIn this section, I will go over some of the classic architectures of CNN’s.
These networks were utilized in some of the seminal work done in the field of deep learning, and are often used for transfer learning purposes (this is a topic for a future article).
The first piece of research proposing something similar to a Convolutional Neural Network was authored by Kunihiko Fukushima in 1980 and was called the NeoCognitron1, who was inspired by discoveries of the visual cortex of mammals.
Fukushima applied the NeoCognitron to hand-written character recognition.
By the end of the 1980’s, several papers were produced that considerably advanced the field.
The idea of backpropagation was first published in French by Yann LeCun in 1985 (which was independently discovered by other researchers as well), followed shortly by TDNN by Waiber et al.
in 1989 — the development of a convolutional-like network trained with backpropagation.
One of the first applications was by LeCun et al.
in 1989, using backpropagation applied to handwritten zip code recognition.
LeNet-5The formulation of LeNet-5 is a bit outdated in comparison to current practices.
This is one of the first neural architectures that was developed during the nascent phase of deep learning at the end of the 20th century.
In November 1998, LeCun published one of his most recognized papers describing a “modern” CNN architecture for document recognition, called LeNet1.
This was not his first iteration, this was, in fact, LeNet-5, but this paper is the commonly cited publication when talking about LeNet.
It uses convolutional networks followed by pooling layers and finishes with fully connected layers.
The network first starts with high dimensional features and reduces its size while increasing the number of channels.
There are around 60,000 parameters in this network.
LeCun, Yann, et al.
“Gradient-based learning applied to document recognition.
” Proceedings of the IEEE 86.
11 (1998): 2278–2324.
AlexNetThe AlexNet architecture is one of the most important architectures in deep learning, with more than 25,000 citations — this is practically unheard of in research literature.
Developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton at the University of Toronto in 2012, AlexNet destroyed the competition in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
The network was trained on the ImageNet dataset, a collection of 1.
2 million high-resolution (227x227x3) images consisting of 1000 different classes, using data augmentation.
The depth of the model was larger than any other network at the time, and was trained using GPU’s for 5–6 days.
The network consists of 12 layers and utilized dropout and smart optimizer layers and was one of the first networks to implement the ReLU activation function, which is still widely used today.
The network had more than 60 million parameters to optimize (~255 MB).
This network almost single-handedly kickstarted the AI revolution by showing the impressive performance and potential benefits of CNN’s.
The network won the ImageNet contest with a top-5 error of 15.
3%, more than 10.
8 percentage points lower than the next runner-up.
ImageNet results from 2011 to 2016.
SourceWe will be discussing the remaining networks that have won the ILSVRC, since most of these are the revolutionary networks at the forefront of research in deep learning.
ZFNetThis network was introduced by Matthew Zeiler and Rob Fergus from New York University, which won ILSVRC 2013 with an 11.
2% error rate.
The network decreased the sizes of filters and was trained for 12 days.
The paper presented a visualization technique named “deconvolutional network”, which helps to examine different feature activations and their relation to the input space.
VGG16 and VGG19The VGG network was introduced by Simonyan and Zisserman (Oxford) in 2014.
This network is revolutionary in its inherent simplicity and its structure.
It consists of 16 or 19 layers (hence the name) with a total of 138 million parameters (522 MB) and uses 3×3 convolutional filters exclusively using same padding and a stride of 1, and 2×2 max-pooling layers with a stride of 2.
The authors showed that two 3×3 filters have an effective receptive field of 5×5 and that as spatial size decreases, the depth increases.
The network was trained for two to three weeks and is still used to this today — mainly for transfer learning.
The network was originally developed for the ImageNet Challenge in 2014.
ImageNet Challenge 2014; 16 or 19 layers138 million parameters (522 MB).
Convolutional layers use ‘same’ padding and stride s = 1.
Max-pooling layers use a filter size f = 2 and stride s = 2.
GoogLeNet (Inception-v1)The GoogLeNet network was introduced by Szegedy et al.
(Google) in 2014.
The network was the winner of ILSVRC 2014, beating the VGG architecture.
The network introduces the concept of the inception module — parallel convolutional layers with different filter sizes.
The idea here is that we do not a priori know which filter size is best, so we just let the network decide.
The inception network is formed by concatenating other inception modules.
It includes several softmax output units to enforce regularization.
This was a key idea which has been important in the development of future architectures.
Another interesting feature is that there is no fully connected layer at the end, and this is instead replaced with an average-pooling layer.
The removal of this fully connected layer results in a network with 12x fewer parameters than AlexNet, making it much faster to train.
Residual NetworksThe first residual network was presented by He et al.
(Microsoft) in 2015.
This network won ILSVRC 2015 in multiple categories.
The main idea behind this network is the residual block.
The network allows for the development of extremely deep neural networks, which can contain 100 layers or more.
This is revolutionary since up to this point, the development of deep neural networks was inhibited by the vanishing gradient problem, which occurs when propagating and multiplying small gradients across a large number of layers.
The authors believe that it is easier to optimize residual mapping than an archetypal neural architecture.
Furthermore, residual block can decide to “shut itself down” if needed.
Let’s compare the network structure for a plain network and a residual network.
The plain network structure is as follows:A residual network structure looks like this:The equations describing this network are:With this extra connection, gradients can travel backward more easily.
It becomes a flexible block that can expand the capacity of the network, or simply transform into an identity function that would not affect training.
Example training of an 18- and 34-layer residual network.
A residual network stacks residual blocks sequentially.
The idea is to allow the network to become deeper without increasing the training complexity.
Residual networks implement blocks with convolutional layers that use ‘same’ padding option (even when max-pooling).
This allows the block to learn the identity function.
The designer may want to reduce the size of features and use ‘valid’ padding.
— In such a case, the shortcut path can implement a new set of convolutional layers that reduces the size appropriately.
These networks can get huge and extremely complicated, and their diagrams begin to look akin to those that describe the functioning of a power plant.
Here is an example of such a network.
Comparing the error values for the previous winners of ImageNet to those of the ResNet formulations, we can see a clear enhancement in the performance.
Alexnet (2012) achieved a top-5 error of 15.
3% (second place was 26.
2%), followed by ZFNet (2013) achieved a top-5 error of 14.
8% (visualization of features), followed by GoogLeNet (2014) with an error of 7.
8%, and then ResNet (2015) which achieved accuracies below 5% for the first time.
Dense NetworksInitially proposed by Huang et al.
in 2016 as a radical extension of the ResNet philosophy.
Each block uses every previous feature map as input, effectively concatenating them.
These connections mean that the network has L(L+1)/ 2 direct connections, where L is the number of layers in the network.
One can think of the architecture as an unrolled recurrent neural network.
Each layer adds k feature-maps of its own to this state.
The growth rate regulates how much new information each layer contributes to the global state.
The idea here is that we have all the previous information available at each point.
Counter-intuitively, this architecture reduces the total number of parameters needed.
A 5-layer dense block with a growth rate of k = 4.
Each layer takes all preceding feature-maps as inputs.
The network works by allowing maximum information (and gradient) flow at each layer by connecting every layer directly with every other layer.
In this way, DenseNets exploit the potential of the network through feature reuse, which means there is no need to learn redundant feature maps.
DenseNet layers are relatively narrow (e.
g.
12 filters), and they just add a small set of new feature-maps.
The DenseNet architecture typically has superior performance to the ResNet architecture and can achieve the same or better accuracy with fewer parameters overall, and the networks are easier to train.
Performance comparison of various ResNet and DenseNet architectures.
The network formulation may be a bit confusing at first, but it is essentially a ResNet architecture the resolution blocks are replaced by dense blocks.
The dense connections have a regularizing effect, which reduces overfitting on tasks with smaller training set sizes.
It is important to note that DenseNets do not sum the output feature maps of the layer with the incoming feature maps, they, in fact, concatenate them:Dimensions of the feature maps remain constant within a block, but the number of filters changes between them, which is known as the growth rate, k.
Below is the full architecture of a dense network.
It is fairly involved when we look at the network in its full resolution, which is why it is typically easier to visualize in an abstracted form (like we did above).
For more information on DenseNet, I recommend the following article.
DenseNetMany papers:towardsdatascience.
comSummary of NetworksAs we can see, over the course of just a few years, we have gone from an error rate of around 15% on the ImageNet dataset (which, if you remember, consists of 1.
2 million images) to an error rate of around 3–4%.
Nowadays the most state-of-the-art networks are able to get below 3% pretty consistently.
There is still quite a long way to go before we are able to obtain perfect scores for these networks, but the rate of progress is quite staggering in this past decade, and it should be apparent from this why we are currently undergoing a deep learning revolution — we have gone from the stage where humans have superior visual recognition, to a stage where these networks have superior vision (a human cannot achieve 3% on the ImageNet dataset).
This has fueled the transition of machine learning algorithms into various commercial fields that require heavy use of image analysis, such as medical imaging (examining brain scans, x-rays, mammography scans) and self-driving cars (computer vision).
Image analysis is easily extended to video since this is just a rapid succession of multiple image frames every second — although this requires more computing power.
Transfer LearningTransfer learning is an important topic, and it is definitely worthy of having an article all to itself.
However, for now, I will outline the basic idea behind transfer learning so that the reader is able to do more research on it if they are interested.
How do you make an image classifier that can be trained in a few hours (minutes) on a CPU?Normally, image classification models can take hours, days, or even weeks to train, especially if they are trained on exceptionally large networks and datasets.
However, we know that companies such as Google and Microsoft have dedicated teams of data scientists that have spent years developing exceptional networks for the purpose of image classification — why not just use these networks as a starting point for your own image classification projects?This is the idea behind transfer learning, to use pre-trained models, i.
e.
models with known weights, in order to apply them to a different machine learning problem.
Obviously, just purely transferring the model will not be helpful, you must still train the network on your new data, but it is common to freeze the weights of the former layers as these are more generalized features that will likely be unchanged during training.
You can think of this as an intelligent way of generating a pre-initialized network, as opposed to having a randomly initialized network (the default case when training a network in Keras).
Typically, smaller learning rates are used in transfer learning than in typical network training, as we are essentially tuning the network.
If large learning rates are used and the early layers in the network are not frozen, transfer learning may not provide any benefit.
Often, it is only the last layer or the last couple of layers that is trained in a transfer learning problem.
Transfer learning works best for problems that are fairly general and there are networks freely available online (such as image analysis) and when the user has a relatively small dataset available such that it is insufficient to train a neural network — this is a fairly common problem.
To summarize the main idea: earlier layers of a network learn low-level features, which can be adapted to new domains by changing weights at later and fully-connected layers.
An example of this would be to use ImageNet trained with any sophisticated huge network, and then to retrain the network on a few thousand hotdog images and you get.
The steps involved in transfer learning are as follows:Get existing network weightsUnfreeze the “head” fully connected layers and train on your new imagesUnfreeze the latest convolutional layers and train at a very low learning rate starting with the weights from the previously trained weights.
This will change the latest layer convolutional weights without triggering large gradient updates which would have occurred had we not done #2.
(Left) Typical imported network to be utilized for transfer learning, (right) newly tuned network with the first four convolutional blocks frozen.
For more information, there are several other Medium articles I recommend:How HBO’s Silicon Valley built “Not Hotdog” with mobile TensorFlow, Keras & React NativeHow we beat the state of the art to build a real-life A.
I.
app.
medium.
comFinal CommentsCongratulations on making it to the end of this article!.This was a long article that touched on multiple facets of deep learning.
The reader should now be fairly well equipped to venture into deep convolutional learning and computer vision literature.
I encourage the reader to do more individual research on the topics that I have discussed here so that they can deepen their knowledge.
I have added links to some further reading in the next section, as well as some of the references to research articles that I borrowed images from during this article.
Thanks for reading and happy deep learning!Further ReadingMobileNetV2 (https://arxiv.
org/abs/1801.
04381)Inception-Resnet, v1 and v2 (https://arxiv.
org/abs/1602.
07261)Wide-Resnet (https://arxiv.
org/abs/1605.
07146)Xception (https://arxiv.
org/abs/1610.
02357)ResNeXt (https://arxiv.
org/pdf/1611.
05431)ShuffleNet, v1 and v2 (https://arxiv.
org/abs/1707.
01083)Squeeze and Excitation Nets (https://arxiv.
org/abs/1709.
01507)Original DenseNet paper (https://arxiv.
org/pdf/1608.
06993v3.
pdf)DenseNet Semantic Segmentation (https://arxiv.
org/pdf/1611.
09326v2.
pdf)DenseNet for Optical flow (https://arxiv.
org/pdf/1707.
06316v1.
pdf)ReferencesYann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol.
86, no.
11, pp.
2278–2324, 1998.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp.
1097–1105, 2012Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.
Min Lin, Qiang Chen, and Shuicheng Yan, “Network in network,” 2013.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.
1–9.
Schroff, Florian, Dmitry Kalenichenko, and James Philbin.
”Facenet: A unified embedding for face recognition and clustering.
” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
815–823.
2015Long, J.
, Shelhamer, E.
, & Darrell, T.
(2014).
Fully Convolutional Networks for Semantic Segmentation.
Retrieved from http://arxiv.
org/abs/1411.
4038v1Chen, L.
-C.
, Papandreou, G.
, Kokkinos, I.
, Murphy, K.
, & Yuille, A.
L.
(2014).
Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs.
Iclr, 1–14.
Retrieved from http://arxiv.
org/abs/1412.
7062Yu, F.
, & Koltun, V.
(2016).
Multi-Scale Context Aggregation by Dilated Convolutions.
Iclr, 1–9.
http://doi.
org/10.
16373/j.
cnki.
ahr.
150049Oord, A.
van den, Dieleman, S.
, Zen, H.
, Simonyan, K.
, Vinyals, O.
, Graves, A.
, … Kavukcuoglu, K.
(2016).
WaveNet: A Generative Model for Raw Audio, 1–15.
Retrieved from http://arxiv.
org/abs/1609.
03499Kalchbrenner, N.
, Espeholt, L.
, Simonyan, K.
, Oord, A.
van den, Graves, A.
, & Kavukcuoglu, K.
(2016).
Neural Machine Translation in Linear Time.
Arxiv, 1–11.
Retrieved from http://arxiv.
org/abs/1610.
10099.