# A journey into Convolutional Neural Network visualization

In reality, the network learns to recognize the weather, not the enemy tanks.

The source code can be found here.

This article is also available as an interactive jupyter notebookNosce te ipsumWith this article, we are going to see different techniques to understand what it is going on inside a Convolutional Neural Network to avoid making the same US’ army mistake.

We are going to use Pytorch.

All the code can be found here.

Most of the visualizations were developed from scratch, however, some inspiration and parts were taken from here.

We will first introduce each technique by briefly explain it and making some example and comparison between different classic computer vision models, alexnet, vgg16 and resnet.

Then we will try to better understand a model used in robotics to predict the local distance sensor using only the frontal camera's images.

Our goal is not to explain in detail how each technique works since this is already done extremely well by each paper, but to use them to help the reader visualize different model with different inputs to better understand and highlight what and how different models react to a given input.

Later on, we show a workflow in which we utilize some of the techniques you will learn in this journey to test the robustness of a model, this is extremely useful to understand and fix its limitations.

The curious reader could further improve is understand by looking and the source code for each visulization and by reading the references.

PreambulaLet’s start our journey by selecting a network.

Our first model will be the old school alexnet.

It is already available in the torchvision.

models package from PytorchAlexNet( (features): Sequential( (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)) (1): ReLU(inplace) (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)) (4): ReLU(inplace) (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (7): ReLU(inplace) (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (9): ReLU(inplace) (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (11): ReLU(inplace) (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False) ) (classifier): Sequential( (0): Dropout(p=0.

5) (1): Linear(in_features=9216, out_features=4096, bias=True) (2): ReLU(inplace) (3): Dropout(p=0.

5) (4): Linear(in_features=4096, out_features=4096, bias=True) (5): ReLU(inplace) (6): Linear(in_features=4096, out_features=1000, bias=True) ) )Now we need some inputsNow we need some inputs images.

We are going to use three pictures, a cat, the beautiful Basilica di San Pietro and an image with a dog and a cat.

In utils there are several utility functions to creates the plots.

Since all of our models were trained on imagenet, a huge dataset with 1000 different classes, we need to parse and normalize them.

In Pytorch, we have to manually send the data to a device.

In this case, the device if the fist gpu if you have one, otherwise cpu is selected.

Be aware that jupyter have not garbage collected so we will need to manually free the gpu memory.

We also define a utility function to clean the gpu cacheAs we said, imagenet is a huge dataset with 1000 classes, represented by an integer not very human interpretable.

We can associate each class id to its label by loading the imaganet2human.

txt and create a python dictionary.

[(0, 'tench Tinca tinca'), (1, 'goldfish Carassius auratus')]Weights VisualizationThe first straightforward visualization is to just plot the weights of a target Layer.

Obviously, the deeper we go the smaller each image becomes while the channels number increases.

We are going to show each channel as a grey array image.

Unfortunately, each Pytorch module can be nested and nested, so to make our code as general as possible we first need to trace each sub-module that the input traverse and then store each layer in order.

We first need to trace our model to get a list of all the layers so we can select a target layer without following the nested structure of a model.

InPyTorch models can be infinitely nested.

In other words, we are flattering the model's layers, this is implemented in the module2traced function.

[Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)), ReLU(inplace), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2)), ReLU(inplace), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace), Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace), Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)), ReLU(inplace), MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False), Dropout(p=0.

5), Linear(in_features=9216, out_features=4096, bias=True), ReLU(inplace), Dropout(p=0.

5), Linear(in_features=4096, out_features=4096, bias=True), ReLU(inplace), Linear(in_features=4096, out_features=1000, bias=True)]Let’s plot the first layer’s weight.

We also print the shape of the weight to give a correct idea to the reader of the dimensional reduction.

torch.

Size([1, 55, 55])Let’s stop for a minute to explain what those images represent.

We traced the input through the computational graph in order to find out all the layers of our models, in this case, alexnet.

Then we instantiate the Weights class implemented in visualization.

core and we call it by passing the current input, the cat image and a target layer.

As outputs, we get all the current layer's weights as grey images.

Then, we plot 16 of them.

We can notice that they, in some way, makes sense; for example, some pixels are brighter in the edges of the images.

Let’s plot the first MaxPool layer to better see this effect, dimensional reduction and higher brightness pixels in some interesting areas.

If you are wondering what the maxpolling operations is doing, check this awesome repotorch.

Size([1, 27, 27])Let’s try with an other input, the San Pietro Basilicatorch.

Size([1, 27, 27])By looking at them, these images make somehow sense; they highlight the basilica layout but it is hard to understand what the model is actually doing.

We got the idea that is computing something correctly but we could ask some questions, for example: is it looking at the cupola?.Which are the most important features of the Basilica?Moreover, the deeper we go the harder it becomes to even recognize the input.

torch.

Size([1, 13, 13])In this case, we have no idea of what is going on.

It can be argued that weights visualization does not carry any useful information about the model, even if this is almost true, there is one nice reason of plotting the weights especially at the first layer.

When a model is poorly trained or not trained at all, the first weights have lots of noise, since they are just randomly initialized, and they are a lot more similar to the inputs images than the trained ones.

This feature can be useful to understand on the fly is a model is trained or not.

However, except for this, weights visualization is not the way to go to understand what your black box is thinking.

Below we plot the first layer’s weight first for the untrainded version of alexnet and the for the trained one.

torch.

Size([1, 55, 55]) torch.

Size([1, 55, 55])You can notice that in the first image is simpler to see the input image.

Hoewer, this is not a general rule, but in some cases it can help.

Similarities with other modelsWe have seen alexnet's weights, but are they similar across models?.Below we plot the first 4 channel of each first layer's weight for alexnet, vgg and resnetThe resnet and vgg weights looks more similar to the input images than alexnet.

But, again, what does it mean?.Remember that at least resnet is initialized in a different way than the other two models.

Saliency visualizationOne idea proposed by Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps is to back-prop the output of the network with respect to a target class until the input and plot the computed gradient.

This will highlight the part of the image responsible for that class.

Let’s first print the prediction of the network (this could change if you re-run the cell)predicted class tiger catEach visualization is implemented in its own class.

You can find the code here.

It will backproprop the output with respect to the one hot encoding representation of the number corresponding to class tiger catWe can see that alexnet gets exited on the cat.

We can even do better!.We can set to 0 each negative relu gradient when backprop.

This is techinique is called guided.

Now we can clearly see that the network is looking at the eyes and the nose of the cat.

We can try to compare different modelsAlextnet seems more interested to the eyes, while VGG looks at the ears and resnet is similar to alexnet.

Now we can clearly understand which part of the inputs help the network gives that prediction.

While guiding yields a better human interpretable image, the vanilla implementation can be used for localizing an object of interest.

In other words, we can find object of interest for free by cropping out of the input image the region corresponding to the gradient.

Let’s plot each input image for each model.

The Basilica is very interesting, all four networks correctly classify it as a dome but only resnet152 is more interested in the sky than on the cupola.

In the last column, we have an image with two classes, dog and cat.

All the networks highlighted booths, like the eyes of the dog and the ears of the cat in vgg16.

What if we would like to discover only the region of the inputs that are related to a specific class?.With this technique is impossible.

Class Activation MappingClass Activation Mapping is a techniques presented in Learning Deep Features for Discriminative Localization.

The idea is to use the last convolutional layer output and the neurons in the linear layer of the model responsable for a target class, the map is generated by taking the dot product of those.

However, to make this work the model has to have some constrains.

First of all, the output from the convolution must first go trought an global average polling and it requires feature maps to directly precede softmax layers.

To make it works with other architecture, such as alexnet and vgg we have to change some layers in the model and retrain it.

This is a major drawback that will be solved with the next section.

For now, we can use it for free with resnet!.Since its architecture is perfect.

The implementation can be found here.

We can pass to the visualization a target_class parameter to get the relative weights from the fc layer.

Notice that by changing the target class, we can see a different part of the image highlighted.

The first image uses the prediction class, while the second an other type of cat and the last one bookcase, just to see what the model will do with a wrong class.

It makes sense, the only thing is that in the last row we still have some part of the cat highlighted for bookcaseLet’s plot the CAM on the cat images for different resnet architecture.

For resnet > 34 the Bottleneck module is usedClipping input data to the valid range for imshow with RGB data ([0.

1] for floats or [0.

255] for integers).

Clipping input data to the valid range for imshow with RGB data ([0.

1] for floats or [0.

255] for integers).

They are all very similar as expected.

One big drawback of this technique is that force you to use a network with a specific architecture, global polling before the decoder part.

The next technique generalize this approach by taking advantage of the gradient at one specific layer.

Remember that with the class activation we are using the weights of the feature map as a scaling factor for the channels of the last layer.

The features map must be before a softmax layer and right after the average pooling.

The next technique propose a more general approach.

The idea is actually simple, we backprop the output with respect to a target class while storing the gradient and the output at a given layer, in our case the last convolution.

Then we perform a global average of the saved gradient keeping the channel dimension in order to get a 1-d tensor, this will represent the importance of each channel in the target convolutional layer.

We then multiply each element of the convolutional layer outputs by the averaged gradients to create the grad cam.

This whole procedure is fast and it is architecture independent.

Interesting, the authors show that is a generalization of the previous technique.

The code is hereWe can use it to higlight what different models are looking at.

It is really interesting to see how alexnet looks at the nose, while vgg at the ears and resnet at the whole cat.

It is interesting to see that the two resnet version looks at different part of the cat.

Below we plot the same input for resnet34 but we change the target class in each column to show the reader how the grad cam change accordingly.

Notice how similar to the CAM output they are.

To better compore our three models, below we plot the grad cam for each input with respect to each modelThe reader can immediately notice the difference across the models.

Interesting regionWe talk before about interesting region localizations.

Grad-cam can be also used to extract the class object out of the image.

Easily, once the have the grad-cam image we can used it as mask to crop out form the input image what we want.

The reader can play with the TR parameter to see different effects.

et voilà!.We can also change again class, and crop the interest region for that class.

Different modelsWe have seen all these techniques used with classic classicification models trained on imagenet.

What about use them on a different domain?.I have ported this paper to Pytorch and retrain it.

The model learn from the frontal camera's image of a robot to predict the local distance sensors in order to avoid obstacles.

Let's see what if, by using those techniques, we can understand better what is going on inside the model.

Learning Long-range Perception using Self-Supervision from Short-Range Sensors and OdometryThe idea is to predict the future outputs of a short-range sensor (such as a proximity sensor) given the current outputs of a long-range sensor (such as a camera).

They trained a very simple CNN from the robot’s camera images to predict the proximity sensor values.

If you are interested in their work, you can read the full paper hereI have made a PyTorch implementation and retrain the model from scratch.

Be aware that I did not fine-tune or try different sets of hyper-parameters so probably my model is not performing as well as the author’s one.

Let’s import itWe know need some inputs to test the model, they are taken directly from the test setThen author normalize each image, this is done by callind pre_processing.

For some reason the inpupts images are different on mac and ubuntu, they should not be like these if you run the notebook on mac the result is different.

This is probably due to the warning message.

We are going to use the SaliencyMap and the GradCam since those are the bestWe can clearly see that the model looks at the objects.

In the GradCam row, on the second picture, the plan is basically segmented by the heatmap.

There is one problem, if you look at the third picture, the white box in front of the camera is not clearly highlighted.

This is probably due to the white color of the floor that is very similar to the box's color.

Let's investigate this problem.

In the second row, the SaliencyMaps highlights all the objects, including the white box.

The reader can notice that the reflection in the first picture on the left seems to excite the network in that region.

We should also investigate this case but due to time limitations, we will leave it as an exercise for the curious reader.

For completeness, let’s also print the predicted sensor output.

The model tries to predict five frontal distance sensors give the image camera.

If you compare with the authors pictures, my prediction are worse.

This is due to the fact that to speed up everything I did not used all the training set and I did not perform any hyper paramater optimisation.

All the code con be found here.

Let’s now investigate the first problem, object with a similar color to the ground.

Similar colorsTo test if the model has a problem with obstacles with a the same color of the ground, we created in blender four different scenarios with an obstacle.

They are showed in the picture below.

There are four different lights configuration and two differents cube colors, one equal to the ground and the second different.

The first column represents a realistic situation, while the second has a really strong light from behind that generates a shadow in front of the camera.

The third column has a shadow on the left and the last one has a little shadow on the left.

This is a perfect scenario to use gradcam to see what the model is looking in each image.

In the picture below we plotted the gradcam results.

The big black shadow in the second column definitly confuses the model.

In the first and last column, the grad cam highlights better the corners of the red cube, especially in the first picture.

We can definitely say that this model has some hard time with the object of the same colour as the ground.

Thanks to this consideration, we could improve the number equal object/ground in the dataset, perform a better preprocessing, change the model structure etc and hopefully increase the robustness of the network.