# Intuitive Deep Learning Part 2: CNNs for Computer Vision

There is another layer we will introduce, and then we’ll put all the layers together in one big architecture and discuss the intuition behind that!Summary: A layer common in CNNs in the Conv layer, which is defined by the filter size, stride, depth and padding.

The Conv layer uses the same parameters and applies the same neuron(s) across different regions of the image, thereby reducing the number of parameters needed.

The next layer we will go through is called the pooling layer, which corresponds roughly to Steps 4 and 5 in the algorithm laid out at the start.

If you recall, we had four numbers in our basic algorithm after applying the conv layer and we wanted it to reduce it to one number.

We simply took the four input numbers and output the maximum as our output number.

This is an example of max-pooling, which as its name suggests, takes the maximum of the numbers it looks at.

More generally, a pooling layer has a filter size and a stride, similar to a convolution layer.

Let’s take the simple example of an input with depth 1 (i.

e.

it only has 1 depth slice).

If we apply a max-pool with filter size 2×2 and stride 2, so there is no overlapping region, we get:Max-pooling with filter 2 and stride 2.

Note that a max-pool layer of filter 2 and stride 2 is commonly seen in many models today.

Image taken from CS231N notes: http://cs231n.

github.

io/convolutional-networks/This max-pool seems very similar to a conv layer, except that there are no parameters (since it just takes the maximum of the four numbers it sees within the filter).

When we introduce depth, however, we see more differences between the pooling layer and the conv layer.

The pooling layer applies to each individual depth channel separately.

That is, the max-pooling operation does not take the maximum across the different depths; it only takes the maximum in a single depth channel.

This is unlike the conv layer, which combines inputs from all the depth channels.

This also means that the depth size of our output layer does not and cannot change, unlike the conv layer where the output depth might be different from input depth.

The purpose of the pooling layer, ultimately, is to reduce the spatial size (width and height) of the layers and it does not touch on the depth at all.

This reduces the number of parameters (and thus computation) required in future layers after this pooling layer.

To give a quick example, let’s suppose after our first conv layer (with pooling), we have an output dimension of 256 * 256 * 64.

We now apply a max-pooling (with filter size 2×2 and stride 2) operation to this, what are the output dimensions after the max pooling layer?— —Answer: 128 * 128 * 64, since the max-pool operator reduces the dimensions on the width and height by half, while leaving the depth dimension unchanged.

Summary: Another common layer in CNNs is the max-pooling layer, defined by the filter size and stride, which reduces the spatial size by taking the maximum of the numbers within its filter.

The last layer that commonly appears in CNNs is one that we’ve seen before in earlier parts — and that is the Fully-Connected (FC) layer.

The FC layer is the same as our standard neural network — every neuron in the next layer takes as input every neuron in the previous layer’s output.

Hence, the name Fully Connected, since all neurons in the next layer are always connected to all the neurons in the previous layer.

To show a familiar diagram we’ve seen in Part 1a:Image taken from CS231N Notes (http://cs231n.

github.

io/neural-networks-1/)We usually use FC layers at the very end of our CNNs.

So when we reach this stage, we can flatten the neurons into a one-dimensional array of features.

If the output of the previous layer was 7 * 7 * 5, we can flatten them into a row of 7*7*5 = 245 features as our input layer in the above diagram.

Then, we apply the hidden layers as per usual.

Summary: We also typically use our traditional Fully-Connected layers at the end of our CNNs.

Now let’s put them all together.

One important benchmark that is commonly used amongst researchers in Computer Vision is this challenge called ImageNet Large Scale Visual Recognition Challenge (ILSVRC).

ImageNet refers to a huge database of images, and the challenge of ILSVRC is to accurately classify an input image into 1,000 separate object categories.

One of the models that was hailed at the turning point in using deep learning is AlexNet, which won the ILSVRC in 2012.

In a paper titled “The History Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches”, I quote:AlexNet achieved state-of-the-art recognition accuracy against all the traditional machine learning and computer vision approaches.

It was a significant breakthrough in the field of machine learning and computer vision for visual recognition and classification tasks and is the point in history where interest in deep learning increased rapidly.

AlexNet showed that amazing improvements in accuracy can be achieved when we go deep — i.

e.

stack more and more layers together like we’ve seen.

In fact, architectures after AlexNet decided to keep going deeper, with more than a hundred layers!AlexNet’s architecture can be summarized somewhat as follows:As you can see, AlexNet is simply made out of the building blocks of:Conv Layers (with ReLU acitvations)Max Pool LayersFC LayersSoftmax LayersThese are layers we’ve all seen in one way or another thus far!.As you can see, we’ve already covered the building blocks for powerful Deep Learning models and all we need to do is stack many of these layers together.

Why does stacking so many layers together work, and what is each layer really doing?We can visualize some of the intermediate layers.

This is a visualization of the first conv layer of AlexNet:A visualization of the first conv layer in AlexNet.

Image taken from CS231N notes: http://cs231n.

github.

io/understanding-cnn/We can see that in the first few layers, the neural network is trying to extract out some low-level features.

These first few layers then combine in subsequent layers to form more and more complex features, and in the end, figure out what represents objects like cats, dogs etc.

Why did the neural network pick out those features in particular in the first layer?.It just figured out that these are the best parameters to characterize the first few layers; they simply produced the minimal loss.

Summary: AlexNet was a CNN which revolutionized the field of Deep Learning, and is built from conv layers, max-pooling layers and FC layers.

When many layers are put together, the earlier layers learn low-level features and combine them in later layers for more complex representations.

Consolidated Summary: Images are a 3-dimensional array of features: each pixel in the 2-D space contains three numbers from 0–255 (inclusive) corresponding to the Red, Green and Blue channels.

Often, image data contains a lot of input features.

A layer common in CNNs in the Conv layer, which is defined by the filter size, stride, depth and padding.

The Conv layer uses the same parameters and applies the same neuron(s) across different regions of the image, thereby reducing the number of parameters needed.

Another common layer in CNNs is the max-pooling layer, defined by the filter size and stride, which reduces the spatial size by taking the maximum of the numbers within its filter.

We also typically use our traditional Fully-Connected layers at the end of our CNNs.

AlexNet was a CNN which revolutionized the field of Deep Learning, and is built from conv layers, max-pooling layers and FC layers.

When many layers are put together, the earlier layers learn low-level features and combine them in later layers for more complex representations.

What’s Next: Deep Learning has not just transformed the way we think about image recognition, it has also revolutionized the way we process language.

But dealing with language comes with its own set of challenges.

How do we represent words as numbers?.Furthermore, a sentence has varying length.

How would we use neural networks to approach sequences where the input might have varying lengths?.If you’re curious, Intuitive Deep Learning Part 3 applies neural networks to natural language, tackling the problem of learning how to translate an English sentence to a French sentence.

This post originally appeared as the third post in the introductory series of Intuitive Deep Learning.

My mission is to explain deep learning concepts in a purely intuitive way!.If you are a non-technical beginner, I want to provide you with the intuition behind the inner workings of Deep Learning and let you communicate with technical engineers using the same language and jargon, even if you don’t know the math or code behind it.

If you are a student of Deep Learning, I believe that gaining a solid foundation in intuition will help you make better sense of all the math and code in the courses you’re taking, providing you a less painful way to learn these concepts.

About the author:Hi there, I’m Joseph!.I recently graduated from Stanford University, where I worked with Andrew Ng in the Stanford Machine Learning Group.

I want to make Deep Learning concepts as intuitive and as easily understandable as possible by everyone, which has motivated my publication: Intuitive Deep Learning.

.