# 10 New Things I Learnt from fast.ai Course V3

I just got to know this like only now.

This is the fundamental of deep learning.

If you have stacks of affine functions (or matrix multiplications) and nonlinear functions, the thing you end up with can approximate any function arbitrarily closely.

It is the reason behind the race for different combinations of affine functions and nonlinearities.

It’s the reason why architectures are getting deeper.

In this section, I will highlight the architectures that were in the limelight during the course, and certain designs incorporated into state-of-the-art (SOTA) models like dropout.

Fig 3.

1: Loss landscapes; left landscape has many bumps, right is a smooth landscape.

Source: https://arxiv.

org/abs/1712.

09913 Loss functions usually have bumpy and flat areas (if you visualise them in 2D or 3D diagrams).

Have a look at Fig.

3.

2.

If you end up in a bumpy area, that solution will tend not to generalise very well.

This is because you found a solution that is good in one place, but it’s not very good in other place.

But if you found a solution in a flat area, you probably will generalise well.

And that’s because you found a solution that is not only good at one spot, but around it as well.

Fig.

3.

2: Loss landscape visualised in a 2D diagram.

Screenshot from course.

fast.

ai.

Most of the above paragraph are quoted from Jeremy Howard.

Such a simple and beautiful explanation.

The new thing I learnt was that the RMSprop optimiser acts as an “accelerator”.

Intuition: if your gradient has been small for the past few steps, obviously you need to go a little bit faster now.

(For an overview of gradient descent optimisers, I have written a post titled 10 Gradient Descent Optimisation Algorithms.

)  Learnt 2 new loss functions:  This section looks into a combination of tweaks for:Transfer learningModel weights can either be (i) randomly initialised, or (ii) transferred from a pre-trained model in a process called transfer learning.

Transfer learning makes use of pre-trained weights.

Pre-trained weights have useful information.

The usual model fitting for transfer learning works like this: train the weights that are closer to the output and freezes the other layers.

It is important that for transfer learning, one uses the same ‘stats’ that the pre-trained model was applied with, eg.

correcting the image RGB values with a certain bias.

❤️ 1cycle policy ❤️This is truly the best thing I learnt in this course.

I am guilty of taking learning rates for granted all this while.

Finding a good learning rate is important, because we can at the very least provide our gradient descent with an educated guess of a learning rate, rather than some gut feeling value that might just be suboptimal.

Jeremy Howard keeps using lr_finder() and fit_one_cycle() in his code and it bothers me that it works well but I don’t know why it works.

So I read the paper by Leslie Smith and Sylvain Gugger’s blog post (recommended readings!), and this is how 1cycle works:1.

Perform an LR range test: train the model with (linearly) increasing learning rates from a small number (10e-8) to a high number (1 or 10).

Plot a loss vs.

learning rate graph like below.

2.

Choose minimum and maximum learning rate.

To choose maximum learning rate, look at the graph and pick a learning rate that is high enough and give lower loss values (not too high, not too low).

Here you’d pick 10e-2.

Choosing the minimum can be about ten times lower.

Here it’d be 10e-3.

3.

Fit the model by the no.

of cycles of cyclical learning rate.

One cycle is when your training runs through the learning rates from the chosen minimum learning rate to the chosen maximum, then back to the minimum.

So why do we do it this way?.The whole idea is the following.

In a loss landscape, we want to jump over the bumps (because we don’t want to get stuck at some trench).

So increasing the learning rate at the start helps the model to jump out away from that trench, explore the function surface and try to find areas where the loss is low and the region is not bumpy (because if it’s bumpy, it gets kicked out again).

This enables us to train the model more quickly.

We also tend to end up with much more generalisable solutions.

Discriminative learning rates for pre-trained modelsTrain earlier layer(s) with super low learning rate, and train later layers with higher learning rate.

The idea is to not drastically alter the almost-perfect pre-trained weights except for minuscule amounts, and to be more aggressive with teaching the layers near the outputs.

Discriminative learning rate was introduced in ULMFiT.

A magic number divisorIn the 1cycle fitting, to get the minimum learning rate, divide maximum with 2.

6⁴.

This number works for NLP task.

See https://course.

fast.

Random forest for hyperparameter searchIt was mentioned that random forest can be used to search for hyperparameters.

Using default valuesWhen using a library or implementing a paper’s code, use the default hyperparameter values and “don’t be a hero”.

Model fine-tuning for pre-trained modelsI notice Jeremy’s style: after training the last layers, unfreeze all layers and train all weights.

However, this step is experimental because it may or may not improve accuracy.

If it doesn’t, I hope you have saved your last trained weights ????.

Progressive resizingThis is most applicable to image-related tasks.

Start training using smaller versions of the images.

Then, train using larger versions.

To do this, use transfer learning to port the trained weights to a model with the same architecture but accepts different input size.

Genius.

Mixed precision training A simplified version what this does is to use single precision (float32) data type for backpropagation, but half precision (float16) for forward pass.

Photo by Rosemary Ketchum from Pexels Use the magic number 0.

1 for weight decay.

If you use too much weight decay, your model won’t trained well enough (underfitting).

If too little, you’ll tend to overfit but that’s okay because you can stop the training early.

Note that not all tasks covered in the course are mentioned here.

a) Multi-label classificationI’ve always wondered how you can carry out an [image] classification task whose number of labels can vary, i.

e.

multi-label classification (not to be confused with multi-class classification/multinomial classification whose sibling is binary classification).

It was not mentioned how the loss function works for multi-label classification in detail.

But after googling, I found out that the labels should be a vector of multi-hot encoding.

This means that each element must be applied to a sigmoid function in the final model output.

The loss function, which is a function of the output and ground truth, is calculated using binary cross entropy to penalise each element independently.

b) Language ModellingFor this language modelling task, I like how ‘language model’ is defined (rephrased):A language model is a model that learns to predict the next word of a sentence.

In order to do so, you need to know quite a lot of about English and world knowledge.

This means you need to train the model with a lot of data.

This is the part where the course introduces ULMFiT, a model that can be reused based on pre-training (transfer learning, in other words).

c) Tabular DataThis is my first encounter of using deep learning for tabular data wi with categorical variables!.I didn’t know you could do that?.Anyway, what we can do is we can create embeddings from categorical variables.

A little googling away got me a post by Rachel Thomas on An Introduction to Deep Learning for Tabular Data on the use of such embeddings.

So then, the question is how do you combine (a) the vector of continuous variables and (b) the embeddings from categorical variables? The course didn’t mention anything about this but this StackOverflow post highlights 3 possible ways:d) Collaborative FilteringCollaborative filtering is when you’re tasked to predict how much a user is going to like a certain item (in this example, let’s say we’re using movie ratings).

The course introduced the use embedding to solve this.

This is my first encounter of collaborative filtering using deep learning (as if I had much experience with collaborative filtering in the first place)!The goal is to create an embedding of size n for each user and item.

To do that, we initialise each embedding vector randomly.

Then, for every user rating for a movie, we compare it with the dot product of their respective embeddings, using MSE, for example.

Then we perform gradient descent optimisation.

e) Image GenerationHere are some things I learnt:Photo by Maria Teneva on Unsplash In one of the lessons, Jeremy Howard showed an activation heat-map of an image for an image classification task.

This heat map displays the pixels that were ‘activated’.

This kind of visualisation will help us understand what features or parts of an image resulted in the outputs of the model ????????.

I transcribed this part of the course (Lesson 5) because the intuition is just so compelling ❤️.

Here Jeremy first rounds up people who think that increasing model complexity is not the way to go, then reshapes their perspective, then brings them to L2 regularisation.

Oh and I was from Statistics so he caught me off guard there ????.

And so if any of you are unlucky enough to have been brainwashed by a background in statistics or psychology or econometrics or any of these kinds of courses, you’re gonna have to unlearn the idea that you need less parameters because what you instead need to realise this is you will fit this lie that you need less parameters because it’s a convenient fiction for the real truth which is you don’t want your function be too complex.

And having less parameters is one way of making it less complex.

But what if you had a thousand parameters and 999 of those parameters were 1e-9.

Or what if there was 0?.If there’s 0 then they’re not really there.

Or if they’re 1e-9, they’re hardly there.

So why can’t I have lots of parameters if lots of them are really small?.And the answer is you can.

So this thing, [where] counting the number of parameters is how we limit complexity, is actually extremely limiting.

It’s a fiction that really has a lot of problems, right?.And so, if in your head complexity is scored by how many parameters you have, you’re doing it all wrong.

Score it properly.

So why do we care?.Why would I want to use more parameters?Because more parameters means more nonlinearities, more interactions, more curvy bits, right?.And real life (of loss landscape) is full of curvy bits.

Real life does not look like this [under fitted line].

But we don’t want them to be more curvy than necessary, or more interacting than necessary.

So therefore let’s use lots of parameters and then penalise complexity.

Okay so one way to penalise complexity is, as I kind of suggested before: Let’s sum up the value of your parameters.

Now that doesn’t quite work because some parameters are positive and some are negative, right?.So what if we sum up the square of the parameters.

And that’s actually a really good idea.

Let’s actually create a model and in the loss function we’re gonna add the sum of the square of the parameters.

Now here’s a problem with that though.

Maybe that number is way too big and it’s so big that the best loss is to set all of the parameters to 0.

Now that would be no good.

So actually we wanna make sure that doesn’t happen.

So therefore let’s not just add the sum of the squares of the parameters to the model but let’s multiply that by some number that we choose.

And that number that we choose in fast is called wd.

You might also like to check out my article Intuitions on L1 and L2 Regularisation how I explain these two regularisation techniques using gradient descent here.

I really love this course.

Here are some reasons why:Looking forward to the next part of the course!Bio: Raimi Bin Karim is an AI Engineer at AI SingaporeOriginal.

Reposted with permission.

Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.

disqus.

com/embed.

js; (document.