Transfer learning with a small data set- “nanos gigantum humeris insidentes”

Before we try to answer this question, let’s do another experiment.

This time we’ll retrain the entire network, no pre-trained weights.

Train all layers — lossTrain all layers — accuracyAnd still, loss is plateaued at some point and accuracy is stuck.

Ok so now let’s try to understand why.

Possible reasons I can think of:Data set is too smallData set is not similar to ImageNetArchitecture is not suitable.

I don’t think that my data is too small (1) since in this architecture there aren’t much more trainable parameters than in my simple CNN.

It could be that the data domain is not similar (2) , at least in the upper layers, this needs to be tested.

Or maybe it’s the architecture (3).

It is similar to mine but it stacks many more Conv layer which could lead to different issues.

let’s try to decide regarding explanation (2).

I’ll Freeze some of the lower layers — trying to keep the low level features — and train the upper layers — trying to fit them to my data.

Freeze lower, train higher, replace top.

Let’s freeze the first six layers (first two stacks) and train the rest.

Notice that the way to fine-tune is to train only the top and then unfreeze the needed layers and retrain.

Freeze 6 first, train the rest — lossWell, that did not improve things.

I’ve tried it with other layers as well but with no improvement, you can find all of my experiments here.

Freeze 6 first, train the rest — accuracyOk, so it’s not the size (1), it’s not the similarity (2), it must be the architecture (3).

I still can’t explain why but I can try to prove it .

Let’s try another one and compare the results.

ResNet50 — Residual valueResidual networks were created to solve the problems with deep networks which might affect my VGG model.

we’ll go through the same process we used in the VGG, we’ll start with freezing all the layers and continue according to the results.

Freeze all layers — lossfinally some progress.

Training accuracy is ~0.

97 and it looks like the model is finally learning something.

Validation accuracy is flat 0.

5Freeze all layers — accuracyIt seems that since I’m training only the top I’m probably highly over fitted to my training set.

Let’s continue and unfreeze all layers as we did with the VGG and see if that helps.

I’ll reinitialize the model and this time I won’t freeze anything.

Train all layers — accuracyWhat happened here?.At some point after ~21 epochs validation loss dropped and accuracy jumped as high as 0.

91 that’s about the same results I’ve got with my simple CNN and way better than the VGG.

Another important point to make is that although the network is completely retrained, the initial weighs are the ‘ImageNet’ weights and not random.

this makes a difference and is probably what gave the model it’s boost.

The point I want to make here after all these experiments is that:Picking the right architecture is a crucial part of utilizing Transfer LearningI’m now at odds with my original CNN model, but where is the benefit of Transfer learning?.can I squeeze more juice from the ResNet or should I try another architecture?.We’ll I did try freezing only parts of the model and even removing some upper layer but haven’t got too far.

I think I’ll move on to the Inception architecture and see if things goes smootherInceptionV3 — Divide and conquerInception is another evolution of the CNN classifiers.

Here I’ll use V3, which is not the latest version but it’s still very evolved.

Freeze all layers — accuracyI quickly run two experiments as I did with the previous two architectures — freeze all layers and free all layers — in both I got ~0.

7 on the validation set, very similar to the ResNet results but the trend still doesn’t look good.

Train all layers — accuracyI want to try something new: Inception is a very big network, let’s try to remove part of the layers and use only a section of the network — specifically I’ll remove the higher stacks of the Conv layers and train the lower layers.

Inception v1The motivation is clear by now — using the lower features of the network and fine tuning to my problem.

I have to say it’s was a trial and error game but with some logic.

If you look at the Inception architecture to the left you’ll see that it is constructed from concatenated blocks.

So it makes sense to “dissect” the network at one of these points and insert the classifier.

it even seems that it was designed for it.

Realizing this makes the trial and error game a lot easier.

So what I’m going to do is find the best block from which I’ll split the network and try to fine tune from that point.

After some trials I found that using “mixed_5” — that’s the name of the concatenated layers, you can see that when using model.

summary() method — gives the best results.

Some more parameters optimizations and I got the validation accuracy to ~0.


That’s a great improvement, it’s seems that the Inception architecture best fit my task so far or maybe it’s just the most evolved I’ve used so far, I can’t rule out this conclusion yet.

Before I sum up let’s try another tweak: The best epoch of the last experiment was #11 out of 50.

From there validation was slowly dropping, a sign for over-fitting.

Assuming we’ve tried all the parameters tweaking, let’s try unfreezing some of the layers.

We’ve tried it with ResNet and it didn’t work so well, maybe the Inception architecture will be more tolerant.

I’ll unfreeze all the layers from “mixed_4” till “mixed_5”, meaning one concatenated stack.

this will enable the upper layers of the new model to train on the new data and hopefully increase accuracy.

Freeze lower, unfreeze middle, remove higher, replace top.

Validation accuracy is the highest so far ~0.


We’ve managed to refine the parts that we need and get rid of the others.

TL has proven itself useful, but all the above experiments took a lot of time.

Let’s think what can be improved on the next project.

What can be deducted?Choose the right architecture for your needs — I would start with the latest and greatest next time, it probably has the best chances.

Take only what you need — try to understand the architecture — high level knowledge at the least — you don’t need to read every article, but make sure you understand its strengths and especially its weaknesses.

Optimize parameters as a final step — saves a lot of time and iterationsDocument your experiments — That’s a great way to save time and keep track on your work, you can find all of mine with the code below.

Please share your thoughts, code and documentation are here.

Follow me on Medium or Twitter for updates on my blog posts!.. More details

Leave a Reply