We’ve gone through it.
Now let me guide you too.
Mistake #1 — Storing dynamic graph during in the inference modeIf you have used TensorFlow back in the days, you are probably aware of the key difference between TF and PT — static and dynamic graphs.
It was extremely hard to debug TFlow due to rebuilding graph every time your model has changed.
It took time, efforts and your hope away too.
Of course, TensorFlow is better now.
Overall, to make debugging easier ML frameworks use dynamic graphs which are related to so-called Variables in PyTorch.
Every variable you use links to the previous variable building relationships for back-propagation.
Here is how it looks in practice:In most of cases, you want to optimize all computations after the model has been trained.
If you look at the torch interface, there are a lot of options, especially in optimization.
A lot of confusion caused by eval mode, detach and no_grad methods.
Let me clarify how they work.
After the model is trained and deployed here are things you care about: Speed, Speed and CUDA Out of Memory exception.
To speed up pytorch model you need to switch it into eval mode.
It notifies all layers to use batchnorm and dropout layers in inference mode (simply saying deactivation dropouts).
Now, there is a detach method which cuts variable from its computational graph.
It’s useful when you are building model from scratch but not very when you want to reuse State of Art mdoel.
A more global solution would be to wrap forward in torch.
no_grad context which reduces memory consumptions by not storing graph links in results.
It saves memory, simplifies computations thus – you get more speed and less memory used.
Bingo!Mistake #2 — Not enabling cudnn optimization algorithmsThere is a lot of boolean flags you can set in nn.
Module, the one you must be aware of stored in cudnn namespace.
To enable cudnn optimization use cudnn.
benchmark = True.
To make sure cudnn does look for optimal algorithms, enable it by setting cudnn.
enabled = True.
NVIDIA does a lot of magic for you in terms of optimization which you could benefit from.
Please be aware your data must be on GPU and model input size should not vary.
The more variety in shape of data — the fewer optimizations can be done.
To normalize data you can pre-process images, for instance.
Overall, be creative, but not too much.
Mistake #3 — Re-using JIT-compilationPyTorch provides an easy way to optimize and reuse your models from different languages (read Python-To-Cpp).
You might be more creative and inject your model in other languages if you are brave enough (I am not, CUDA: Out of memory is my motto)JIT-compilation allows optimizing computational graph if input does not change in shape.
What it means is if your data does not vary too much (see Mistake #2) JIT is a way to go.
To be honest, it did not make a huge difference comparing to no_grad and cudnn mentioned above, but it might.
This is only the first version and has huge potential.
Please be aware that it does not work if you have conditions in your model which is a common case in RNNs.
Full documentation can be found on pytorch.
org/docs/stable/jitMistake #4 — Trying to scale using CPU instancesGPUs are expensive, as VMs in the cloud.
Even if you check AWS one instance will cost you around 100$/day (the lowest price is 0.
7$/h) Reference: aws.
Another useful cheatsheet I use is www.
info Every person who graduated from 3d grade could think: “Ok, what if I buy 5 CPU instances instead of 1 GPU”.
Everyone who has tried to run NN model on CPU knows this is a dead end.
Yes, you could optimize a model for CPU, however in the end it still will be slower than a GPU one.
I strongly recommend to relax and forget about this idea, trust me.
Mistake #5 — Processing vectors instead of matricescudnn – checkno_grad – checkGPU with correct version of CUDA – checkJIT-compilation – checkEverything is ready, what else can be done?Now it’s time to use a bit of math.
If you remember how most of NN are trained using so-called Tensor(s).
Tensor is an N-dimensional array or multi-linear geometric vectors mathematically speaking.
What you could do is to group inputs (if you have a luxury to) into tensors or matrix and feed it into your model.
For instance, using an array of images as a matrix sent to PyTorch.
Performance gain equals to the number of objects passed simultaneously.
This is an obvious solution but few people actually using it as most of the time objects are processed one by one and it might be a bit hard to set up such flow architecturally.
Do not worry, you’ll make it!What’s next?There are definitely more tips on how to optimize models in PyTorch.
I will continue posting on our experience using Facebook kid in the wild.
What about you, what are your tips to achieve better performance on inference?Originally published at tarasmatsyk.
com on February 18, 2019.
.. More details