Applying Occam’s razor to Deep Learning

By Mehmet Suzen, Theoretical Physicist, Research Scientist.

Occams razor or principle of parsimony has been the guiding principle in statistical model selection.

In comparing two models that provide similar predictions or descriptions of reality, we would vouch for the one which is less complex.

This boils down to the problem of how to measure the complexity of a statistical model and model selection.

What constitutes a model, as discussed by McCaullug (2002) in a statistical model context, is a different discussion, but here we assume a machine learning algorithm is considered a statistical model.

 Classically, the complexity of statistical models usually are measured with the Akaike information criterion (AIC) or similar.

Using a complexity measure, one would choose a less complex model to use in practice, with all other things fixed.

Figure 1.

Arbitrary architecture, each node represents a layer for a given deep neural network, such as convolutions or set of units.

 Suzen et.



The surge in interest in using complex neural network architectures, i.


, deep learning due to their unprecedented success in certain tasks,  pushes the boundaries of “standard” statistical concepts such as overfitting/overtraining and regularisation.

Now Overfitting/overtraining is often used as an umbrella term to describe any unwanted performance drop off a machine learning model Roelofs et al.

(2019), and nearly anything that improves generalization is called regularization, Martin and Mahoney (2019).

 Deep learning practitioners completely let go of the model selection steps and omit to practice Occams razor.

With the advent of Neural Architecture Search and new complexity measures that take the structure of the network into account, give rise to the possibility of practicing Occams razor in deep learning.

Here, we would cover one of the very practical and simple measures called cPSE, i.


, cascading periodic spectral ergodicity.

This measure takes into account the depth of the neural network and computes fluctuations of the weight structure over the entire network, Suzen et al.

(2019), Figure 1.

It is shown that the measure is correlated with the generalization performance almost perfectly, see Figure 2.

 The cPSE measure is implemented in the Bristol python package, starting from version 0.


6,  and can be used for a trained network put into a PyTorch model object.

Figure 2.

Evolution of PSE, periodic spectral ergodicity, shows that the complexity measure cPSE saturates after a certain depth, Suzen et al.


An example of usage requires a couple of lines, example measurements for VGG and ResNet are given at Suzen et al.

(2019) Using a less complex deep neural network that would give similar performance is not practiced by the deep learning community due to the complexity of training and designing new architectures.

However, quantifying the complexity of similarly performing neural network architecture would bring the advantage of using less computing power to train and deploy such less complex models into production.

Bringing back the Occams razor to modern connectionist machine learning is not only a theoretical and philosophical satisfaction but the practical advantages for environment and computing time are immense.


Reposted with permission.

Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.



js; (document.

getElementsByTagName(head)[0] || document.


appendChild(dsq); })();.

Leave a Reply