Would using the tanh term may be the reason behind this?But additionally, we noticed that after ReLU activation, case H’s layera value was sparser than other ones.

Meaning a healthy dose amount of zeros are mixed within the internal representation.

I really wonder why case H gave the best test accuracy, the hand-wavy explanation of ‘because it was well regularized’ can do for now, but I know that there is more to it.

Conclusion / CodeIn conclusion, the right balance between weight magnitude as well as sparsity seems to give good generalization performance.

(in hand-wavy fashion).

To access the code in google collab please click here, to access the code in my GitHub please click here.

Final WordsWhile doing these experiments, I can’t help but think that these type of regularization have one (not so critical) flaw.

They are all passive method of regularization.

More specifically, every weight in the network, despite where they are located, what kind of features they tend to capture, they are all being regularized.

I certainly believe that this is an interesting area of study, for example, we can have a simple conditional statement, that says “if the magnitude of the weight is greater than x perform regularization” or more complex one “if the mutual information between this weight and the class label is already high, perform regularization”.

I wonder how those methods will change the overall dynamics of training.

Finally, I want to thank my supervisor, Dr.

Bruce, for recommending me the paper “Comparing Measures of Sparsity”.

For more articles please visit my website.

Appendix (GIF animation for each layer for each case)Case Z: In order of gradientp, gradientw, layer, layera, moment, weightCase A: In order of gradientp, gradientw, layer, layera, moment, weightCase B: In order of gradientp, gradientw, layer, layera, moment, weightCase C: In order of gradientp, gradientw, layer, layera, moment, weightCase D: In order of gradientp, gradientw, layer, layera, moment, weightCase E: In order of gradientp, gradientw, layer, layera, moment, weightCase F: In order of gradientp, gradientw, layer, layera, moment, weightCase G: In order of gradientp, gradientw, layer, layera, moment, weightCase H: In order of gradientp, gradientw, layer, layera, moment, weightCase I: In order of gradientp, gradientw, layer, layera, moment, weightCase J: In order of gradientp, gradientw, layer, layera, moment, weightAppendix (Derivatives)ReferenceBruce, N.

(2016).

Neil D.

B.

Bruce.

Neil D.

B.

Bruce.

Retrieved 4 January 2019, from http://www.

scs.

ryerson.

ca/~bruce/Hurley, N.

, & Rickard, S.

(2008).

Comparing Measures of Sparsity.

arXiv.

org.

Retrieved 4 January 2019, from https://arxiv.

org/abs/0811.

4706Numerical & Scientific Computing with Python: Creating Subplots with Python and Matplotlib.

(2019).

Python-course.

eu.

Retrieved 4 January 2019, from https://www.

python-course.

eu/matplotlib_multiple_figures.

phpscipy.

stats.

kurtosis — SciPy v1.

2.

0 Reference Guide.

(2019).

Docs.

scipy.

org.

Retrieved 4 January 2019, from https://docs.

scipy.

org/doc/scipy/reference/generated/scipy.

stats.

kurtosis.

htmlscipy.

stats.

skew — SciPy v0.

13.

0 Reference Guide.

(2019).

Docs.

scipy.

org.

Retrieved 4 January 2019, from https://docs.

scipy.

org/doc/scipy-0.

13.

0/reference/generated/scipy.

stats.

skew.

htmlnumpy.

count_nonzero — NumPy v1.

15 Manual.

(2019).

Docs.

scipy.

org.

Retrieved 4 January 2019, from https://docs.

scipy.

org/doc/numpy-1.

15.

1/reference/generated/numpy.

count_nonzero.

htmlarray?, E.

(2017).

Efficiently count zero elements in numpy array?.

Stack Overflow.

Retrieved 4 January 2019, from https://stackoverflow.

com/questions/42916330/efficiently-count-zero-elements-in-numpy-arrayparsing, S.

(2013).

SyntaxError: unexpected EOF while parsing.

Stack Overflow.

Retrieved 4 January 2019, from https://stackoverflow.

com/questions/16327405/syntaxerror-unexpected-eof-while-parsingmatplotlib.

pyplot.

legend — Matplotlib 3.

0.

2 documentation.

(2019).

Matplotlib.

org.

Retrieved 4 January 2019, from https://matplotlib.

org/api/_as_gen/matplotlib.

pyplot.

legend.

htmlrow vs column — Google Search.

(2019).

Google.

com.

Retrieved 4 January 2019, from https://www.

google.

com/search?q=row+vs+column&rlz=1C1CHBF_enCA771CA771&oq=row+vs+col&aqs=chrome.

0.

35i39j69i60j69i57j0l3.

2289j1j7&sourceid=chrome&ie=UTF-8Legend guide — Matplotlib 2.

0.

2 documentation.

(2019).

Matplotlib.

org.

Retrieved 4 January 2019, from https://matplotlib.

org/users/legend_guide.

html[ Archived Post ] Random Notes for Derivative for Regularization Terms.

(2019).

Medium.

Retrieved 4 January 2019, from https://medium.

com/@SeoJaeDuk/archived-post-random-notes-for-derivative-for-regularization-terms-1859b1faadaIPython, R.

(2013).

Releasing memory of huge numpy array in IPython.

Stack Overflow.

Retrieved 5 January 2019, from https://stackoverflow.

com/questions/16261240/releasing-memory-of-huge-numpy-array-in-ipythonBuilt-in magic commands — IPython 7.

2.

0 documentation.

(2019).

Ipython.

readthedocs.

io.

Retrieved 5 January 2019, from https://ipython.

readthedocs.

io/en/stable/interactive/magics.

htmlChapman, J.

(2017).

How to Make Gifs Using Python.

Superfluous Sextant.

Retrieved 5 January 2019, from http://superfluoussextant.

com/making-gifs-with-python.

htmlWhat is the difference between Ridge Regression, the LASSO, and ElasticNet?.

(2017).

Eclectic Esoterica.

Retrieved 5 January 2019, from https://blog.

alexlenail.

me/what-is-the-difference-between-ridge-regression-the-lasso-and-elasticnet-ec19c71c9028(2019).

Web.

stanford.

edu.

Retrieved 5 January 2019, from https://web.

stanford.

edu/~hastie/Papers/B67.

2%20%282005%29%20301-320%20Zou%20&%20Hastie.

pdfL1 and L2 Regularization Methods — Towards Data Science.

(2017).

Towards Data Science.

Retrieved 5 January 2019, from https://towardsdatascience.

com/l1-and-l2-regularization-methods-ce25e7fc831cWhat is the difference between Ridge Regression, the LASSO, and ElasticNet?.

(2017).

Eclectic Esoterica.

Retrieved 5 January 2019, from https://blog.

alexlenail.

me/what-is-the-difference-between-ridge-regression-the-lasso-and-elasticnet-ec19c71c9028STL-10 dataset.

(2019).

Cs.

stanford.

edu.

Retrieved 5 January 2019, from https://cs.

stanford.

edu/~acoates/stl10/plot, H.

(2010).

How to change the font size on a matplotlib plot.

Stack Overflow.

Retrieved 5 January 2019, from https://stackoverflow.

com/questions/3899980/how-to-change-the-font-size-on-a-matplotlib-plotmatplotlib?, H.

(2015).

How can I plot multiple figure in the same line with matplotlib?.

Stack Overflow.

Retrieved 5 January 2019, from https://stackoverflow.

com/questions/34291260/how-can-i-plot-multiple-figure-in-the-same-line-with-matplotlibtext, I.

(2016).

IPython Notebook keyboard shortcut search for text.

Stack Overflow.

Retrieved 5 January 2019, from https://stackoverflow.

com/questions/35119831/ipython-notebook-keyboard-shortcut-search-for-textCreating and exporting video clips — MoviePy 0.

2.

3.

2 documentation.

(2019).

Zulko.

github.

io.

Retrieved 5 January 2019, from https://zulko.

github.

io/moviepy/getting_started/videoclips.

htmlMaking GIFs from Video Files with Python — __del__( self ).

(2014).

Zulko.

github.

io.

Retrieved 5 January 2019, from http://zulko.

github.

io/blog/2014/01/23/making-animated-gifs-from-video-files-with-python/.