Neural Networks: Tricks of the Trade Review

Deep learning neural networks are challenging to configure and train.

There are decades of tips and tricks spread across hundreds of research papers, source code, and in the heads of academics and practitioners.

The book “Neural Networks: Tricks of the Trade” originally published in 1998 and updated in 2012 at the cusp of the deep learning renaissance ties together the disparate tips and tricks into a single volume.

It includes advice that is required reading for all deep learning neural network practitioners.

In this post, you will discover the book “Neural Networks: Tricks of the Trade” that provides advice by neural network academics and practitioners on how to get the most out of your models.

After reading this post, you will know:Let’s get started.

Neural Networks – Tricks of the TradeNeural Networks: Tricks of the Trade is a collection of papers on techniques to get better performance from neural network models.

The first edition was published in 1998 comprised of five parts and 17 chapters.

The second edition was published right on the cusp of the new deep learning renaissance in 2012 and includes three more parts and 13 new chapters.

If you are a deep learning practitioner, then it is a must read book.

I own and reference both editions.

The motivation for the book was to collate the empirical and theoretically grounded tips, tricks, and best practices used to get the best performance from neural network models in practice.

The author’s concern is that many of the useful tips and tricks are tacit knowledge in the field, trapped in peoples heads, code bases, or at the end of conference papers and that beginners to the field should be aware of them.

It is our belief that researchers and practitioners acquire, through experience and word-of-mouth, techniques and heuristics that help them successfully apply neural networks to difficult real-world problems.

[…] they are usually hidden in people’s heads or in the back pages of space-constrained conference papers.

The book is an effort to try to group the tricks together, after the success of a workshop at the 1996 NIPS conference with the same name.

This book is an outgrowth of a 1996 NIPS workshop called Tricks of the Trade whose goal was to begin the process of gathering and documenting these tricks.

The interest that the workshop generated motivated us to expand our collection and compile it into this book.

— Page 1, Neural Networks: Tricks of the Trade, Second Edition, 2012.

The first edition of the book was put together (edited) by Genevieve Orr and Klaus-Robert Muller comprised of five parts and 17 chapters and was published 20 years ago in 1998.

Each part includes a useful preface that summarizes what to expect in the upcoming chapters, and each chapter written by one or more academics in the field.

The breakdown of this first edition was as follows:It is an expensive book, and if you can pick-up a cheap second-hand copy of this first edition, then I highly recommend it.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-CourseThe second edition of the book was released in 2012, seemingly right at the beginning of the large push that became “deep learning.

” As such, the book captures the new techniques at the time such as layer-wise pretraining and restricted Boltzmann machines.

It was too early to focus on the ReLU, ImageNet with CNNs, and use of large LSTMs.

Nevertheless, the second edition included three new parts and 13 new chapters.

The breakdown of the additions in the second edition are as follows:The whole book is a good read, although I don’t recommend reading all of it if you are looking for quick and useful tips that you can use immediately.

This is because many of the chapters focus on the writers’ pet projects, or on highly specialized methods.

Instead, I recommend reading four specific chapters, two from the first edition and two from the second.

The second edition of the book is worth purchasing for these four chapters alone, and I highly recommend picking up a copy for yourself, your team, or your office.

Fortunately, there are pre-print PDFs of these chapters available for free online.

The recommended chapters are:Let’s take a closer look at each of these chapters in turn.

This chapter focuses on providing very specific tips to get the most out of the stochastic gradient descent optimization algorithm and the backpropagation weight update algorithm.

Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications.

This paper gives some of those tricks, and offers explanations of why they work.

— Page 9, Neural Networks: Tricks of the Trade, First Edition, 1998.

The chapter proceeds to provide a dense and theoretically supported list of tips for configuring the algorithm, preparing input data, and more.

The chapter is so dense that it is hard to summarize, although a good list of recommendations is provided in the “Discussion and Conclusion” section at the end, quoted from the book below:– shuffle the examples – center the input variables by subtracting the mean – normalize the input variable to a standard deviation of 1 – if possible, decorrelate the input variables.

– pick a network with the sigmoid function shown in figure 1.

4 – set the target values within the range of the sigmoid, typically +1 and -1.

– initialize the weights to random values as prescribed by 1.


The preferred method for training the network should be picked as follows: – if the training set is large (more than a few hundred samples) and redundant, and if the task is classification, use stochastic gradient with careful tuning, or use the stochastic diagonal Levenberg Marquardt method.

– if the training set is not too large, or if the task is regression, use conjugate gradient.

— Pages 47-48, Neural Networks: Tricks of the Trade, First Edition, 1998.

The field of applied neural networks has come a long way in the twenty years since this was published (e.


the comments on sigmoid activation functions are no longer relevant), yet the basics have not changed.

This chapter is required reading for all deep learning practitioners.

This chapter describes the simple yet powerful regularization method called early stopping that will halt the training of a neural network when the performance of the model begins to degrade on a hold-out validation dataset.

Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting (“early stopping”)— Page 55, Neural Networks: Tricks of the Trade, First Edition, 1998.

The challenge of early stopping is the choice and configuration of the trigger used to stop the training process, and the systematic configuration of early stopping is the focus of the chapter.

The general early stopping criteria are described as:Three recommendations are provided, e.


“the trick“:1.

Use fast stopping criteria unless small improvements of network performance (e.


4%) are worth large increases of training time (e.


factor 4).


To maximize the probability of finding a “good” solution (as opposed to maximizing the average quality of solutions), use a GL criterion.


To maximize the average quality of solutions, use a PQ criterion if the net- work overfits only very little or an UP criterion otherwise.

— Page 60, Neural Networks: Tricks of the Trade, First Edition, 1998.

The rules are analyzed empirically over a large number of training runs and test problems.

The crux of the finding is that being more patient with the early stopping criteria results in better hold-out performance at the cost of additional computational complexity.

I conclude slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).

— Page 55, Neural Networks: Tricks of the Trade, First Edition, 1998.

This chapter focuses on a detailed review of the stochastic gradient descent optimization algorithm and tips to help get the most out of it.

This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.

— Page 421, Neural Networks: Tricks of the Trade, Second Edition, 2012.

There is a lot of overlap with Chapter 1: Efficient BackProp, and although the chapter calls out tips along the way with boxes, a useful list of tips is not summarized at the end of the chapter.

Nevertheless, it is a compulsory read for all neural network practitioners.

Below is my own summary of the tips called out in boxes throughout the chapter, mostly quoting directly from the second edition:Some of these tips are pithy without context; I recommend reading the chapter.

This chapter focuses on the effective training of neural networks and early deep learning models.

It ties together the classical advice from Chapters 1 and 29 but adds comments on (at the time) recent deep learning developments like greedy layer-wise pretraining, modern hardware like GPUs, modern efficient code libraries like BLAS, and advice from real projects tuning the training of models, like the order to train hyperparameters.

This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on backpropagated gradient and gradient-based optimization.

— Page 437, Neural Networks: Tricks of the Trade, Second Edition, 2012.

It’s also long, divided into six main sections:There’s far too much for me to summarize; the chapter is dense with useful advice for configuring and tuning neural network models.

Without a doubt, this is required reading and provided the seeds for the recommendations later described in the 2016 book Deep Learning, of which Yoshua Bengio was one of three authors.

The chapter finishes on a strong, optimistic note.

The practice summarized here, coupled with the increase in available computing power, now allows researchers to train neural networks on a scale that is far beyond what was possible at the time of the first edition of this book, helping to move us closer to artificial intelligence.

— Page 473, Neural Networks: Tricks of the Trade, Second Edition, 2012.

In this post, you discovered the book “Neural Networks: Tricks of the Trade” that provides advice from neural network academics and practitioners on how to get the most out of your models.

Have you read some or all of this book?.What do you think of it?.Let me know in the comments below.

…with just a few lines of python codeDiscover how in my new Ebook: Better Deep LearningIt provides self-study tutorials on topics like: weight decay, batch normalization, dropout, model stacking and much more…Skip the Academics.

Just Results.

Click to learn more.


. More details

Leave a Reply