Ensemble Methods for Deep Learning Neural Networks to Reduce Variance and Improve Performance

Importantly, the models must be good in different ways; they must make different prediction errors.The reason that model averaging works is that different models will usually not make all the same errors on the test set.— Page 256, Deep Learning, 2016.Combining the predictions from multiple neural networks adds a bias that in turn counters the variance of a single trained neural network model..The results are predictions that are less sensitive to the specifics of the training data, choice of training scheme, and the serendipity of a single training run.In addition to reducing the variance in the prediction, the ensemble can also result in better predictions than any single best model.… the performance of a committee can be better than the performance of the best single network used in isolation.— Page 365, Neural Networks for Pattern Recognition, 1995.This approach belongs to a general class of methods called “ensemble learning” that describes methods that attempt to make the best use of the predictions from multiple models prepared for the same problem.Generally, ensemble learning involves training more than one network on the same dataset, then using each of the trained models to make a prediction before combining the predictions in some way to make a final outcome or prediction.In fact, ensembling of models is a standard approach in applied machine learning to ensure that the most stable and best possible prediction is made.For example, Alex Krizhevsky, et al..in their famous 2012 paper titled “Imagenet classification with deep convolutional neural networks” that introduced very deep convolutional neural networks for photo classification (i.e. AlexNet) used model averaging across multiple well-performing CNN models to achieve state-of-the-art results at the time..Performance of one model was compared to ensemble predictions averaged over two, five, and seven different models.Averaging the predictions of five similar CNNs gives an error rate of 16.4%..[…] Averaging the predictions of two CNNs that were pre-trained […] with the aforementioned five CNNs gives an error rate of 15.3%.Ensembling is also the approach used by winners in machine learning competitions.Another powerful technique for obtaining the best possible results on a task is model ensembling..[…] If you look at machine-learning competitions, in particular on Kaggle, you’ll see that the winners use very large ensembles of models that inevitably beat any single model, no matter how good.— Page 264, Deep Learning With Python, 2017.Perhaps the oldest and still most commonly used ensembling approach for neural networks is called a “committee of networks.”A collection of networks with the same configuration and different initial random weights is trained on the same dataset..Each model is then used to make a prediction and the actual prediction is calculated as the average of the predictions.The number of models in the ensemble is often kept small both because of the computational expense in training models and because of the diminishing returns in performance from adding more ensemble members..Ensembles may be as small as three, five, or 10 trained models.The field of ensemble learning is well studied and there are many variations on this simple theme.It can be helpful to think of varying each of the three major elements of the ensemble method; for example:Let’s take a closer look at each element in turn.The data used to train each member of the ensemble can be varied.The simplest approach would be to use k-fold cross-validation to estimate the generalization error of the chosen model configuration..In this procedure, k different models are trained on k different subsets of the training data..These k models can then be saved and used as members of an ensemble.Another popular approach involves resampling the training dataset with replacement, then training a network using the resampled dataset..The resampling procedure means that the composition of each training dataset is different with the possibility of duplicated examples allowing the model trained on the dataset to have a slightly different expectation of the density of the samples, and in turn different generalization error.This approach is called bootstrap aggregation, or bagging for short, and was designed for use with unpruned decision trees that have high variance and low bias..Typically a large number of decision trees are used, such as hundreds or thousands, given that they are fast to prepare.… a natural way to reduce the variance and hence increase the prediction accuracy of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions.. More details

Leave a Reply