How to Reduce the Variance of Deep Learning Models in Keras Using Model Averaging Ensembles

The model then has a single hidden layer with 15 modes and a rectified linear activation function, then an output layer with 3 nodes to predict the probability of each of the 3 classes and a softmax activation function.Because the problem is multi-class, we will use the categorical cross entropy loss function to optimize the model and the efficient Adam flavor of stochastic gradient descent.The model is fit for 200 training epochs and we will evaluate the model each epoch on the test set, using the test set as a validation set.At the end of the run, we will evaluate the performance of the model on both the train and the test sets.Then finally, we will plot learning curves of the model accuracy over each training epoch on both the training and test dataset.The complete example is listed below.Running the example first prints the performance of the final model on the train and test datasets.Your specific results will vary (by design!) given the high variance nature of the model.In this case, we can see that the model achieved about 84% accuracy on the training dataset and about 76% accuracy on the test dataset; not terrible.A line plot is also created showing the learning curves for the model accuracy on the train and test sets over each training epoch.We can see that the model is not really overfit, but is perhaps a little underfit and may benefit from an increase in capacity, more training, and perhaps some regularization..All of these improvements of which we intentionally hold back to force the high variance for our case study.Line Plot Learning Curves of Model Accuracy on Train and Test Dataset Over Each Training EpochIt is important to demonstrate that the model indeed has a variance in its prediction.We can demonstrate this by repeating the fit and evaluation of the same model configuration on the same dataset and summarizing the final performance of the model.To do this, we first split the fit and evaluation of the model out as a function that we can call repeatedly..The evaluate_model() function below takes the train and test dataset, fits a model, then evaluates it, retuning the accuracy of the model on the test dataset.We can call this function 30 times, saving the test accuracy scores.Once collected, we can summarize the distribution scores, first in terms of the mean and standard deviation, assuming the distribution is Gaussian, which is very reasonable.We can then summarize the distribution both as a histogram to show the shape of the distribution and as a box and whisker plot to show the spread and body of the distribution.The complete example of summarizing the variance of the MLP model on the chosen blobs dataset is listed below.Running the example first prints the accuracy of each model on the test set, finishing with the mean and standard deviation of the sample of accuracy scores.The specifics of your sample may differ, but the summary statistics should be similar.In this case, we can see that the average of the sample is 77% with a standard deviation of about 1.4%..Assuming a Gaussian distribution, we would expect 99% of accuracy scores to fall between about 73% and 81% (i.e. 3 standard deviations above and below the mean).We can take the standard deviation of the accuracy of the model on the test set as an estimate for the variance of the predictions made by the model.A histogram of the accuracy scores is also created, showing a very rough Gaussian shape, perhaps with a longer right tail.A large sample and a different number of bins on the plot might better expose the true underlying shape of the distribution.Histogram of Model Test Accuracy Over 30 RepeatsA box and whisker plot is also created showing a line at the median at about 76.5% accuracy on the test set and the interquartile range or middle 50% of the samples between about 78% and 76%.Box and Whisker Plot of Model Test Accuracy Over 30 RepeatsThe analysis of the sample of test scores clearly demonstrates a variance in the performance of the same model trained on the same dataset.A spread of likely scores of about 8 percentage points (81% – 73%) on the test set could reasonably be considered large, e.g..a high variance result.We can use model averaging to both reduce the variance of the model and possibly reduce the generalization error of the model.Specifically, this would result in a smaller standard deviation on the holdout test set and a better performance on the training set..We can check both of these assumptions.First, we must develop a function to prepare and return a fit model on the training dataset.Next, we need a function that can take a list of ensemble members and make a prediction for an out of sample dataset..This could be one or more samples arranged in a two-dimensional array of samples and input features.Hint: you can use this function yourself for testing ensembles and for making predictions with ensembles on new data.We don’t know how many ensemble members will be appropriate for this problem.Therefore, we can perform a sensitivity analysis of the number of ensemble members and how it impacts test accuracy..This means we need a function that can evaluate a specified number of ensemble members and return the accuracy of a prediction combined from those members.Finally, we can create a line plot of the number of ensemble members (x-axis) versus the accuracy of a prediction averaged across that many members on the test dataset (y-axis).The complete example is listed below.Running the example first fits 20 models on the same training dataset, which may take less than a minute on modern hardware.Then, different sized ensembles are tested from 1 member to all 20 members and test accuracy results are printed for each ensemble size.Finally, a line plot is created showing the relationship between ensemble size and performance on the test set.We can see that performance improves to about five members, after which performance plateaus around 76% accuracy..This is close to the average test set performance observed during the analysis of the repeated evaluation of the model.Line Plot of Ensemble Size Versus Model Test AccuracyFinally, we can update the repeated evaluation experiment to use an ensemble of five models instead of a single model and compare the distribution of scores.The complete example of a repeated evaluated five-member ensemble of the blobs dataset is listed below.Running the example may take a few minutes as five models are fit and evaluated and this process is repeated 30 times.The performance of each model on the test set is printed to provide an indication of progress.The mean and standard deviation of the model performance is printed at the end of the run..Your specific results may vary, but not by much.In this case, we can see that the average performance of a five-member ensemble on the dataset is 76%..This is very close to the average of 77% seen for a single model.The important difference is the standard deviation shrinking from 1.4% for a single model to 0.6% with an ensemble of five models..We might expect that a given ensemble of five models on this problem to have a performance fall between about 74% and about 78% with a likelihood of 99%.Averaging the same model trained on the same dataset gives us a spread for improved reliability, a property often highly desired in a final model to be used operationally.More models in the ensemble will further decrease the standard deviation of the accuracy of an ensemble on the test dataset given the law of large numbers, at least to a point of diminishing returns.This demonstrates that for this specific model and prediction problem, that a model averaging ensemble with five members is sufficient to reduce the variance of the model.. More details

Leave a Reply