The k-fold cross-validation procedure is used to estimate the performance of machine learning models when making predictions on data not used during training.
This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset.
When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.
One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure.
This is called double cross-validation or nested cross-validation and is the preferred way to evaluate and compare tuned machine learning models.
In this tutorial, you will discover nested cross-validation for evaluating tuned machine learning models.
After completing this tutorial, you will know:Let’s get started.
Nested Cross-Validation for Machine Learning with PythonPhoto by Andrew Bone, some rights reserved.
This tutorial is divided into three parts; they are:It is common to evaluate machine learning models on a dataset using k-fold cross-validation.
The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds.
Each of the k folds is given an opportunity to be used as a held back test set whilst all other folds collectively are used as a training dataset.
A total of k models are fit and evaluated on the k holdout test sets and the mean performance is reported.
For more on the k-fold cross-validation procedure, see the tutorial:The procedure provides an estimate of the model performance on the dataset when making a prediction on data not used during training.
It is less biased than some other techniques, such as a single train-test split for small- to modestly-sized dataset.
Common values for k are k=3, k=5, and k=10.
Each machine learning algorithm includes one or more hyperparameters that allow the algorithm behavior to be tailored to a specific dataset.
The trouble is, there is rarely if ever good heuristics on how to configure the model hyperparameters for a dataset.
Instead, an optimization procedure is used to discover a set of hyperparameters that perform well or best on the dataset.
Common examples of optimization algorithms include grid search and random search, and each distinct set of model hyperparameters are typically evaluated using k-fold cross-validation.
This highlights that the k-fold cross-validation procedure is used both in the selection of model hyperparameters to configure each model and in the selection of configured models.
The k-fold cross-validation procedure is an effective approach for estimating the performance of a model.
Nevertheless, a limitation of the procedure is that if it is used multiple times with the same algorithm, it can lead to overfitting.
Each time a model with different model hyperparameters is evaluated on a dataset, it provides information about the dataset.
Specifically, an often noisy model performance score.
This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset.
The k-fold cross-validation procedure attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing or overfitting of the model hyperparameters to the dataset will be performed.
This is the normal case for hyperparameter optimization.
The problem is that if this score alone is used to then select a model, or the same dataset is used to evaluate the tuned models, then the selection process will be biased by this inadvertent overfitting.
The result is an overly optimistic estimate of model performance that does not generalize to new data.
A procedure is required that allows both the models to select well-performing hyperparameters for the dataset and select among a collection of well-configured models on a dataset.
One approach to this problem is called nested cross-validation.
Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.
In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.
— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.
The procedure involves treating model hyperparameter optimization as part of the model itself and evaluating it within the broader k-fold cross-validation procedure for evaluating models for comparison and selection.
As such, the k-fold cross-validation procedure for model hyperparameter optimization is nested inside the k-fold cross-validation procedure for model selection.
The use of two cross-validation loops also leads the procedure to be called “double cross-validation.
”Typically, the k-fold cross-validation procedure involves fitting a model on all folds but one and evaluating the fit model on the holdout fold.
Let’s refer to the aggregate of folds used to train the model as the “train dataset” and the held-out fold as the “test dataset.
”Each training dataset is then provided to a hyperparameter optimized procedure, such as grid search or random search, that finds an optimal set of hyperparameters for the model.
The evaluation of each set of hyperparameters is performed using k-fold cross-validation that splits up the provided train dataset into k folds, not the original dataset.
This is termed the “internal” protocol as the model selection process is performed independently within each fold of the resampling procedure.
— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.
Under this procedure, hyperparameter search does not have an opportunity to overfit the dataset as it is only exposed to a subset of the dataset provided by the outer cross-validation procedure.
This reduces, if not eliminates, the risk of the search procedure overfitting the original dataset and should provide a less biased estimate of a tuned model’s performance on the dataset.
In this way, the performance estimate includes a component properly accounting for the error introduced by overfitting the model selection criterion.
— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.
A downside of nested cross-validation is the dramatic increase in the number of model evaluations performed.
If n * k models are fit and evaluated as part of a traditional cross-validation hyperparameter search for a given model, then this is increased to k * n * k as the procedure is then performed k more times for each fold in the outer loop of nested cross-validation.
To make this concrete, you might use k=5 for the hyperparameter search and test 100 combinations of model hyperparameters.
A traditional hyperparameter search would, therefore, fit and evaluate 5 * 100 or 500 models.
Nested cross-validation with k=10 folds in the outer loop would fit and evaluate 5,000 models.
A 10x increase in this case.
The k value for the inner loop and the outer loop should be set as you would set the k-value for a single k-fold cross-validation procedure.
You must choose a k-value for your dataset that balances the computational cost of the evaluation procedure (not too many model evaluations) and unbiased estimate of model performance.
It is common to use k=10 for the outer loop and a smaller value of k for the inner loop, such as k=3 or k=5.
The final model is configured and fit using the procedure applied internally to the outer loop.
As follows:This model can then be used to make predictions on new data.
We know how well it will perform on average based on the score provided during the final model tuning procedure.
Now that we are familiar with nested-cross validation, let’s review how we can implement it in practice.
The k-fold cross-validation procedure is available in the scikit-learn Python machine learning library via the KFold class.
The class is configured with the number of folds (splits), then the split() function is called, passing in the dataset.
The results of the split() function are enumerated to give the row indexes for the train and test sets for each fold.
For example:This class can be used to perform the outer-loop of the nested-cross validation procedure.
The scikit-learn library provides cross-validation random search and grid search hyperparameter optimization via the RandomizedSearchCV and GridSearchCV classes respectively.
The procedure is configured by creating the class and specifying the model, dataset, hyperparameters to search, and cross-validation procedure.
For example:These classes can be used for the inner loop of nested cross-validation where the train dataset defined by the outer loop is used as the dataset for the inner loop.
We can tie these elements together and implement the nested cross-validation procedure.
Importantly, we can configure the hyperparameter search to refit a final model with the entire training dataset using the best hyperparameters found during the search.
This can be achieved by setting the “refit” argument to True, then retrieving the model via the “best_estimator_” attribute on the search result.
This model can then be used to make predictions on the holdout data from the outer loop and estimate the performance of the model.
Tying all of this together, we can demonstrate nested cross-validation for the RandomForestClassifier on a synthetic classification dataset.
We will keep things simple and tune just two hyperparameters with three values each, e.
g.
(3 * 3) 9 combinations.
We will use 10 folds in the outer cross-validation and three folds for the inner cross-validation, resulting in (10 * 9 * 3) or 270 model evaluations.
The complete example is listed below.
Running the example evaluates random forest using nested-cross validation on a synthetic classification dataset.
You can use the example as a starting point and adapt it to evaluate different algorithm hyperparameters, different algorithms, or a different dataset.
Each iteration of the outer cross-validation procedure reports the estimated performance of the best performing model (using 3-fold cross-validation) and the hyperparameters found to perform the best, as well as the accuracy on the holdout dataset.
This is insightful as we can see that the actual and estimated accuracies are different, but in this case, similar.
We can also see that different hyperparameters are found on each iteration, showing that good hyperparameters on this dataset are dependent on the specifics of the dataset.
A final mean classification accuracy is then reported.
A simpler way that we can perform the same procedure is by using the cross_val_score() function that will execute the outer cross-validation procedure.
This can be performed on the configured GridSearchCV directly that will automatically use the refit best performing model on the test set from the outer loop.
This greatly reduces the amount of code required to perform the nested cross-validation.
The complete example is listed below.
Running the examples performs the nested cross-validation on the random forest algorithm, achieving a mean accuracy that matches our manual procedure.
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you discovered nested cross-validation for evaluating tuned machine learning models.
Specifically, you learned:Do you have any questions? Ask your questions in the comments below and I will do my best to answer.
with just a few lines of scikit-learn codeLearn how in my new Ebook: Machine Learning Mastery With PythonCovers self-study tutorials and end-to-end projects like: Loading data, visualization, modeling, tuning, and much more.
Skip the Academics.
Just Results.
.