Cross Validation: A Beginner’s Guide

Cross Validation: A Beginner’s GuideAn introduction to LOO, K-Fold, and Holdout model validationCaleb NealeBlockedUnblockFollowFollowingMay 24By: Caleb Neale, Demetri Workman, Abhinay DommalapatiIn beginning your journey into the world of machine learning and data science, there is often a temptation to jump into algorithms and model creation, without gaining an understanding of how to test the effectiveness of a generated model on real world data.

Cross validation is a form of model validation which attempts to improve on the basic methods of hold-out validation by leveraging subsets of our data and an understanding of the bias/variance trade-off in order to gain a better understanding of how our models will actually perform when applied outside of the data it was trained on.

Don’t worry, it’ll all be explained!This article seeks to be a beginning to execution guide for three methods of model validation (hold out, k-fold, and LOOCV) and the concepts behind them, with links and references to guide you to further reading.

We make use of scikit learn, pandas, numpy and other python libraries in the given examples.

What will be addressed in this article:What is model validation?Why is it important?What are bias and variance in the context of model validation?What is cross validation?What are common methods?Where, and when should different methods be implemented?How do various methods of cross validation work?How can we leverage cross validation to create better models?What is model validation?Model validation is the process by which we ensure that our models can perform acceptable in “the real world.

” In more technical terms, model validation allows you to predict how your model will perform on datasets not used in the training (model validation is a big part of what preventing data leakage is so important).

Model validation is important because we don’t actually care how well the model predicts data we trained it on.

We already know the target values for the data we used to train a model, and as such it is much more important to consider how robust and capable a model is when tasked to model new datasets of the same distribution and characteristics, but with different individual values from our training set.

The first form of model validation introduced is usually what is known as holdout validation, often considered to be the simplest form of cross validation and thus the easiest to implement.

Let’s work through an example below.

Holdout validationFor this example, we’ll use a linear regression on the scikit-learn database of California housing data.

# import scikit learn databasesfrom sklearn import datasets# import california housing data from sklearn and store data into a variablecalihouses = datasets.

fetch_california_housing()calidata = calihouses.

dataOnce the data is stored into a variable we can more easily work with, we’ll convert in into a pandas dataframe so we can more easily view and work with the data.

# import pandas and numpyimport pandas as pdimport numpy as np# define the columns names of the data then convert to dataframeheaders = calihouses.

feature_namesdf = pd.

DataFrame(calidata, columns=headers)# print the df and shape to get a better understanding of the dataprint(df.

shape)print(df)Now that we’ve seen the data we’re working with, we can begin the process of generating a model and cross validation.

In holdout validation, we split the data into a training and testing set.

The training set will be what the model is created on and the testing data will be used to validate the generated model.

Though there are (fairly easy) ways to do this using pandas methods, we can make use of scikit-learns “train_test_split” method to accomplish this.

# first store all target data to a variabley = calihouses.

target# create testing and training sets for hold-out verification using scikit learn methodfrom sklearn import train_test_splitX_train, X_test, y_train, y_test = train_test_split(df, y, test_size = 0.

25)# validate set shapesprint(X_train.

shape, y_train.

shape)print(X_test.

shape, y_test.

shape)As you can see, we use the “train_test_split” with three parameters: the input (X) data, the target (y) data, and the percentage of data we’d like to remove and put into the test dataset, in this case 25% (common split is usually 70–30, depending on a multitude of factors about your data).

We then assign the split X and y data to a set of new variables to work with later.

Your output should appear at this point as:(15480, 8) (15480,)(5160, 8) (5160,)Now that we’ve created our test/train split we can create a model and generate some predictions based on the train data.

Though there are other methods of creating a model which show more of the nitty gritty, we’ll use scikit learn to make our lives a little easier.

I’ve included a few lines to time the runtime of the function, which we will use for later comparison.

# time function using .

time methods for later comparisonfrom timeit import default_timer as timerstart_ho = timer()# fit a model using linear model method from sklearnfrom sklearn import linear_modellm = linear_model.

LinearRegression()model = lm.

fit(X_train, y_train)# generate predictionspredictions = lm.

predict(X_test)end_ho = timer()# calcualte function runtimetime_ho = (end_ho – start_ho)# show predictionsprint(predictions)Let’s pause here for a moment and look at what we’ve done.

Everything up to this point is just setup in creating a linear model and using it to make predictions on a dataset.

This is how far you get without model validation.

In other words, we have yet to look at how the model performs on its predictions of the test data when compared to the actual target values in the test data.

The test/train split we did earlier was necessary to divide the data such that we can now test the model on data that was not used in training (see: data leakage).

Now that we have a model, and have created some predictions, let’s go though with our holdout validation.

We’ll start by graphing our given target data vs our predicted target data to give us a visualization of how our model performs.

# import seaborn and plotlymport matplotlibfrom matplotlib import pyplot as pltimport seaborn as sns# set viz stylesns.

set_style('dark')# plot the modelplot = sns.

scatterplot(y_test, predictions)plot.

set(xlabel='Given', ylabel='Prediction')# generate and graph y = x linex_plot = np.

linspace(0,5,100)y_plot = x_plotplt.

plot(x_plot, y_plot, color='r')Output:Scatter plot of given vs predicted data, with y = x line charted in redIn a perfect model (overfit maybe), all our data points would be on that red line, but as our data points approximate that trend, we can see the model is roughly appropriate for the test data.

Now, lets get a score for the model to evaluate it against later methods.

start_ho_score = timer()# model score (neg_mean_squared_error)from sklearn import metricsho_score = -1*metrics.

mean_squared_error(y_test, predictions)print(ho_score)end_ho_score = timer()ho_score_time = (end_ho_score – start_ho_score)Output:-0.

5201754311947533That’s model validation!.We created a model using training data, used it to predict outcomes on a split segment of test data then used a scoring method to determine a measure of effectiveness of the model on the testing data.

This gives us an approximation of how well the model will perform on other similar datasets.

Now, a few things to consider.

We validated our model once.

What if the split we made just happened to be very conducive to this model?.What if the split we made introduced a large skew into the date?.Didn’t we significantly reduce the size of our training dataset by splitting it like that?.These are a few questions we’ll consider as we move into cross validation, but first a few background concepts.

What are bias and variance in the context of model validation?To understand the bias and variance, let’s first address over and under fit models.

On over fit model is generated when the model is so tightly fit to the training data that it may account for random noise or unwanted trends which will not be present or useful in predicting targets for subsequent datasets.

Under fit occurs when the model is not complex enough to account for general trends in the data which would be useful in predicting targets in subsequent datasets, such as using a linear fit on a polynomial trend(An awesome visualization and further explanation of this concept from AWS can be found here).

Source: https://docs.

aws.

amazon.

com/machine-learning/latest/dg/model-fit-underfitting-vs-overfitting.

htmlWhen creating a model, we account for a few types of error: validation error, testing error, error due to bias, and error due to variance in a relationship known as the bias variance trade-off (another great visual here).

Source: http://www.

luigifreda.

com/2017/03/22/bias-variance-tradeoff/As mentioned earlier, we want to know how the model will perform “in the real world.

” Part of that is validation error, which is comprised of error due to bias and error due to variance (training error does not provide information on how the model will perform on future datasets, and can be set aside for now).

Minimizing model validation error requires finding the point of model complexity where the combination of bias and variance error is minimized, as shown in the linked visual.

As model complexity increases, error due to bias decreases, while error due to variance increases, creating the bias-variance trade-off, which we will seek to address later with various methods of cross validation.

Now let’s define bias and variance:BiasBias is the error resulting from the difference between the expected value(s) of a model and the actual (or “correct”) value(s) for which we want to predict over multiple iterations.

In the scientific concepts of accuracy and precision, bias is very similar to accuracy.

VarianceVariance is defined as the error resulting from the variability between different data predictions in a model.

In variance, the correct value(s) don’t matter as much as the range of differences in value between the predictions.

Variance also comes into play more when we run multiple model creation trials.

More complete definitions and visuals here.

Source: http://scott.

fortmann-roe.

com/docs/BiasVariance.

htmlIn machine learning, bias and variance are often discussed together as a “bias-variance tradeoff,” saying that minimizing one error effectively makes the one more likely to be present when creating and assessing a model.

Ideally, we would seek a model whose tradeoff results in both low bias and low variance, and we would look to achieve this by using cross validation.

Depending on characteristics of the dataset, one method of cross validation is likely to be more ideal to achieving the bias-variance tradeoff when creating and assessing a model.

What is cross validation?What if the split we made just happened to be very conducive to this model?.What if the split we made introduced a large skew into the date?.Didn’t we significantly reduce the size of our training dataset by splitting it like that?Cross validation is a method of model validation which splits the data in creative ways in order to obtain the better estimates of “real world” model performance, and minimize validation error.

Remember those questions we asked about hold out validation?.Cross validation is our answer.

K-Fold Cross ValidationK-fold validation is a popular method of cross validation which shuffles the data and splits it into k number of folds (groups).

In general K-fold validation is performed by taking one group as the test data set, and the other k-1 groups as the training data, fitting and evaluating a model, and recording the chosen score.

This process is then repeated with each fold (group) as the test data and all the scores averaged to obtain a more comprehensive model validation score.

(More reading and a helpful visualization here).

Source: http://www.

ebc.

cat/2017/01/31/cross-validation-strategies/#k-foldWhen choosing a value for k each set should be: large enough to be representative of the model (commonly k=10 or k=5).

Depending on the dataset size, different k values can sometimes be experimented with.

As a general rule, as k increases, bias decreases and variance increases.

Lets work though an example with our dataset from earlier.

We’ll make use of a linear model again, but this time do model validation with scikit learn’s cross_val_predict method which will do most of the heavy lifting in generating K-Fold predictions.

In this case, I chose to set k=10.

# store data as an arrayX = np.

array(df)# again, timing the function for comparisonstart_kfold = timer()# use cross_val_predict to generate K-Fold predictionslm_k = linear_model.

LinearRegression()k_predictions = cross_val_predict(lm_k, X, y, cv=10)print(k_predictions)end_kfold = timer()kfold_time = (end_kfold – start_kfold)Output (or approximate):[4.

22358985 4.

04800271 3.

75534521 .

0.

14474758 0.

29600522 0.

49525933]‘cross_val_predict’ takes the model used on the data, the input and target data, as well as a ‘cv’ argument, which is essentially our k value and returns the predicted values for each input.

Now we can plot the predictions as we did with the hold out method.

# plot k-fold predictions against actualplot_k = sns.

scatterplot(y, k_predictions)plot_k.

set(xlabel='Given', ylabel='Prediction')# generate and graph y = x linex_plot = np.

linspace(0,5,100)y_plot = x_plotplt.

plot(x_plot, y_plot, color='r')Output:Now let’s get the scores of the 10 generated models and plot them into a visualization.

kfold_score_start = timer()# find the mean score from the k-fold models usinf cross_val_scorekfold_scores = cross_val_score(lm_k, X, y, cv=10, scoring='neg_mean_squared_error')print(kfold_scores.

mean())kfold_score_end = timer()kfold_score_time = (kfold_score_end – kfold_score_start)# plot scoressns.

distplot(kfold_scores, bins=5)Output:-0.

5509524296956634You’ll notice that the score is a little farther from zero than the holdout method (not good).

We’ll discuss that later.

Leave One Out Cross ValidationLeave One Out Cross Validation (LOOCV) can be considered a type of K-Fold validation where k=n given n is the number of rows in the dataset.

Other than that the methods are quire similar.

You will notice, however, that running the following code will take much longer than previous methods.

We’ll dig into that later.

Let’s work an example with the same dataset, following the same process and modifying k:Generate predictions:start_LOO = timer()# generate LOO predictionsLOO_predictions = cross_val_predict(lm_k, X, y, cv=(len(X)))end_LOO = timer()LOO_time = (end_LOO – start_LOO)Plot the predictions:# plot LOO predictions against actualplot_LOO = sns.

scatterplot(y, LOO_predictions)plot_LOO.

set(xlabel='Given', ylabel='Prediction')# generate and graph y = x linex_plot = np.

linspace(0,5,100)y_plot = x_plotplt.

plot(x_plot, y_plot, color='r')Output:Generate and average scores:LOO_score_start = timer()# find the mean score from the LOO models using cross_val_score LOO_score = cross_val_score(lm_k, X, y, cv=(len(X)), scoring='neg_mean_squared_error').

mean()print(LOO_score)LOO_score_end = timer()LOO_score_time = (LOO_score_end – LOO_score_start)Now lets compare the run times and scores of our three methods:print("Hold out method took", time_ho, "seconds to generate a model and", ho_score_time ,"seconds to generate a MSE of", ho_score)print("K-Fold method took", kfold_time, 'seconds to generate a model and', kfold_score_time, 'seconds to generate a MSE of', kfold_scores.

mean())print("Leave One Out Cross Validation method took", LOO_time, 'seconds to generate a model and', LOO_score_time, 'seconds to generate a MSE of', LOO_score)Output:Hold out method took 0.

03958953900000495 seconds to generate a model and 0.

002666198000042641 seconds to generate a MSE of -0.

5201754311947533K-Fold method took 0.

07809067700000583 seconds to generate a model and 0.

1253743699999177 seconds to generate a MSE of -0.

5509524296956634Leave One Out Cross Validation method took 152.

00629317099992 seconds to generate a model and 161.

83364986200013 seconds to generate a MSE of -0.

5282462043712458Let’s dig into these results a little, as well as some of the points raised earlier.

Where, and when should different methods be implemented?As we noticed in the results of our comparison, we can see that the LOOCV method takes way longer to complete than our other two.

This is because that method creates and evaluates a model for each row in the dataset, in this case over 20,000.

Even though our MSE is a little lower, this may not be worth it given the additional computational requirements.

Here are some heuristics which can help in choosing a method.

Hold out methodThe hold out method can be effective and computationally inexpensive on very large datasets, or on limited computational resources.

It is also often easier to implement and understand for beginners.

However, it is very rarely good to apply to small datasets as it can significantly reduce the training data available and hurt model performance.

K-Fold Cross ValidationK-Fold can be very effective on medium sized datasets, though by adjusting the K value can significantly alter the results of the validation.

Let’s add to our rule from earlier; as k increases, bias decreases, and variance and computational requirements increase.

K-Fold cross validation is likely the most common of the three methods due to the versatility of adjusting K-values.

LOOCVLOOCV is most useful in small datasets as it allows for the smallest amount of data to be removed from the training data in each iteration.

However, in large datasets the process of generating a model for each row in the dataset can be incredibly computationally expensive and thus prohibitive for larger datasets.

What are some advantages and disadvantages of the different cross validation techniques?Holdout ValidationIn holdout validation, we are doing nothing more than performing a simple train/test split in which we fit our model to our training data and apply it to our testing data to generate predicted values.

We “hold out” the testing data to be strictly used for prediction purposes only.

Holdout validation is NOT a cross validation technique.

But we must discuss the standard method of model evaluation so that we can compare its attributes with the actual cross validation techniques.

When it comes to code, holdout validation is easy to use.

The implementation is simple and doesn’t require large dedications to computational power and time complexity.

Moreover, we can interpret and understand the results of holdout validation better as they don’t require us to figure out how the iterations are performing in the grand scheme of things.

However, holdout validation does not preserve the statistical integrity of the dataset in many cases.

For instance, a holdout validation that splits the data into training and testing segments causes bias by not incorporating the testing data into the model.

The testing data could contain some important observations.

This would result in a detriment to the accuracy of the model.

Furthermore, this will cause an underfitting and overfitting of the data in addition to an introduction of validation and/or training error.

K-foldIn K-fold cross validation, we answer many of the problems inherent in holdout validation such as underfitting/overfitting and validation and training error.

This is done by using all of the observations in our validation set at some iteration.

We compute an average accuracy score of all the accuracy scores that are calculated in each k iteration.

By doing so, we minimize bias and variation that may be present in our initial model evaluation technique, holdout validation.

 However, in terms of computational power, k-fold cross validation is very costly.

The computer has to perform several iterations to generate a proper accuracy score.

The accuracy score of the model will in theory increase with each added k iteration.

This will decrease bias while increasing variation.

We will see an example of this later in this article when we attempt to apply k-fold validation to a very large dataset that contains about 580,000 observations.

LOOCVLOOCV is very similar to K-fold, with a special case in which k is equal to the length (or number of samples/rows) of the whole dataset.

Thus the training set will be of length k-1, and the testing set will be a single sample of the data.

LOOCV is particularly useful in the case that our data set is not large enough to sensibly do Kfold.

LOOCV is also less computationally expensive in general, although it is usually due to the inherently smaller datasets that tend utilize it.

However, LOOCV tends to yield high variance due to the fact that the method would pick up on all of the possible noise and outlier values in the data through the individual testing values.

LOOCV would be very computationally expensive for very large data sets; in this case, it would be better to use regular k-fold.

When would you not want to use cross validation?Cross validation becomes a computationally expensive and taxing method of model evaluation when dealing with large datasets.

Generating prediction values ends up taking a very long time because the validation method have to run k times in K-Fold strategy, iterating through the entire dataset.

Thus cross validation becomes a very costly model evaluation strategy in terms of time complexity.

We will examine this phenomenon by performing a normal holdout validation and a K-Fold cross validation on a very large dataset with approximately 580,000 rows.

See if you can figure it out, why it works the way it does (and the new data visualizations), and comment any questions.

Good luck!# upload dataset from kaggle (we're using google colab here, adapt to your IDE)from google.

colab import filesuploaded = files.

upload()# initialize data framedf = pd.

read_csv("covtype.

csv")print(df.

head())# initialize data framedf = pd.

read_csv("covtype.

csv")print(df.

head())print(df.

tail())# that's a lot of rows!# notice that we use all features of our dataset so that we can illustrate how taxing cross validation will beX=df.

loc[:,'Elevation':'Soil_Type40']y=df['Cover_Type']# some nan values happen to sneak into our dataset so we will fill them upX = X.

fillna(method='ffill')y = y.

fillna(method='ffill')# use a K-nearest neighbhors machine learning algorithmneigh = KNeighborsClassifier(n_neighbors=5)# only with 200 folds are we able to generate an accuracy of 80%neigh.

fit(X,y)kFoldStart = time.

time()y_pred = cross_val_predict(neigh, X, y, cv = 200)kFoldEnd = time.

time()kFoldTime = kFoldEnd – kFoldStartprint("K Fold Validation Accuracy is ", accuracy_score(y, y_pred))# it takes 16 minutes to run the K-Fold cross validation!!!!print(kFoldTime)# generate a heatmap of a confusion matrix with predicted and true values of the type of treeslabels = [1.

0, 2.

0, 3.

0, 4.

0, 5.

0, 6.

0, 7.

0]cm = confusion_matrix(y_pred, y, labels)print(cm)fig = plt.

figure()ax = fig.

add_subplot(111)cax = ax.

matshow(cm, vmin=0, vmax=19000)fig.

colorbar(cax)ax.

set_xticklabels([''] + labels)ax.

set_yticklabels([''] + labels)plt.

xlabel('Predicted')plt.

ylabel('True')plt.

show()Holdout validation:# split our dataset into training and testing dataX_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.

3, random_state=101)# some nan values happen to sneak into our dataset so we will fill them upX_train = X_train.

fillna(method='ffill')y_train = y_train.

fillna(method='ffill')# run the holdout validation and make predictions# it takes only 30 seconds for a normal validation which is still pretty longneigh.

fit(X_train, y_train)holdOutStart = time.

time()holdOutPredictions = neigh.

predict(X_test)holdOutEnd = time.

time()holdOutTime = holdOutEnd – holdOutStartprint("Hold Out Validation takes ", holdOutTime, " seconds")print(accuracy_score(y_test, holdOutPredictions))# notice how much more accurate the holdout validation is compared to the k-fold cross validation# generate a heatmap of a confusion matrix with predicted and true values of the type of treeslabels = [1.

0, 2.

0, 3.

0, 4.

0, 5.

0, 6.

0, 7.

0]cm = confusion_matrix(holdOutPredictions, y_test, labels)print(cm)fig = plt.

figure()ax = fig.

add_subplot(111)cax = ax.

matshow(cm, vmin=0, vmax=8000)fig.

colorbar(cax)ax.

set_xticklabels([''] + labels)ax.

set_yticklabels([''] + labels)plt.

xlabel('Predicted')plt.

ylabel('True')plt.

show().

. More details

Leave a Reply