5 Reasons Why You Should Use Cross-Validation in Your Data Science Projects

Before I present you my five reasons to use cross-validation, I want to briefly go over what cross-validation is and show some common strategies.When we’re building a machine learning model using some data, we often split our data into training and validation/test sets..The training set is used to train the model, and the validation/test set is used to validate it on data it has never seen before..We then build three different models, each model is trained on two parts and tested on the third..For each instance in our dataset, we build a model using all other instances and then test it on the selected instance.Stratified Cross Validation — When we split our data into folds, we want to make sure that each fold is a good representative of the whole data..Testing anything on only 2 examples can’t lead to any real conclusion.If we use cross-validation in this case, we build K different models, so we are able to make predictions on all of our data..After we evaluated our learning algorithm (see #2 below) we are now can train our model on all our data because if our 5 models had similar performance using different train sets, we assume that by training it on all the data will get similar performance.By doing cross-validation, we’re able to use all our 100 examples both for training and for testing while evaluating our learning algorithm on examples it has never seen before.   As mentioned in #1, when we create five different models using our learning algorithm and test it on five different test sets, we can be more confident in our algorithm performance..This means that our algorithm (and our data) is consistent and we can be confident that by training it on all the data set and deploy it in production will lead to similar performance.However, we could end up in a slightly different scenario, say 92.0, 44.0, 91.5, 92.5 and 91.8..Here it looks like that our algorithm or our data (or both) is no consistent, it could be that our algorithm is unable to learn, or our data is very complicated.By using Cross-Validation, we are able to get more metrics and draw important conclusion both about our algorithm and our data.   Sometimes we want to (or have to) build a pipeline of models to solve something..When we do something similar but not using Neural Networks, we can’t train it in the same way because there’s not always a clear “error” (or derivative) that we can pass back.For example, we may create a Random Forest Model that predicts something for us, and right after that, we want to do a Linear Regression that will rely on previous predictions and produce some real number.The critical part here is that our second model must learn on the predictionsof our first model. The best solution here is to use two different datasets for each model..Then we use the dataset B predictions to train our second model (the logistic regression) and finally, we use dataset C to evaluate our complete solution..We make predictions using the first model, pass them to our second model and then compare it to the ground truth.When we have limited data (as in most cases), we can’t really do it..Also, we can’t train both our models on the same dataset because then, our second model learns on predictions that our first model already seen..This may lead to different effects in our final evaluations that will be hard to understand.By using cross-validation, we can make predictions on our dataset in the same way as described before and so our second’s models input will be real predictions on data that our first model never seen before.   When we perform a random train-test split of our data, we assume that our examples are independent.. More details

Leave a Reply