Data Validation for Machine Learning

Data is the sustenance that keeps machine learning going.

No matter how powerful a machine learning and/or deep learning model is, it can never do what we want it to do with bad data.

Random noise (i.

e.

data points that make it difficult to see a pattern), low frequency of a certain categorical variable, low frequency of the target category (if target variable is categorical) and incorrect numeric values etc.

are just some of the ways data can mess up a model.

While the validation process cannot directly find what is wrong, the process can show us sometimes that there is a problem with the stability of the model.

  The most basic method of validating your data (i.

e.

tuning your hyperparameters before testing the model) is when someone will perform a train/validate/test split on the data.

A typical ratio for this might be 80/10/10 to make sure you still have enough training data.

After training the model with the training set, the user will move onto validating the results and tuning the hyperparameters with the validation set till the user reaches a satisfactory performance metric.

Once this stage is completed, the user would move on to testing the model with the test set to predict and evaluate the performance.

 .

and the dataset will be split into n-1 data sets and the one that was removed will be the test data.

performance is measured the same way as k-fold cross validation.

Validating a dataset gives reassurance to the user about the stability of their model.

With machine learning penetrating facets of society and being used in our daily lives, it becomes more imperative that the models are representative of our society.

Overfitting and underfitting are the two most common pitfalls that a Data Scientist can face during a model building process.

Validation is the gateway to your model being optimized for performance and being stable for a period of time before needing to be retrained.

  Related var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.

disqus.

com/embed.

js; (document.

getElementsByTagName(head)[0] || document.

getElementsByTagName(body)[0]).

appendChild(dsq); })();.

Leave a Reply