Dealing with the Lack of Data in Machine Learning

This is where data generation can play a role.

It is used when no data is available, or when you need to create more data than you could amass even through aggregation.

In this case, the small amount of data that does exist is modified to create variations on that data to train the model.

For example, many images of a car can be generated by cropping, cropping, downsizing, one single image of a car.

Unfortunately, the lack of quality labeled data is also one of the largest challenges facing data science teams, but by using techniques such as transfer learning and data generation it is possible to overcome data scarcity.

Another common application of transfer learning is to train models on cross-customer datasets to overcome the cold-start problem.

I noticed that SaaS companies often have to deal with when onboarding new customers to their ML products.

Indeed, until the new customer has collected enough data to achieve good model performance (which could take several months) it’s hard to provide valueData AugmentationData augmentation means increasing the number of data points.

In my latest project, we used data augmentation techniques to increase the number of images in our dataset.

In terms of traditional row/column format data, it means increasing the number of rows or objects.

We had no choice but to rely on data augmentation for two reasons: Time and Accuracy.

Every data collection process is associated with a cost.

This cost can be in terms of dollars, human effort, computational resources and off course time consumed in the process.

As a consequence, we had to augment existing data to increase the data size that we feed to our ML classifiers and to compensate for the cost involved in further data collection.

There are many ways to augment data.

In our case, you can rotate the original image, change lighting conditions, crop it differently, so for one image you can generate different sub-samples.

This way you can reduce overfitting your classifier.

However, if you are generating artificial data using over-sampling methods such as SMOTE, then there is a fair chance you may introduce over-fitting.

Over-fitting: An overfitted model is a model with a trend line that reflects the errors in the data that it is trained with, instead of accurately predicting unseen data.

This is something you must take into consideration when developing your AI solution.

Synthetic DataSynthetic data means fake data that contains the same schema and statistical properties as its “real” counterpart.

Basically, it looks so real that it’s nearly impossible to tell that it’s not.

So what’s the point of synthetic data, and why does it matter if we already have access to the real thing?I have seen synthetic data applied especially when we were dealing with private data (banking, healthcare, etc.

), this makes the use of synthetic data a more secure approach to development in certain instances.

Synthetic data is used mostly when there is not enough real data or there is not enough real data for specific patterns you know about.

Usage mostly the same for training and testing datasets.

Synthetic Minority Over-sampling Technique (SMOTE) and Modified- SMOTE are two such techniques which generate synthetic data.

Simply put, SMOTE takes the minority class data points and creates new data points which lie between any two nearest data points joined by a straight line.

In order to do this, the algorithm calculates the distance between two data points in the feature space, multiplies the distance by a random number between 0 and 1 and places the new data point at this new distance from one of the data points used for distance calculation.

In order to generate synthetic data, you have to use a Training Set to define a model, which would require at least a validation, and then by changing the parameters of interest, you can generate synthetic data, through simulation.

The domain / data type is very significant affecting the complexity of the entire process.

In my opinion, by asking yourself if you have enough data will show inconsistencies that you likely never realized.

It will probably highlight issues in your business processes that you thought were perfect or make you understand why it is key to have a data strategy within your organization.

.. More details

Leave a Reply