The minimum viable data set

To answer this question, one should know the concept of different data sets for machine learning.

There are three types of data sets for the training of algorithms:Training data is used to train an algorithm.

It is an initial set of data used to help a program learn and produce sophisticated results.

Validation data is a portion of the data used to assess how well the models fit, to adjust some models, and to select the best one.

Testing data is a portion of the data used to assess how well the final model might perform on additional data.

Every machine learning algorithm will only be as good as the underlying data that we are feeding it.

Therefore, the importance of that data cannot be overestimated.

Generally speaking, you’d have one data set and split it up into those three subsets:Splitting your Data into Training, Validation and Testing DataThis 60/20/20 split is of course only a rule of thumb.

It always varies depending on your use case, the number of variables and the size of your sample.

Training data basically consists of pairs of input and output.

Having that in mind, different types of algorithms need the data to be structured in different ways.

Examples for that are computer vision, where the training set consists of a large number of images, or sequential decision trees, where it would be alphanumerical data.

(Cross-) Validation data is used to ensure better accuracy and efficiency of the algorithm.

It is well suited for tuning parameters and avoids ‘overfitting’ — which means training the algorithm too specifically on the training data.

Testing data evaluates the final model on how well it performs when confronted with previously unknown data input.

It is used for comparing different models in order to derive which one to decide for.

Validation data can’t be used for this step as it was part of the training process itself.

As you can see it is crucial to test in order to derive reliable results and avoid misinterpretation due to incomplete or biased training data sets.

The importance of testing dataThe concept of a minimum viable data setNow, the key question arises, how can I minimize my data input, if I need to split up all of the data in different sets to train my data, and if it is so important to have a valid, non-biased sample? There are different approaches to minimize the required amount of data from millions of data points to significantly less:1) Data pooling  Join forces with other data vendors, e.


business partners, suppliers or non-competitive market participants.

This requires a well-organized-process of standardization, ideally managed by an independent third-party.

2) Data enrichment  Enrich your existing data set by using public data sets or buying from dedicated data vendors — prerequisite is to make your data set more meaningful without losing the initial cognitive interest.

3) Knowledge transfer Use pre-trained models or train your model with suitable but more generic sample data and refine it by using smaller samples of your proprietary data.

Cloud vendors and specialized service providers will soongrow this segment very quickly in the future.

4) Iterative data generation If you have rather small data sets, this not exludes you from starting to build machine learning models.

Take the data you have or can easily extract from your systems to derive a very rough first idea, whether your idea of optimization is working and accept a high degree of uncertainty in the first place.

Then start to build up your data resources over time by kicking off corresponding business intelligence processes and adjust your (in the beginning) very simple model iteratively.

This way you can legitimate every financial decision regarding AI and BI projects and work very focused towards one goal instead of making huge efforts in either one of these fields without knowing exactly what to do with the end result.

ConclusionSo, what does all of that mean in practice? Whether or not your company already works with AI, there will be situations where data becomes a scarce resource.

Here are some aspects to consider in such a scenario:1) AI strongly depends on high-quality data, which is even more important than large volumes2) A lack of data does not necessarily kill ideas or projects, there are several ways to deal with data shortage3) Transfer learning will be a massive adoption driver for AI in the coming years — also for medium-sized companies4) Cloud-based Machine-Learning-as-a-Service offerings will provide a suitable infrastructure for rather generic enterprise functions5) Specialized service providers will fill the gap for niche but, high-value use cases such as optical quality control or a certain supply chain optimization problem6) Iterative development on both the data side and the model side create higher uncertainty but is far better than not starting to develop your data infrastructureThe first step is defining what business problem you want to solve or evaluate what strategic potential can be tapped with systems based on machine learning.

And as the quality of your data is critical, this should be the starting point of the efforts on an operational level: Getting your data as clean as possible for example in the form of integrated and (near) real-time data warehouse systems.

If you don’t have that yet, there are some free example data sets to get an idea of how a nice and clean data basis could look like:AWS Public Data setsYou’ll need an AWS (Amazon Web Services) account, but Amazon gives out a free access tier for new accounts that will enable you to explore the data without being charged.

Google Public Data setsYou’ll need a GCP (Google Cloud Platform) account here as well.

The first TB of queries you make is free.

KaggleThis amazing data science community regularly hosts machine learning competitions.

You can get free data sets when entering competitions or out of contributions by the community.


govYou can browse data sets by various US government agencies directly on the website, without signing up.


worldYou can find numerous data sets tagged with the relevant keywords while navigating in a github-like environment.

With a free account, you can easily start off and use up to 100MB per project.

QuandlAmong traditional financial data, you can find a vast amount of ‘alternative’ data on the platform — meaning: tapping into different pools of data (e.


from satellites or IoT devices).

Pricing depends on the data set and your intention of utilization.

After that — and most likely after choosing the tools to use out of the vast landscape of technology vendors — the process of data preparation can begin.

For more insights on how to kick-off your AI projects check our other blogposts or our website: appanion.

com.. More details

Leave a Reply