Data Scientist’s Dilemma: The Cold Start Problem – Ten Machine Learning Examples

How?This can be a real challenge.

Of course nobody said the “cold start” problem would be easy.

Anyone who has ever tried to start a very cold car on a frozen morning knows the pain of a cold start challenge.

Nothing can be more frustrating on such a morning.

But, nothing can be more exhilarating and uplifting on such a morning than that moment when the engine starts and the car begins moving forward with increasing performance.

The experiences for data scientists who face cold-start problems in machine learning can be very similar to those, especially the excitement when our models begin moving forward with increasing performance.

We will itemize several examples at the end.

But before we do that, let’s address the objective function.

That is the true key that unlocks performance in a cold-start challenge.

 That’s the magic ingredient in most of the examples that we will list.

The objective function (also known as cost function, or benefit function) provides an objective measure of model performance.

It might be as simple as the percentage of class labels that the model got right (in a classification model), or the sum of the squares of the deviations of the points from the model curve (in a regression model), or the compactness of the clusters relative to their separation (in a clustering analysis).

The value of the objective function is not only in its final value (i.

e.

, giving us a quantitative overall model performance rating), but its great (perhaps greatest) value is realized in guiding our progression from the initial random model (cold-start zero point) to that final successful (hopefully, optimal) model.

In those intermediate steps it serves as an evaluation (or validation) metric.

By measuring the evaluation metric at step zero (cold-start), then measuring it again after making adjustments to the model parameters, we learn whether our adjustments led to a better performing model or worse performance.

We then know whether to continue making model parameter adjustments in the same direction or in the opposite direction.

This is called gradient descent.

Gradient descent methods basically find the slope (i.

e.

, the gradient) of the performance error curve as we progress from one model to the next.

As we learned in grade school algebra class, we need two points to find the slope of a curve.

Therefore, it is only after we have run and evaluated two models that we will have two performance points — the slope of the curve at the latest point then informs our next choice of model parameter adjustments: either (a) keep adjusting in the same direction as the previous step (if the performance error decreased) to continue descending the error curve; or (b) adjust in the opposite direction (if the performance error increased) to turn around and start descending the error curve.

Note that hill-climbing is the opposite of gradient descent, but essentially the same thing.

Instead of minimizing error (a cost function), hill-climbing focuses on maximizing accuracy (a benefit function).

Again, we measure the slope of the performance curve from two models, then proceed in the direction of better-performing models.

In both cases (hill-climbing and gradient descent), we hope to reach an optimal point (maximum accuracy or minimum error), and then declare that to be the best solution.

And that is amazing and satisfying when we remember that we started (as a cold-start) with an initial random guess at the solution.

When our machine learning model has many parameters (which could be thousands for a deep neural network), the calculations are more complex (perhaps involving a multi-dimensional gradient calculation, known as a tensor).

But the principle is the same: quantitatively discover at each step in the model-building progression which adjustments (size and direction) are needed in each one of the model parameters in order to progress towards the optimal value of the objective function (e.

g.

, minimize errors, maximize accuracy, maximize goodness of fit, maximize precision, minimize false positives, etc.

).

In deep learning, as in typical neural network models, the method by which those adjustments to the model parameters are estimated (i.

e.

,for each of the edge weights between the network nodes) is called backpropagation.

That is still based on gradient descent.

One way to think about gradient descent, backpropagation, and perhaps all machine learning is this: “Machine Learning is the set of mathematical algorithms that learn from experience.

Good judgment comes experience.

And experience comes from bad judgment.

” In our case, the initial guess for our random cold-start model can be considered “bad judgment”, but then experience (i.

e.

, the feedback from validation metrics such as gradient descent) bring “good judgment” (better models) into our model-building workflow.

Here are ten examples of cold-start problems in data science where the algorithms and techniques of machine learning produce the good judgment in model progression toward the optimal solution:Finally, as a bonus, we mention a special case, Recommender Engines, where the cold-start problem is a subject of ongoing research.

The research challenge is to find the optimal recommendation for a new customer or for a new product that has not been seen before.

Check out these articles related to this challenge:We started this article mentioning Confucius and his wisdom.

Here is another form of wisdom: https://rapidminer.

com/wisdom/ — the RapidMiner Wisdom conference.

It is a wonderful conference, with many excellent tutorials, use cases, applications, and customer testimonials.

I was honored to be the keynote speaker for their 2018 conference in New Orleans, where I spoke about “Clearing the Fog around Data Science and Machine Learning: The Usual Suspects in Some Unusual Places”.

You can find my slide presentation here: KirkBorne-RMWisdom2018.

pdf NOTE: Genetic Algorithms (GAs) are an example of meta-learning.

They are not machine learning algorithms in themselves, but GAs can be applied across ensembles of machine learning models and tasks, in order to find the optimal model (perhaps globally optimal model) across a collection of locally optimal solutions.

Original.

Reposted with permission.

Bio: Kirk D.

Borne is a Principal Data Scientist and Executive Advisor at Booz Allen Hamilton.

Resources:Related: var disqus_shortname = kdnuggets; (function() { var dsq = document.

createElement(script); dsq.

type = text/javascript; dsq.

async = true; dsq.

src = https://kdnuggets.

disqus.

com/embed.

js; (document.

getElementsByTagName(head)[0] || document.

getElementsByTagName(body)[0]).

appendChild(dsq); })();.

. More details

Leave a Reply