Probability is a field of mathematics that quantifies uncertainty.
It is undeniably a pillar of the field of machine learning, and many recommend it as a prerequisite subject to study prior to getting started.
This is misleading advice, as probability makes more sense to a practitioner once they have the context of the applied machine learning process in which to interpret it.
In this post, you will discover why machine learning practitioners should study probabilities to improve their skills and capabilities.
After reading this post, you will know:Let’s get started.
5 Reasons to Learn Probability for Machine LearningPhoto by Marco Verch, some rights reserved.
This tutorial is divided into seven parts; they are:Before we go through the reasons that you should learn probability, let’s start off by taking a small look at the reason why you should not.
I think you should not study probability if you are just getting started with applied machine learning.
I recommend a breadth-first approach to getting started in applied machine learning.
I call this the results-first approach.
It is where you start by learning and practicing the steps for working through a predictive modeling problem end-to-end (e.
g.
how to get results) with a tool (such as scikit-learn and Pandas in Python).
This process then provides the skeleton and context for progressively deepening your knowledge, such as how algorithms work and, eventually, the math that underlies them.
After you know how to work through a predictive modeling problem, let’s look at why you should deepen your understanding of probability.
Classification predictive modeling problems are those where an example is assigned a given label.
An example that you may be familiar with is the iris flowers dataset where we have four measurements of a flower and the goal is to assign one of three different known species of iris flower to the observation.
We can model the problem as directly assigning a class label to each observation.
A more common approach is to frame the problem as a probabilistic class membership, where the probability of an observation belonging to each known class is predicted.
Framing the problem as a prediction of class membership simplifies the modeling problem and makes it easier for a model to learn.
It allows the model to capture ambiguity in the data, which allows a process downstream, such as the user to interpret the probabilities in the context of the domain.
The probabilities can be transformed into a crisp class label by choosing the class with the largest probability.
The probabilities can also be scaled or transformed using a probability calibration process.
This choice of a class membership framing of the problem interpretation of the predictions made by the model requires a basic understanding of probability.
There are algorithms that are specifically designed to harness the tools and methods from probability.
These range from individual algorithms, like Naive Bayes algorithm, which is constructed using Bayes Theorem with some simplifying assumptions.
It also extends to whole fields of study, such as probabilistic graphical models, often called graphical models or PGM for short, and designed around Bayes Theorem.
A notable graphical model is Bayesian Belief Networks or Bayes Nets, which are capable of capturing the conditional dependencies between variables.
Many machine learning models are trained using an iterative algorithm designed under a probabilistic framework.
Perhaps the most common is the framework of maximum likelihood estimation, sometimes shorted as MLE.
This is a framework for estimating model parameters (e.
g.
weights) given observed data.
This is the framework that underlies the ordinary least squares estimate of a linear regression model.
The expectation-maximization algorithm, or EM for short, is an approach for maximum likelihood estimation often used for unsupervised data clustering, e.
g.
estimating k means for k clusters, also known as the k-Means clustering algorithm.
For models that predict class membership, maximum likelihood estimation provides the framework for minimizing the difference or divergence between an observed and predicted probability distribution.
This is used in classification algorithms like logistic regression as well as deep learning neural networks.
It is common to measure this difference in probability distribution during training using entropy, e.
g.
via cross-entropy.
Entropy, and differences between distributions measured via KL divergence, and cross-entropy are from the field of information theory that directly build upon probability theory.
For example, entropy is calculated directly as the negative log of the probability.
It is common to tune the hyperparameters of a machine learning model, such as k for kNN or the learning rate in a neural network.
Typical approaches include grid searching ranges of hyperparameters or randomly sampling hyperparameter combinations.
Bayesian optimization is a more efficient to hyperparameter optimization that involves a directed search of the space of possible configurations based on those configurations that are most likely to result in better performance.
As its name suggests, the approach was devised from and harnesses Bayes Theorem when sampling the space of possible configurations.
For those algorithms where a prediction of probabilities is made, evaluation measures are required to summarize the performance of the model.
There are many measures used to summarize the performance of a model based on predicted probabilities.
Common examples include aggregate measures like log loss and Brier score.
For binary classification tasks where a single probability score is predicted, Receiver Operating Characteristic, or ROC, curves can be constructed to explore different cut-offs that can be used when interpreting the prediction that, in turn, result in different trade-offs.
The area under the ROC curve, or ROC AUC, can also be calculated as an aggregate measure.
Choice and interpretation of these scoring methods require a foundational understanding of probability theory.
If I could give one more reason, it would be: Because it is fun.
Seriously.
Learning probability, at least the way I teach it with practical examples and executable code, is a lot of fun.
Once you can see how the operations work on real data, it is hard to avoid developing a strong intuition for a subject that is often quite unintuitive.
Do you have more reasons why it is critical for an intermediate machine learning practitioner to learn probability?Let me know in the comments below.
This section provides more resources on the topic if you are looking to go deeper.
In this post, you discovered why, as a machine learning practitioner, you should deepen your understanding of probability.
Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.
.