Last Updated on October 7, 2019 Classification is a predictive modeling problem that involves assigning a label to a given input data sample.

The problem of classification predictive modeling can be framed as calculating the conditional probability of a class label given a data sample.

Bayes Theorem provides a principled way for calculating this conditional probability, although in practice requires an enormous number of samples (very large-sized dataset) and is computationally expensive.

Instead, the calculation of Bayes Theorem can be simplified by making some assumptions, such as each input variable is independent of all other input variables.

Although a dramatic and unrealistic assumption, this has the effect of making the calculations of the conditional probability tractable and results in an effective classification model referred to as Naive Bayes.

In this tutorial, you will discover the Naive Bayes algorithm for classification predictive modeling.

After completing this tutorial, you will know:Discover bayes opimization, naive bayes, maximum likelihood, distributions, cross entropy, and much more in my new book, with 28 step-by-step tutorials and full Python source code.

Let’s get started.

How to Develop a Naive Bayes Classifier from Scratch in PythonPhoto by Ryan Dickey, some rights reserved.

This tutorial is divided into five parts; they are:In machine learning, we are often interested in a predictive modeling problem where we want to predict a class label for a given observation.

For example, classifying the species of plant based on measurements of the flower.

Problems of this type are referred to as classification predictive modeling problems, as opposed to regression problems that involve predicting a numerical value.

The observation or input to the model is referred to as X and the class label or output of the model is referred to as y.

Together, X and y represent observations collected from the domain, i.

e.

a table or matrix (columns and rows or features and samples) of training data used to fit a model.

The model must learn how to map specific examples to class labels or y = f(X) that minimized the error of misclassification.

One approach to solving this problem is to develop a probabilistic model.

From a probabilistic perspective, we are interested in estimating the conditional probability of the class label, given the observation.

For example, a classification problem may have k class labels y1, y2, …, yk and n input variables, X1, X2, …, Xn.

We can calculate the conditional probability for a class label with a given instance or set of input values for each column x1, x2, …, xn as follows:The conditional probability can then be calculated for each class label in the problem and the label with the highest probability can be returned as the most likely classification.

The conditional probability can be calculated using the joint probability, although it would be intractable.

Bayes Theorem provides a principled way for calculating the conditional probability.

The simple form of the calculation for Bayes Theorem is as follows:Where the probability that we are interested in calculating P(A|B) is called the posterior probability and the marginal probability of the event P(A) is called the prior.

We can frame classification as a conditional classification problem with Bayes Theorem as follows:The prior P(yi) is easy to estimate from a dataset, but the conditional probability of the observation based on the class P(x1, x2, …, xn | yi) is not feasible unless the number of examples is extraordinarily large, e.

g.

large enough to effectively estimate the probability distribution for all different possible combinations of values.

As such, the direct application of Bayes Theorem also becomes intractable, especially as the number of variables or features (n) increases.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-CourseThe solution to using Bayes Theorem for a conditional probability classification model is to simplify the calculation.

The Bayes Theorem assumes that each input variable is dependent upon all other variables.

This is a cause of complexity in the calculation.

We can remove this assumption and consider each input variable as being independent from each other.

This changes the model from a dependent conditional probability model to an independent conditional probability model and dramatically simplifies the calculation.

First, the denominator is removed from the calculation P(x1, x2, …, xn) as it is a constant used in calculating the conditional probability of each class for a given instance and has the effect of normalizing the result.

Next, the conditional probability of all variables given the class label is changed into separate conditional probabilities of each variable value given the class label.

These independent conditional variables are then multiplied together.

For example:This calculation can be performed for each of the class labels, and the label with the largest probability can be selected as the classification for the given instance.

This decision rule is referred to as the maximum a posteriori, or MAP, decision rule.

This simplification of Bayes Theorem is common and widely used for classification predictive modeling problems and is generally referred to as Naive Bayes.

The word “naive” is French and typically has a diaeresis (umlaut) over the “i”, which is commonly left out for simplicity, and “Bayes” is capitalized as it is named for Reverend Thomas Bayes.

Now that we know what Naive Bayes is, we can take a closer look at how to calculate the elements of the equation.

The calculation of the prior P(yi) is straightforward.

It can be estimated by dividing the frequency of observations in the training dataset that have the class label by the total number of examples (rows) in the training dataset.

For example:The conditional probability for a feature value given the class label can also be estimated from the data.

Specifically, those data examples that belong to a given class, and one data distribution per variable.

This means that if there are K classes and n variables, that k * n different probability distributions must be created and maintained.

A different approach is required depending on the data type of each feature.

Specifically, the data is used to estimate the parameters of one of three standard probability distributions.

In the case of categorical variables, such as counts or labels, a multinomial distribution can be used.

If the variables are binary, such as yes/no or true/false, a binomial distribution can be used.

If a variable is numerical, such as a measurement, often a Gaussian distribution is used.

These three distributions are so common that the Naive Bayes implementation is often named after the distribution.

For example:A dataset with mixed data types for the input variables may require the selection of different types of data distributions for each variable.

Using one of the three common distributions is not mandatory; for example, if a real-valued variable is known to have a different specific distribution, such as exponential, then that specific distribution may be used instead.

If a real-valued variable does not have a well-defined distribution, such as bimodal or multimodal, then a kernel density estimator can be used to estimate the probability distribution instead.

The Naive Bayes algorithm has proven effective and therefore is popular for text classification tasks.

The words in a document may be encoded as binary (word present), count (word occurrence), or frequency (tf/idf) input vectors and binary, multinomial, or Gaussian probability distributions used respectively.

In this section, we will make the Naive Bayes calculation concrete with a small example on a machine learning dataset.

We can generate a small contrived binary (2 class) classification problem using the make_blobs() function from the scikit-learn API.

The example below generates 100 examples with two numerical input variables, each assigned one of two classes.

Running the example generates the dataset and summarizes the size, confirming the dataset was generated as expected.

The “random_state” argument is set to 1, ensuring that the same random sample of observations is generated each time the code is run.

The input and output elements of the first five examples are also printed, showing that indeed, the two input variables are numeric and the class labels are either 0 or 1 for each example.

We will model the numerical input variables using a Gaussian probability distribution.

This can be achieved using the norm SciPy API.

First, the distribution can be constructed by specifying the parameters of the distribution, e.

g.

the mean and standard deviation, then the probability density function can be sampled for specific values using the norm.

pdf() function.

We can estimate the parameters of the distribution from the dataset using the mean() and std() NumPy functions.

The fit_distribution() function below takes a sample of data for one variable and fits a data distribution.

Recall that we are interested in the conditional probability of each input variable.

This means we need one distribution for each of the input variables, and one set of distributions for each of the class labels, or four distributions in total.

First, we must split the data into groups of samples for each of the class labels.

We can then use these groups to calculate the prior probabilities for a data sample belonging to each group.

This will be 50% exactly given that we have created the same number of examples in each of the two classes; nevertheless, we will calculate these priors for completeness.

Finally, we can call the fit_distribution() function that we defined to prepare a probability distribution for each variable, for each class label.

Tying this all together, the complete probabilistic model of the dataset is listed below.

Running the example first splits the dataset into two groups for the two class labels and confirms the size of each group is even and the priors are 50%.

Probability distributions are then prepared for each variable for each class label and the mean and standard deviation parameters of each distribution are reported, confirming that the distributions differ.

Next, we can use the prepared probabilistic model to make a prediction.

The independent conditional probability for each class label can be calculated using the prior for the class (50%) and the conditional probability of the value for each variable.

The probability() function below performs this calculation for one input example (array of two values) given the prior and conditional probability distribution for each variable.

The value returned is a score rather than a probability as the quantity is not normalized, a simplification often performed when implementing naive bayes.

We can use this function to calculate the probability for an example belonging to each class.

First, we can select an example to be classified; in this case, the first example in the dataset.

Next, we can calculate the score of the example belonging to the first class, then the second class, then report the results.

The class with the largest score will be the resulting classification.

Tying this together, the complete example of fitting the Naive Bayes model and using it to make one prediction is listed below.

Running the example first prepares the prior and conditional probabilities as before, then uses them to make a prediction for one example.

The score of the example belonging to y=0 is about 0.

3 (recall this is an unnormalized probability), whereas the score of the example belonging to y=1 is 0.

0.

Therefore, we would classify the example as belonging to y=0.

In this case, the true or actual outcome is known, y=0, which matches the prediction by our Naive Bayes model.

In practice, it is a good idea to use optimized implementations of the Naive Bayes algorithm.

The scikit-learn library provides three implementations, one for each of the three main probability distributions; for example, BernoulliNB, MultinomialNB, and GaussianNB for binomial, multinomial and Gaussian distributed input variables respectively.

To use a scikit-learn Naive Bayes model, first the model is defined, then it is fit on the training dataset.

Once fit, probabilities can be predicted via the predict_proba() function and class labels can be predicted directly via the predict() function.

The complete example of fitting a Gaussian Naive Bayes model (GaussianNB) to the same test dataset is listed below.

Running the example fits the model on the training dataset, then makes predictions for the same first example that we used in the prior example.

In this case, the probability of the example belonging to y=0 is 1.

0 or a certainty.

The probability of y=1 is a very small value close to 0.

0.

Finally, the class label is predicted directly, again matching the ground truth for the example.

This section lists some practical tips when working with Naive Bayes models.

If the probability distribution for a variable is complex or unknown, it can be a good idea to use a kernel density estimator or KDE to approximate the distribution directly from the data samples.

A good example would be the Gaussian KDE.

By definition, Naive Bayes assumes the input variables are independent of each other.

This works well most of the time, even when some or most of the variables are in fact dependent.

Nevertheless, the performance of the algorithm degrades the more dependent the input variables happen to be.

The calculation of the independent conditional probability for one example for one class label involves multiplying many probabilities together, one for the class and one for each input variable.

As such, the multiplication of many small numbers together can become numerically unstable, especially as the number of input variables increases.

To overcome this problem, it is common to change the calculation from the product of probabilities to the sum of log probabilities.

For example:Calculating the natural logarithm of probabilities has the effect of creating larger (negative) numbers and adding the numbers together will mean that larger probabilities will be closer to zero.

The resulting values can still be compared and maximized to give the most likely class label.

This is often called the log-trick when multiplying probabilities.

As new data becomes available, it can be relatively straightforward to use this new data with the old data to update the estimates of the parameters for each variable’s probability distribution.

This allows the model to easily make use of new data or the changing distributions of data over time.

The probability distributions will summarize the conditional probability of each input variable value for each class label.

These probability distributions can be useful more generally beyond use in a classification model.

For example, the prepared probability distributions can be randomly sampled in order to create new plausible data instances.

The conditional independence assumption assumed may mean that the examples are more or less plausible based on how much actual interdependence exists between the input variables in the dataset.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the Naive Bayes algorithm for classification predictive modeling.

Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.

Develop Your Understanding of Probability .

with just a few lines of python codeDiscover how in my new Ebook: Probability for Machine LearningIt provides self-study tutorials and end-to-end projects on: Bayes Theorem, Bayesian Optimization, Distributions, Maximum Likelihood, Cross-Entropy, Calibrating Models and much more.

Finally Harness Uncertainty in Your Projects Skip the Academics.

Just Results.

.