A Gentle Introduction to Probability Density Estimation

Probability density is the relationship between observations and their probability.

Some outcomes of a random variable will have low probability density and other outcomes will have a high probability density.

The overall shape of the probability density is referred to as a probability distribution, and the calculation of probabilities for specific outcomes of a random variable is performed by a probability density function, or PDF for short.

It is useful to know the probability density function for a sample of data in order to know whether a given observation is unlikely, or so unlikely as to be considered an outlier or anomaly and whether it should be removed.

It is also helpful in order to choose appropriate learning methods that require input data to have a specific probability distribution.

It is unlikely that the probability density function for a random sample of data is known.

As such, the probability density must be approximated using a process known as probability density estimation.

In this tutorial, you will discover a gentle introduction to probability density estimation.

After completing this tutorial, you will know:Let’s get started.

A Gentle Introduction to Probability Density EstimationPhoto by Alistair Paterson, some rights reserved.

This tutorial is divided into four parts; they are:A random variable x has a probability distribution p(x).

The relationship between the outcomes of a random variable and its probability is referred to as the probability density, or simply the “density.

”If a random variable is continuous, then the probability can be calculated via probability density function, or PDF for short.

The shape of the probability density function across the domain for a random variable is referred to as the probability distribution and common probability distributions have names, such as uniform, normal, exponential, and so on.

Given a random variable, we are interested in the density of its probabilities.

For example, given a random sample of a variable, we might want to know things like the shape of the probability distribution, the most likely value, the spread of values, and other properties.

Knowing the probability distribution for a random variable can help to calculate moments of the distribution, like the mean and variance, but can also be useful for other more general considerations, like determining whether an observation is unlikely or very unlikely and might be an outlier or anomaly.

The problem is, we may not know the probability distribution for a random variable.

We rarely do know the distribution because we don’t have access to all possible outcomes for a random variable.

In fact, all we have access to is a sample of observations.

As such, we must select a probability distribution.

This problem is referred to as probability density estimation, or simply “density estimation,” as we are using the observations in a random sample to estimate the general density of probabilities beyond just the sample of data we have available.

There are a few steps in the process of density estimation for a random variable.

The first step is to review the density of observations in the random sample with a simple histogram.

From the histogram, we might be able to identify a common and well-understood probability distribution that can be used, such as a normal distribution.

If not, we may have to fit a model to estimate the distribution.

In the following sections, we will take a closer look at each one of these steps in turn.

We will focus on univariate data, e.

g.

one random variable, in this post for simplicity.

Although the steps are applicable for multivariate data, they can become more challenging as the number of variables increases.

The first step in density estimation is to create a histogram of the observations in the random sample.

A histogram is a plot that involves first grouping the observations into bins and counting the number of events that fall into each bin.

The counts, or frequencies of observations, in each bin are then plotted as a bar graph with the bins on the x-axis and the frequency on the y-axis.

The choice of the number of bins is important as it controls the coarseness of the distribution (number of bars) and, in turn, how well the density of the observations is plotted.

It is a good idea to experiment with different bin sizes for a given data sample to get multiple perspectives or views on the same data.

For example, observations between 1 and 100 could be split into 3 bins (1-33, 34-66, 67-100), which might be too coarse, or 10 bins (1-10, 11-20, … 91-100), which might better capture the density.

A histogram can be created using the Matplotlib library and the hist() function.

The data is provided as the first argument, and the number of bins is specified via the “bins” argument either as an integer (e.

g.

10) or as a sequence of the boundaries of each bin (e.

g.

[1, 34, 67, 100]).

The snippet below creates a histogram with 10 bins for a data sample.

We can create a random sample drawn from a normal distribution and pretend we don’t know the distribution, then create a histogram of the data.

The normal() NumPy function will achieve this and we will generate 1,000 samples with a mean of 0 and a standard deviation of 1, e.

g.

a standard Gaussian.

The complete example is listed below.

Running the example draws a sample of random observations and creates the histogram with 10 bins.

We can clearly see the shape of the normal distribution.

Note that your results will differ given the random nature of the data sample.

Try running the example a few times.

Histogram Plot With 10 Bins of a Random Data SampleRunning the example with bins set to 3 makes the normal distribution less obvious.

Histogram Plot With 3 Bins of a Random Data SampleReviewing a histogram of a data sample with a range of different numbers of bins will help to identify whether the density looks like a common probability distribution or not.

In most cases, you will see a unimodal distribution, such as the familiar bell shape of the normal, the flat shape of the uniform, or the descending or ascending shape of an exponential or Pareto distribution.

You might also see complex distributions, such as multiple peaks that don’t disappear with different numbers of bins, referred to as a bimodal distribution, or multiple peaks, referred to as a multimodal distribution.

You might also see a large spike in density for a given value or small range of values indicating outliers, often occurring on the tail of a distribution far away from the rest of the density.

The shape of a histogram of most random samples will match a well-known probability distribution.

The common distributions are common because they occur again and again in different and sometimes unexpected domains.

Get familiar with the common probability distributions as it will help you to identify a given distribution from a histogram.

Once identified, you can attempt to estimate the density of the random variable with a chosen probability distribution.

This can be achieved by estimating the parameters of the distribution from a random sample of data.

For example, the normal distribution has two parameters: the mean and the standard deviation.

Given these two parameters, we now know the probability distribution function.

These parameters can be estimated from data by calculating the sample mean and sample standard deviation.

We refer to this process as parametric density estimation.

The reason is that we are using predefined functions to summarize the relationship between observations and their probability that can be controlled or configured with parameters, hence “parametric“.

Once we have estimated the density, we can check if it is a good fit.

This can be done in many ways, such as:We can demonstrate this with an example.

We can generate a random sample of 100 observations from a normal distribution with a mean of 50 and a standard deviation of 5.

We can then pretend that we don’t know the probability distribution and maybe look at a histogram and guess that it is normal.

Assuming that it is normal, we can then calculate the parameters of the distribution, specifically the mean and standard deviation.

We would not expect the mean and standard deviation to be 50 and 5 exactly given the small sample size and noise in the sampling process.

Then fit the distribution with these parameters, so-called parametric density estimation of our data sample.

In this case, we can use the norm() SciPy function.

We can then sample the probabilities from this distribution for a range of values in our domain, in this case between 30 and 70.

Finally, we can plot a histogram of the data sample and overlay a line plot of the probabilities calculated for the range of values from the PDF.

Importantly, we can convert the counts or frequencies in each bin of the histogram to a normalized probability to ensure the y-axis of the histogram matches the y-axis of the line plot.

This can be achieved by setting the “density” argument to “True” in the call to hist().

Tying these snippets together, the complete example of parametric density estimation is listed below.

Running the example first generates the data sample, then estimates the parameters of the normal probability distribution.

Note that your results will differ given the random nature of the data sample.

Try running the example a few times.

In this case, we can see that the mean and standard deviation have some noise and are slightly different from the expected values of 50 and 5 respectively.

The noise is minor and the distribution is expected to still be a good fit.

Next, the PDF is fit using the estimated parameters and the histogram of the data with 10 bins is compared to probabilities for a range of values sampled from the PDF.

We can see that the PDF is a good match for our data.

Data Sample Histogram With Probability Density Function Overlay for the Normal DistributionIt is possible that the data does match a common probability distribution, but requires a transformation before parametric density estimation.

For example, you may have outlier values that are far from the mean or center of mass of the distribution.

This may have the effect of giving incorrect estimates of the distribution parameters and, in turn, causing a poor fit to the data.

These outliers should be removed prior to estimating the distribution parameters.

Another example is the data may have a skew or be shifted left or right.

In this case, you might need to transform the data prior to estimating the parameters, such as taking the log or square root, or more generally, using a power transform like the Box-Cox transform.

These types of modifications to the data may not be obvious and effective parametric density estimation may require an iterative process of:In some cases, a data sample may not resemble a common probability distribution or cannot be easily made to fit the distribution.

This is often the case when the data has two peaks (bimodal distribution) or many peaks (multimodal distribution).

In this case, parametric density estimation is not feasible and alternative methods can be used that do not use a common distribution.

Instead, an algorithm is used to approximate the probability distribution of the data without a pre-defined distribution, referred to as a nonparametric method.

The distributions will still have parameters but are not directly controllable in the same way as simple probability distributions.

For example, a nonparametric method might estimate the density using all observations in a random sample, in effect making all observations in the sample “parameters.

”Perhaps the most common nonparametric approach for estimating the probability density function of a continuous random variable is called kernel smoothing, or kernel density estimation, KDE for short.

In this case, a kernel is a mathematical function that returns a probability for a given value of a random variable.

The kernel effectively smooths or interpolates the probabilities across the range of outcomes for a random variable such that the sum of probabilities equals one, a requirement of well-behaved probabilities.

The kernel function weights the contribution of observations from a data sample based on their relationship or distance to a given query sample for which the probability is requested.

A parameter, called the smoothing parameter or the bandwidth, controls the scope, or window of observations, from the data sample that contributes to estimating the probability for a given sample.

As such, kernel density estimation is sometimes referred to as a Parzen-Rosenblatt window, or simply a Parzen window, after the developers of the method.

A large window may result in a coarse density with little details, whereas a small window may have too much detail and not be smooth or general enough to correctly cover new or unseen examples.

The contribution of samples within the window can be shaped using different functions, sometimes referred to as basis functions, e.

g.

uniform normal, etc.

, with different effects on the smoothness of the resulting density function.

As such, it may be useful to experiment with different window sizes and different contribution functions and evaluate the results against histograms of the data.

We can demonstrate this with an example.

First, we can construct a bimodal distribution by combining samples from two different normal distributions.

Specifically, 300 examples with a mean of 20 and a standard deviation of 5 (the smaller peak), and 700 examples with a mean of 40 and a standard deviation of 5 (the larger peak).

The means were chosen close together to ensure the distributions overlap in the combined sample.

The complete example of creating this sample with a bimodal probability distribution and plotting the histogram is listed below.

Running the example creates the data sample and plots the histogram.

Note that your results will differ given the random nature of the data sample.

Try running the example a few times.

We have fewer samples with a mean of 20 than samples with a mean of 40, which we can see reflected in the histogram with a larger density of samples around 40 than around 20.

Data with this distribution does not nicely fit into a common probability distribution, by design.

It is a good case for using a nonparametric kernel density estimation method.

Histogram Plot of Data Sample With a Bimodal Probability DistributionThe scikit-learn machine learning library provides the KernelDensity class that implements kernel density estimation.

First, the class is constructed with the desired bandwidth (window size) and kernel (basis function) arguments.

It is a good idea to test different configurations on your data.

In this case, we will try a bandwidth of 2 and a Gaussian kernel.

The class is then fit on a data sample via the fit() function.

The function expects the data to have a 2D shape with the form [rows, columns], therefore we can reshape our data sample to have 1,000 rows and 1 column.

We can then evaluate how well the density estimate matches our data by calculating the probabilities for a range of observations and comparing the shape to the histogram, just like we did for the parametric case in the prior section.

The score_samples() function on the KernelDensity will calculate the log probability for an array of samples.

We can create a range of samples from 1 to 60, about the range of our domain, calculate the log probabilities, then invert the log operation by calculating the exponent or exp() to return the values to the range 0-1 for normal probabilities.

Finally, we can create a histogram with normalized frequencies and an overlay line plot of values to estimated probabilities.

Tying this together, the complete example of kernel density estimation for a bimodal data sample is listed below.

Running the example creates the data distribution, fits the kernel density estimation model, then plots the histogram of the data sample and the PDF from the KDE model.

Note that your results will differ given the random nature of the data sample.

Try running the example a few times.

In this case, we can see that the PDF is a good fit for the histogram.

It is not very smooth and could be made more so by setting the “bandwidth” argument to 3 samples or higher.

Experiment with different values of the bandwidth and the kernel function.

Histogram and Probability Density Function Plot Estimated via Kernel Density Estimation for a Bimodal Data SampleThe KernelDensity class is powerful and does support estimating the PDF for multidimensional data.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered a gentle introduction to probability density estimation.

Specifically, you learned:Do you have any questions?.Ask your questions in the comments below and I will do my best to answer.

.. More details

Leave a Reply