Deep Learning Book Series 3.1 to 3.3 Probability Mass and Density Functions

The probabilities corresponding to every pair of values are written P(x = x, y = y) or P(x, y).

This is what we call the joint probability.

Example 3.

For example, let’s calculate the probability to have a 1 with the first dice and a 2 in the second:P(x = x, y = y) = 1/6 .

1/6 = 1/36 = 0.

028Properties of a probability mass functionA function is a probability mass function if:The symbol ∀ means “for any”.

This means that for every possible value x in the range of x (in the example of a die rolling experiment, all possible values are 1, 2, 3, 4, 5 and 6), the probability that the outcome corresponds to this value is between 0 and 1.

A probability of 0 means that the event is impossible and a probability of 1 means that you can be sure that the outcome will correspond to this value.

In the example of the dice, the probability of each possible value is 1/6 which is between 0 and 1.

This property is fulfilled.

This means that the sum of the probabilities associated with each possible value is equal to 1.

In the example of the dice experiment, we can see that there are 6 possible outcomes, each with a probability of 1/6 giving a total of 1/6 * 6 = 1.

This property is fulfilled.



2 Continuous Variable and Probability Density FunctionSome variables are not discrete.

They can take an infinite number of values in a certain range.

But we still need to describe the probability associated with outcomes.

The equivalent of the probability mass function for a continuous variable is called the probability density function.

In the case of the probability mass function, we saw that the y-axis gives a probability.

For instance, in the plot we created with Python, the probability to get a 1 was equal to 1/6 = 0.

16 (check on the plot above).

It is 1/6 because it is one possibility over 6 total possibilities.

However, we can’t do this for continuous variables because the total number of possibilities is infinite.

For instance, if we draw a number between 0 and 1, we have an infinite number of possible outcomes (for instance 0.


In the example above, we had 6 possible outcomes, leading to probabilities around 1/6.

Now, we have each probability equal to 1/+∞ = 0.

Such a function would not be very useful.

For that reason, the y-axis of the probability density function doesn’t represent probability values.

To get the probability, we need to calculate the area under the curve (we will see below some details about the area under the curve).

The advantage is that it leads to the probabilities according to a certain range (on the x-axis): the area under the curve increases if the range increases.

Let’s see some examples to clarify all of this.

Example 4.

Let’s say that we have a random variable x that can take values between 0 and 1.

Here is its probability density function:Probability density functionWe can see that 0 seems to be not possible (probability around 0) and neither 1.

The pic around 0.

3 means that will get a lot of outcomes around this value.

Finding probabilities from probability density function between a certain range of values can be done by calculating the area under the curve for this range.

For example, the probability of drawing a value between 0.

5 and 0.

6 corresponds to the following area:Probability density function and area under the curve between 0.

5 and 0.


We can easily see that if we increase the range, the probability (the area under the curve) will increase as well.

For instance, for the range of 0.


7:Probability density function and area under the curve between 0.

5 and 0.


We will see in a moment how to calculate the area under the curve and get the probability associated with a specific range.

Properties of the probability density functionThese differences between the probability mass functions and the probability density function lead to different properties for the probability density function:In this case, p(x) is not necessarily less than 1 because it doesn’t correspond to the probability (the probability itself will still need to be between 0 and 1).

Example 5.

For instance, let’s say that we have a continuous random variable that can take values between 0 and 0.


This variable is described by a uniform distribution so we will have the following probability distribution function:Probability density function (uniform distribution).

The area under the curve is equal to 1 (2 * 0.

5) and the y-values are greater than 1.

We can see that the y-values are greater than 1.

The probability is given by the area under the curve and thus it depends on the x-axis as well.

????.If you like to see this by yourself, we will reproduce this example in Python.

To do that we will create a random variable x that can take a value between 0 and 0.

5 randomly.

The uniform distribution will be used thanks to the Numpy function random.


The parameters of this function are the lowest value (included), the highest value (not included) and the number of samples.

So np.


uniform(0, 0.

5, 10000) will create 10000 value randomly chosen to be > 0 and ≤0.


Looks good!.????????‍♀️We can see that the shape looks like what I draw above with y-axis values around 2 for all x between 0 and 0.


However, one thing can be intriguing in this plot.

We talked about continuous variable and here we have represented the distribution with bars.

The explanation is the same as before: we need to discretise the function to count the number of outcomes in each interval.

Actually, the number of intervals is a parameter of the function distplot().

Let's try to use a lot of bins:We can see that we are still around 2 but that the variability is greater than before (the bars can go from 1 to 4 which was not the case in the last plot).

Any idea why?.????????This is because since we took more bins, a smaller number of values were in each bin leading to a less accurate estimate.

If this hypothesis is true, we could reduce this variability by increasing the number of samples.

Let’s try that:That’s great ????????‍♂️We can now go to the second property!For the probability mass function, we have seen that the sum of the probabilities has to be equal to 1.

This is not the case for the probability density functions since the probability corresponds to the area under the curve and not directly to y values.

However, this means that the area under the curve has to be equal to 1.

We saw in the last example, that the area was actually equal to 1.

It can be easily obtained and visualised because of the squared shape of the uniform distribution.

It is thus possible to multiply the height by the width: 2 * 0.

5 = 1.

However, in many cases, the shape is not a square and we still need to calculate the area under the curve.

Let’s see how to do this!????.Area under the curveThe area under the curve of a function for a specific range of values can be calculated with the integral of the function.

We will see that calculating the integral of a function is the opposite of calculating the derivative.

This means that if you derive a function f(x) and calculate the integral of the resulting function f’(x) you will get back f(x).

????The derivative at a point of a function gives its rate of change.

What is the link between the function describing the rate of change of another function (the derivative) and the area under the curve ?????Let’s start with a point on derivative!.And then, with the next graphical example, it will be crystal clear.

????Example 6.

We want to modelise the speed of a vehicle.

Let’s say that the function f(x) = x² define its speed (y-axis) in function of time (x-axis).

First, we will plot the function f(x)=x² it to see its shape:The shape is a parabola!.It shows that the speed increases slowly at the beginning but increases more and more for a constant duration.

I have created a variable x (with the function arange() from Numpy) that contains all the points of the x-axis.

So it is just all values from -10 to 10 with a step of 0.


Let's see the first 10 values.


, -9.

9, -9.

8, -9.

7, -9.

6, -9.

5, -9.

4, -9.

3, -9.

2, -9.

1])Here is the doc of the arange() function from Numpy.

In our example, the function defines the speed of the vehicle in function of time so it doesn’t make sense to have negative values.

Let’s take only the positive part of the x-axis to avoid negative time (we’ll say that 0 is the start of the experiment).

Ok, that’s better!The derivative of this function is f’(x)=2x.

To have more information about derivative rules, check here.

Here is a plot of f’(x):????.DerivativeThis representation of the derivative shows the acceleration.

f(x) described the speed of the vehicle in function of time and the derivative f’(x) shows the rate of change of the speed in function of time, that is the acceleration.

We can see that the acceleration of the vehicle increases linearly with time.

The derivative tells us that the rate of change of the vehicle speed is 2x.

For instance, when x=0, the rate of change is equal to 2 * 0 = 0, so the speed is not changing.

When x=3, the rate of change is 2 * 3 = 6.

This means that at this point, the speed is increased by 6 when time is increased by 1.

To summarise, the derivative of a function gives its rate of change.

In our example, the rate of change was another function (f’(x) = 2x) but it can be a constant (the rate of change is always the same, e.


f’(x)=2) or a quadratic function for instance (e.


f’(x) = x³).

????.IntegralsBeing able to calculate derivatives is very powerful but is it possible to do the reverse: going from the rate of change to the change itself ????.

Whoah, this is cool!.The answer is given by the integral of a function.

The integral of f’(x) gives us f(x) back.

The notation is the following:This means that we take f’(x) to get back f(x).

The notation dx here means that we integrate over x, that is to say, that we sum slices weighted by y (see here).

If we take again the last example we have:We can see that there is a difference: the addition of a constant c.

This is because an infinite number of function could have given the derivative 2x (for instance x² + 1 or x² + 294…).

We lose a bit of information and we can’t recover it.

And now, the graphical explanation (I love this one ????): we have seen that 2x is the function describing the rate of change (the slope) of the function x².

Now if we go from f’(x) to f(x) we can see that the area under the curve of f’(x) correspond to f(x):The area under the curve of f’(x) corresponds to f(x).

This shows how the integral and derivative are reverse operations.

This plot shows the function f’(x)=2x and we can see that the area under the curve increases exponentially.

This area is represented for different ranges ([0–0], [0–1], [0–2], [0–3]).

We can calculate the area under the curve (using the Pythagorean theorem and dividing by 2 since the areas are half squares) and find the following values: 0, 1, 4, 9… This corresponds to the original function f(x)=x²!.????ConclusionTo summarise, we have seen what is a random variable and how the distribution of probabilities can be expressed for discrete (probability mass function) and continuous variable (probability density function).

We also studied the concept of joint probability distribution and bedrock math tools like derivatives and integrals.

You now have all the tools to dive more into probability.

The next part will be about Chapters 3.

4 to 3.


We will see what we called marginal and conditional probability, the chain rule and the concept of independence.

If you want to stay tuned: Twitter, Github, Linkedin.

I hope that this helped you to gain a better intuition on all of this!.Feel free to contact me about any question/note/correction!. More details

Leave a Reply