How to measure distances in machine learning

Why do we need another type of distances?There are some situations where Euclidean distance will fail to give us the proper metric.

In those cases, we will need to make use of different distance functions.

2Manhattan distance: Let’s say that we again want to calculate the distance between two points.

But this time, we want to do it in a grid-like path like the purple line in the figure.

In this case, the relevant metric is Manhattan distance.

It is defined as the sum of the absolute differences of their Cartesian coordinates.

Let’s clarify this.

A data point has a set of numerical cartesian coordinates that specify uniquely that point.

These coordinates are a signed distance from the point to two fixed perpendicular oriented lines such as the one displayed in the figure below.

This may also bring some memories to Math classes, right?Cartesian coordinate systemSo, in our example, Manhattan distance will be calculated as follows: Get the difference in the (Δx = x2-x1) and the difference in the y-axis (Δy = y2-y1).

Then, get their absolute number, |Δx| and finally, sum up both values.

In general, the formula is:Manhattan distance metric is also called L1 distance or L1 norm.

If you are familiar with machine learning regularization, you probably heard this before.

It is advisable to use it when dealing with high dimensional data.

Also, if you are calculating errors, it is useful when you want to emphasis on outliers due to its linear nature.

3Minkowski distance: First of all, we will define some mathematical terms in order to define Minkowski distance afterward.

A vector space is a collection of objects called vectors that can be added together and multiplied by numbers (also called scalars).

A norm is a function that assigns a strictly positive length to each vector in a vector space (The only exception is the zero vector which length is zero).

It is usually represented as ∥x∥.

A Normed vector space is a vector space over the real or complex numbers on which a norm is defined.

What does this have to do with Minkowski distance?Minkowski distance is defined as the similarity metric between two points in the normed vector space (N-dimensional real space).

It represents also a generalized metric that includes Euclidean and Manhattan distance.

How does the formula look like?If we pay attention when λ = 1, we have the Manhattan distance.

If λ = 2, we are in the presence of Euclidean distance.

There is another distance called Chebyshev distance that happens when λ = ∞.

Overall, we can change the value of λ to calculate the distance between two points in many ways.

When do we use it?.Minkowski distance is frequently used when the variables of interest are measured on ratio scales with an absolute zero value.

4Mahalanobis Distance: When we need to calculate the distance of two points in multivariate space, we need to use the Mahalanobis distance.

We talked before about the Cartesian coordinate system.

We drew perpendicular lines.

Then we calculated distances according to that axis-system.

This is very easy to do if our variables are not correlated.

Because the distances can be measured with a straight line.

Let’s say that two or more correlated variables are present.

We will also add that we are working with more than 3 dimensions.

Now, the problem gets complicated.

In those cases, Mahalanobis distance comes to rescue us.

It measures distance relative to the centroid for the multivariate data.

In this point, means from all variables intersect.

Its formula is the following one:where Xa and Xb are a pair of objects and C is the sample covariance matrix.

5Cosine similarity: Let’s imagine that you need to determine how similar two documents or corpus of text are.

Which distance metrics will you use?The answer is cosine similarity.

In order to calculate it, we need to measure the cosine of the angle between two vectors.

Then, cosine similarity returns the normalized dot product of them.

A normalized vector is a vector in the same direction but with norm 1.

The dot product is the operation in which two equal-length vectors are multiplied resulting in a single scalar.

Cosine similaritySo, the formula for the cosine similarity is:where A and B are vectors, ∥ A∥ and ∥ B∥ are the norm of A and B, and cosθ is the cosine of the angle between A and B.

This can also be written in other terms:Cosine similarity is very useful when we are interested in the orientation but not the magnitude of the vectors.

Two vectors with the same orientation have a cosine similarity of 1.

Two vectors at 90° have a similarity of 0.

Two vectors diametrically opposed have a similarity of -1.

All independent of their magnitude.

6Jaccard distance: Lastly, we will change our focus of attention.

Instead of calculating distances between vectors, we will work with sets.

A set is an unordered collection of objects.

So for example, {1, 2, 3, 4} is equal to {2, 4, 3, 1}.

We can calculate its cardinality (represented as |set|) which is no other thing than the number of elements contained in the set.

Let’s say we have two sets of objects, A and B.

We wonder how many elements they have in common.

This is called Intersection.

It is represented mathematically as A ∩ B.

Maybe, we want to get all items regardless of the set they belong to.

This is called Union.

It is represented mathematically as A ∪ B.

We can picture this better using Venn Diagrams.

Intersection and Union represented in light blue in the Venn diagrams.

How does this relate to Jaccard similarity?.Jaccard similarity is defined as the cardinality of the intersection of defined sets divided by the cardinality of the union of them.

It can only be applied to finite sample sets.

Jaccard similarity = |A ∩ B| / |A ∪ B|Imagine we have the set A = {“flower”, “dog”, “cat”, 1, 3} and B = {“flower”, “cat”, “boat”}.

Then, A ∩ B = 2 and A ∪ B = 6.

As a result, the Jaccard similarity is 2/6 = 3.

As we stated before, all of these metrics are used in several machine learning algorithms.

A clear example is clustering algorithms, such as k-means, where we need to determine if two data points are similar.

You can read my post about clustering to learn more about that.

The take-home message is that several distance metrics exist.

Each of them has a particular context in which they are more suitable.

Learning to choose the correct one will improve the outcome of your machine learning algorithm.

.

. More details

Leave a Reply