This question takes us to the new similarity metric.
Jaccard Index:Let’s consider another situation.
An insurance company wants to segment the claims filed by its customers based on some similarity.
They have a database of claims, there are 100 attributes in the database, on the basis of which the company decides whether the claim is fraudulent or not.
The attributes can be driving skill of a person, car inspection record, purchase records, etc.
Each attribute generates a red flag for the claim.
In most of the cases, only a few attributes generate red flag, other attributes rarely change.
In this case, the presence of red flag provides more information to the insurance company than the green flag does(asymmetry).
If we use SMC, we will get scores which will be biased by attributes which rarely create red flags.
In such cases, Jaccard index is used.
Let’s check that with numbers.
Consider three claims A, B & C with 20 binary attributes,Claim A = (R,R,R,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G)Claim B = (R,R,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G)Claim C = (R,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G,G)Jaccard index for each pair is calculated as,Where,M11- Number of attributes where both claims have the red flag,M10,M01- Number of attributes where one claim has the red flag and other has the green flag.
For claim A and B, Jaccard index is 2/3 i.
e.
0.
66 and SMC is 19/20 i.
e.
0.
95.
For claim A and C, Jaccard index is 1/3 i.
e.
0.
33 and SMC is 18/20 i.
e.
0.
90.
For claim B and C, Jaccard index is 1/ 2 i.
e.
0.
5 and SMC is 19/20 i.
e.
0.
95.
We see that the SMC scores of all three pairs are close to each other and Jaccard index is showing significant difference.
This is the problem with SMC when the classes do not carry equal information.
For e.
g.
in our case, R class carries more information than G but SMC considers them as equal.
Jaccard index is also called IOU (intersection over union) metric which is used while doing semantic segmentation of an image.
The similarity index is calculated by the number of highlighted pixels in the intersection image divided by the highlighted pixels in the union image.
Jaccard index can be thought of as a generalized case of SMC.
In cases where we have multiple symmetric classes (multiple classes having equal weights) we cannot use SMC as it works only with binary symmetric classes.
In that case, we can create dummy variables for each class which would make the individual dummy variables asymmetric as the presence of one class in each dummy variable will provide more information than the absence of that class.
We can then use Jaccard index to find out the similarity score.
Basically, we converted multiple symmetric classes into binary asymmetric dummy variables and then calculated the Jaccard index.
Until now, we were just discussing about vectors with binary attributes what if the attributes are continuous/numeric.
This is the case where we turn to distance and angle based similarity scores.
Euclidean Distance:Euclidean distance is more of a dissimilarity measure like Minkowski and Mahalanobis distance.
I have included this as it forms the basis of discussion for the upcoming metrics.
We know that the points which are closer in space will have smaller distance between them than the points which are far from each other.
So smaller distance relates to more similarity, this is the thought behind using Euclidean distance as the similarity metric.
Euclidean distance between vectors p and q is calculated as,Consider three users A, B and C.
They have provided ratings to few movies, each rating can range from 1 to 5 and 0 means that the user hasn’t watched that movie.
User A = (0.
5, 1, 1.
5, 2, 2.
5, 3, 3.
5)User B = (0.
5, 1.
5, 2.
5, 0, 0, 0, 0)User C = (0.
5, 1, 1.
5, 2, 0, 0, 0)Using the above formula, we get distance between A & B as 5.
72, between B & C as 2.
29 and between A & C as 3.
5.
If you see carefully A & C vectors have given the same ratings to first four movies which tell us that both have similar liking for the movies, but since C has not seen few movies because of that we are getting a significant distance between them.
Since the above vectors have seven dimensions, we cannot visualize them here.
Instead, let’s look at similar vectors on two axes where each axis represents one movie.
In the plot, red vector represent user A, green represents user B and the blue vector represents user C.
All the vectors have tail at origin.
As per above plot, we should expect blue and red vectors to show high similarity since they are co-linear.
But we get significant distance between them when we calculate Euclidean distance.
What if instead of using distance between the vectors, we calculate the cosine of angle between them?.Vectors can have smaller length or bigger, the angle between them will remain the same.
This takes us to the new similarity metric.
Cosine Similarity:In our academics, we have come across dot product and cross product of two vectors.
Dot product of two vectors is calculated as multiplication of magnitudes of each vector and the cosine of angle between the vectors i.
e.
Where, |A|and |B| represent lengths of the vectors A and B.
It is the distance of A and B from the origin.
A.
B is obtained by summing the element-wise multiplication of vector A & B i.
e.
Cosine similarity is calculated as,Since the ratings are positive our vectors will always lie in the first quadrant.
So, we will get cosine similarity in the range [0,1] , 1 being highly similar.
We thought of using cosine similarity because we knew that the angle between the vectors remains the same irrespective of their lengths, but can we improve it further?.Do you see any problem yet?.Let’s see!Centered or Adjusted Cosine Similarity :Centered?.What’s that?.Up until now we were trying to find similarity between smaller apples and bigger apples.
How’s that?.We know that there are some people who will always be strict when giving ratings and then there are the generous ones (I belong to this category ????).
If we try to find similarity between them we will always get some bias because of this behavior.
This can be handled by removing average rating a user gives from all the movie ratings of that user there by aligning the ratings around the mean, this is nothing but normalizing the ratings.
Once all the vectors are normalized, we calculate the cosine similarity.
This is nothing but centered or adjusted cosine similarity!.This is also known by the popular name Pearson’s correlation!To prove the above said point, I created two arrays, out of which the second array is obtained by adding the offset in the first array keeping all the variations of the first array same.
Check below notebook for implementation.
We got correlation as 1 and cosine similarity as 0.
85, which proves that correlation performed well compared to the cosine similarity.
This is because of the normalization of vectors.
There are few other similarity metrics available too, but the metrics we discussed so far are the ones that we encounter most of the time while working on a data science problem.
Below are some reference links where you can read more about these metrics and their use cases.
Cosine similarity for vector space modelComparison study of similarity and dissimilarity measuresEvaluating image segmentation modelsUser-Item similarity ResearchGatePearson’s correlation & Salton’s cosine measureThanks for reading till the end.
I hope you enjoyed it.
Happy learning and see you soon!.. More details