
Cosine Similarity
- By Bruce Nielson
- ML & AI Specialist
In this blog post I’m going to answer the question I know has been burning in your mind: What is Cosine Similarity and how does it affect your life?
Yes, I know you’ve all been hearing about Cosine Similarity everywhere! You can hardly go 15 minutes before someone brings it up in casual conversation acting like you should know what it is. You feel stupid that you don’t know what they are taking about. You are too ashamed to admit your ignorance.
Well, fear not dear reader! After reading this article you’ll be able to end those feelings of foolishness and be able to join in on the conversation!
Wikipedia defines Cosine Similarity as:
There you go! Now you’ve been educated! Do you feel better? No?
Okay, let’s use an example to make it easier to make sense of.
Let’s say we want to measure how similar two sentences are. Let’s use the following sentence:
“This blog post sucks”
Vs
“This blog post is awesome”
There are various ways we might represent this sentence mathematically so that a Machine Learning model can make sense of it; but let’s stick with something pretty simple. We’ll imagine the following numbered list:
- This
- Blog
- Post
- Is
- Sucks
- Awesome
So, let’s imagine a ‘vector’ (which is really just an ordered list) where we put a 1 if the word exists in the sentence and a zero if it doesn’t. So:
This blog post sucks = [1, 1, 1, 0, 1, 0]
This blog post is awesome = [1, 1, 1, 1, 0, 1]
Here it should be obvious we’re only comparing words in the sentence and not the order between the words. This should be obvious if we realize this:
Post this blog sucks = [1, 1, 1, 0, 1, 0] just like “This blog post sucks”
If we were doing this for real, perhaps we might want to take order into consideration since clearly the order of words matters to the meaning of a sentence. When we drop the order of the words like this, we call this a “Bag of Words”. For this simple example, we’re going to ignore order of words because otherwise the math becomes too difficult to keep it simple.
So, now we have two ‘vectors’:
[1, 1, 1, 0, 1, 0] and [1, 1, 1, 1, 0, 1]
How ‘similar’ are these two vectors? You could probably come up with some sort of measure if you stopped and thought about it, and there is no one ‘True’ way to measure the similarity and difference of these two vectors.
But here is a clever idea: Let’s treat this vector as if it was a geometrical vector and then measure the angle of difference between them.
To see why this works, let’s imagine two vectors on a 2D plane (since it is hard to imagine six-dimensional space like our bag of words example – though a computer doesn’t care how many dimensions it is calculating in).
Let’s imagine two unit vectors (basically two line segments of length 1) on a grid. The first is at 45 degrees and the second is at 76 degrees:
How might we measure the ‘similarity’ between these two lines?
One obvious idea is to measure how far they are rotated from each other. That is to say:
75-45 = 30
Now take the Cosine of that to place it as a value between 0 and 1:
Cos(30) = 0.8660254
Or in other words these lines are 86.7% the same. (1)
Just to prove the point, let’s try this again with 45 degrees and 100 degrees or:
Cos(100-45) = 0.57357644
So those are not as similar.
Okay, but how can we use this same idea with comparing sentences?
Well to the computer a vector can be treated as a line in, in this case, 6-dimensional space. So, we just calculate the cosine between those two vectors, and it will effectively be a rating of how similar the two vectors are.
Let’s now break down that intimidating Wikipedia formula and work this out for our vectors so that we can compare sentences. Here is some python code that does the job: import NumPy as np
def cosine_similarity(x, y):
assert x.shape[0] == y.shape[1], "Dimension mismatch: x vector size should match the number of columns in y"
dot_products = np.dot(y, x)
x_magnitude = np.linalg.norm(x)
y_magnitudes = np.linalg.norm(y, axis=1)
cosine_similarities = dot_products / (x_magnitude * y_magnitudes)
return cosine_similarities
Let’s walk through this one part at a time. This from the Wikipedia formula:
Is equivalent to this from the python code:
dot_products = np.dot(y, x)
We want to take the dot product of the two vectors. You can look up, but it’s a function built-in to NumPy, so don’t worry too much about what it is. It’s a fairly standard matrix operation. Note that this dot operation will only work if the number of rows of x matches the number of columns of y. Thus, our assertion checking for that. Then this from the Wikipedia formula:
It is saying take the of the vectors and multiply them together. Magnitude (Do you recall how to do that from high school geometry? Magnitude is exactly the same as finding the length of a line except you are doing it in any number of dimensions). Again, this is built-in to NumPy so:
x_magnitude = np.linalg.norm(x)
y_magnitudes = np.linalg.norm(y, axis=1)
Finally, we take that dot product and divide it by the multiplied magnitudes:
Which is this in the python code:
cosine_similarities = dot_products / (x_magnitude * y_magnitudes)
If you really wanted to not use the built-in functions in NumPy and calculate it out here is what the revised python code would look like:
import math
def cosine_similarity(x, y):
assert len(x) == len(y), "Dimension mismatch: Vectors must have the same length"
dot_products = sum(xi * yi for xi, yi in zip(x, y))
x_magnitude = math.sqrt(sum(xi ** 2 for xi in x))
y_magnitude = math.sqrt(sum(yi ** 2 for yi in y))
cosine_similarity = dot_products / (x_magnitude * y_magnitude) if x_magnitude * y_magnitude != 0 else 0
return cosine_similarity
And there you go- we now have a function by which to calculate the cosine between two vectors. Let’s actually run it on our simple example. Recall our two vectors were:
[1, 1, 1, 0, 1, 0] and [1, 1, 1, 1, 0, 1]
So take the dot product:
11 + 11 + 11 + 01 + 10 + 01 = 3
And take the magnitudes:
Sqrt(1^2 + 1^2 + 1^2 + 0^2 + 1^2 + 0^2) = Sqrt(4) = 2
Vs
Sqrt(1^2 + 1^2 + 1^2 + 1^2 + 0^2 + 1^2) = Sqrt(5) = 2.24
The result:
3 / (2 * 2.24) = 3 / (4.48) = 0.6696
And our python code calculates: 0.67 due to rounding errors, but is basically the same.
So, these two sentences (as measured via a bag of words) are 67% the same.
The beauty of cosine similarity is that you can use it on anything that you can represent as vectors! This will turn out to be really useful in our next post where we tackle using cosine similarity with Large Language Models (LLMs).
Note:
(1) Note that I’m actually sort of lying here. To be considered 86.7% the same I’m making an additional assumption that we’re specifically talking about Cosine similarity for text. Cosines actually range not from 0 to 1 but from -1 to 1. So, a Cosine of 0.867 can’t really be considered to mean 86.7% ‘the same.’ However, text frequency can’t be negative. So, we’d expect the results to range from 0 to 1. So, within that context we can think of results as a percentage of similarity.
Also be sure to checkout my next article addressing how to use Cosine Similarity for Semantic Searches.
To stay in the loop, make sure to follow us on LinkedIn, and also be sure to have a look at our other articles here on the Mindfire Blog.