MachineX: Cosine Similarity for Item-Based Collaborative Filtering

Reading Time: 4 minutes

“A recommender system or a recommendation system (sometimes replacing “system” with a synonym such as platform or engine) is a subclass of information filtering system that seeks to predict the “rating” or “preference” a user would give to an item. “ – Wikipedia

In simple terms a recommender system is where the system is capable of producing a list of recommendation with respect to an item. One of the ways to create a recommender system is through Collaborative Filtering, where the information is filtered by looking at the activity of other users. Most companies these days use recommender systems to provide better recommendations to the users.

Some of the examples are: Amazon using a recommender system to provide recommendation on the items or Netflix providing recommendations on next movies to watch after a user has seen a movie

Collaborative Filtering is further divided into 2 parts

  1. User Based Collaborative Filtering (UB-CF) : Recommendations based on the calculating similarities of two users
  2. Item Based Collaborative Filtering (IB-CF) : Recommendation based on calculating similarties of two items based on peoples rating of two items.

In this post we will be looking at a method named Cosine Similarity for Item-Based Collaborative Filtering

NOTE: Item-Based similarity doesn’t imply that the two things are like each other in case of attributes. Rather it is simialrity concerning how individuals treat the two given things in case of like or dislike.


Cosine similarity is a metric used to meausure how similar the two items or documents are irrespective of their size. It measures the cosine of an angle between two vectors projected in multi-dimensional space. This allows us to meausre smilarity of document of any type. Due to multi-dimenisonal array any number of variables (which are treated as dimensions ) can be used, which in turn supports large sized documents

Mathematically, Cosine of angle of between two vectors is derived from the dot product of the two vectors divided by the product of the two vectors’ magnitude.

Since we are finding the Cosine of two vectors the output will always range from -1 to 1, where -1 shows that two items are an dissimilar and 1 shows that two items are completely similar. We will now see how we can use Cosine Similarity measure to determine how similar the movies are.

Why Cos(Θ) ?

We can multiply two vectors only when they are in same direction. So we make one “Point in the same direction” as the other by multiplying by Cos, which gives us the dot product of two vectors

A.B = |a||b|Cos(Θ)


Suppose we have a movie ratings given by different user in a table format as shown below

Step 1: We create a sparse matrix where we write user-item ratings in a matrix form

In this matrix user, Amy has already rated and watched movies Pulp Fiction and The GodFather but hasn’t watched the movie, Forrest Gump. We will be using the above matrix for our example and will try to create item-item similarity matrix using Cosine Similarity method to determine how similar the movies are to each other.

Step 2: To calculate the similarity between the movie Pulp Fiction(P) and Forrest Gump(F), we will first find all the users who have rated both the movies. In our case, Calvin (C), Robert (R) and Bradley (B) have rated the movies. We now create two vectors

v1 =  5 C + 3 R + 1 B

v2 = 2 C + 3 R + 3 B

Therefore Cosine Similarity between movies Pulp Fiction and Forrest Gump is:

cos(v1,v2) = (5*2 + 3*3 + 1*3) / sqrt[(25+9+1) * (4+9+9)] = 0.792

Similarly we can calculate the cosine similarity of all the movies and our final similarity matrix will be

Step 3: Now we can predict and fill the ratings for a user for the items he hasn’t rated yet. So to calculate the rating of user Amy for the movie Forrest Gump we will use the calculated similarity matrix along with the already rated movie by the user. Therefore, rating would be

(4*0.792 + 5*0.8) / (0.792+ 0.8) = 4.5

Hence, our final matrix would be like

Hope, you enjoyed the post.

Written by 

Rahul Khanna is a software consultant having 1+ years of experience. In past, Rahul has worked on Python where his main focus of work was to handle and analyze data using various libraries such as pandas, numpy etc. Rahul is currently working on reactive technologies like Scala, Akka and Spark along with Machine learning algorithms.