In this blog I will explain what is a recommendation engine in general, and How to build a personalized recommendation model using Scala and Spark Collaborative filtering algorithm.
What is a Recommendation Engine?
I assume you’ve shopped online for books or visited movie review sites to pick top rated movies to watch. You must have been seen top rated movie lists which have been voted as best movies. That is not not a recommendation. When you browse through several categories or clicked several catchy posters and you get lines like, “People who bought/watched this also bought/watched that”. That is an example of recommendation. It assumes that if you share a similar taste with someone, you are going to like what they liked.
Preparing the Data Set
MovieLens is a database which was prepared by the GrpupLens Research Project at the University of Minnesota. The data set we are going to use for building the recommendation engine contains over hundred thousands rated movies. Each user has rated at least 20 movies. Ratings of different sizes (1 million,
|User ID||Movie ID||Rating|
|1||Toy Story (1995)||Animation|Children’s|Comedy|
|3||Grumpier Old Men (1995)||Comedy|Romance|
|4||Waiting to Exhale (1995)||Comedy|Drama|
About the Algorithm:
To make a preference prediction for any user, Collaborative filtering uses a preference by other users of similar interests and predicts movies of your interests which is unknown to you. Spark MLlib uses Alternate Least Squares (ALS) to make recommendation. Here is a glimpse of collaborative filtering method:
Here user ratings on movies are represented as a matrix where a cell represents ratings for a particular movie by a user. The cell with “?” represents the movies which is user U4 is not aware of or haven’t seen. Based on the current preference of user U4 the cell with “?” can be filled with approximated rating of users which are having similar interest as user U4.
Building Recommendation Model
Spark MLlib uses Alternate Least Squares (ALS) to build recommendation model.
Loading the Data as Rating
Training the Model
The rating data is split into two part. The variable training contains 47% of data for training. Rest is kept for evaluating the model. We are going to join the Rating of user U4 (the user for which we are going to predict) for the movies he has seen so far so that we can learn taste of user U4 about movies.
Evaluating the Model
Now let us evaluate the model on test data and get the prediction error. The prediction error is calculated as Root Mean Square Error (RMSE)
The RMSE value was: 0.95.
Making Recommendation for User U4
Here a new user id (0) is associated with each movie user u4 has not seen so far. And MatrixFactorization model is used to predict values for the unknown movies. Method predict returns Rating(user,item,rating) RDD which is again sorted by Rating to recommend best movies out of entire list.
The above code predict values of (user,Item) where user id is 0 i.e. of U4 and item is all movies he hasn’t watched. The result is further sorted by the rating and the top 20 movie Ids are returned.
You can view entire code example at: GitHub
 Collaborative Filtering Apache Spark Documentation.