Build your personalized movie recommender with Scala and Spark

Table of contents

Reading Time: 3 minutes

In this blog I will explain what is a recommendation engine in general, and How to build a personalized recommendation model using Scala and Spark Collaborative filtering algorithm.

What is a Recommendation Engine?

I assume you’ve shopped online for books or visited movie review sites to pick top rated movies to watch. You must have been seen top rated movie lists which have been voted as best movies. That is not not a recommendation. When you browse through several categories or clicked several catchy posters and you get lines like, “People who bought/watched this also bought/watched that”. That is an example of recommendation. It assumes that if you share a similar taste with someone, you are going to like what they liked.

Preparing the Data Set

MovieLens is a database which was prepared by the GrpupLens Research Project at the University of Minnesota. The data set we are going to use for building the recommendation engine contains over hundred thousands rated movies. Each user has rated at least 20 movies. Ratings of different sizes (1 million,

Ratings

User ID	Movie ID	Rating
1	1	5
1	2	3
2	2	3
2	4	4

Movies

Movie ID	Name	Genre
1	Toy Story (1995)	Animation\|Children’s\|Comedy
2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
3	Grumpier Old Men (1995)	Comedy\|Romance
4	Waiting to Exhale (1995)	Comedy\|Drama

About the Algorithm:

To make a preference prediction for any user, Collaborative filtering uses a preference by other users of similar interests and predicts movies of your interests which is unknown to you. Spark MLlib uses Alternate Least Squares (ALS) to make recommendation. Here is a glimpse of collaborative filtering method:

	Movies
Users	M1	M2	M3	M4
U1	2	4	3	1
U2	0	0	4	4
U3	3	2	2	3
U4	2	?	3	?

Here user ratings on movies are represented as a matrix where a cell represents ratings for a particular movie by a user. The cell with “?” represents the movies which is user U4 is not aware of or haven’t seen. Based on the current preference of user U4 the cell with “?” can be filled with approximated rating of users which are having similar interest as user U4.

Building Recommendation Model

Spark MLlib uses Alternate Least Squares (ALS) to build recommendation model.

Loading the Data as Rating

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile("your_dir/ratings.dat")
val ratings = data.map(_.split("::") match { case Array(user, item, rate, timestamp) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
val movies = movieRecommendationHelper.getMovieRDD.map( _.split("::"))
.map { case Array(movieId,movieName,genre) => (movieId.toInt ,movieName) }
val myRatingsRDD = movieRecommendationHelper.topTenMovies //gets 10 popular movies. See <a href="https://github.com/akashsethi24/Machine-Learning/blob/master/src/main/scala/movie/RecommendMovie.scala">Code</a> for for details
val training = ratings.filter { case Rating(userId, movieId, rating) => (userId * movieId) % 10 <= 3 }.persist val test = ratings.filter { case Rating(userId, movieId, rating) => (userId * movieId) % 10 > 3}.persist

Training the Model

The rating data is split into two part. The variable training contains 47% of data for training. Rest is kept for evaluating the model. We are going to join the Rating of user U4 (the user for which we are going to predict) for the movies he has seen so far so that we can learn taste of user U4 about movies.

val rank = 8
val iteration = 10
val lambda = 0.01
val model = ALS.train(training.union(myRatingsRDD), rank, iteration, lambda)
val moviesIHaveSeen = myRatingsRDD.map(x => x.product).collect().toList
val moviesIHaveNotSeen = movies.filter { case (movieId, name) => !moviesIHaveSeen.contains(movieId) }.map( _._1)

Evaluating the Model

Now let us evaluate the model on test data and get the prediction error. The prediction error is calculated as Root Mean Square Error (RMSE)

val predictedRates =
model.predict(test.map { case Rating(user,item,rating) => (user,item)} ).map { case Rating(user, product, rate) =>
((user, product), rate)
}.persist()

val ratesAndPreds = test.map { case Rating(user, product, rate) =>
((user, product), rate)
}.join(predictedRates)

val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) => Math.pow((r1 - r2), 2) }.mean()

The RMSE value was: 0.95.

Making Recommendation for User U4

Here a new user id (0) is associated with each movie user u4 has not seen so far. And MatrixFactorization model is used to predict values for the unknown movies. Method predict returns Rating(user,item,rating) RDD which is again sorted by Rating to recommend best movies out of entire list.

val recommendedMoviesId = model.predict(moviesIHaveNotSeen.map { product =>
(0, product)}).map { case Rating(user,movie,rating) => (movie,rating) }
.sortBy( x => x._2, ascending = false).take(20).map( x => x._1)

The above code predict values of (user,Item) where user id is 0 i.e. of U4 and item is all movies he hasn’t watched. The result is further sorted by the rating and the top 20 movie Ids are returned.
You can view entire code example at: GitHub