Build your personalized movie recommender with Scala and Spark

In this blog I will explain what is a recommendation engine in general, and How to build a personalized recommendation model using Scala and Spark Collaborative filtering algorithm.

What is a Recommendation Engine?

I assume you’ve shopped online for books or visited movie review sites to pick top rated movies to watch. You must have been seen top rated movie lists which have been voted as best movies. That is not not a recommendation. When you browse through several categories or clicked several catchy posters and you get lines like, “People who bought/watched this also bought/watched that”. That is an example of recommendation. It assumes that if you share a similar taste with someone, you are going to like what they liked.


Preparing the Data Set

MovieLens is a database which was prepared by the GrpupLens Research Project at the University of Minnesota. The data set we are going to use for building the recommendation engine contains over hundred thousands rated movies. Each user has rated at least 20 movies. Ratings of different sizes (1 million,


User ID Movie ID Rating
1 1 5
1 2 3
2 2 3
2 4 4


Movie ID Name Genre
1 Toy Story (1995) Animation|Children’s|Comedy
2 Jumanji (1995) Adventure|Children’s|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama

About the Algorithm:

To make a preference prediction for any user, Collaborative filtering  uses a preference by other users of  similar interests and predicts movies of your interests which is unknown to you. Spark MLlib uses Alternate Least Squares (ALS) to make recommendation. Here is a glimpse of collaborative filtering method:

Users M1 M2 M3 M4
U1 2 4 3 1
U2 0 0 4 4
U3 3 2 2 3
U4 2 ? 3 ?

Here user ratings on movies are represented as a matrix where a cell represents ratings for a particular movie by a user. The cell with “?” represents the movies which is user U4 is not aware of or haven’t seen. Based on the current preference of user U4 the cell with “?” can be filled with approximated rating of users which are having similar interest as user U4.

Building Recommendation Model

Spark MLlib uses Alternate Least Squares (ALS) to build recommendation model.

Loading the Data as Rating

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile("your_dir/ratings.dat")
val ratings ="::") match { case Array(user, item, rate, timestamp) =>
  Rating(user.toInt, item.toInt, rate.toDouble)
val movies = _.split("::"))
  .map { case Array(movieId,movieName,genre) => (movieId.toInt ,movieName) }
val myRatingsRDD = movieRecommendationHelper.topTenMovies //gets 10 popular movies. See <a href="">Code</a> for for details
val training = ratings.filter { case Rating(userId, movieId, rating) => (userId * movieId) % 10 <= 3 }.persist val test = ratings.filter { case Rating(userId, movieId, rating) => (userId * movieId) % 10 > 3}.persist

Training the Model

The rating data is split into two part. The variable training contains 47% of data for training. Rest is kept for evaluating the model. We are going to join the Rating of user U4 (the user for which we are going to predict) for the movies he has seen so far so that we can learn taste of user U4 about movies.

val rank = 8
val iteration = 10
val lambda = 0.01
val model = ALS.train(training.union(myRatingsRDD), rank, iteration, lambda)
val moviesIHaveSeen = => x.product).collect().toList
val moviesIHaveNotSeen = movies.filter { case (movieId, name) => !moviesIHaveSeen.contains(movieId) }.map( _._1)

Evaluating the Model

Now let us evaluate the model on test data and get the prediction error. The prediction error is calculated as Root Mean Square Error (RMSE)

val predictedRates =
    model.predict( { case Rating(user,item,rating) => (user,item)} ).map { case Rating(user, product, rate) =>
      ((user, product), rate)

  val ratesAndPreds = { case Rating(user, product, rate) =>
    ((user, product), rate)

  val MSE = { case ((user, product), (r1, r2)) => Math.pow((r1 - r2), 2) }.mean()

The RMSE value was: 0.95.

Making Recommendation for User U4

Here a new user id (0) is associated with each movie user u4 has not seen so far. And MatrixFactorization model is used to predict values for the unknown movies. Method predict returns Rating(user,item,rating) RDD which is again sorted by Rating to recommend best movies out of entire list.

 val recommendedMoviesId = model.predict( { product =>
    (0, product)}).map { case Rating(user,movie,rating) => (movie,rating) }
    .sortBy( x => x._2, ascending = false).take(20).map( x => x._1)

The above code predict values of (user,Item) where user id is 0 i.e. of U4 and item is all movies he hasn’t watched. The result is further sorted by the rating and the top 20 movie Ids are returned.
You can view entire code example at: GitHub


[1] Collaborative Filtering  Apache Spark Documentation.

About Manish Mishra

Manish is a Scala Developer at Knoldus Software LLP. He loves to learn and share about Functional Programming, Scala, Akka, Spark.
This entry was posted in Scala and tagged , , , , , . Bookmark the permalink.

One Response to Build your personalized movie recommender with Scala and Spark

  1. sanketkleit says:

    Thanks for the article! Much helpful.
    Can you also do some other MLLib algorithms as recommender systems have already been done many times.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s