Introduction to Machine Learning with Spark (Clustering)

In this blog, we will learn how to group similar data objects using K-means clustering offered by Spark Machine Learning Library.


The code example needs only Spark Shell to execute.

What is Clustering

Clustering is like grouping data objects in some random clusters (with no initial class of group defined) on the basis of similarity or the natural closeness to each other. The “closeness” will be clear later in the blog.

Why Clustering

The reason I chose Clustering as beginning of ML that I find it most basic algorithm which requires minimal domain knowledge about your data and produce good analytic over it. Before diving deep into terminologies, lets look at a simple problem which we are going to solve with K means algorithm.

A certain real estate firm wants to build homes targeting different buyers. The only information that firm has collected is the income and the
current property locations of their potential buyers. The firm wants to group their potential buyers into some group which is not obvious by dataset itself.

Let us solve this case by using K Means Clustering algorithm offered by MLLib of Apache Spark.

This is how our dataset looks like:


Determining K

K is the number of groups that we want to partition our dataset into. For the case given above, it may be the number of different group of buyers we want out of our dataset.

How it forms different groups

1. Algorithm selects k random centroids (data objects) at first
2. Groups each data object into the closest centroid. The closeness is the distance between features of data objects.
3. Re calculate the value of centroid by taking mean of features of all of the data object that belong to the group.
4. Repeat 2 to 3 until number of iterations exhaust.

Preparing Input Data

K-means algorithm works only on numerical features and it is obvious because it needs to quantify the features in order to calculate distance between data objects. For simplicity our data set contains only numerical features. To run K-means on categorical data, we need to quantify categorical data by some encoding technique which is out of scope of this blog.

import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors

val rawData = sc.textFile("file path")
val extractedFeatureVector = { row=> 

Setting parameters

We need to set the following parameters to run K means algorithm:

val numberOfClusters = 3 
val numberOfInterations = 50

The number of resultant clusters may be equal to numberOfCluster or less than number of clusters specified. It totally dependent on how dissimilarity of groups and similarity of objects within a group.

Running K-means

We can run K-means algorithm as:

//We use KMeans object provided by MLLib to run 
val model = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)

Visualizing Results

model.train returns KMeansModel object by which we can get the following information about the clusters of the datasets.
We can get centres of the clusters as:


//results in


This result is the centre of three clusters having three different means of locations (long,lat) and three different means of income.

We can predict cluster of each data point it belongs to as:

//Get cluster index for each buyer Id
val buyersByCluster = {

results as tuple of cluster Index and Buyer Id as:


Now we can easily group the buyers according to the derived clusters.

About Manish Mishra

Manish is a Scala Developer at Knoldus Software LLP. He loves to learn and share about Functional Programming, Scala, Akka, Spark.
This entry was posted in Scala and tagged , , , , , , , . Bookmark the permalink.