Data Science & Spark :- Logistic Regression implementation for spam dataset

We all are bit familiar with term Data Science, as it is turning out to be a field with potential of new discoveries. Challenges of Data Science seem to be evolutionary, given the amount of data in the world is only going to increase. However, the development of tools and libraries to deal with these challenges can be somewhat called revolutionary. One such product of this revolution is Spark, Over the period of time We would be discussing about its implementations and experiments in details.

Starting with Data Science, a very broad explanatory term in itself, is about retrieving all the information from the trails of data people may leave behind in virtual or physical world. For examples it could be your product browsing history, List of items you have bought from a grocery stores etc. As written by Alpaydin “What we lack in knowledge, we make up for in Data” and Data is considered to be the cheapest raw material ever found 😉.

Now question arise what to do with this data?. How does analysis on them help multinational corporations to cash fortune out of it?  The main purpose of this field, in general opinion, is to understand the nature of data to form better visualization, structures and models to achieve highly accurate results or predictions.

On the other hand Spark provides Mllib, a library, of functions of machine learning which allow one to invoke various algorithms on distributed datasets. As data is represented in form of RDDs in Spark.

In general way the machine learning pipeline can be considered as this


With above provided diagram, It can be assumed that the most fundamental task is to understand the data and relationship among it’s elements to extract features from it . Here is a simple example of logistic regression algorithm to differentiate between spam and non spam messages. The datasets provided here are two text files containing simple texts of spam and non-spam. The reason to select an algorithm for model on a dataset is based on different factors such as problem statement (Classification,Clustering, Regression etc.), Dataset structure, feature weight analysis and many more. However,with the arbitrary structure of dataset, we want to have a binary output from the input variables.

Logistic regression is a statistical model that calculate the probability of an object belonging to a particular class. As per our requirement we have a text message that could either belong to spam or normal message category.

Following are steps for implementation :-

1. Introduce the Mllib dependency in build.sbt file

name := “””spark-examples”””

version := “1.0”

scalaVersion := “2.11.5”

libraryDependencies ++=Seq(

“org.apache.spark” %% “spark-core” % “1.2.1”,

“org.apache.spark” %% “spark-mllib” % “1.2.1”)

2. Load the data :-

The dataset can be loaded with command

val data = sc.textFile(“Path to the file”)

3. Feature Extraction :-

Feature extraction is a process of finding the key elements from data that would play major role in outcome. We can use HashingTF to map text data to vectors as while identifying labeled points they take feature as Vector.

val hashingTF = new HashingTF()

Map and transform data

val features = => hashingTF.transform(data.split(”  “)))

4. Labeled Points :-

Labeled points are local vectors that are used to denote the target values. In case of binary classification these should be categorized into either 1(positive) or 0(negative) form.

val positive = => LabeledPoint(1, features))

val negative = => LabeledPoint(0, features))

val trainingData = positive.union(negative)

5. Model :-

Now the model we are implementing is basic Logistic Regression Classifier. We are here using binary class classifier, For multi class classifier labels could be start from 0,1,2,3… so on.

val model = new LogisticRegressionWithSGD().run(trainingData)

6. Test Data :-

Now we just need to feed a test data

val message = hashingTF.transform(“You have virus please reset your password“.split(”  “))


7. Result :-

Simply go to directory via terminal and write sbt run


The source code can be downloaded from here

This entry was posted in Scala. Bookmark the permalink.

3 Responses to Data Science & Spark :- Logistic Regression implementation for spam dataset

  1. Pingback: Data Science & Spark :- Logistic Regression implementation for spam dataset | Apache Spark Central

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s