Email spam detection using apache spark mllib


In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary folder.

So now jump into the programming and see how it will implement. So first we will load the data from training from spam dataset and primary dataset as follow

val spam = sc.textFile("/home/sandy/Spark/enron1/spam/0052.2003-12-20.GP.spam.txt", 4)
val normal = sc.textFile("/home/sandy/Spark/enron1/ham/0022.1999-12-16.farmer.ham.txt", 4)

Next we need to use HashinTF or IDF to find the frequency of word in the mail and create a Vector which is helpful in creating the LabelPoints for the training

val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val normalFeatures = normal.map(email => tf.transform(email.split(" ")))

With the help of vectors we will create the LabelPoints , LabelPoints are the input for our model we will create label points as follows

 val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features)) //LabelPoints(rank,vectors)
  val negativeExamples = normalFeatures.map(features => LabeledPoint(0, features))  //LabelPoints(rank,vectors)

Now we union both the Label points to give input to th model and train model for spam and primary email

 val trainingData = positiveExamples.union(negativeExamples)
  trainingData.cache()

Above we cache the data because iterative algorithms are train it iteratively on same rdd so with the help of cache it will more optimize and faster.

We need to create the LogisticRegressionWithSGD() and train the data in it as below.

val model = new LogisticRegressionWithSGD().run(trainingData)

Now we have model to predict the email will goes in spam folder or primary folder, Let see below example we pass text of eamil and our model will predict.

val posTest = tf.transform(
    "insurance plan which change your life ...".split(" "))
  val negTest = tf.transform(
    "hi sorry yaar i forget tell you i cant come today".split(" "))
  println("Prediction for positive test example: " + model.predict(posTest))
  println("Prediction for negative test example: " + model.predict(negTest))

Above we predict for both primary mail and spam mail so when output is 0 it means it is spam mail and when output is 1 it means its is primary mail.

Screenshot from 2016-05-29 14:43:33

For playing with source code just grab it from here

About sandeep

I m working as an software consultant in Knoldus Software LLP . I m working on scala, play, spark,hive, hdfs, hadoop and many big data technologies.
This entry was posted in apache spark, big data, Scala, Spark and tagged , , . Bookmark the permalink.

3 Responses to Email spam detection using apache spark mllib

  1. Prabhat Kashyap says:

    Reblogged this on Prabhat Kashyap – Scala-Trek.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s