Email spam detection using apache spark mllib

Table of contents

Reading Time: 2 minutes

In this blog we will see the real use case of spark mllib that is email spam detection. With the help of using the apache spark mllib component we will detect that email will goes in spam folder or primary folder.

So now jump into the programming and see how it will implement. So first we will load the data from training from spam dataset and primary dataset as follow

val spam = sc.textFile("/home/sandy/Spark/enron1/spam/0052.2003-12-20.GP.spam.txt", 4)
val normal = sc.textFile("/home/sandy/Spark/enron1/ham/0022.1999-12-16.farmer.ham.txt", 4)

Next we need to use HashinTF or IDF to find the frequency of word in the mail and create a Vector which is helpful in creating the LabelPoints for the training

val spamFeatures = spam.map(email => tf.transform(email.split(" ")))
val normalFeatures = normal.map(email => tf.transform(email.split(" ")))

With the help of vectors we will create the LabelPoints , LabelPoints are the input for our model we will create label points as follows

 val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features)) //LabelPoints(rank,vectors)
  val negativeExamples = normalFeatures.map(features => LabeledPoint(0, features))  //LabelPoints(rank,vectors)

Now we union both the Label points to give input to th model and train model for spam and primary email

 val trainingData = positiveExamples.union(negativeExamples)
  trainingData.cache()

Above we cache the data because iterative algorithms are train it iteratively on same rdd so with the help of cache it will more optimize and faster.

We need to create the LogisticRegressionWithSGD() and train the data in it as below.

val model = new LogisticRegressionWithSGD().run(trainingData)

Now we have model to predict the email will goes in spam folder or primary folder, Let see below example we pass text of eamil and our model will predict.

val posTest = tf.transform(
    "insurance plan which change your life ...".split(" "))
  val negTest = tf.transform(
    "hi sorry yaar i forget tell you i cant come today".split(" "))
  println("Prediction for positive test example: " + model.predict(posTest))
  println("Prediction for negative test example: " + model.predict(negTest))

Above we predict for both primary mail and spam mail so when output is 0 it means it is spam mail and when output is 1 it means its is primary mail.