We all are bit familiar with term Data Science, as it is turning out to be a field with potential of new discoveries. Challenges of Data Science seem to be evolutionary, given the amount of data in the world is only going to increase. However, the development of tools and libraries to deal with these challenges can be somewhat called revolutionary. One such product of this revolution is Spark, Over the period of time We would be discussing about its implementations and experiments in details.
Starting with Data Science, a very broad explanatory term in itself, is about retrieving all the information from the trails of data people may leave behind in virtual or physical world. For examples it could be your product browsing history, List of items you have bought from a grocery stores etc. As written by Alpaydin “What we lack in knowledge, we make up for in Data” and Data is considered to be the cheapest raw material ever found 😉.
Now question arise what to do with this data?. How does analysis on them help multinational corporations to cash fortune out of it? The main purpose of this field, in general opinion, is to understand the nature of data to form better visualization, structures and models to achieve highly accurate results or predictions.
On the other hand Spark provides Mllib, a library, of functions of machine learning which allow one to invoke various algorithms on distributed datasets. As data is represented in form of RDDs in Spark.
In general way the machine learning pipeline can be considered as this
With above provided diagram, It can be assumed that the most fundamental task is to understand the data and relationship among it’s elements to extract features from it . Here is a simple example of logistic regression algorithm to differentiate between spam and non spam messages. The datasets provided here are two text files containing simple texts of spam and non-spam. The reason to select an algorithm for model on a dataset is based on different factors such as problem statement (Classification,Clustering, Regression etc.), Dataset structure, feature weight analysis and many more. However,with the arbitrary structure of dataset, we want to have a binary output from the input variables.
Logistic regression is a statistical model that calculate the probability of an object belonging to a particular class. As per our requirement we have a text message that could either belong to spam or normal message category.
Following are steps for implementation :-
1. Introduce the Mllib dependency in build.sbt file
name := “””spark-examples”””
version := “1.0”
scalaVersion := “2.11.5”
“org.apache.spark” %% “spark-core” % “1.2.1”,
“org.apache.spark” %% “spark-mllib” % “1.2.1”)
2. Load the data :-
The dataset can be loaded with command
val data = sc.textFile(“Path to the file”)
3. Feature Extraction :-
Feature extraction is a process of finding the key elements from data that would play major role in outcome. We can use HashingTF to map text data to vectors as while identifying labeled points they take feature as Vector.
val hashingTF = new HashingTF()
Map and transform data
val features = spam.map(data => hashingTF.transform(data.split(” “)))
4. Labeled Points :-
Labeled points are local vectors that are used to denote the target values. In case of binary classification these should be categorized into either 1(positive) or 0(negative) form.
val positive = features.map(features => LabeledPoint(1, features))
val negative = features.map(features => LabeledPoint(0, features))
val trainingData = positive.union(negative)
5. Model :-
Now the model we are implementing is basic Logistic Regression Classifier. We are here using binary class classifier, For multi class classifier labels could be start from 0,1,2,3… so on.
val model = new LogisticRegressionWithSGD().run(trainingData)
6. Test Data :-
Now we just need to feed a test data
val message = hashingTF.transform(“You have virus please reset your password“.split(” “))
7. Result :-
Simply go to directory via terminal and write sbt run
The source code can be downloaded from here