Often a machine learning task contains several steps such as extracting features out of raw data, creating learning models to train on features and running predictions on trained models, etc. With the help of the pipeline API provided by Spark, it is easier to combine and tune multiple ML algorithms into a single workflow.
Whats is in the blog?
We will create a sample ML pipeline to extract features out of raw data and apply K-Means Clustering algorithm to group data points.
The code examples used in the blog can be executed on spark-shell running Spark 1.3 or higher.
Basics of Spark ML pipeline API
DataFrame is a Spark SQL datatype which is used as Datasets in ML pipline. A Dataframe allows storing structured data into named columns. A Dataframe can be created from structured data files, Hive tables, external databases, or existing RDDs.
A Transformer converts a Dataframe into another Dataframe with one or more added features to it. e.g. OneHotEncoder transforms a column with a label index into a column of vectored features. Every Transformer has a method transform() which is called to transform a Dataframe into another.
An Estimator is a learning algorithms which learns from the training data. Estimators accept a data set to be trained on and produces a model which is a transformer. e.g. K Means is an estimator which accept a training Dataframe and produces a K Means model which is a transformer. Every estimator has method fit() which invoked to learn from the data.
Each pipeline consists of an array of stages where each stage is either an Estimator or a Transformer which operate on Dataframes.
Input Dataset for Pipeline
We will use the following sample Dataframe as our input data. Each row in the Dataframe represents a customer with attributes: email, income and gender.
The aim is to cluster this Dataset into similar groups using K-Means clustering algorithm available in Spark MLlib. The sequence of task involves:
- Converting categorical attribute labels into label indexes
- Converting categorical label indexes into numerical vectors
- Combining all numerical features into a single feature vector
- Fitting a K-Means model on the extracted features
- Predicting using K-Means model to get clusters for each data row
Creating Pipeline of Tasks:
Pipeline stages and Output:
We have the following pipeline stages generated out of the above code. At each stage, the Dataframe is transformed and becomes the input to the next stage :-