One of the most common uses of Machine Learning algorithms is for the purpose of classification. Classification comes in couple of varieties. Binary classification is when we classify a given set of inputs into two classes. If there are more than 2 classes, then it is Multiclass classification. AWS ML supports both kinds of classifications.
In order to use AWS Machine Learning, we downloaded a data set from Kaggle. This dataset contains data about a huge data network. In this network there are events that happen on a constant basis at various locations throughout the network. Each event has a set of properties like event type, severity type and few other unlabelled features identified only by indexes. Given these feature, the aim is to develop a model that can classify an event, as belonging to one of the three severity levels. A total of about 7000 training examples are present in the dataset.
As first step in the process, the input format of each of the input features of training dataset was decided. We took a naive approach to this and just mapped the data provided as input features directly. We did not use any higher order variables. AWS supports numeric, binary, categorical and text data types. The prepared training data was as follows.
- Number of attributes 458
- Binary 69
- Categorical 3
- Numeric 386
The target variable y of the model is categorical type data representing the severity level of the network event.
The training data was prepared and was saved in a csv file. The workflow of AWS ML is that once we have input data in the AWS ML format, we need to upload it to an S3 bucket. Then in AWS ML console, we need to create a DataSource. A DataSource contains the metadata about the training examples and their features. An AWS ML DataSource has a data store backing it with data. In this case it is a csv file in an S3 bucket. AWS also supports Redshift as datastore for training examples. During the specification of DataSource, we identify the types of each of the features like Binary or Categorical. We also identify the target variable y and any feature that can be used as an identifier for the example. AWS takes few minutes to process the data and prepare DataSource.
Once DataSource is prepared, we are ready to move on to the next step which is to create a model. We created a Multiclass Classifier ML model by using the default feature transformation recipe that Amazon provided while generating ML. AWS gives us option to specify the feature transformation by specifying how we can transform a particular feature into a different format so that it is in suitable format for ML model to use.
Once we have the model, we can use it to classify examples. AWS ML can automatically divide the examples data into a training set and evaluation set. By default it is 70:30 though we can specify a different ratio and control which examples are for training and which for evaluation. The model that AWS ML generated for around 5000 training examples with 458 features is 100MB. AWS ML by default used L2 regularization with λ = 1e-6. AWS ML uses Stochastic Gradient Descent Algorithm to discover the model parameters. By default, during training, it does 10 passes to compute the model parameters. If we choose, we can do a custom training of the model by specifying the recipe for feature transformation, type of regularization – L1 or L2, the regularization amount, number of iterations over the data for training the model.
Model training uses 70% of the training examples. After model is generated, rest of the 30% examples are run through the model to evaluate the quality of the model. During evaluation an example’s values are given as input to the model. Model spits out an output, in this case it is the classification of the example. This is compared against the real y value which is already known for these 30% of examples. The results of evaluation for our model are as shown in the screenshot below.
As we can see above, we have three categories, labelled as 0, 1 and 2. The effectiveness of classification algorithms is generally not measured using traditional accuracy measure. Instead we calculate two values, precision and recall. Precision and recall then are used in turn to measure another quantity called F1 value (check out https://en.wikipedia.org/wiki/Precision_and_recall for excellent description of precision, recall and F1). Higher value of F1 indicates better performance of the classification algorithm. Normally there is only one F1 value in binary classification problems. Since, we are doing multiclass classification here, we have as many F1 values as there are classes. So we have F1 = 0.83, F1 = 0.44 and F1 = 0.58 for each class’ classification. For multiclass classification average of these values is called Macro Average F1 score, which comes out to be 0.62.
In order to train and evaluate this model, it cost us $1.19 using AWS ML. For creation of DataSource, train the ML model and evaluate it, it took about 30 mins. The created model can be used to process batch data or can be used on an example by example basis as well.
We took a very straightforward and simple approach to this ML problem using defaults offered by AWS ML. Our aim here was to see how we can achieve this using Amazon’s AWS ML, how easy will it be, how long does it take and how much does it cost. Obviously, there are a lot of things that can be done in terms of feature selection, Principal Component Analysis(PCA) to reduce number of dimensions, introduce higher order terms as features as well as to try out various values of λ to discover optimal model which does not overfit or underfit among other things. In conclusion, we found that AWS ML gives us a very simple interface that can be used quickly to train a model and start using it very quickly.