Machine Learning with Decision Trees: Implementation


Smile is a machine learning library based on JAVA and provides Scala API too although most of the classes it uses are implemented in JAVA. To know more about smile please go through this introductory blog.

Here in this blog, we will implement decision tree using Smile library. To know more about the decision trees you can check out this blog

Quickstart 

To use smile the following dependency we need to include in our sbt project:

libraryDependencies += "com.github.haifengl" %% "smile-scala" % "1.4.0"

Ingredients

The main ingredient for a decision tree is the training data on basis of what the nodes or branches of a decision tree are created. The test data we need to train our decision tree must be in attribute-value format. In this blog, I’ll be denoting the collection of all attributes that we need to gain a single response as an instance.

So let’s start. With Smile, we can use Weka’s arff format, csv format, text file format, JDBC result set and many other formats. We have a read object in the package smile of Scala API which provides us with methods like arff(), csv(), jdbc() to parse our data. The syntax goes like this:

val weather = read.arff("src/main/resources/weather.nominal.arff", 4)

So here we are taking the weather data in the arff format. You might be wondering what is the second parameter (value 4) in this arff() method. This parameter takes the index number where we are going to get the response value for our training instances.To explain this let’s take a quick look at the data we are using

Screenshot from 2017-08-13 19-02-53

In this file, we have two parts of the data. First part declares the attributes and the second part contains the instances of these attributes. The order of the declaration of the attributes is important since the data instances are also stored in the same order.

Indexing starts from 0th position. So, response attribute ‘play’ is at 4th position.

For ex: {sunny, hot, high, FALSE, no} we have response ‘no‘ at 4th index.

So after parsing the data, we get the object of AttributeDataSet class. In this, all our values are converted to numeric values. The response values are converted to Int type (for attribute play {yes, no} =>{0, 1}) and all other training instances are to Double type (for attribute windy {TRUE, FALSE} => {0.0, 1.0}). Hence our data {sunny, hot, high, FALSE, no} converts to {0.0, 0.0, 0.0, 1.0, 1}.

Next part is to extract those training instances and response values separately. For this we have the method toArray() of class AttributeDataSet.

val trainingInstances = weather.toArray(Array(new Array[Double](weather.size())))
val responseValues = weather.toArray(new Array[Int](weather.size()))

Here we are using two overloaded forms of toArray() method:

1. First one takes a parameter of type Array of Array of Double (i.e. Array[Array[Double]]) and returns the training instances i.e. the Array of Array of instance variables for each situation.

Confusing no! If you print this array then the scene will be a little less cloudy. The output here will be like Array(Array(0.0, 0.0, 0.0, 1.0), Array(…for each column of data), ….for each row of data) *if still confusing the link for code is at the end of this blog
2. The second form takes a simple Array of Int values and returns all the responses in an array.

Actual Training

So after getting the training instances and response values for them we now need to create the decision tree based on these attributes. For that, we have cart() method from smile.classification package which returns an object of DecisionTree class. For this, we need following additional parameters

val maxNodes = 200
val splitRule = DecisionTree.SplitRule.ENTROPY
val attributes = weather.attributes()

The variable ‘maxNodes‘ tells us the maximum number of leaf nodes that we can afford to save ourselves from over calculation (for training data being too large).

The rule we are going to use to find information gain to create our tree is ENTROPY. Last one here is an array which contains objects of Attribute class.

‘attributes’ is optional here as it can calculate all the attributes from the training data itself. So here is the cart() method.

val decisionTree = cart(trainingInstances, responseInstances, maxNodes, attributes, splitRule)

This method returns us an instance of class DecisionTree. So here our decision tree is trained using the data we provided.

Prediction

After the training of the tree is completed we should be able to predict the responses for our test data. The class DecisionTree contains the method predict(). This predict method takes the array of test instances as input and returns the responses accordingly.

So now we need the test data to verify our decision tree to be working or not

Screenshot from 2017-08-15 12-19-03

So this is the data part is our weatherTest.arff file. I have included some data with some wrong predictions in it(total 12 wrong predictions). So our decision tree must catch these and show the number of error as 12.

First, let’s load the test data like we did for training data.

val weatherTest = read.arff("src/main/resources/weatherTest.nominal.arff", 4)
val testInstances = weather.toArray(Array(new Array[Double](weather.size())))
val testResponseValues = weather.toArray(new Array[Int](weather.size()))

Here we actually do not need the variable ‘testResponseValues‘ as we can predict the response for each instance using the decision tree we trained. We are just going to use these values to match the predictions of our decision tree.

So here we are predicting the outcome and matching them with the responses respectively

val error = testInstances.zip(testResponseValues).count {
  case (testInstance, response) => dTree.predict(testInstance) != response
}
println("Number of errors in test data is "+error)

And the output is

Screenshot from 2017-08-15 13-00-06

So we can see here that our decision tree is working fine.

And if we just want to predict the response we need not use the array of response values.

val decisions = testInstances.map{
  dTree.predict(_) match{
    case 0 => "play"
    case 1 => "not playable weather"
  }
}.toList

From this, we get the list of decisions, based on our instances.

Where is The Tree

So the decision tree is trained now. We have also tested its working with correct/incorrect inputs. But can we actually see the branches where the splitting is happening and how does the tree looks like as we used to see it in data mining tools??

Yes, we can!

We have got another method of DecisionTree class dot() which by the documentation “Returns the graphic representation in Graphviz dot format”. We need to use the string output provided by dot() method and paste it on Viz.js which is a ‘makefile for building Graphviz with Emscripten and a simple wrapper for using it in any browser’. So let’s do that.

digraph DecisionTree {
 node [shape=box, style="filled, rounded", color="black", fontname=helvetica];
 edge [fontname=helvetica];
 0 [label=nscore = 0.3476>, fillcolor="#00000000"];
 1 [label=, fillcolor="#00000000", shape=ellipse];
 0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"];
 2 [label=nscore = 0.4200>, fillcolor="#00000000"];
 0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"];
 3 [label=, fillcolor="#00000000", shape=ellipse];
 2 -> 3;
 4 [label=nscore = 0.9183>, fillcolor="#00000000"];
 2 -> 4;
 5 [label=, fillcolor="#00000000", shape=ellipse];
 4 -> 5;
 6 [label=, fillcolor="#00000000", shape=ellipse];
 4 -> 6;
}

The output of the dot() method. Now paste it in the link provided above

Screenshot from 2017-08-15 13-28-05

So that is our decision tree right there.

So that’s all for the simple implementation of the decision tree using Smile. Here is the link to the repository with sample code

I hope it helped you!

#mlforscalalovers

Thank You 🙂

References:


knoldus-advt-sticker


Advertisements
This entry was posted in machine learning, Scala and tagged , , . Bookmark the permalink.

5 Responses to Machine Learning with Decision Trees: Implementation

  1. Haifeng Li says:

    Thanks for the great article! A couple of minor improvements are:

    1. If you just download the prebuilt smile package, the test data used here can be loaded by the path “data/weka/weather.nominal.arff”.

    2. To extract the data out of the AttributeDataset, an easier way is use unzipInt (or unzipDouble method when the responsible variable is real valued). For example, val (trainingInstances, responseValues) = weather.unzipInt

    3. There are several helper functions in smile.validation package. For example, test(), test2(), test2soft(), loocv(), cv(), bootstrap(), etc. They are handy for you to test the trained model and report a lot of metrics.

  2. Pingback: Machine Learning with Random Forests | Anuj's Blog

  3. Pingback: Machine Learning with Random Forests | Knoldus

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s