Machine learning with Decision Trees

Reading Time: 6 minutes

Machine learning is the subfield of computer science that gives computers the ability to learn without being programmed.”    – Arthur Samuel, 1959

Machine learning is a buzzword in the technology world right now. It is fun, challenging, puzzling, and even a bit scary if you’re one of those people that believe robots will someday steal our jobs and rule the world. Whether we like it or not, we are surrounded by adaptive smart things that can fix some of our most common daily queries in a split of a second.


Machine Learning or we can call it the famous ‘SKYNET’ from the Terminator franchise. Some are afraid of the fantasy becoming real and some are excited about a whole new world of opportunities we’ll get in the AI world. This future AI may be want to eradicate the entire human race (Just a pun) but for now, we can achieve much more than we can imagine with it.

Google self-drive car, Facebook face recognition, Amazon recommendations, speech recognition- Siri and Cortana, Fraud detection by PayPal. The list is long!

So that was a brief about ML, now let’s begin with Decision trees – Machine learning technique.

What are Decision trees?

Simply put, a decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a decision.

It is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems and works for both categorical and continuous input and output variables. It is one of the most widely used and practical methods for Inductive Inference. (Inductive inference is the process of reaching a general conclusion from specific examples.)

Decision trees learn and train itself from given examples and predict for unseen examples.


Graphical representation of a sample decision tree could be like this

Image result for decision tree

Decision tree algorithm- ID3

ID3 stands for Iterative Dichotomiser 3. The ID3 algorithm was invented by Ross Quinlan. It builds a decision tree from a fixed set of examples and the resulting tree is used to classify future samples.

The basic idea is to construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node.

Sounds simple? But which node to select to build the correct and precise decision tree? How would we decide that?

Well, we have some measures that would help us in selecting the best choice!

Entropy – In information theory, entropy is a measure of the uncertainty about a source of messages. It gives us the degree of disorganization in our data.

Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification is

Screenshot from 2017-08-12 22-58-25

Here, p+ and p- are the proportion of positive and negative examples in S.

Consider this entropy function relative to a boolean classification,
as the proportion, p+, of positive examples varies
between 0 and 1.

Screenshot from 2017-08-12 23-11-21

Now, notice that the entropy is 0 if all members of S belong to the same class. For example, if all members are positive (p+ = 1), then p- is 0, and

Entropy(S) = -1 . log2(1) – 0 . log2 0 = -1 . 0 – 0 . log2 0 = 0.

The entropy is 1 when the collection contains an equal number of positive and negative examples.

If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1.

Information gain – It measures the expected reduction in entropy. It decides which attribute goes into a decision node. To minimize the decision tree depth, the attribute with the most entropy reduction is the best choice!

More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as

Screenshot from 2017-08-12 23-53-05

S is each value v of all possible values of attribute A
Sv = subset of S for which attribute A has value v
|Sv| = number of elements in Sv
|S| = number of elements in S

Let’s see how these measures work!

Suppose we want ID3 to decide whether the weather is amenable to playing baseball. Over the course of 2 weeks, data is collected to help ID3 build a decision tree. See the following table:

The target classification is “should we play baseball?” which can be yes or no.






play baseball

D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

The weather attributes are outlook, temperature, humidity, and wind speed. They can have the following values:

outlook = { sunny, overcast, rain }

temperature = {hot, mild, cool }

humidity = { high, normal }

wind = {weak, strong }

We need to find which attribute will be the root node in our decision tree.

Entropy(S) = – (9/14) Log2 (9/14) – (5/14) Log2 (5/14) = 0.940

Gain(S,Wind) = Entropy(S) – (8/14)*Entropy(Sweak)  – (6/14)*Entropy(Sstrong)

= 0.940 – (8/14)*0.811 – (6/14)*1.00

= 0.048
Entropy(Sweak)  = - (6/8)*log2(6/8) – (2/8)*log2(2/8) = 0.811

Entropy(Sstrong) = - (3/6)*log2(3/6) – (3/6)*log2(3/6) = 1.00

For each attribute, the gain is calculated and the highest gain is used in the decision.

Gain(S, Outlook) = 0.246

Gain(S, Temperature) = 0.029

Gain(S, Humidity) = 0.151

Clearly, Outlook attribute has the highest gain, therefore it is used as the decision attribute in the root node.

Since Outlook has three possible values, the root node has three branches (sunny, overcast, rain). The next question is “what attribute should be tested at the Sunny branch node?” Since we=92ve used Outlook at the root, we only decide on the remaining three attributes: Humidity, Temperature, or Wind.

Ssunny = {D1, D2, D8, D9, D11} = 5 examples from the table with outlook = sunny

Gain(Ssunny, Humidity) = 0.970

Gain(Ssunny, Temperature) = 0.570

Gain(Ssunny, Wind) = 0.019

Humidity has the highest gain; therefore, it is used as the decision node. This process goes on until all data is classified perfectly or we run out of attributes.

The decision tree can also be expressed in rule format:

IF outlook = sunny AND humidity = high THEN play baseball = no

IF outlook = rain AND humidity = high THEN play baseball = no

IF outlook = rain AND wind = strong THEN play baseball = yes

IF outlook = overcast THEN play baseball = yes

IF outlook = rain AND wind = weak THEN play baseball = yes

So, that was a quick start to decision trees. It’s as simple as depicted above. I hope it would help you! 🙂


  1. Machine Learning – Tom Mitchell

Written by 

Tech Enthusiast