In this blog, we are going to discuss Decision Tree algorithm, a supervised algorithm which can be used to solve both regression and as well as classification problem too.
A classification algorithm, in general, is a function that weighs the input features so that the output separates one class into positive values and the other into negative values.
Introduction to Decision Tree Algorithm
A decision tree is a graphical representation of all possible solutions to a decision.
The objective of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from training data.
It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.
Let’s understand this using a scenario. So suppose a person plans to go out on weekend.
- Root Node: Represents the entire population or sample or dataset which further gets divided into two or more homogeneous sets.
- Decision Node: A sub-node splits into further sub-nodes.
- Leaf / Terminal Node: Final output nodes. Tree cannot be segregated further.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Pruning: Process to remove sub-nodes/unwanted nodes of a decision node. You can say the opposite process of splitting.
- Branch / Sub-Tree: The subsection of the entire tree is called branch or sub-tree.
- Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes
whereas sub-nodes are the child of a parent node.
Building a Decision Tree Classifier
Before building the decision tree classifier, First lets understand how it works.
- Begin the tree with the root node, says S, which contains the complete dataset.
- Find the best attribute in the dataset using Attribute Selection Measure (ASM).
- Divide the S into subsets that contains possible values for the best attributes.
- Generate the decision tree node, which contains the best attribute.
- Recursively make new decision trees using the subsets of the dataset created in step -3.
Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.
Attribute Selection Measures
If the dataset consists of N attributes then deciding which attribute to place at the root or at different levels of the tree as internal nodes is a complicated step. Randomly selecting any node as root node will cause bad results and low accuracy and will not solve the problem.
So a big question here is how to select the best attribute for the root node and for sub-nodes?
The answer to this is Attribute Selection Measures i.e., ASM.
Using ASM we select the best attribute for the nodes of the tree. There are multiple techniques for ASM like Entropy, Information Gain, Gini Index, Gain Ratio, Reduction in Variance, and Chi-Square. Among these two are popular techniques for ASM, which are:
- Information Gain
- Gini Index
The reduction in entropy or surprise by transforming a dataset.
Used in training decision trees.
Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.
Information Gain = entropy(parent) – [average entropy(children)]
A function that determines how well a decision tree was split. Basically, it helps us to determine which splitter is best so that we can build a pure decision tree.
Gini impurity ranges values from 0 to 0.5.18
An attribute with the low Gini index should be preferred as compared to the high Gini index.
Gini Index= 1- ∑jPj2
Now, lets build our decision tree.
We are using Car Evaluation Data Set to build our decision tree classifier model which will predict the safety of the car.
You can download the data from here.
Lets load the dataset into pandas dataframe.
data = 'car_evaluation.csv' df = pd.read_csv(data, header=None) df.head()
After loading the dataset we will do some data pre-processing like changing the column names.
Splitting the data
Now we will define our target variable and split our dataset.
Our target variable will be ‘class’, accordingly we define our feature vector.
X = df.drop(['class'], axis=1) y = df['class']
Now, lets split our dataset in 8:2 i.e., 80% for training and 20% for testing.
The shape of training and testing data will be:
Training Decision Tree Classifier Model
We will be training our classifier model two ASM (i.e., Attribute Selection Measure).
Before training, we will to encode the categorical variables of training dataset.
Decision Tree Classifier with ASM Gini index
Lets build and train our classifier model using criterion as gini index.
# import DecisionTreeClassifier from sklearn.tree import DecisionTreeClassifier gini_classifier = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0) # fit the model gini_classifier.fit(X_train, y_train)
Okay, lets check the accuracy of the trained model on training and testing data.
Decision Tree Classifier with ASM Entropy
Lets build and train our classifier model using criterion as entropy.
entropy_classifier = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0) # fit the model entropy_classifier.fit(X_train, y_train)
The accuracy of the trained model on training and testing data.
Based on the above analysis we can conclude that our classification model accuracy is very good.
Our trained classification model is very good at predicting the class labels.
We can also get the classification report of model to evaluate it.
It will tell us the underlying distribution of values, and about the type of errors our classifier is making.
So, In this blog we have learned about a CART algorithm i.e., Decision Tree Algorithm. It can used to solve both the regression as well as classification problems. we also understand the it how the algorithm works and what are attribute selection measure, what role ASM plays in building a decision tree classifier. Then we build our own classification model to predict the safety of the car using two ASM’s and we got good accuracy on that, and also prepare the classification report of the model.