In my previous blog, we were successfully able to make the decision tree as per the given data. The story doesn’t end here. We cannot just fit the data as it comes. This leads to Overfitting in the Decision tree.
Overfitting – what and why?
If a decision tree is fully grown, it may lose some generalization capability. This is a phenomenon known as overfitting.
” Given a hypothesis space H, a hypothesis h ∈ H is said to overfit the
training data if there exists some alternative hypothesis h’ ∈ H, such that h has
smaller error than h’ over the training examples, but h’ has a smaller error than h
over the entire distribution of instances.”
It simply means a hypothesis overfits the training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances (i.e, including instances beyond the training examples).
The figure below illustrates the impact of overfitting in a typical application of decision tree learning. Suppose we have made our decision tree based on the given training examples. It fits all the training examples. Hence, it gives 100% accuracy on that data. But when we check this decision tree on unseen sample data, the accuracy was drastically different.
What could be the reason for this difference in accuracy?
The answer is “Overfitting of the training examples“. In order to provide 100% accuracy while making the DTree, we overfitted the data and ended up with the decreased accuracy or we can say a wrong Decision tree.
Causes of Overfitting
There are two major situations that could cause overfitting in DTrees:
- Overfitting Due to Presence of Noise – Mislabeled instances may contradict the class labels of other similar records.
- Overfitting Due to Lack of Representative Instances – Lack of representative instances in the training data can prevent refinement of the learning algorithm.
A good model must not only fit the training data well
but also accurately classify records it has never seen.
How to avoid overfitting?
There are 2 major approaches to avoid overfitting in DTrees.
- approaches that stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data.
- approaches that allow the tree to overfit the data, and then post-prune the tree.
Before looking into these approaches first let us understand what is PRUNING?
PRUNING – what and how?
Pruning is a technique that reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.
How ? – Consider each of the decision nodes in the tree to be candidates for pruning.
Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node.
Nodes are removed only if the resulting pruned tree performs no worse than the original over the validation set.
Pruning of nodes continues until further pruning is harmful (i.e., decreases the accuracy of the tree over the validation set).
Now let’s get back to the approaches to deal with overfitting.
- First one is called pre-pruning. Typical stopping conditions for a node could be:
– stop if all instances belong to the same class.
– stop if all the feature values are the same.
- The second approach is called post-pruning. In this, we grow the decision tree to its entirety and then trim the nodes of the decision tree in a bottom-up fashion.
– If generalization error improves after trimming, replace sub-tree by a leaf node.
– Class label of a leaf node is determined from majority class of instances in the subtree.
Either of these approaches can be used to prune your Dtree. Although the first of these approaches might seem more direct, the second approach of post-pruning overfits trees have been found to be more successful in practice. This is due to the difficulty in the first approach of estimating precisely when to stop growing the tree.
- Machine Learning – Tom Mitchell