Fundamentals Of Classification Models Part-2

Reading Time: 3 minutes

This article is the continuation of “Fundamentals of Classification Models Part – 1” You need to go through this part before going to learn about Classifier Models.

Classifier Models

As discussed in the previous article ” We prepare the data for training the algorithm” the first step is to pre-process and clean the data The cleaning we need for this dataset is to change the string names of the flowers to integer values so the algorithm can classify them properly. We also need to drop the observations having NaN values.

d = {'Adelie':1, 'Chinstrap':2, 'Gentoo':3}
penguins["species"] = penguins["species"].map(d, na_action='ignore')
penguins = penguins.dropna(axis=0)
It is preparing data for training algorithm.

We need to split the data for training and testing . 75% of the main data is the training data and the rest of it is the test data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In this we will go through Random forest classifier and calculate the accuracy of it .Let us dive into the code .

Random Forest Classifier

To basic introduction about Random forest classifier we can say that the algorithm works on the “insight of people in a group”. A thousand random people will collectively give a more accurate aggregated result than an expert.

 Random Forest Classifiers will make an ensemble of decision trees and predict the aggregated result of all the trees. The decision trees randomly use features from the data to perform classification.

Random Forests use Bagging in their algorithm. Bagging involves dividing the dataset into random subsets with replacement and using these subsets for training the decision tree. The n_estimators parameter decides the number of decision trees in the Random Forest.

from sklearn.ensemble import RandomForestClassifier
rfclf = RandomForestClassifier(n_estimators=5, max_depth=3, random_state=0),y_train)

We can also find which features are more important in classification. We can use the feature importance function.

for name, score in zip(X_train.columns, rfclf.feature_importances_):
    print(name, score)

We can improve the accuracy of these trees by Boosting them. We use Gradient Boosting to do.

from sklearn.ensemble import GradientBoostingClassifier
gbclf = GradientBoostingClassifier(max_depth=3, n_estimators=10), y_train)
gbclf.score(X_test, y_test)

If you run the code you will find out the accuracy around


We can improve our model by tuning the fine balance between all the parameters in our algorithm. Many techniques are present but by far the most popular is Cross-Validation and Grid Search.

The cross-validation method, we split the training dataset into a further smaller training set and a validation set. This process is repeated several times and what we get is an improved model with higher accuracy.

Coming to the Grid Search CV function in python provides a great method to fine-tune your parameters in such a way that you get the best combination of the parameters to get the highest score.

We just need to input what parameters we want to experiment with and the list of values we want to juggle between. The function will run the model with all the values of the parameters and return the one with the best score.

We can test the accuracy of the models by using evaluation methods such as Confusion Matrices, finding a good Precision-Recall tradeoff and using the Area under the ROC curve.

Thank you for refering to the blog. Hope you have a great learning ahead.