MachineX: Evaluation Metrics for Classification Models

Reading Time: 5 minutes

In our last blog post, we have looked at various evaluation metrics for the regression model. Continuing on this we will take a look at the evaluation metrics used for classification models.

Classification is about predicting class labels given input data. In binary classification, there are two possible output classes whereas in Multi-class classification we have more than two possible output classes. We are going to focus here on binary classification. However, all these metrics can be extended to the multi-class scenario.

A simple example of the binary classification is Spam detection, where the input data includes the email but output data is of like whether the email is “spam” or “not spam”. Some people also use generic names for this such as “True” or “False”.

There are many ways to measure classification model performance such as Accuracy, Confusion Matrix, AUC, log loss and Precision-Recall.


Accuracy in simple terms means how many times our classifier predicted correct output divided by the total number of the data points in our set.

Confusion Matrix

Although accuracy looks good it doesn’t give us a clear picture as to if and how the different classes were treated. In some scenarios, we are ok with the overall accuracy whereas in some scenario the cost of misclassifying a single data point is huge. For example In a scenario of bank finding whether a customer is eligible for the loan or not it can be alright if we might misclassify as some eligible customers as not eligible. But in case of a doctor classifying the patients as having cancer or not it would be a blunder if we declare some potential cancer patients as cancer-free.

Hence confusion matrix comes into place where we look at the more detailed breakdown of the various classes. So let’s try to understand the confusion matrix with minimum confusion.

FALSE POSITIVE: It refers to how many data points our classifier predicted positive but actually it was negative.

FALSE NEGATIVE: It refers to how many data points our classifier predicted negative whereas they were actually positive

TRUE POSITIVE: It refers to in how many scenarios out classifier correctly predicted the positive classes i.e where the actual output was positive and our classifier also predicted positive

TRUE NEGATIVE: It refers to in how many scenarios our classifier correctly predicted the negative classes i.e where the actual output was negative and our classifier also predicted negative

Once this confusion matrix is built we now get to see a much broader view of the accuracy of each class. Suppose our test dataset contains 100 points of positive class and 200 points of the negative class. Looking at the confusion matrix below we see something like:

The positive class has a much lower accuracy which is 80/ (20+80) = 80% than the negative class which is 195/(195+5) = 97.5%. This specific information would have been lost if we would have just looked at the overall accuracy of the model which is (80+195)/ (100+200) = 91.7%

Log Loss

This measure is also known as Cross-Entropy Loss and gets into much finer details of the classifier. In particular, if the raw output of the classifier is a numeric probability instead of a class label of 0 or 1, then log-loss can be used. Therefore if the divergence between two probabilities i.e predicted and actual is large then we have large log loss hence otherwise. In simple terms, let’s say if the actual label is 1 and our classifier predicts 0 but with a probability of say 0.58% it would mean its a close miss for our classifier. In mathematical terms, log loss is referred to as:

Therefore log loss penalizes both types of errors, but especially those predictions that are confident and wrong. Smaller values of log loss mean higher accuracy.

Area Under Curve

It is one of the most important evaluation metrics for checking any classification model’s performance. It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate.

The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives. In other words, it shows you how many correct positive classifications can be gained as you allow for more and more false positives. The perfect classifier that makes no mistakes would hit a true positive rate of 100% immediately, without incurring any false positives—this almost never happens in practice. This is particularly useful for comparing different thresholds of various models. It can also be used as a summary of the model skill. You can also look at tutorial by Kevin Markham for more understanding.


Recall is what fraction of all positive instances does the classifier correctly predicted/identified positively. In simple terms, the percentage of positive classes correctly predicted by the classifier. For instance, if the recall is 0.95 then we can say that out of 100 positive instances our classifier correctly predicted 95 of them. It is also known by the names of TPR or Sensitivity. Recall is highly useful in medical diagnosing of Tumors and Cancers. Formula to find recall is:


Whenever the classifier predicts a positive class we want it to be confident. Therefore, precision answers a simple question which is Out of the items that the ranker/classifier predicted to be relevant, how many are truly relevant? It is used where we have to minimize false positives. Precision is used in search engine ranking, document classification, and many customer-facing task. Formula to find precision is:

Let me know your thoughts in the comments.